The Real Problem: PPO Costs Two Models
Proximal Policy Optimization is the algorithm that made RLHF practical. OpenAI used it for InstructGPT and ChatGPT; Anthropic and Meta used variants of it for Claude and LLaMA-Chat; nearly every commercial LLM deployed between 2022 and 2024 owes its alignment to a PPO loop. It works. It is stable. It is well-understood. And it has a problem that only becomes obvious when you try to scale it to a 70B parameter policy: PPO trains two models at the same time.
The first model is the one you actually want — the policy , which generates the responses. The second is the critic , which predicts how much reward each intermediate state is going to accumulate. PPO needs the critic because it relies on a quantity called the advantage — how much better the action at step was than the critic expected — to do the variance-reduction trick that makes the gradient learnable. No advantage, no PPO.
In the original Atari and MuJoCo papers, the critic was tiny: a two-layer MLP feeding off a shared CNN trunk. Free, basically. In LLM RLHF the critic is a copy of the language model with a scalar head bolted on. A 70B policy comes with a 70B critic. Mixed precision and AdamW push the per-parameter training cost to ~16 bytes, so the critic alone is over 1 TB of GPU memory before you have stored a single activation. That is the bottleneck this section is named for.
What this section is going to prove. The critic is not a polite optional ingredient of PPO — it is structurally necessary for the algorithm as written. Removing it changes the math. The next section (15.2) shows how GRPO replaces it with a group baseline that achieves the same variance-reduction effect without the model. Before we can earn that result, we have to understand exactly what the critic is doing and exactly what it costs.
Intuition: The Coach Who Must Be as Smart as the Player
Imagine a chess match. The player (the policy) makes moves. After the whole game, an external judge says "you won" or "you lost" — that is the reward signal from the reward model. The player wants to learn from this verdict, but there is a problem: the verdict applies to the entire game, not to any individual move. Was the queen sacrifice on move 12 brilliant or terrible? The scalar "you won" cannot tell us.
Plain policy gradient handles this by being naïve: it credits every move in a winning game with the full reward. This is unbiased but absurdly noisy — most of the moves had nothing to do with the win. Training is technically possible but takes oceans of data and is prone to wild swings.
PPO solves it by hiring a coach — the critic. After every move, the coach whispers: "you are now in a position I'd expect to win 73% of the time." The next move's coach-estimate is 81%. So the policy now has a clean signal: that move was a +8% surprise, much smaller variance than a blunt "you won". Every move now gets its own credit assignment, derived from how much it shifted the coach's opinion.
The catch: the coach has to actually understand the game. A coach who knows nothing about chess gives noisy whispers, and the whole scheme falls apart. For chess this is fine — winning probability has been studied for centuries and a small model can estimate it. For an LLM completing an arbitrary prompt under an arbitrary reward model, "winning probability from this partial response" is a problem on the same complexity scale as generating the response. The coach needs to be roughly as smart as the player.
GRPO's insight (which we will derive properly in 15.2) is that there is another way to get a baseline: sample the same prompt many times, then use the group's mean reward as the baseline for each individual sample. A response that scored above the group mean gets positive advantage; below mean gets negative. No critic, no value function, no second neural network. The variance reduction is comparable, and the memory cost is gone.
The Mathematics of PPO and Its Critic
We need three equations to see the bottleneck. First, the policy gradient with a baseline:
Here is a trajectory (in LLM-land, a full response of tokens), is the action at step (a sampled token), and is the state (the prompt plus all tokens generated so far). The advantage is what gets multiplied with the score function. Without a baseline it would just be the cumulative reward from onward — a random variable with enormous variance for sparse, terminal rewards.
The baseline-corrected advantage subtracts a state-conditional estimate of expected return, called the value function:
This is the quantity the critic is trained to predict. With it in hand, we can build the generalized advantage estimator (GAE) of :
The term is the temporal-difference (TD) error: the difference between the immediate reward plus the critic's estimate of the next state and the critic's estimate of the current state. GAE compounds these errors with discount factor (typically 1 for LLMs) and an interpolation knob (typically 0.95). At the advantage reduces to a single TD error (low variance, biased by an inaccurate critic); at it collapses to the Monte-Carlo return minus (unbiased but high variance).
Finally, the PPO clipped surrogate loss:
And alongside it, the value-function loss that trains the critic:
The full PPO step is a weighted sum: with and an entropy bonus weighted by . Note where appears in this system: in (which enters the policy loss), in , and implicitly in the bootstrap target via . If you delete , you have to provide an alternative for every one of these — which is exactly what GRPO does.
Manual Numerical Walkthrough
Let us compute one PPO step end-to-end on a trajectory of 6 tokens, by hand, to see exactly where the critic enters every line of the calculation.
Manual PPO step on a 6-token trajectory
Step 0 — Setup. We have a single response of 6 tokens. The reward model assigns a sparse, terminal reward: . The old policy log-probs (from rollout time) and the current log-probs (after a couple of gradient steps) are: and . The critic predicts . Take , .
Step 1 — Bootstrap value. For each step we need . Shift V left and append 0 (terminal): .
Step 2 — TD errors. . Token 0: . Token 1: . Token 2: . Token 3: . Token 4: . Token 5 (terminal): . So .
Step 3 — Run GAE backwards. With :
So . Notice that every token has a positive advantage even though only the last token actually received a reward. That is GAE doing its job — credit propagated backward through the critic's estimates.
Step 4 — Importance ratio. . Differences: . Exponentiate: . Every ratio sits inside the 0.8–1.2 clip window, so clipping is inactive for this trace — a clean PPO update.
Step 5 — Policy loss. . Per-token products: . Sum 2.902, mean 0.484, so (a negative loss means the policy is improving in the direction of the advantage — exactly what we want).
Step 6 — Value loss. Returns : . Squared error per step : . Mean: 0.216. So . The critic is consistently under-predicting the returns at every step — it's late to the party, which is typical critic behaviour and what the value loss is trying to correct.
Step 7 — Total loss. . Two networks contributed gradients to this number. Two networks received parameter updates. Two networks need to be evaluated on the next rollout. Multiply by every PPO step in a training run, then multiply by the parameter count of a frontier policy.
Step 8 — What GRPO will replace. Steps 1–3 (the entire GAE machinery) and step 6 (the value loss) all go away. In GRPO we would instead sample responses from the same prompt, give each its terminal reward, normalize those rewards across the group, and use the normalized scalar directly as for every token of that response. Steps 4 (ratio) and 5 (clipped surrogate) survive unchanged.
Visualizing the Critic Bottleneck
The ledger below stacks the GPU memory PPO consumes during one training step. Trainable weights (policy and critic) cost ~16 bytes per parameter under mixed-precision AdamW: 2 bytes for the fp16 forward copy, 2 for the fp16 gradient, 4 for the fp32 master copy, and 4 + 4 for the AdamW first- and second-moment buffers. Frozen models (reference policy and reward model) cost only the 2 bytes for fp16 forward. Toggle PPO ↔ GRPO to delete the critic and watch its share of the bill — typically the largest share — disappear.
Three things to internalize from the widget. First, at every model size the critic and its optimizer state together account for more than the policy itself. That is because the policy and critic are the same architecture (so their fp16 weights are equal), but the optimizer state for the critic is added on top. Second, the saving is not 25% — it is closer to 50% of the trainable budget and ~40% of the total budget once frozen models and activations are counted. Third, doubling the H100 count to host the critic is just the memory story; the compute story (FLOPs per step) and the comms story (gradient all-reduce volume) also double. The bottleneck is multi-dimensional.
Plain Python: PPO With a Tiny Critic
Here is the full PPO step in NumPy. Six tokens, hand-coded GAE, the clipped surrogate loss, and the value loss. The numbers from this script reproduce the manual walkthrough above to four decimal places.
Two implementation details to lock into memory. First, GAE is computed by a backward recursion over time, not a forward one. This is because the discount factor compounds away from the terminal state — only after we know can we compute . Forgetting this and writing a forward loop is one of the most common bugs in homegrown PPO implementations. Second, the value loss is computed against , not against the raw cumulative reward. The critic is being trained to predict a quantity that itself depends on the critic's current predictions. It is a bootstrap, and bootstraps can diverge when the policy is improving rapidly — which is precisely when you most want stable training.
PyTorch: The Real PPO Loss
Now the production shape. Two transformers — one with a vocab-sized output head (the policy), one with a scalar head (the critic). Identical layer counts, identical hidden sizes, identical attention. The only architectural difference is the final linear layer.
Pay attention to the two forward passes through transformer stacks of the same size. In the toy version (4 layers, 256 hidden, batch of 2 × 16 tokens) the wall-clock difference is invisible. At the scale of a 70B policy, those two lines are the dominant per-step cost of RLHF — and they are unavoidable in PPO.
Sanity-check yourself. Print the parameter counts of policy and critic. They should be within 0.1% of each other (the head shape differs by a factor of , but the head is a tiny fraction of total params). This near-equality is the bottleneck, expressed in a single number.
At Massive Scale: The 70B PPO Memory Bill
Let us put real numbers on the page. A 70B parameter policy trained with PPO RLHF, mixed-precision AdamW, ZeRO-3 sharded across a multi-node cluster of H100s:
| Component | Bytes / param | Total (70B) |
|---|---|---|
| Policy fp16 weights | 2 | 140 GB |
| Critic fp16 weights | 2 | 140 GB |
| Reference policy fp16 (frozen) | 2 | 140 GB |
| Reward model fp16 (frozen) | 2 | 140 GB |
| AdamW state (policy): fp32 master + (m,v) + grad fp16 | 14 | 980 GB |
| AdamW state (critic): fp32 master + (m,v) + grad fp16 | 14 | 980 GB |
| Activations (batch 8, seq 4096, recompute on) | ~ | ~120 GB |
| NCCL buffers + framework overhead | ~ | ~80 GB |
| Total | ~ | ~2.7 TB |
2.7 TB of GPU memory. An 8×H100 node provides 640 GB. ZeRO-3 shards the optimizer state, gradients, and parameters across the node, but the activation memory and frozen models are replicated per data-parallel replica. In practice a 70B PPO RLHF run uses 16–32 H100s as the smallest viable configuration, and most of those GPUs are sitting on bytes of critic state that exist for the sole purpose of producing a scalar baseline.
Now look at the same ledger for GRPO. The critic row disappears (140 GB saved). The critic's AdamW state disappears (980 GB saved). The total drops to ~1.6 TB — a 40% reduction, and it is the cleanest 40% you will ever cut from a training bill because it requires changing exactly one part of the algorithm and leaves the rest of the RLHF pipeline (reference policy, KL anchor, reward model, sampling code, sequence-packing utilities) completely untouched.
- Memory. ~40% lower per-replica footprint, allowing larger batches or smaller GPU pools.
- Compute. ~50% fewer FLOPs per training step (no critic forward, no critic backward).
- Communication. ~50% less gradient all-reduce volume, which is often the latency bottleneck on multi-node training.
- Engineering. One fewer model to checkpoint, monitor, debug, and tune. Eliminates the value-learning-rate hyperparameter, which was a perennial source of instability.
Engineering Reality: Three Failure Modes
The memory bill alone is enough to motivate GRPO, but the critic causes three more practical headaches that anyone who has tried to run PPO RLHF at scale will recognize.
1. The critic lags the policy
The critic is trained to predict the value function of the current policy. But the policy is changing every step. If the policy improves faster than the critic can keep up, the critic's predictions become systematically wrong, producing biased advantages that point the policy in the wrong direction. The classic symptom: training loss looks fine for several hundred steps, then the reward score collapses suddenly. The fix is to give the critic a higher learning rate than the policy, but the right ratio is dataset-dependent and is one of those hyperparameters you can only find by burning compute. GRPO does not have this knob because it does not have a critic.
2. The critic saturates on early-trajectory tokens
For LLM RLHF, the first few tokens of every response are nearly deterministic given the prompt (think: "Sure, I can help with that..."). The critic learns to predict their value with extreme confidence and a tiny gradient. The last few tokens, where the reward actually gets assigned, have much higher entropy — but the critic has spent its capacity on the easy beginnings. The result is poorly calibrated advantages near the end of the sequence, which is exactly where they matter most. Practitioners work around this with various attention-mask tricks and per-token loss weights. GRPO sidesteps it entirely by using a single sequence-level advantage.
3. Reward hacking through the critic
Because the critic is part of the gradient computation, the policy can — and in long enough training runs, will — find ways to manipulate the critic's predictions to inflate advantages without actually improving the reward. This is Goodhart's Law applied to value learning: the moment your proxy (the critic) becomes part of the optimization loop, it stops being a clean measurement. Detecting this is hard because the proxy reward keeps going up; the true reward (held-out evaluations) is what reveals the deception. GRPO is not immune to reward hacking, but it removes one entire surface — the critic — from the set of things the policy can learn to exploit.
In Section 15.2 we will derive GRPO formally — show how the group-relative advantage falls out of the same policy-gradient objective, prove it is a valid baseline, and trace where every piece of PPO is replaced. The math, you will see, is shorter than what we just did.