The Real Problem: PPO Needs a Critic It Cannot Afford
The previous section laid out PPO's machinery: a clipped importance ratio, a value head that supplies the per-token baseline, and a per-token KL anchor to the reference policy. The algorithm works. It also pays a heavy price. A 70B policy demands a 70B value head running in lockstep — the value network is a second model the same size as the policy, with its own forward pass, its own backward pass, and its own slice of AdamW state. At full mixed precision that is roughly bytes per trainable parameter, doubled because there are two such models. For a 70B run this is over a terabyte of memory before a single activation is allocated.
Worse, the value head is dead weight in the most literal sense. Nothing in the policy's output depends on ; its only job is to subtract a per-state expected return from the actual return so the policy gradient sees a centred signal. The value head is a variance reducer, not a behavioural component. We are paying 70 billion parameters to compute one scalar per token whose only function is to make another loss less noisy.
And the value head is hard to train. It chases a moving target: the policy keeps changing, so the return distribution keeps shifting, so the value head's MSE target keeps drifting. PPO ships with a clipped value loss precisely because uncontrolled value updates routinely blow up. Every engineer who has stood up a from-scratch PPO-RLHF run has felt this pain — value loss oscillating, advantages collapsing, the run wedging on a plateau that has nothing to do with the policy itself.
GRPO was born from a single question: can we get a usable per-token advantage without a value head? The DeepSeek-Math team (2024) answered yes, with a one-line trick that has since propagated through DeepSeek-R1, Qwen-2-Math, Llama-3 math fine-tunes, and most open-source rule-based RL recipes. The trick: for every prompt, sample G rollouts at once, and use the dispersion of their final rewards as the baseline. The group is its own critic.
What this section will show, in one paragraph
Intuition: Let the Group Be Its Own Baseline
Imagine you are grading G = 8 students who all wrote the same essay. You do not have an objective rubric; you only have your gut ranking. The natural way to teach the class is relatively: praise the essays that scored above the average of these eight, criticise the ones that scored below. Whether the average is “objectively good” is irrelevant — what matters is which essays stood out from this group.
GRPO is that intuition mechanised. For every prompt, sample G responses. Score each one. Subtract the group mean. Now you have a signed advantage per response — positive means “do more of this”, negative means “do less of this”. There is no value network because the group itself supplies the answer to “what was expected of this prompt?”. The expected reward of the current policy on this prompt is, well, the empirical mean of the G samples we just drew from it.
Why does this work? Because the policy gradient does not need an absolute baseline — it only needs an unbiased one. Any baseline that does not depend on the action leaves the gradient unbiased (this is the classical baseline-subtraction theorem from REINFORCE). The group mean of rewards on the same prompt is a perfectly valid such baseline — it depends only on the prompt and on the current policy, not on any individual action. So we are allowed to subtract it, and the gradient stays pointing in the right direction.
One-line mental model: PPO uses the value function as a baseline because . GRPO replaces with the empirical Monte Carlo estimator of that same expectation — the mean reward of G samples drawn at . The G samples are the value network.
A physical analogy: PPO carries an absolute thermometer that reads “true” expected reward at every state. GRPO carries no thermometer at all; instead it compares the temperature of G samples against each other and trusts that one of them is above average. The first is more informative when calibrated; the second is dramatically cheaper and cannot drift out of calibration because there is nothing to calibrate.
Deriving GRPO from PPO, One Substitution at a Time
We start exactly where § 14.6 ended. PPO's RLHF objective on a single prompt is:
where is the importance ratio and the advantage is the GAE estimate built on top of a learned value head .
Substitution one — replace GAE with a single trajectory-level reward. For LLM tasks the per-token reward is zero everywhere except at the EOS token, where it is the reward model's score . The GAE estimate over such a sparse reward stream is dominated by that one terminal value; in the limit it collapses to
which is just the Monte-Carlo return minus the value baseline. Already we have one clean expression for the advantage that does not need the GAE bookkeeping; we keep it.
Substitution two — replace the value baseline with the group mean. Drop the learned entirely. In its place, sample G rollouts from the OLD policy on the same prompt and use the empirical mean of their rewards as the baseline:
This is unbiased: the group mean does not depend on any individual rollout's actions, so the baseline-subtraction theorem applies. It is also high variance — for a small G the empirical mean is a noisy estimator of the true expected reward. GRPO addresses the variance not by adding parameters back, but by standardising: divide by the group standard deviation.
Substitution three — standardise by the group std. Compute the group standard deviation alongside the mean and scale:
Two things happen at once. First, the scale of the advantage is now per-prompt normalised — an means “one std above the group mean for this prompt”, regardless of whether this prompt's rewards live in or . Second, the loss magnitude across batches becomes comparable, which is what lets a single learning rate and a single KL coefficient work for prompts of wildly different difficulty.
Substitution four — broadcast the scalar advantage to every token. The advantage we just defined is one number per rollout. The policy update needs one number per token. GRPO makes the simplest possible choice: copy the same to every token of rollout :
This is a deliberately coarse credit assignment. A 200-token response gets credit for every token equally — the one tactically correct token gets the same advantage as 199 boilerplate tokens. PPO with GAE would give each token a different advantage based on local TD error; GRPO trades that resolution for the elimination of the value head. In practice, with large enough G and a reward signal that distinguishes responses on overall quality (not on token-level details), this trade pays off.
The bias-variance picture in one line
The Group-Relative Advantage
Collecting the substitutions, the GRPO advantage in its final form is one line:
Read it slowly. The numerator is “this rollout's reward minus the group mean” — the centred signal. The denominator is the group standard deviation (population formula — divide by G, not G−1), regularised by a small to prevent division by zero on degenerate groups. The result is broadcast across all tokens of rollout — every response token sees the same scalar advantage.
| Property | PPO + GAE | GRPO |
|---|---|---|
| Per-token advantage source | TD residuals through a learned V(s_t) | Group-mean and group-std over G rollouts on the same prompt |
| Extra trainable parameters | ≈ N (a second model the size of the policy) | 0 |
| Per-token variance reduction | Strong — V(s_t) is per-state | Weaker — same scalar per token in a rollout |
| Sensitivity to reward scale across prompts | Depends on V's calibration | Eliminated by std normalisation |
| Rollouts needed per prompt | 1 (sometimes 2 for chosen/rejected RLHF) | G (typically 4 – 64) |
| Failure mode if rewards collapse | V learns a constant; gradient continues | std → 0 → divide by ε → advantages explode (see § engineering) |
Two consequences of this choice are worth naming up front. One: GRPO trades model parameters (the value head) for inference samples (the G rollouts). If your inference is fast and your reward is cheap to compute, GRPO is strictly better. If your inference is slow or your reward call is expensive, GRPO can be worse than PPO in wall-clock terms. Two: the std normalisation makes the gradient scale-free, which means the choice of and the learning rate become more universal — DeepSeek-Math uses the same hyperparameters across math problems of vastly different reward distributions.
Importing PPO's Clip — Why GRPO Keeps the Ratio
With the advantage defined, the rest of GRPO is PPO almost verbatim. Define the same importance ratio:
and the same clipped surrogate, replacing PPO's per-token advantage with GRPO's broadcast advantage :
Why keep the clip at all if the value head is gone? Because the ratio clip protects against the OTHER PPO failure mode — taking K gradient steps on the same rollout (the ‘outer loop’ of PPO/GRPO) and having the policy drift far enough between steps that the importance correction breaks down. The clip prevents a runaway feedback loop in which a high-ratio token gets pushed higher on the next minibatch, raising its ratio further, and compounding. Even without a critic, that feedback loop is real; the clip is the cheap insurance against it.
Crucially, remains the right default. The PPO ablations that fixed this value were about ratio dynamics, not about whether a critic was present. Some recent variants (Dr. GRPO, Open-RLHF's “reinforce++”) experiment with , but the differences are second-order — get the advantage and the KL right first, then think about .
Symmetry check: when does the clip fire in GRPO?
Anchoring to the Reference: The k3 KL Term
Per-step drift is bounded by the clip; cumulative drift over thousands of steps is bounded by a separate per-token KL penalty against the frozen reference (SFT) policy. GRPO keeps PPO's formulation here, including the k3 unbiased estimator:
where . The k3 estimator is always non-negative, unbiased in expectation, and has lower variance than the naive estimator that early policy- gradient code used. It costs nothing extra to compute — the ingredients are the same and we already need for the ratio.
The coefficient tends to be slightly higher in GRPO than in PPO-RLHF — DeepSeek-Math uses , roughly double the PPO-RLHF default of 0.02. The reason: in PPO the value head provides indirect smoothing of the policy (a noisy value estimate produces noisy advantages which damp the policy gradient); GRPO has no such smoothing, so the KL term has to do more anchoring on its own.
| β_KL | Behaviour in GRPO |
|---|---|
| 0 | No anchor. Policy escapes to reward-hacked nonsense in 1–5k steps. |
| 0.01 – 0.02 | Aggressive. OK if the reward is robust (a strong verifier). |
| 0.04 (DeepSeek-Math default) | Standard. Realised KL ≈ 10 – 30 nats per response. |
| 0.05 – 0.1 | Conservative. Works for noisy or partial reward signals. |
| >0.5 | Policy barely moves. Use only when reward is very suspect. |
Putting It Together: The Full GRPO Objective
The complete GRPO loss for one prompt's group is the weighted sum of the clipped policy objective and the KL penalty, averaged over tokens within each rollout and then over rollouts within the group:
Read the three layers of averaging from the inside out. The innermost is the per-token loss combining the clipped surrogate and the KL penalty. The middle is — average over the response tokens of rollout . The outermost is — average over rollouts in the group. The leading minus turns the objective into a loss that gradient descent will minimise.
The double-mean (sequence-mean then group-mean) is a deliberate bookkeeping choice that gives every rollout equal weight in the final loss regardless of its length. It also introduces a length-bias artefact that the Dr. GRPO paper analyses in detail (short responses get effectively up-weighted relative to long ones because a small denominator inflates the contribution of every token in that rollout). The two principal variants in practice are:
| Variant | Token aggregation | Effect |
|---|---|---|
| Vanilla GRPO (DeepSeek-Math) | mean over tokens, then mean over rollouts | Length-bias: shorter rollouts contribute more per token than longer ones. |
| Dr. GRPO | sum over tokens, divide by a fixed max-length constant | Removes length-bias by giving every token equal weight; common in math RL recipes since 2024. |
| Reinforce++ / Open-RLHF | sum over tokens, no per-sequence normalisation | Closer to vanilla REINFORCE; advantages must be carefully scaled. |
For the rest of this section we use vanilla GRPO — the formula above with sequence-mean then group-mean — because it is what DeepSeek-Math, DeepSeek-R1, and Qwen-2-Math ship by default. The engineering subsection below documents the length-bias trade-off you accept by choosing it.
Manual Numerical Walkthrough
We trace one full GRPO update by hand for G = 4 rollouts of T = 3 tokens each, all from the same prompt. Numbers small enough to check on paper; structure identical to a production step.
Step 1: the group of rollouts
Four rollouts have been sampled from the old policy on a single prompt. Each rollout consists of three response tokens; we list the action sequences and the log-probabilities the old policy assigned to them, plus the scalar reward each response received from the reward model (or verifier).
| rollout i | tokens (a_1, a_2, a_3) | old log π (sum over tokens) | reward R_i |
|---|---|---|---|
| 1 | (0, 2, 1) | −3.22 | +0.90 |
| 2 | (1, 0, 2) | −2.99 | +0.30 |
| 3 | (2, 1, 0) | −2.85 | −0.10 |
| 4 | (0, 1, 2) | −3.15 | +0.70 |
Step 2: the group baseline and standard deviation
Mean of the four rewards: .
Population variance: .
Standard deviation: .
Step 3: the per-rollout advantages
Subtract the mean and divide by the std (with in the denominator):
| rollout i | R_i | R_i − R̄ | A_i = (R_i − R̄) / σ_R |
|---|---|---|---|
| 1 | +0.90 | +0.450 | +1.171 |
| 2 | +0.30 | −0.150 | −0.390 |
| 3 | −0.10 | −0.550 | −1.431 |
| 4 | +0.70 | +0.250 | +0.651 |
Two checks worth doing in your head: (round-off). And the std of the advantages is . Both are guaranteed by the standardisation, but worth confirming once to internalise that the advantage scale is per-prompt comparable.
Step 4: broadcast to tokens
Every token of rollout receives the same scalar advantage . The advantage tensor is now of shape with each row constant:
| rollout i | A_{i,1} | A_{i,2} | A_{i,3} |
|---|---|---|---|
| 1 | +1.171 | +1.171 | +1.171 |
| 2 | −0.390 | −0.390 | −0.390 |
| 3 | −1.431 | −1.431 | −1.431 |
| 4 | +0.651 | +0.651 | +0.651 |
Step 5: the policy ratio at one token
Take token — the first token of rollout 1. The old policy assigned it . The live policy (after one gradient step) now assigns it (computed from the new logits via log-softmax — we will see this exact computation in the code below).
Log ratio: .
Ratio: . Way outside the trust region .
Step 6: clipped surrogate at this one token
The two surrogates: and . The min picks the smaller: . The clip fired, the gradient through at this token is zero — we have already pushed this rewarded token's probability far enough; do not push further.
Compare to rollout 3's first token where . Suppose its ratio after one step is . Then and ; the min picks . The clip does NOT fire — gradient flows fully because the policy is moving in the rewarded direction (decreasing a bad action's probability), and the asymmetric clip never blocks that. This is exactly the PPO behaviour from § 14.6, carried over into GRPO unchanged.
Step 7: per-token KL at the same first token
The reference model assigned this token . Then . The k3 estimator gives . With , the KL contribution at this token is .
Step 8: per-token loss and aggregation
Per-token loss at : . Negative because the clipped surrogate is positive (rewarded action) and dominates the small positive KL.
Repeating this calculation across all tokens (every token of rollouts 1 and 4 yields a negative loss with magnitude bounded by the clip; rollouts 2 and 3 yield positive losses — the model is penalised for sampling them and will reduce their probability), averaging within each rollout, then averaging across rollouts, produces the scalar loss that gets backpropped through the policy.
The walkthrough headline: the entire GRPO update is computable from the four rewards, the four old-policy log-probs, the four reference log-probs, and the new policy's log-probs — no value head, no MSE target, no GAE bookkeeping. The per-token gradient signal is identical within a rollout (same scalar advantage) and differs across rollouts only by the per-token ratio and KL. That is the entire algorithm.
Interactive Visualization: The Group Baseline
Drag the reward sliders below. Each row is one rollout from the same prompt; the live readouts show the group mean, the group std, and the per-rollout advantage. Toggle divide by std (vanilla GRPO) off to see the unscaled baseline-only variant — the colours stay (same sign of advantage) but the magnitudes spread out wildly. Drag a single reward up until it dominates: every other rollout's advantage flips red because the group mean shifts upward by that single change.
Two experiments worth running in the widget. First: set all four rewards equal (say all to 0.5). The std collapses to zero, the in the denominator dominates, every advantage is numerically tiny and the gradient signal vanishes. This is the “std collapse” failure mode and is exactly what happens in a real GRPO run when a binary verifier returns all-correct or all-wrong for a group. Recovery: raise sampling temperature so the rollouts diversify. Second: set one reward to +1 and the other three to 0. The +1 rollout gets a strongly positive advantage, the other three get equal small negative advantages, and the gradient will push the policy hard toward the +1 rollout. This is the canonical “one winner in a group of G” pattern that DeepSeek-R1 reasoning training operates in for hours at a time.
Plain Python: One GRPO Step From Scratch
We implement the GRPO loss in pure NumPy on a toy group: G = 4 rollouts, T = 3 tokens, V = 3 vocabulary entries. No autograd, no transformer — just the arithmetic an RL trainer runs every minibatch. Swap the hard-coded logits for a transformer forward pass and the code is unchanged.
PyTorch: A Production-Shaped GRPO Step
Now the full thing — clipped surrogate, per-token KL to a reference model, gradient clipping, sequence-mean then batch-mean, and the four diagnostics that tell you whether the run is healthy. This is the inner loop that lives inside OpenRLHF's GRPO trainer, TRL's GRPOTrainer, and the DeepSeek-R1 reference code. Drop in any HF causal LM for policy and any frozen copy of the SFT model for ref_policy and this trains a real GRPO run.
At Massive Scale: What Happens at 70B+
The same code scales to a 70B GRPO run, but three forces change in spirit.
Memory: the value head was half the cost
At 70B, each model is ~140 GB in bf16. PPO needs four such models (policy, value, ref, reward) plus optimiser state for the two trainable ones — about 2 TB before activations. GRPO needs two (policy, ref) plus optimiser state for just the policy — about 980 GB. Cutting the trainable-model count from 2 to 1 is the savings that matters: AdamW state is ~12 N bytes per trainable parameter (fp32 master copy plus fp32 m and v), so eliminating the value head's optimiser state alone saves ~840 GB. This is what makes 70B GRPO runs feasible on the same hardware that struggles with 70B PPO-RLHF.
The rollout-vs-train balance shifts further toward rollout
PPO already spent ~85% of wall-clock time on rollout (§ 14.6). GRPO multiplies the rollout cost by G — for G = 16 a single prompt now demands 16 full responses. The optimisation phase, on the other hand, is cheaper because there is no value-head gradient and no value-loss computation. Production GRPO systems push the rollout fraction to 90%+ and architect around that: dedicated vLLM rollout clusters, async weight syncs, and batched-prompt servers that emit all G rollouts of a prompt in one forward call. The architectural choice that PPO made under pressure (separate rollout and training clusters) becomes the only viable choice in GRPO.
Distributed group statistics: an all-reduce per group
If the G rollouts of a single prompt are split across FSDP ranks, the group mean and std must be computed with an all-reduce within each group's rank set, not over the global batch. In practice every production GRPO collator keeps each prompt's G rollouts on a single rank so the group statistics are a local operation — the “contiguous-group” layout rule. Get this wrong and you compute statistics over a mixture of prompts, which is mathematically a different and much worse algorithm. The first thing to assert in any custom GRPO collator is prompt_id[i*G:(i+1)*G].unique().numel() == 1 for every group.
Reward bottleneck disappears when the reward is a verifier
For DeepSeek-R1 the reward is not a learned model — it is a deterministic verifier (math answer checker, code executor, JSON validator). That verifier runs on CPU in microseconds, with no GPU memory at all. The “reward model bottleneck” that defines PPO-RLHF (the fourth 70B model that must be queried per rollout) simply vanishes. This is why GRPO + verifier is the dominant recipe for reasoning RL in 2024-2025: the cost structure is fundamentally simpler than PPO-RLHF, and the per- example signal-to-noise is higher because verifiers are not subject to reward hacking.
Sampling temperature becomes a load-bearing hyperparameter
In PPO, sampling temperature affects exploration but not the algorithm's stability. In GRPO it affects both. If temperature is too low, every rollout in a group is nearly identical, the reward variance collapses, the std , and the gradient signal vanishes. If temperature is too high, rollouts become garbage, rewards are uniformly low, and the policy gets pushed toward those uniform-low responses. The DeepSeek-Math sweet spot is to with a small top-p cutoff; every production GRPO trainer logs adv_var per step and is prepared to bump temperature if it collapses.
Engineering Reality: Length Bias, Std Collapse, and Other Knobs
After a season of GRPO runs every team converges on the same set of failure modes and the same set of fixes. The list is shorter than PPO's because the algorithm has fewer moving parts.
- Std collapse. Symptom: training stalls and adv_var drops to ~1e-6 in the diagnostics. Cause: every rollout in a group received the same reward (binary verifier returns all-correct or all-wrong), so and the advantage collapses to a tiny ratio governed by the floor. Fix: raise sampling temperature so the rollouts diversify; or use a richer reward (partial credit instead of binary) so equal scores are rarer; or filter out collapsed groups entirely from the loss (most production trainers do this).
- Length bias. Symptom: the policy converges to short, terse responses regardless of prompt. Cause: vanilla GRPO's sequence-mean normalisation upweights short rollouts in the loss; a short rewarded rollout contributes more per-token gradient than a long rewarded one, so the policy learns to prefer brevity. Fix: switch to Dr. GRPO (sum over tokens divided by a fixed max-length constant); or add an explicit length-target reward.
- KL exploding. Symptom: kl_ref grows above 100 nats per response and never recovers, eval quality collapses. Cause: is too small for the chosen reward signal. Fix: raise to 0.05 – 0.1 and restart from a pre-collapse checkpoint. Many teams use an adaptive controller that adjusts to hold realised KL at a target value (e.g. 20 nats).
- KL stuck near zero. Symptom: reward barely moves, kl_ref stays under 1 nat. Cause: policy is not changing — either LR is too low, or grad_norm is being clipped to near zero, or there is a reference-vs-policy aliasing bug (both tensors point to the same weights). Fix: assert
id(ref_policy.state_dict()) != id(policy.state_dict())and that the reference's parameters haverequires_grad=False. - Group layout wrong. Symptom: reward seems to go up but eval gets worse. Cause: the batch is not laid out contiguously by prompt, so the group statistics are computed over a mixture of prompts. Fix: explicitly enforce contiguous- group layout in the collator and assert that
prompt_id[i*G:(i+1)*G]is a single value for every . - Clipfrac too high. Symptom: clipfrac > 0.5 per step. Cause: the policy is moving too fast — usually too many GRPO epochs per rollout (K > 2) or LR too high. Fix: drop K to 1 (DeepSeek-R1 uses a single epoch) or halve LR. The clip is doing its job, but you are wasting compute on gradients that are being clipped to zero.
- Rollout decoding mismatch. Symptom: ratio distribution centred far from 1.0 on the very first step before any gradient has been taken. Cause: the rollout used a sampler that the trainer cannot reproduce — top-p, temperature, or stop-token logic differs between vLLM and the live forward pass. Fix: greedy reconstruction is fine for log-prob purposes; just ensure the trainer's logit transform matches what the rollout engine emitted. trl, OpenRLHF, and DeepSpeed-Chat all have explicit code for this.
- Memory blowup on the reference model. Symptom: OOM after enabling reference KL. Cause: holding two full bf16 copies of a 70B policy on the same GPUs. Fix: shard the reference with FSDP or place it on CPU and run reference forwards once per rollout (it is not in the gradient path, so the latency hit is tolerable). LoRA tricks that share weights between the policy and the reference (with a different adapter for each) are increasingly common.
The mental model that unifies the section: GRPO is PPO with one substitution. Replace the value head with the group mean. Replace the value-head's normalisation with the group std. Everything else — the importance ratio, the clipped surrogate, the k3 KL to a reference — survives unchanged. Two models live in memory instead of four. The cost moves from per-step trainable parameters to per-prompt inference samples. For LLM tasks where the per-prompt rollout is cheap (vLLM, async batching) and the reward is a verifier (no learned reward model), GRPO is strictly better than PPO. For everything else it is a one-line decision: how much variance can you tolerate in exchange for a free 70B model?