The Real Problem: PPO's Critic Is the Bottleneck
By the end of Section 15.2 we derived GRPO as PPO with the value head replaced by an in-group baseline. In Section 15.3 we pinned the hyperparameters DeepSeek used for R1. We now turn theory into code. The question is narrow and practical: given a rollout of G responses per prompt with a scalar reward each, what is the exact tensor program that updates the policy?
The four-model dance of PPO — policy, value head, reference, reward — is a memory monster at scale. At 70B parameters each model is roughly 140 GB in bf16. The value head alone, plus its gradients, plus its AdamW state, costs the same as a second policy replica. For DeepSeek V3 at 671B the value head was simply not affordable on the cluster they had. GRPO is the answer that fell out of that constraint, and its implementation is what makes the savings real.
Intuition: Compare Each Response to Its Siblings
Picture a class of four students answering the same hard math problem. Each turns in a different attempt. You grade them and obtain rewards . How do you tell each student whether their answer was good?
You could grade against an absolute rubric — this is PPO with a learned value head, where the rubric is a neural net that predicts “what reward would I expect from a typical attempt at this problem?” The trouble is that the rubric itself has to be learned, and at 70B+ the rubric is as expensive as the student.
Or you can grade relative to the other students in the class — this is GRPO. The baseline for student is the average reward of the other three. If you outscored the class average you did well; if you fell below it you did poorly; if everyone scored the same, the class learned nothing from this problem and we move on.
The GRPO Objective in Equations
Given a prompt and a group of responses sampled from the old policy , with scalar rewards , the group-relative advantage of response is
Here is a SINGLE scalar broadcast to every token of : for any token at position inside response , the advantage used in the gradient is . There is no per-token credit assignment beyond what the PPO clip already does.
The per-token probability ratio is the standard off-policy correction:
The clipped surrogate is PPO's, applied per token:
The per-token KL penalty against the frozen reference policy uses Schulman's estimator, which is unbiased and sample-wise non-negative:
Putting it together, vanilla GRPO maximises (equivalently, minimises the negative of)
Symbols, in order: the prompt; the group size; the -th sampled response; its scalar reward; its group-relative advantage; the per-token importance ratio of the live vs. old policy; the PPO clip range (0.2 in DeepSeek R1); the frozen reference (SFT) policy; the per-token KL coefficient (0.04 in DeepSeek R1).
Manual Numerical Walkthrough: One GRPO Update
We trace one GRPO step on a single prompt with short responses. Every intermediate number is shown; the toy is set up so that no clipping happens for most tokens and exactly one clip event happens for response 3.
Step 1 — rewards and group statistics
Rewards from the reward function: . Group mean and group standard deviation:
Step 2 — per-response advantages
| response | reward r_i | r_i - mean | advantage A_i |
|---|---|---|---|
| o_1 | +0.90 | +0.45 | +1.172 |
| o_2 | +0.30 | -0.15 | -0.391 |
| o_3 | -0.10 | -0.55 | -1.432 |
| o_4 | +0.70 | +0.25 | +0.651 |
Notice the advantages sum to zero by construction. The most-rewarded response () gets the largest positive advantage; the least-rewarded () gets the largest negative advantage.
Step 3 — per-token ratios for one response
Take response 3, which has 2 tokens. Suppose the log-ratios (live vs old) are . Exponentiating: . Both lie OUTSIDE the trust region .
Step 4 — the asymmetric clip in action
With :
| token | ratio | surr1 = r·A | clip(r)·A | min (objective) | clipped? |
|---|---|---|---|---|---|
| t=1 | 1.284 | -1.838 | 1.2·(-1.432) = -1.718 | -1.838 | no (A<0, r>1+ε → full penalty) |
| t=2 | 0.741 | -1.061 | 0.8·(-1.432) = -1.146 | -1.146 | yes (A<0, r<1-ε → bounded) |
The mean objective for response 3 is . The clip did exactly what it should: it allowed the full penalty when the live policy over-assigned probability to a bad token (), but prevented an outsized penalty when the live policy was already moving away from the bad token ().
Step 5 — per-token KL (Schulman k3)
For response 1, token 1, suppose . Then and
Small, positive — the policy has drifted only slightly from the reference. Multiplied by the KL penalty per token is — a gentle leash, not a tight one.
Step 6 — aggregate into one scalar loss
Compute the per-response mean objective for each of the four responses (with ratios mostly near 1), negate, add , and average over the group. With ratios this close to 1 the loss is small — the policy will only start moving once enough gradient steps have shifted the per-token log-probs meaningfully.
Sanity reading of the numbers: on a fresh rollout every ratio is very close to 1, every KL term is close to 0, and the loss is small. That is healthy. A first-step loss with typically means the rollout and the trainer are computing log-probs differently — check sampling temperature, dtype, and the prompt/response mask alignment.
Interactive: The Group-Relative Advantage
Drag the reward sliders below to see how a group of four responses gets turned into four advantages. Notice three things in particular:
- If you align all four rewards to the same value every advantage collapses to zero. This group contributes nothing to the gradient — GRPO learns from disagreement inside a group, not from absolute reward.
- Shifting all four rewards up or down by the same amount leaves the advantages unchanged. Only the spread inside the group matters.
- Toggle off “divide by std” to see the Dr.GRPO variant (Section 15.5): the advantages now scale with the spread, which avoids vanilla GRPO's subtle bias toward low-variance groups.
Plain Python: GRPO Loss from Scratch
We re-implement the GRPO objective in pure NumPy on the toy rollout from the walkthrough above. No autograd, no transformer — just the loss arithmetic. Swap the hard-coded logits for a real transformer's output and the code is unchanged.
PyTorch: A Production-Shaped GRPO Step
Now the full thing: tensorised across the (B, G, S) layout that every production GRPO trainer uses, with the group-relative normalisation as a single mean/std along the G axis, gradient clipping, and the diagnostics that go straight onto a training dashboard. Drop in any HuggingFace causal LM for policy and a frozen copy of the SFT checkpoint for ref_policy and this trains a real reasoning model.
At Massive Scale: What Changes for a 70B+ Run
Every line of code above scales to a 70B-parameter GRPO run, but three things change in spirit.
Memory: GRPO's biggest gift to production training
At 70B parameters in bf16, the policy and the reference together are ~280 GB of weights. Adding a value head (PPO) doubles weights to ~420 GB, plus another ~140 GB of value-head gradients, plus ~280 GB of AdamW optimiser state for the value head — an extra ~560 GB relative to GRPO. On an 8×H100 node with 80 GB per GPU, that extra 560 GB is the difference between a run that fits with FSDP sharding and a run that requires a second node just for the value head. This is why DeepSeek picked GRPO for R1 and why every post-R1 reasoning recipe (Qwen-Math-RL, AceMath-RL, Olmo 3) followed suit.
The (B, G) axis lives across ranks
At scale, the natural FSDP sharding is to split across data-parallel ranks (every rank holds the same parameters and a different slice of prompts) and shard the parameters across model-parallel ranks. The group-relative mean/std becomes a careful primitive: each rank computes partial sum and partial sum-of-squares along the G axis for its own prompts; an all-reduce is NOT needed for the mean and std as long as every prompt's entire group is on the same rank. The most common GRPO bug at scale is splitting a single prompt's group across ranks — each rank then sees a different baseline, the effective advantage scale varies across ranks, and the run becomes noisy in a way that takes a week to diagnose. Every production GRPO library has unit tests that assert all G responses for a prompt live on the same rank.
Sampling-vs-training imbalance is even worse than PPO
GRPO requires × more rollout tokens per prompt than PPO did. With and responses of length 1024 tokens that is 16 384 tokens per prompt to sample before a single gradient step can happen. At 500 tok/s/GPU with vLLM, even a 1024-prompt minibatch takes minutes to roll out. Production GRPO infrastructure runs the rollout cluster (many GPUs, prioritised for inference throughput with vLLM and continuous batching) entirely separately from the training cluster (FSDP, fewer GPUs, prioritised for memory bandwidth) and ships generations over the network. Weights sync after each optimisation phase.
Reward model bottleneck disappears for verifier rewards
DeepSeek R1 uses RULE-BASED rewards for math and code (the verifier is the unit test or the symbolic check), which is essentially free to compute. For tasks that still need a learned RM, the same rollout/score/train split as PPO applies, and the RM becomes the new cost bottleneck.
Engineering Reality: The Knobs That Break Runs
After enough GRPO runs every team converges on the same set of failure modes and the same set of high-leverage fixes.
- All-equal-reward groups dominate the batch. Symptom: loss is flat, gradient norm collapses, advantage std drops near zero. Cause: for hard math/code prompts every response in a group is wrong (reward = 0) so every advantage is zero. The group contributes nothing. Fix: oversample these prompts with higher temperature, or skip them entirely from the gradient and log them separately for later curriculum work. DeepSeek R1 explicitly filters such groups from the loss.
- Group split across ranks. Symptom: loss noise is much higher than a single-GPU debug run. Cause: a careless DataLoader put 8 of a prompt's 16 responses on rank 0 and 8 on rank 1. The baseline computed on rank 0 differs from rank 1 and the advantages are inconsistent. Fix: a custom sampler that guarantees
prompt_id % num_ranks == rank_idfor the entire group. This is a one-time setup that prevents weeks of pain. - Length bias toward longer responses. Symptom: mean response length climbs steadily over training even though rewards do not require length. Cause: vanilla GRPO's per-response normalisation gives long responses a smaller per-token loss; the policy can game this by lengthening responses to dilute the gradient. Fix: switch to Dr.GRPO's normalisation (Section 15.5), or apply an explicit length penalty inside the reward function.
- KL coefficient too small. Symptom: reward climbs for 200 steps then the policy degenerates (repeats a single phrase, drops English, emits gibberish that scores high on a mis-calibrated RM). Cause: insufficient KL pressure against the reference SFT model. Fix: raise by 2× and restart from a pre-collapse checkpoint. Many teams use adaptive KL (raise when realised KL exceeds a target, lower it when below).
- Rollout temperature mismatch. Symptom: ratio distribution is centred far from 1.0 on step 0 even with the freshly-cloned policy. Cause: the rollout sampled with
temperature=1.0but the trainer computes from . The two distributions are different and the ratio is meaningless. Fix: apply the SAME temperature inside the trainer when computing per-token log-probs, or sample with temperature=1.0 throughout. - Action mask off by one. Symptom: training looks fine, reward inches up, but eval shows no change from the SFT model. Cause: the mask includes prompt tokens (which carry no gradient by construction since their log-probs are not part of the decision), or excludes the first response token. Fix: print one example's mask alongside its decoded tokens before launch and eyeball that mask=1 lines up with the response tokens only. The off-by-one between
input_ids[:, :, :-1]andinput_ids[:, :, 1:]is the most common variant. - Old log-probs not refreshed between PPO epochs. Symptom: ratio drifts far from 1 after the second optimisation epoch on the same rollout, clipfrac spikes. Cause: GRPO typically uses 1 epoch per rollout for exactly this reason, but if you set PPO-style 4 epochs you must NOT re-use the old_logp from epoch 1 on epoch 4 — the live policy has moved. Fix: either stick with 1 epoch (the DeepSeek R1 choice) or refresh old_logp at the start of every epoch.
- Gradient norm explodes after a few hundred steps. Symptom: climbs from 0.5 to 50 in 100 steps. Cause: either KL coefficient too small (policy escaping to extreme outputs) or a reward model bug that gives unboundedly large rewards on some inputs. Fix: raise or clip the reward distribution at the source.
- Reference accidentally on the same fork as policy. Symptom: KL goes to zero, policy never moves. Cause: the “reference” was initialised as a Python reference to the same tensors as the live policy. Fix: assert
id(ref_policy) != id(policy)at startup; load the reference from a separate checkpoint.
The Mental Model That Unifies This Section
GRPO is PPO with one substitution and two deletions. Substitution: replace with the group-normalised reward , broadcast over every token of response . Deletion 1: remove the value head, its gradient, and its optimiser state (saving ~3 model-replicas worth of memory at scale). Deletion 2: remove the entropy bonus (the per-group reward diversity replaces it as an exploration signal). Everything else — the clipped surrogate, the per-token KL to the reference, the gradient clip, the diagnostics — is the PPO code unchanged. Get the (B, G, S) shape right, keep every group on one rank, and a 70B reasoning model can be RL-trained for a thousand steps on a single 8×H100 node.
With this section closed, Chapter 15 is complete. The next chapter opens DeepSeek R1-Zero — the pure-RL experiment that ran this exact GRPO loop with rule-based math rewards and discovered emergent chain-of-thought reasoning along the way.