Chapter 15
32 min read
Section 85 of 117

GRPO Derivation

GRPO: Group Relative Policy Optimisation

The Real Problem: PPO Needs a Critic It Cannot Afford

The previous section laid out PPO's machinery: a clipped importance ratio, a value head that supplies the per-token baseline, and a per-token KL anchor to the reference policy. The algorithm works. It also pays a heavy price. A 70B policy demands a 70B value head running in lockstep — the value network is a second model the same size as the policy, with its own forward pass, its own backward pass, and its own slice of AdamW state. At full mixed precision that is roughly 16N16 N bytes per trainable parameter, doubled because there are two such models. For a 70B run this is over a terabyte of memory before a single activation is allocated.

Worse, the value head is dead weight in the most literal sense. Nothing in the policy's output depends on V(st)V(s_t); its only job is to subtract a per-state expected return from the actual return so the policy gradient sees a centred signal. The value head is a variance reducer, not a behavioural component. We are paying 70 billion parameters to compute one scalar per token whose only function is to make another loss less noisy.

And the value head is hard to train. It chases a moving target: the policy keeps changing, so the return distribution keeps shifting, so the value head's MSE target keeps drifting. PPO ships with a clipped value loss precisely because uncontrolled value updates routinely blow up. Every engineer who has stood up a from-scratch PPO-RLHF run has felt this pain — value loss oscillating, advantages collapsing, the run wedging on a plateau that has nothing to do with the policy itself.

GRPO was born from a single question: can we get a usable per-token advantage without a value head? The DeepSeek-Math team (2024) answered yes, with a one-line trick that has since propagated through DeepSeek-R1, Qwen-2-Math, Llama-3 math fine-tunes, and most open-source rule-based RL recipes. The trick: for every prompt, sample G rollouts at once, and use the dispersion of their final rewards as the baseline. The group is its own critic.

What this section will show, in one paragraph

GRPO keeps every piece of PPO except the value head. The clipped surrogate stays. The KL-to-reference stays. The importance ratio stays. The only thing that changes is how the per-token advantage is computed: instead of At=R(τ)V(st)A_t = R(\tau) - V(s_t) with a learned VV, GRPO uses Ai=(RiRˉ)/σRA_i = (R_i - \bar R) / \sigma_R, where Rˉ\bar R and σR\sigma_R are the mean and std of the rewards across the G rollouts of the same prompt — and that scalar advantage is broadcast to every token of the rollout. No value network. No GAE. Two models in memory instead of four.

Intuition: Let the Group Be Its Own Baseline

Imagine you are grading G = 8 students who all wrote the same essay. You do not have an objective rubric; you only have your gut ranking. The natural way to teach the class is relatively: praise the essays that scored above the average of these eight, criticise the ones that scored below. Whether the average is “objectively good” is irrelevant — what matters is which essays stood out from this group.

GRPO is that intuition mechanised. For every prompt, sample G responses. Score each one. Subtract the group mean. Now you have a signed advantage per response — positive means “do more of this”, negative means “do less of this”. There is no value network because the group itself supplies the answer to “what was expected of this prompt?”. The expected reward of the current policy on this prompt is, well, the empirical mean of the G samples we just drew from it.

Why does this work? Because the policy gradient does not need an absolute baseline — it only needs an unbiased one. Any baseline b(s)b(s) that does not depend on the action aa leaves the gradient unbiased (this is the classical baseline-subtraction theorem from REINFORCE). The group mean of rewards on the same prompt is a perfectly valid such baseline — it depends only on the prompt and on the current policy, not on any individual action. So we are allowed to subtract it, and the gradient stays pointing in the right direction.

One-line mental model: PPO uses the value function V(st)V(s_t) as a baseline because Ea[Rst]=V(st)\mathbb{E}_{a}[R \mid s_t] = V(s_t). GRPO replaces V(st)V(s_t) with the empirical Monte Carlo estimator of that same expectation — the mean reward of G samples drawn at st=prompts_t = \text{prompt}. The G samples are the value network.

A physical analogy: PPO carries an absolute thermometer that reads “true” expected reward at every state. GRPO carries no thermometer at all; instead it compares the temperature of G samples against each other and trusts that one of them is above average. The first is more informative when calibrated; the second is dramatically cheaper and cannot drift out of calibration because there is nothing to calibrate.

Deriving GRPO from PPO, One Substitution at a Time

We start exactly where § 14.6 ended. PPO's RLHF objective on a single prompt is:

JPPO(θ)=Et[min(rt(θ)At,  clip(rt(θ),1ε,1+ε)At)]βDKL(πθπref)J^{\text{PPO}}(\theta) = \mathbb{E}_t \left[ \min\left( r_t(\theta) A_t,\; \mathrm{clip}\big(r_t(\theta), 1{-}\varepsilon, 1{+}\varepsilon\big) A_t \right) \right] - \beta\, D_{\text{KL}}\big(\pi_\theta \,\|\, \pi_{\text{ref}}\big)

where rt(θ)=πθ(atst)/πθold(atst)r_t(\theta) = \pi_\theta(a_t \mid s_t) / \pi_{\theta_{\text{old}}}(a_t \mid s_t) is the importance ratio and the advantage AtA_t is the GAE estimate built on top of a learned value head Vϕ(st)V_\phi(s_t).

Substitution one — replace GAE with a single trajectory-level reward. For LLM tasks the per-token reward is zero everywhere except at the EOS token, where it is the reward model's score R(τ)R(\tau). The GAE estimate over such a sparse reward stream is dominated by that one terminal value; in the limit λ1\lambda \to 1 it collapses to

AtMC=R(τ)Vϕ(st)A_t^{\text{MC}} = R(\tau) - V_\phi(s_t)

which is just the Monte-Carlo return minus the value baseline. Already we have one clean expression for the advantage that does not need the GAE bookkeeping; we keep it.

Substitution two — replace the value baseline with the group mean. Drop the learned Vϕ(st)V_\phi(s_t) entirely. In its place, sample G rollouts τ1,,τG\tau_1, \ldots, \tau_G from the OLD policy on the same prompt and use the empirical mean of their rewards as the baseline:

Rˉ=1Gi=1GR(τi),Ai=R(τi)Rˉ\bar R = \frac{1}{G} \sum_{i=1}^{G} R(\tau_i), \qquad A_i = R(\tau_i) - \bar R

This is unbiased: the group mean does not depend on any individual rollout's actions, so the baseline-subtraction theorem applies. It is also high variance — for a small G the empirical mean is a noisy estimator of the true expected reward. GRPO addresses the variance not by adding parameters back, but by standardising: divide by the group standard deviation.

Substitution three — standardise by the group std. Compute the group standard deviation σR\sigma_R alongside the mean and scale:

Ai=R(τi)RˉσR+ϵA_i = \frac{R(\tau_i) - \bar R}{\sigma_R + \epsilon}

Two things happen at once. First, the scale of the advantage is now per-prompt normalised — an Ai=+1A_i = +1 means “one std above the group mean for this prompt”, regardless of whether this prompt's rewards live in [0,1][0, 1] or [100,100][-100, 100]. Second, the loss magnitude across batches becomes comparable, which is what lets a single learning rate and a single KL coefficient work for prompts of wildly different difficulty.

Substitution four — broadcast the scalar advantage to every token. The advantage we just defined is one number per rollout. The policy update needs one number per token. GRPO makes the simplest possible choice: copy the same AiA_i to every token of rollout ii:

Ai,t=Aifor all tresponse tokens of rollout iA_{i, t} = A_i \quad \text{for all } t \in \text{response tokens of rollout } i

This is a deliberately coarse credit assignment. A 200-token response gets credit for every token equally — the one tactically correct token gets the same advantage as 199 boilerplate tokens. PPO with GAE would give each token a different advantage based on local TD error; GRPO trades that resolution for the elimination of the value head. In practice, with large enough G and a reward signal that distinguishes responses on overall quality (not on token-level details), this trade pays off.

The bias-variance picture in one line

PPO has lower per-token variance (the value head smooths) but higher cost (an extra 70B model). GRPO has higher per-token variance (one scalar per response) but zero extra parameters. The variance is bounded by 1 because of the std normalisation; the cost is bounded by 0 because there is no critic. For LLM tasks where rewards are sparse and the per-token attribution is already a crude approximation, the GRPO trade is generally a win — and it becomes a free win as G grows.

The Group-Relative Advantage

Collecting the substitutions, the GRPO advantage in its final form is one line:

Ai,t  =  R(τi)1Gj=1GR(τj)1Gj=1G(R(τj)Rˉ)2+ϵA_{i, t} \;=\; \frac{R(\tau_i) - \frac{1}{G} \sum_{j=1}^{G} R(\tau_j)}{\sqrt{\frac{1}{G} \sum_{j=1}^{G} \big(R(\tau_j) - \bar R\big)^2} + \epsilon}

Read it slowly. The numerator is “this rollout's reward minus the group mean” — the centred signal. The denominator is the group standard deviation (population formula — divide by G, not G−1), regularised by a small ϵ\epsilon to prevent division by zero on degenerate groups. The result is broadcast across all tokens of rollout ii — every response token sees the same scalar advantage.

PropertyPPO + GAEGRPO
Per-token advantage sourceTD residuals through a learned V(s_t)Group-mean and group-std over G rollouts on the same prompt
Extra trainable parameters≈ N (a second model the size of the policy)0
Per-token variance reductionStrong — V(s_t) is per-stateWeaker — same scalar per token in a rollout
Sensitivity to reward scale across promptsDepends on V's calibrationEliminated by std normalisation
Rollouts needed per prompt1 (sometimes 2 for chosen/rejected RLHF)G (typically 4 – 64)
Failure mode if rewards collapseV learns a constant; gradient continuesstd → 0 → divide by ε → advantages explode (see § engineering)

Two consequences of this choice are worth naming up front. One: GRPO trades model parameters (the value head) for inference samples (the G rollouts). If your inference is fast and your reward is cheap to compute, GRPO is strictly better. If your inference is slow or your reward call is expensive, GRPO can be worse than PPO in wall-clock terms. Two: the std normalisation makes the gradient scale-free, which means the choice of βKL\beta_{\text{KL}} and the learning rate become more universal — DeepSeek-Math uses the same hyperparameters across math problems of vastly different reward distributions.

Importing PPO's Clip — Why GRPO Keeps the Ratio

With the advantage defined, the rest of GRPO is PPO almost verbatim. Define the same importance ratio:

ri,t(θ)=πθ(ai,tsi,t)πθold(ai,tsi,t)r_{i, t}(\theta) = \frac{\pi_\theta(a_{i, t} \mid s_{i, t})}{\pi_{\theta_{\text{old}}}(a_{i, t} \mid s_{i, t})}

and the same clipped surrogate, replacing PPO's per-token advantage AtA_t with GRPO's broadcast advantage Ai,tA_{i, t}:

Li,tCLIP(θ)=min ⁣(ri,t(θ)Ai,t,  clip(ri,t(θ),1ε,1+ε)Ai,t)L_{i, t}^{\text{CLIP}}(\theta) = \min\!\left( r_{i, t}(\theta) A_{i, t},\; \mathrm{clip}\big(r_{i, t}(\theta), 1{-}\varepsilon, 1{+}\varepsilon\big) A_{i, t} \right)

Why keep the clip at all if the value head is gone? Because the ratio clip protects against the OTHER PPO failure mode — taking K gradient steps on the same rollout (the ‘outer loop’ of PPO/GRPO) and having the policy drift far enough between steps that the importance correction breaks down. The clip prevents a runaway feedback loop in which a high-ratio token gets pushed higher on the next minibatch, raising its ratio further, and compounding. Even without a critic, that feedback loop is real; the clip is the cheap insurance against it.

Crucially, ε=0.2\varepsilon = 0.2 remains the right default. The PPO ablations that fixed this value were about ratio dynamics, not about whether a critic was present. Some recent variants (Dr. GRPO, Open-RLHF's “reinforce++”) experiment with ε[0.1,0.3]\varepsilon \in [0.1, 0.3], but the differences are second-order — get the advantage and the KL right first, then think about ε\varepsilon.

Symmetry check: when does the clip fire in GRPO?

Exactly the same conditions as in PPO. For a positive group advantage Ai>0A_i > 0, the clip fires when the new policy has already grown the ratio past 1+ε1 + \varepsilon — gradient becomes zero, no further pushing of an already-rewarded action. For a negative advantage, the clip fires on the other side. The asymmetric ‘safety brake’ that makes PPO trustworthy carries over verbatim because it depends only on the sign of the advantage and the position of the ratio, not on how the advantage was computed.

Anchoring to the Reference: The k3 KL Term

Per-step drift is bounded by the clip; cumulative drift over thousands of steps is bounded by a separate per-token KL penalty against the frozen reference (SFT) policy. GRPO keeps PPO's formulation here, including the k3 unbiased estimator:

D^KL(πθπref)i,t=exp ⁣(logri,tref)logri,tref1\hat D_{\text{KL}}\big(\pi_\theta \,\|\, \pi_{\text{ref}}\big)_{i, t} = \exp\!\big(\log r_{i, t}^{\text{ref}}\big) - \log r_{i, t}^{\text{ref}} - 1

where logri,tref=logπθ(ai,tsi,t)logπref(ai,tsi,t)\log r_{i, t}^{\text{ref}} = \log \pi_\theta(a_{i, t} \mid s_{i, t}) - \log \pi_{\text{ref}}(a_{i, t} \mid s_{i, t}). The k3 estimator is always non-negative, unbiased in expectation, and has lower variance than the naive logr\log r estimator that early policy- gradient code used. It costs nothing extra to compute — the ingredients are the same logπθ\log \pi_\theta and logπref\log \pi_{\text{ref}} we already need for the ratio.

The coefficient βKL\beta_{\text{KL}} tends to be slightly higher in GRPO than in PPO-RLHF — DeepSeek-Math uses βKL=0.04\beta_{\text{KL}} = 0.04, roughly double the PPO-RLHF default of 0.02. The reason: in PPO the value head provides indirect smoothing of the policy (a noisy value estimate produces noisy advantages which damp the policy gradient); GRPO has no such smoothing, so the KL term has to do more anchoring on its own.

β_KLBehaviour in GRPO
0No anchor. Policy escapes to reward-hacked nonsense in 1–5k steps.
0.01 – 0.02Aggressive. OK if the reward is robust (a strong verifier).
0.04 (DeepSeek-Math default)Standard. Realised KL ≈ 10 – 30 nats per response.
0.05 – 0.1Conservative. Works for noisy or partial reward signals.
>0.5Policy barely moves. Use only when reward is very suspect.

Putting It Together: The Full GRPO Objective

The complete GRPO loss for one prompt's group is the weighted sum of the clipped policy objective and the KL penalty, averaged over tokens within each rollout and then over rollouts within the group:

LGRPO(θ)=1Gi=1G1τitτi[Li,tCLIP(θ)βKLD^KL ⁣(πθπref)i,t]\mathcal{L}^{\text{GRPO}}(\theta) = -\frac{1}{G} \sum_{i=1}^{G} \frac{1}{|\tau_i|} \sum_{t \in \tau_i} \left[ L_{i, t}^{\text{CLIP}}(\theta) - \beta_{\text{KL}} \, \hat D_{\text{KL}}\!\left(\pi_\theta \,\|\, \pi_{\text{ref}}\right)_{i, t} \right]

Read the three layers of averaging from the inside out. The innermost is the per-token loss combining the clipped surrogate and the KL penalty. The middle is 1τi\frac{1}{|\tau_i|} — average over the response tokens of rollout ii. The outermost is 1G\frac{1}{G} — average over rollouts in the group. The leading minus turns the objective into a loss that gradient descent will minimise.

The double-mean (sequence-mean then group-mean) is a deliberate bookkeeping choice that gives every rollout equal weight in the final loss regardless of its length. It also introduces a length-bias artefact that the Dr. GRPO paper analyses in detail (short responses get effectively up-weighted relative to long ones because a small τi|\tau_i| denominator inflates the contribution of every token in that rollout). The two principal variants in practice are:

VariantToken aggregationEffect
Vanilla GRPO (DeepSeek-Math)mean over tokens, then mean over rolloutsLength-bias: shorter rollouts contribute more per token than longer ones.
Dr. GRPOsum over tokens, divide by a fixed max-length constantRemoves length-bias by giving every token equal weight; common in math RL recipes since 2024.
Reinforce++ / Open-RLHFsum over tokens, no per-sequence normalisationCloser to vanilla REINFORCE; advantages must be carefully scaled.

For the rest of this section we use vanilla GRPO — the formula above with sequence-mean then group-mean — because it is what DeepSeek-Math, DeepSeek-R1, and Qwen-2-Math ship by default. The engineering subsection below documents the length-bias trade-off you accept by choosing it.

Manual Numerical Walkthrough

We trace one full GRPO update by hand for G = 4 rollouts of T = 3 tokens each, all from the same prompt. Numbers small enough to check on paper; structure identical to a production step.

Step 1: the group of rollouts

Four rollouts have been sampled from the old policy on a single prompt. Each rollout consists of three response tokens; we list the action sequences and the log-probabilities the old policy assigned to them, plus the scalar reward each response received from the reward model (or verifier).

rollout itokens (a_1, a_2, a_3)old log π (sum over tokens)reward R_i
1(0, 2, 1)−3.22+0.90
2(1, 0, 2)−2.99+0.30
3(2, 1, 0)−2.85−0.10
4(0, 1, 2)−3.15+0.70

Step 2: the group baseline and standard deviation

Mean of the four rewards: Rˉ=(0.90+0.300.10+0.70)/4=0.450\bar R = (0.90 + 0.30 - 0.10 + 0.70) / 4 = 0.450.

Population variance: σR2=14((0.900.45)2+(0.300.45)2+(0.100.45)2+(0.700.45)2)\sigma^2_R = \frac{1}{4}\big((0.90 - 0.45)^2 + (0.30 - 0.45)^2 + (-0.10 - 0.45)^2 + (0.70 - 0.45)^2\big) =14(0.2025+0.0225+0.3025+0.0625)=0.1475= \frac{1}{4}(0.2025 + 0.0225 + 0.3025 + 0.0625) = 0.1475.

Standard deviation: σR=0.14750.3841\sigma_R = \sqrt{0.1475} \approx 0.3841.

Step 3: the per-rollout advantages

Subtract the mean and divide by the std (with ϵ=108\epsilon = 10^{-8} in the denominator):

rollout iR_iR_i − R̄A_i = (R_i − R̄) / σ_R
1+0.90+0.450+1.171
2+0.30−0.150−0.390
3−0.10−0.550−1.431
4+0.70+0.250+0.651

Two checks worth doing in your head: iAi=+1.1710.3901.431+0.651=+0.0010\sum_i A_i = +1.171 - 0.390 - 1.431 + 0.651 = +0.001 \approx 0 (round-off). And the std of the advantages is (1.1712+0.3902+1.4312+0.6512)/41.00\sqrt{(1.171^2 + 0.390^2 + 1.431^2 + 0.651^2) / 4} \approx 1.00. Both are guaranteed by the standardisation, but worth confirming once to internalise that the advantage scale is per-prompt comparable.

Step 4: broadcast to tokens

Every token of rollout ii receives the same scalar advantage AiA_i. The advantage tensor is now of shape (G,T)=(4,3)(G, T) = (4, 3) with each row constant:

rollout iA_{i,1}A_{i,2}A_{i,3}
1+1.171+1.171+1.171
2−0.390−0.390−0.390
3−1.431−1.431−1.431
4+0.651+0.651+0.651

Step 5: the policy ratio at one token

Take token (i,t)=(1,1)(i, t) = (1, 1) — the first token of rollout 1. The old policy assigned it logπold=1.10\log \pi_{\text{old}} = -1.10. The live policy (after one gradient step) now assigns it logπnew=0.21\log \pi_{\text{new}} = -0.21 (computed from the new logits via log-softmax — we will see this exact computation in the code below).

Log ratio: logr=0.21(1.10)=+0.89\log r = -0.21 - (-1.10) = +0.89.

Ratio: r=e0.89=2.43r = e^{0.89} = 2.43. Way outside the trust region [0.8,1.2][0.8, 1.2].

Step 6: clipped surrogate at this one token

The two surrogates: rA=2.431.171=+2.845r \cdot A = 2.43 \cdot 1.171 = +2.845 and clip(r,0.8,1.2)A=1.21.171=+1.405\mathrm{clip}(r, 0.8, 1.2) \cdot A = 1.2 \cdot 1.171 = +1.405. The min picks the smaller: +1.405+1.405. The clip fired, the gradient through r1,1r_{1,1} at this token is zero — we have already pushed this rewarded token's probability far enough; do not push further.

Compare to rollout 3's first token where A=1.431A = -1.431. Suppose its ratio after one step is r=1.6r = 1.6. Then rA=2.290r \cdot A = -2.290 and clip(r,0.8,1.2)A=1.2(1.431)=1.717\mathrm{clip}(r, 0.8, 1.2) \cdot A = 1.2 \cdot (-1.431) = -1.717; the min picks 2.290-2.290. The clip does NOT fire — gradient flows fully because the policy is moving in the rewarded direction (decreasing a bad action's probability), and the asymmetric clip never blocks that. This is exactly the PPO behaviour from § 14.6, carried over into GRPO unchanged.

Step 7: per-token KL at the same first token

The reference model assigned this token logπref=1.05\log \pi_{\text{ref}} = -1.05. Then logrref=0.21(1.05)=+0.84\log r^{\text{ref}} = -0.21 - (-1.05) = +0.84. The k3 estimator gives D^KL=e0.840.841=2.3170.841=0.477\hat D_{\text{KL}} = e^{0.84} - 0.84 - 1 = 2.317 - 0.84 - 1 = 0.477. With βKL=0.04\beta_{\text{KL}} = 0.04, the KL contribution at this token is 0.040.477=0.01910.04 \cdot 0.477 = 0.0191.

Step 8: per-token loss and aggregation

Per-token loss at (1,1)(1, 1): =LCLIP+βKLD^KL=1.405+0.0191=1.386\ell = -L^{\text{CLIP}} + \beta_{\text{KL}} \hat D_{\text{KL}} = -1.405 + 0.0191 = -1.386. Negative because the clipped surrogate is positive (rewarded action) and dominates the small positive KL.

Repeating this calculation across all GT=12G \cdot T = 12 tokens (every token of rollouts 1 and 4 yields a negative loss with magnitude bounded by the clip; rollouts 2 and 3 yield positive losses — the model is penalised for sampling them and will reduce their probability), averaging within each rollout, then averaging across rollouts, produces the scalar loss that gets backpropped through the policy.

The walkthrough headline: the entire GRPO update is computable from the four rewards, the four old-policy log-probs, the four reference log-probs, and the new policy's log-probs — no value head, no MSE target, no GAE bookkeeping. The per-token gradient signal is identical within a rollout (same scalar advantage) and differs across rollouts only by the per-token ratio and KL. That is the entire algorithm.

Interactive Visualization: The Group Baseline

Drag the reward sliders below. Each row is one rollout from the same prompt; the live readouts show the group mean, the group std, and the per-rollout advantage. Toggle divide by std (vanilla GRPO) off to see the unscaled baseline-only variant — the colours stay (same sign of advantage) but the magnitudes spread out wildly. Drag a single reward up until it dominates: every other rollout's advantage flips red because the group mean shifts upward by that single change.

Loading group-advantage visualiser…

Two experiments worth running in the widget. First: set all four rewards equal (say all to 0.5). The std collapses to zero, the ϵ\epsilon in the denominator dominates, every advantage is numerically tiny and the gradient signal vanishes. This is the “std collapse” failure mode and is exactly what happens in a real GRPO run when a binary verifier returns all-correct or all-wrong for a group. Recovery: raise sampling temperature so the rollouts diversify. Second: set one reward to +1 and the other three to 0. The +1 rollout gets a strongly positive advantage, the other three get equal small negative advantages, and the gradient will push the policy hard toward the +1 rollout. This is the canonical “one winner in a group of G” pattern that DeepSeek-R1 reasoning training operates in for hours at a time.

Plain Python: One GRPO Step From Scratch

We implement the GRPO loss in pure NumPy on a toy group: G = 4 rollouts, T = 3 tokens, V = 3 vocabulary entries. No autograd, no transformer — just the arithmetic an RL trainer runs every minibatch. Swap the hard-coded logits for a transformer forward pass and the code is unchanged.

GRPO loss for one group — pure NumPy
🐍grpo_loss_plain.py
19The group is the unit of work — G rollouts, all from ONE prompt

📚 The defining choice of GRPO. Where PPO collects rollouts from many prompts and uses a value head to score each token's state, GRPO samples G full responses to the SAME prompt and uses the dispersion of their final rewards as the baseline. G is a knob (4 in toy experiments, 64 in DeepSeek-Math, 16 is a sweet spot). The cost: every prompt must be rolled out G times. The benefit: zero value-network parameters, zero value-network gradients, zero value-network optimizer state.

EXAMPLE
G=4, T=3, V=3 → tokens.shape = (4, 3); rewards.shape = (4,)
21tokens — the actions actually sampled by the old policy

📚 Per rollout i, tokens[i, t] is the token id sampled at position t. This is the same data shape PPO uses — what changes downstream is how it gets scored. In a real run these come from an inference engine (vLLM) producing G completions per prompt in one call; the engine returns the token ids and the per-token old_logp simultaneously.

EXAMPLE
rollout 1 sampled the sequence [0, 2, 1]; that is the action sequence we will grade
27old_logp — log probabilities under the OLD policy, frozen

📚 Shape (G, T). old_logp[i, t] = log pi_old(tokens[i, t] | tokens[i, <t], prompt). Frozen — no gradient flows through these numbers. They are needed for the importance ratio r_t = pi_new / pi_old. In production the inference engine returns these alongside the sampled tokens so the trainer does not have to re-forward the rollout through the old policy to get them.

34ref_logp — log probabilities under the REFERENCE (SFT) policy, also frozen

📚 Same shape (G, T) as old_logp, but computed from a frozen reference model — usually the SFT checkpoint that initialised the policy. Required for the per-token KL penalty that prevents the policy from drifting too far from the SFT manifold. In RLHF you compute it once per rollout (one forward pass of the reference model on the same tokens). The reference model is read-only; many implementations keep it on a separate device or even on CPU to save HBM.

44ONE scalar reward per rollout — the trajectory-level reward signal

📚 Shape (G,). This is the entire 'where does signal come from?' surface area in GRPO. For DeepSeek-Math it is 1.0 if the math answer is correct and 0.0 otherwise. For RLHF it is the reward model's scalar score. For chain-of-thought rule-based RL it is a verifier's pass/fail. The crucial property is that the reward is ONE number per rollout, not per token — and yet the policy update still applies per token. The group baseline is what makes this work.

EXAMPLE
rewards = [+0.90, +0.30, -0.10, +0.70] → rollout 3 was the worst, rollout 1 the best
52Group mean — the GRPO baseline, computed without a neural network

📚 mean(R) is the trivial estimator of the policy's expected reward on this prompt. PPO uses V(s_t), a learned value head whose forward and backward passes cost as much as the policy itself; GRPO replaces V(s_t) with this one-line arithmetic. The trade-off: V(s_t) gives per-state precision and reduces variance more than a single mean, but it costs a critic. GRPO bets that with large enough G the group mean is precise enough.

EXAMPLE
mean = (0.90 + 0.30 - 0.10 + 0.70) / 4 = 0.450
53Group std — the per-prompt normaliser

📚 std(R) is the within-group dispersion. Dividing by it makes the advantage scale-free: an advantage of +1 means 'one std above the group mean for this prompt', regardless of whether this prompt's rewards live in [0, 1] or [-100, 100]. This is what lets a single KL coefficient and a single learning rate work across prompts of wildly different difficulty — the loss landscape is uniformised by per-prompt standardisation.

57The group-relative advantage — one scalar per rollout

📚 A_i = (R_i - mean) / (std + eps). The +eps prevents division by zero on a degenerate group where every rollout scored identically (a real edge case when the reward is binary and all G samples are correct OR all G are wrong — see 'std collapse' in the engineering section). A_i is positive for above-average rollouts, negative for below-average ones, and the group as a whole sums to ≈ 0 by construction.

EXAMPLE
advantages = [(0.90-0.45)/0.371, (0.30-0.45)/0.371, (-0.10-0.45)/0.371, (0.70-0.45)/0.371] = [+1.213, -0.404, -1.483, +0.674]
60Broadcast the scalar advantage to EVERY token of the rollout

📚 The same A_i is assigned to all T tokens of rollout i. This is GRPO's other defining choice: a uniform per-token credit assignment. PPO with GAE gives each token its own advantage depending on local TD error; GRPO says 'this entire response was good, so credit every token in it equally'. The credit-assignment loss is real (a 200-token response might have one critical token and 199 noise tokens, but they all get the same A); the simplification is what makes the algorithm fit on one page.

EXAMPLE
advantages_per_rollout.shape = (4,); after broadcasting, advantages.shape = (4, 3) with each row constant
69new_logits — the NEW policy's scores at the SAME tokens

📚 In real code this is the live model's output: a transformer forward pass on the rollout's input_ids. Shape (G, T, V) where V is the vocab size. We then log_softmax and gather at the actions that were actually taken. Gradient flows from new_logp back through the model to its parameters — this is the ONLY tensor in the GRPO loss that the optimiser sees a gradient for.

EXAMPLE
new_logits.shape = (4, 3, 3); after log_softmax + gather → new_logp.shape = (4, 3)
79Numerically stable log-softmax — subtract max first

📚 Standard log-sum-exp trick. Without subtracting the row max, large logits (≥ 30) would overflow exp() and the whole tensor would NaN. log_probs_all[i, t, v] is then log pi_new(v | s_{i,t}) for every (rollout, position, vocab) cell. After gathering at the sampled tokens we collapse this to new_logp[i, t] = log pi_new(tokens[i, t] | s_{i, <t}).

87Probability ratio r_t = pi_new / pi_old — same as PPO

📚 GRPO keeps PPO's importance ratio unchanged. Computed in log space first (log_ratio = log pi_new - log pi_old, bounded and safe), then exponentiated. r_t > 1 means the new policy now assigns this token MORE probability than the old; r_t < 1 means less; r_t = 1 means no change. The ratio is the off-policy correction that lets us take K gradient steps on data sampled by pi_old.

EXAMPLE
log pi_new[0,0] = log_softmax(new_logits[0,0])[tokens[0,0]] ≈ -0.207; log pi_old[0,0] = -1.10 → log_ratio = +0.893, ratio = 2.443
94The PPO clipped surrogate — borrowed wholesale into GRPO

📚 surr1 = ratio * advantage is the unclipped importance-weighted gradient. surr2 = clamp(ratio, 1-eps, 1+eps) * advantage caps the ratio inside the trust region BEFORE multiplying by the advantage. np.minimum picks the more pessimistic of the two. The clip is GRPO's only per-step safety brake (there is no clipped-value loss because there is no value head). EPSILON = 0.2 is the same default as PPO — the DeepSeek-Math paper reuses it without modification.

EXAMPLE
ratio[0,0] = 2.443, A[0,0] = +1.213 → surr1 = 2.964, surr2 = 1.2 * 1.213 = 1.456; min = 1.456 (clip fires)
102k3 KL estimator — to the REFERENCE policy, not the old policy

📚 D_KL(pi_new || pi_ref) per token, approximated by the unbiased k3 estimator: exp(log_r) - log_r - 1 where log_r = log pi_new - log pi_ref. Always non-negative, low variance, exactly KL in expectation. Crucially, the KL is computed against the REFERENCE (frozen SFT) model — NOT against the old policy. The clip already handles per-step old-vs-new drift; this KL handles the long-term cumulative drift from the SFT manifold that would otherwise drag the model into reward-hacked nonsense.

EXAMPLE
log pi_new[0,0] ≈ -0.207, ref_logp[0,0] = -1.05 → log_r_ref = 0.843, kl = exp(0.843) - 0.843 - 1 = 0.481
113Two means, not one — sequence-mean THEN group-mean

📚 First we average over T tokens within each rollout (axis=-1), giving G scalars — one loss per rollout. Then we average over G rollouts. This double-mean is deliberate: it makes the per-rollout contribution to the loss equal regardless of rollout length, which is the vanilla GRPO convention. The Dr. GRPO variant drops the sequence-mean and uses a sum over tokens (sum, divided by a fixed max length) — that variant trades the length bias for a different one. See 'engineering reality' below.

EXAMPLE
per_token_loss.shape = (4, 3); per_rollout_loss.shape = (4,); loss = scalar
110 lines without explanation
1"""
2GRPO loss for ONE prompt with G rollouts — pure NumPy, no autograd.
3
4A 'group' is the G full responses sampled from the OLD policy for the same
5prompt. Each response gets a single scalar reward at the end (from a reward
6model, a verifier, or any oracle). GRPO turns that group of scalar rewards
7into per-TOKEN advantages by subtracting the group mean and dividing by the
8group std. No value network, no GAE, no critic.
9
10We pick a tiny vocab (3 tokens) and G = 4 rollouts of length T = 3 each so
11that every intermediate tensor fits on the page.
12"""
13
14import numpy as np
15
16# ---------------------------------------------------------------------------
17# 1. One group — G = 4 rollouts of length T = 3 tokens each from one prompt.
18#    tokens_i is the action sequence the OLD policy actually sampled.
19#    old_logp_i is log pi_old(a_t | s_t) per token (frozen, no gradient).
20#    ref_logp_i is log pi_ref(a_t | s_t) per token (frozen SFT model).
21# ---------------------------------------------------------------------------
22
23G, T, V = 4, 3, 3                    # group size, tokens per rollout, vocab
24
25tokens   = np.array([
26    [0, 2, 1],                       # rollout 1
27    [1, 0, 2],                       # rollout 2
28    [2, 1, 0],                       # rollout 3
29    [0, 1, 2],                       # rollout 4
30])
31
32old_logp = np.array([
33    [-1.10, -0.92, -1.20],
34    [-0.69, -1.20, -1.10],
35    [-1.00, -0.80, -1.05],
36    [-1.05, -0.95, -1.15],
37])
38
39ref_logp = np.array([
40    [-1.05, -0.90, -1.10],
41    [-0.65, -1.18, -1.05],
42    [-0.98, -0.82, -1.02],
43    [-1.00, -0.92, -1.10],
44])
45
46# ---------------------------------------------------------------------------
47# 2. ONE scalar reward per rollout. In real RLHF this is the reward model's
48#    score; in DeepSeek-R1 it is 1.0 if the math answer is correct, else 0.
49# ---------------------------------------------------------------------------
50
51rewards = np.array([+0.90, +0.30, -0.10, +0.70])     # shape (G,)
52
53# ---------------------------------------------------------------------------
54# 3. GROUP-RELATIVE ADVANTAGE — the heart of GRPO.
55#    For every rollout, A_i = (R_i - mean(R)) / (std(R) + eps).
56#    Broadcast the SAME scalar A_i to all T tokens of that rollout.
57# ---------------------------------------------------------------------------
58
59mean = rewards.mean()
60std  = rewards.std()
61eps  = 1e-8
62
63advantages_per_rollout = (rewards - mean) / (std + eps)     # shape (G,)
64advantages = np.broadcast_to(
65    advantages_per_rollout[:, None],                        # (G, 1)
66    (G, T),                                                 # (G, T)
67).copy()
68
69# ---------------------------------------------------------------------------
70# 4. The NEW policy's logits at the SAME tokens (gradient flows here in
71#    real code). Hard-coded "drifted" logits make the walkthrough readable.
72# ---------------------------------------------------------------------------
73
74new_logits = np.array([
75    [[ 1.50, -0.10,  0.30], [ 0.20,  0.30,  1.50], [-0.10,  1.60,  0.20]],
76    [[ 0.10,  1.50,  0.30], [ 1.40, -0.10,  0.30], [ 0.20,  0.30,  1.20]],
77    [[-0.10,  0.30,  1.50], [ 0.30,  1.40, -0.10], [ 1.50, -0.10,  0.30]],
78    [[ 1.40, -0.10,  0.30], [ 0.20,  1.30, -0.10], [-0.10,  0.30,  1.40]],
79])
80
81# Stable log-softmax → log pi_new for the action that was actually taken.
82z = new_logits - new_logits.max(axis=-1, keepdims=True)
83log_probs_all = z - np.log(np.exp(z).sum(axis=-1, keepdims=True))
84new_logp = np.take_along_axis(
85    log_probs_all, tokens[..., None], axis=-1,
86).squeeze(-1)                                               # shape (G, T)
87
88# ---------------------------------------------------------------------------
89# 5. Probability ratio r_t = pi_new / pi_old, computed in log space.
90# ---------------------------------------------------------------------------
91
92log_ratio = new_logp - old_logp
93ratio     = np.exp(log_ratio)                               # shape (G, T)
94
95# ---------------------------------------------------------------------------
96# 6. PPO-style clipped surrogate — but with the GROUP advantage.
97# ---------------------------------------------------------------------------
98
99EPSILON = 0.2
100surr1 = ratio * advantages
101surr2 = np.clip(ratio, 1.0 - EPSILON, 1.0 + EPSILON) * advantages
102clipped_surrogate = np.minimum(surr1, surr2)
103
104# ---------------------------------------------------------------------------
105# 7. k3 KL estimator to the REFERENCE policy (the SFT model), per token.
106#    kl ≈ exp(log_r) - log_r - 1; always positive, unbiased in expectation.
107# ---------------------------------------------------------------------------
108
109log_r_ref  = new_logp - ref_logp
110kl_per_tok = np.exp(log_r_ref) - log_r_ref - 1              # shape (G, T)
111
112# ---------------------------------------------------------------------------
113# 8. Per-token loss, then sequence-mean within each rollout, then group-mean
114#    across rollouts. Two means are deliberate — see "length bias" below.
115# ---------------------------------------------------------------------------
116
117BETA_KL = 0.04
118
119per_token_loss = -clipped_surrogate + BETA_KL * kl_per_tok  # (G, T)
120per_rollout_loss = per_token_loss.mean(axis=-1)             # mean over T → (G,)
121loss = per_rollout_loss.mean()                              # mean over G → ()
122
123print(f"advantages = {advantages_per_rollout.round(3)}")
124print(f"ratios     =\n{ratio.round(3)}")
125print(f"loss       = {loss:.4f}")

PyTorch: A Production-Shaped GRPO Step

Now the full thing — clipped surrogate, per-token KL to a reference model, gradient clipping, sequence-mean then batch-mean, and the four diagnostics that tell you whether the run is healthy. This is the inner loop that lives inside OpenRLHF's GRPO trainer, TRL's GRPOTrainer, and the DeepSeek-R1 reference code. Drop in any HF causal LM for policy and any frozen copy of the SFT model for ref_policy and this trains a real GRPO run.

GRPO step — group advantage + PPO clip + k3 KL
🐍grpo_step_torch.py
22Four hyperparameters define a GRPO run

📚 EPSILON = 0.2 is the same PPO clip range; GRPO papers reuse it without modification. BETA_KL = 0.04 is DeepSeek-Math's chosen KL coefficient — twice the typical PPO-RLHF value (0.02) because GRPO has no value head smoothing the policy and the KL term is doing more work alone. MAX_GRAD = 1.0 is the universal clip-norm default. GROUP_G = 16 is the most common group size in production — small enough that the rollout cost stays manageable, large enough that mean(R) and std(R) are not noise.

28The GRPO step signature — two models, one optimizer

📚 Only two models live in memory on the training cluster: 'policy' (trainable) and 'ref_policy' (frozen). No value head and no reward model. The reward model runs on a separate device or has already been replaced by a rule-based verifier (DeepSeek-R1) — its score arrives as the pre-computed reward_per_seq tensor. This is the 'two-model dance' that GRPO promises versus PPO's 'four-model dance' (see § 14.6).

41Unpack the rollout minibatch — note B = N_prompts * G

📚 Every batch row is one (prompt, response) pair. To compute the group statistics we will reshape the flat batch into (N_prompts, G) so each row contains exactly one prompt's G rollouts. The collator must guarantee that the G rollouts for the same prompt are contiguous in batch dimension — getting this layout wrong silently corrupts the group baseline and is the #1 GRPO-specific bug.

EXAMPLE
If batch has N=4 prompts × G=16 rollouts → B=64 sequences; input_ids.shape = (64, S)
49Reshape rewards to (N, G) — the group is now explicit

📚 reward_per_seq has shape (B,); reshape to (N, G) so each row is one prompt's group. Then mean(dim=1) and std(dim=1) compute the per-prompt baseline and dispersion in a single vectorised call across the whole batch. unbiased=False uses the population std (divide by G), matching the DeepSeek-Math reference implementation; using unbiased=True (divide by G-1) is a common drop-in but subtly changes the advantage scale and shifts the effective KL coefficient — pick one and stick to it.

EXAMPLE
B=64, GROUP_G=16 → N=4; r.shape = (4, 16); r_mean.shape = (4, 1); r_std.shape = (4, 1)
53Group-relative advantage — broadcast per-sequence to per-token

📚 adv_per_seq has shape (N, G) — one scalar advantage per (prompt, rollout). view(B, 1).expand(B, S) repeats that scalar to every token of the sequence. Every token of the same rollout shares one advantage; every token of a different rollout (even on the same prompt) sees a different one. The expand is memory-free — PyTorch broadcasts without copying.

EXAMPLE
adv_per_seq.shape = (4, 16) → reshape to (64, 1) → expand to (64, S). On row 0 every column is the same scalar.
58Forward the live policy — gradient flows from here only

📚 use_cache=False is mandatory during training (a KV-cache breaks gradient flow). The (B, S, V) logits are bf16 — for a 70B policy with V=128k and S=1024 this is 4 GB per minibatch. We slice [:, :-1, :] because the model at position t predicts the token at position t+1: there is no successor to predict for the last position.

62Per-token log-probability of the action that was taken

📚 log_softmax → gather is the standard two-step. gather along the vocab dim with targets gives us, for each (batch, time) cell, the log-prob of the token that was sampled at that position. Squeezing the last dim collapses the shape to (B, S-1). new_logp[b, t] = log pi_new(a_t | s_{b, <t}). The same tensor feeds the ratio AND the KL — one forward pass, two uses.

66Importance ratio — log space first, exponentiate second

📚 log_ratio = log pi_new − log pi_old is bounded and numerically safe. Exponentiating gives the actual ratio. NEVER compute ratio = pi_new / pi_old directly — both can underflow to zero on long sequences and you lose precision exactly where the ratio matters most (near 1). Note we slice old_logp[:, 1:] to align with the predict-next-token shape of new_logp.

EXAMPLE
Typical first GRPO step: log_ratio ≈ 0 ± 0.05 → ratio ≈ 1.0 ± 0.05
70Clipped surrogate — same as PPO, advantage is the only thing that changed

📚 surr1 = ratio * advantage; surr2 = clamp(ratio, 1-ε, 1+ε) * advantage; min of the two is the pessimistic bound. The asymmetry of the clip (no gradient when the ratio drifts in the rewarded direction past 1±ε; full gradient when drifting away from a penalty) is unchanged from PPO — GRPO inherits PPO's safety brake free. We slice advantages[:, 1:] to match the (B, S-1) shape after the predict-next-token offset.

75k3 KL to the reference — the long-term anchor

📚 D_KL(pi_new || pi_ref) per token, k3 estimator: (exp(log_r) - 1) - log_r where log_r = log pi_new - log pi_ref. Always non-negative, low variance, unbiased for true KL in expectation. The reference is the frozen SFT model — same as PPO-RLHF. This penalty is the second safety brake (the first is the clip): without it the policy escapes to rewarded-but-non-human text. With BETA_KL = 0.04, a steady-state KL of 10–30 nats per response is healthy; >100 nats means runaway and you should raise BETA_KL.

78Per-token loss = −objective + β · KL — one tensor combines both signals

📚 The negative of the clipped surrogate IS the policy loss (we MAXIMISE the surrogate, MINIMISE its negation). Adding BETA_KL * kl_per_tok glues the two signals into a single per-token loss that we can mask, average, and backprop in one shot. Both terms are (B, S-1) so the elementwise sum works without reshapes.

83Sequence-mean then batch-mean — vanilla GRPO's two-means convention

📚 First sum over masked tokens divided by per-sequence response length → one loss per sequence (B,). Then mean over the batch → scalar. This gives every rollout equal weight in the final loss regardless of its length. The Dr. GRPO variant replaces the per-sequence divisor with a fixed max-length constant, removing the implicit upweighting of short rollouts that vanilla GRPO carries. Both are used in practice — pick one and document it.

89Single combined backward + clip + step

📚 loss.backward() populates .grad on every parameter of the policy (and only the policy — ref_policy has requires_grad=False and never accumulates). clip_grad_norm_ rescales gradients so the joint L2 norm is at most MAX_GRAD = 1.0. Always together with a positive max_norm; without grad-norm clipping a GRPO run will eventually NaN even with a well-tuned KL and LR.

94Diagnostics — the four numbers that tell you if GRPO is healthy

📚 approx_kl is the empirical per-step policy drift (old vs new). clipfrac is the fraction of tokens where the ratio exceeded the clip — healthy at 0.10 to 0.30. kl_ref is the realised drift from the SFT reference — healthy at 5 to 30 nats per response. adv_var is the within-group variance of advantages — if it COLLAPSES to ~0 across the batch, every group has identical rewards (binary verifier, all-correct or all-wrong) and the gradient signal vanishes; raise temperature or change the prompt distribution. These four numbers, watched together, are the difference between debugging in real time and debugging from a post-mortem.

EXAMPLE
Healthy step: approx_kl ≈ 0.02, clipfrac ≈ 0.18, kl_ref ≈ 12 nats, adv_var ≈ 0.5
91 lines without explanation
1"""
2PyTorch GRPO step — the inner loop of DeepSeek-R1, Qwen-2 math, Llama-3 RLHF
3when run in critic-free mode. Three models live in memory (policy, ref,
4no critic, no reward model on the training cluster).
5
6A 'rollout batch' is a list of N prompts, each rolled out to G full responses
7by the OLD policy (frozen snapshot). For each (prompt, response) token we have:
8   - old_logp      : log pi_old(a_t | s_t)            (no gradient)
9   - ref_logp      : log pi_ref(a_t | s_t)            (no gradient)
10   - reward_per_seq : scalar reward per FULL response (no gradient)
11
12K epochs of mini-batched updates on this same rollout follow. Each minibatch
13re-forwards the rollout tokens through the LIVE policy and computes the loss
14below.
15"""
16
17import torch
18import torch.nn.functional as F
19
20EPSILON  = 0.2          # PPO-style ratio clip (same default as PPO)
21BETA_KL  = 0.04         # per-token KL to reference; DeepSeek-Math uses 0.04
22MAX_GRAD = 1.0          # global grad-norm clip
23GROUP_G  = 16           # G rollouts per prompt
24
25def grpo_step(batch, policy, ref_policy, optimizer):
26    """
27    batch is one minibatch from the rollout, already on device. Tensors are
28    grouped: B = N_prompts * G, then reshape (N, G, S) to compute group stats.
29
30      input_ids       : (B, S)        prompt + response tokens
31      attn_mask       : (B, S)
32      action_mask     : (B, S)        1 on RESPONSE tokens, 0 on prompt/pad
33      old_logp        : (B, S)        log pi_old at each token (frozen)
34      ref_logp        : (B, S)        log pi_ref at each token (frozen)
35      reward_per_seq  : (B,)          one scalar per (prompt, response)
36      prompt_id       : (B,)          which prompt each sequence belongs to
37    """
38    input_ids      = batch["input_ids"]
39    attn_mask      = batch["attn_mask"]
40    action_mask    = batch["action_mask"]
41    old_logp       = batch["old_logp"]
42    ref_logp       = batch["ref_logp"]
43    reward_per_seq = batch["reward_per_seq"]
44    B, S = input_ids.shape
45
46    # --- Group-relative advantage (the GRPO core) ------------------------
47    # Reshape (B,) -> (N, G) so each row is one prompt's G rollouts.
48    assert B % GROUP_G == 0, "batch must be a whole number of groups"
49    N = B // GROUP_G
50    r = reward_per_seq.view(N, GROUP_G)                          # (N, G)
51    r_mean = r.mean(dim=1, keepdim=True)                         # (N, 1)
52    r_std  = r.std(dim=1, keepdim=True, unbiased=False)          # (N, 1)
53    adv_per_seq = (r - r_mean) / (r_std + 1e-8)                  # (N, G)
54    advantages  = adv_per_seq.view(B, 1).expand(B, S)            # (B, S)
55
56    # --- Forward the LIVE policy on the rollout --------------------------
57    out = policy(input_ids=input_ids, attention_mask=attn_mask, use_cache=False)
58    logits = out.logits[:, :-1, :]                               # predict t+1 from <=t
59    targets = input_ids[:, 1:]
60    am      = action_mask[:, 1:].float()
61
62    log_probs_all = F.log_softmax(logits, dim=-1)
63    new_logp = log_probs_all.gather(-1, targets.unsqueeze(-1)).squeeze(-1)
64
65    # --- PPO-style clipped surrogate, group advantage --------------------
66    log_ratio = new_logp - old_logp[:, 1:]
67    ratio     = log_ratio.exp()
68
69    surr1 = ratio * advantages[:, 1:]
70    surr2 = torch.clamp(ratio, 1 - EPSILON, 1 + EPSILON) * advantages[:, 1:]
71    policy_objective = torch.min(surr1, surr2)
72
73    # --- k3 KL to reference, per token -----------------------------------
74    log_r_ref   = new_logp - ref_logp[:, 1:]
75    kl_per_tok  = (log_r_ref.exp() - 1) - log_r_ref              # ≥ 0
76    per_token   = -policy_objective + BETA_KL * kl_per_tok       # (B, S-1)
77
78    # --- Two means: sequence-mean THEN group/batch-mean ------------------
79    # Masked mean over tokens within each sequence.
80    seq_len     = am.sum(dim=-1).clamp(min=1.0)                  # (B,)
81    seq_loss    = (per_token * am).sum(dim=-1) / seq_len         # (B,)
82    loss        = seq_loss.mean()                                # scalar
83
84    # --- One gradient step ------------------------------------------------
85    loss.backward()
86    gn = torch.nn.utils.clip_grad_norm_(policy.parameters(), MAX_GRAD)
87    optimizer.step()
88    optimizer.zero_grad(set_to_none=True)
89
90    # --- Diagnostics — the four numbers an RL engineer watches -----------
91    with torch.no_grad():
92        approx_kl = ((old_logp[:, 1:] - new_logp) * am).sum() / am.sum().clamp(min=1.0)
93        clipfrac  = (((ratio - 1.0).abs() > EPSILON).float() * am).sum() / am.sum().clamp(min=1.0)
94        kl_ref    = (kl_per_tok * am).sum() / am.sum().clamp(min=1.0)
95        adv_var   = adv_per_seq.var(dim=1).mean()                # within-group spread
96
97    return {
98        "loss":       loss.detach(),
99        "kl_ref":     kl_ref.detach(),
100        "approx_kl":  approx_kl.detach(),
101        "clipfrac":   clipfrac.detach(),
102        "grad_norm":  gn.detach(),
103        "adv_var":    adv_var.detach(),
104        "reward":     reward_per_seq.mean().detach(),
105    }

At Massive Scale: What Happens at 70B+

The same code scales to a 70B GRPO run, but three forces change in spirit.

Memory: the value head was half the cost

At 70B, each model is ~140 GB in bf16. PPO needs four such models (policy, value, ref, reward) plus optimiser state for the two trainable ones — about 2 TB before activations. GRPO needs two (policy, ref) plus optimiser state for just the policy — about 980 GB. Cutting the trainable-model count from 2 to 1 is the savings that matters: AdamW state is ~12 N bytes per trainable parameter (fp32 master copy plus fp32 m and v), so eliminating the value head's optimiser state alone saves ~840 GB. This is what makes 70B GRPO runs feasible on the same hardware that struggles with 70B PPO-RLHF.

The rollout-vs-train balance shifts further toward rollout

PPO already spent ~85% of wall-clock time on rollout (§ 14.6). GRPO multiplies the rollout cost by G — for G = 16 a single prompt now demands 16 full responses. The optimisation phase, on the other hand, is cheaper because there is no value-head gradient and no value-loss computation. Production GRPO systems push the rollout fraction to 90%+ and architect around that: dedicated vLLM rollout clusters, async weight syncs, and batched-prompt servers that emit all G rollouts of a prompt in one forward call. The architectural choice that PPO made under pressure (separate rollout and training clusters) becomes the only viable choice in GRPO.

Distributed group statistics: an all-reduce per group

If the G rollouts of a single prompt are split across FSDP ranks, the group mean and std must be computed with an all-reduce within each group's rank set, not over the global batch. In practice every production GRPO collator keeps each prompt's G rollouts on a single rank so the group statistics are a local operation — the “contiguous-group” layout rule. Get this wrong and you compute statistics over a mixture of prompts, which is mathematically a different and much worse algorithm. The first thing to assert in any custom GRPO collator is prompt_id[i*G:(i+1)*G].unique().numel() == 1 for every group.

Reward bottleneck disappears when the reward is a verifier

For DeepSeek-R1 the reward is not a learned model — it is a deterministic verifier (math answer checker, code executor, JSON validator). That verifier runs on CPU in microseconds, with no GPU memory at all. The “reward model bottleneck” that defines PPO-RLHF (the fourth 70B model that must be queried per rollout) simply vanishes. This is why GRPO + verifier is the dominant recipe for reasoning RL in 2024-2025: the cost structure is fundamentally simpler than PPO-RLHF, and the per- example signal-to-noise is higher because verifiers are not subject to reward hacking.

Sampling temperature becomes a load-bearing hyperparameter

In PPO, sampling temperature affects exploration but not the algorithm's stability. In GRPO it affects both. If temperature is too low, every rollout in a group is nearly identical, the reward variance collapses, the std σR0\sigma_R \to 0, and the gradient signal vanishes. If temperature is too high, rollouts become garbage, rewards are uniformly low, and the policy gets pushed toward those uniform-low responses. The DeepSeek-Math sweet spot is T=0.6T = 0.6 to T=1.0T = 1.0 with a small top-p cutoff; every production GRPO trainer logs adv_var per step and is prepared to bump temperature if it collapses.

Engineering Reality: Length Bias, Std Collapse, and Other Knobs

After a season of GRPO runs every team converges on the same set of failure modes and the same set of fixes. The list is shorter than PPO's because the algorithm has fewer moving parts.

  • Std collapse. Symptom: training stalls and adv_var drops to ~1e-6 in the diagnostics. Cause: every rollout in a group received the same reward (binary verifier returns all-correct or all-wrong), so σR0\sigma_R \to 0 and the advantage collapses to a tiny ratio governed by the ϵ\epsilon floor. Fix: raise sampling temperature so the rollouts diversify; or use a richer reward (partial credit instead of binary) so equal scores are rarer; or filter out collapsed groups entirely from the loss (most production trainers do this).
  • Length bias. Symptom: the policy converges to short, terse responses regardless of prompt. Cause: vanilla GRPO's sequence-mean normalisation upweights short rollouts in the loss; a short rewarded rollout contributes more per-token gradient than a long rewarded one, so the policy learns to prefer brevity. Fix: switch to Dr. GRPO (sum over tokens divided by a fixed max-length constant); or add an explicit length-target reward.
  • KL exploding. Symptom: kl_ref grows above 100 nats per response and never recovers, eval quality collapses. Cause: βKL\beta_{\text{KL}} is too small for the chosen reward signal. Fix: raise to 0.05 – 0.1 and restart from a pre-collapse checkpoint. Many teams use an adaptive controller that adjusts βKL\beta_{\text{KL}} to hold realised KL at a target value (e.g. 20 nats).
  • KL stuck near zero. Symptom: reward barely moves, kl_ref stays under 1 nat. Cause: policy is not changing — either LR is too low, or grad_norm is being clipped to near zero, or there is a reference-vs-policy aliasing bug (both tensors point to the same weights). Fix: assert id(ref_policy.state_dict()) != id(policy.state_dict()) and that the reference's parameters have requires_grad=False.
  • Group layout wrong. Symptom: reward seems to go up but eval gets worse. Cause: the batch is not laid out contiguously by prompt, so the group statistics are computed over a mixture of prompts. Fix: explicitly enforce contiguous- group layout in the collator and assert that prompt_id[i*G:(i+1)*G] is a single value for every ii.
  • Clipfrac too high. Symptom: clipfrac > 0.5 per step. Cause: the policy is moving too fast — usually too many GRPO epochs per rollout (K > 2) or LR too high. Fix: drop K to 1 (DeepSeek-R1 uses a single epoch) or halve LR. The clip is doing its job, but you are wasting compute on gradients that are being clipped to zero.
  • Rollout decoding mismatch. Symptom: ratio distribution centred far from 1.0 on the very first step before any gradient has been taken. Cause: the rollout used a sampler that the trainer cannot reproduce — top-p, temperature, or stop-token logic differs between vLLM and the live forward pass. Fix: greedy reconstruction is fine for log-prob purposes; just ensure the trainer's logit transform matches what the rollout engine emitted. trl, OpenRLHF, and DeepSpeed-Chat all have explicit code for this.
  • Memory blowup on the reference model. Symptom: OOM after enabling reference KL. Cause: holding two full bf16 copies of a 70B policy on the same GPUs. Fix: shard the reference with FSDP or place it on CPU and run reference forwards once per rollout (it is not in the gradient path, so the latency hit is tolerable). LoRA tricks that share weights between the policy and the reference (with a different adapter for each) are increasingly common.
The mental model that unifies the section: GRPO is PPO with one substitution. Replace the value head with the group mean. Replace the value-head's normalisation with the group std. Everything else — the importance ratio, the clipped surrogate, the k3 KL to a reference — survives unchanged. Two models live in memory instead of four. The cost moves from per-step trainable parameters to per-prompt inference samples. For LLM tasks where the per-prompt rollout is cheap (vLLM, async batching) and the reward is a verifier (no learned reward model), GRPO is strictly better than PPO. For everything else it is a one-line decision: how much variance can you tolerate in exchange for a free 70B model?
Loading comments...