Section 15.2 derived GRPO from PPO by eliminating the critic and replacing the value-baseline with a per-prompt z-score over a group of sampled completions. That derivation gives you a loss function with four exposed knobs: the group size , the PPO clip range , the KL coefficient , and the learning rate. The algorithm is now exactly as good as the numbers you put into those knobs. DeepSeek R1 picked a specific combination — most of which is unusual compared to classic PPO defaults — and that combination is much of what made the run successful. This section is about why those numbers.
The thesis of this section. GRPO is not a single algorithm, it is a family of algorithms parameterised by . DeepSeek R1 sits at , , , . Each of those values trades a specific failure mode against a specific cost, and the four together define a stable corridor in which long-chain reasoning can emerge from a rule-based reward.
The Real Problem: Hyperparameters Are the Whole Algorithm
Most papers describe an algorithm in equations and bury the hyperparameters in an appendix. For RL fine-tuning of large language models, that ordering is wrong. The equations are common to PPO, GRPO, and their cousins; the numbers determine whether the run converges to a frontier reasoning model or collapses to repetitive nonsense within a few hundred steps. The R1 release made this explicit: the supplementary material is dominated not by new math but by the exact numerical settings of an otherwise standard GRPO loop. Reproducing R1 means matching the numbers, not just the symbols.
Three categories of catastrophe lurk at the boundaries:
| Failure mode | Hyperparameter regime | What it looks like |
|---|---|---|
| Reward hacking / format collapse | beta too small AND epsilon too large | Model finds a degenerate response that passes the checker and repeats it |
| No learning | beta too large OR LR too small | Loss curve is flat; KL hugs zero; policy never moves |
| Variance blow-up | G too small | Loss bounces 10x; gradient norm spikes; updates fight noise |
| Long-tail truncation | MAX_LEN too small | Aha-moment reasoning gets cut off; reward is wrongly attributed to truncation |
| KL drift | beta reasonable but pi_ref never refreshed | After ~20K steps the KL penalty dwarfs the surrogate; learning stalls |
Every one of these failure modes is visible in the loss curve within a few hundred steps if you know which curve to watch. The R1 hyperparameter choices were calibrated to keep the run inside a narrow band where none of them fires. Understanding those choices is the difference between a paper you can read and a paper you can reproduce.
Intuition: Four Knobs That Move Together
Think of GRPO as a sailing problem. You are steering a boat (the policy) across a noisy sea (sampled rewards) toward an island (the reasoning behaviour you want to reinforce). The four hyperparameters are four physical levers on the boat, and they have to be balanced:
- is your sample size against the waves. Each prompt generates trajectories. The mean and std of their rewards is the signal you steer by. Bigger means less noise but more forward-pass cost. is the sweet spot where the std estimate stabilises at order-of-magnitude accuracy without burning four times the compute that would.
- is how hard you are allowed to push the rudder in one step. The PPO clip forces each per-token policy ratio to stay inside . Bigger lets the policy move further per update but risks instability; smaller is safe but slow. is the literal copy-paste from PPO — it works because GRPO uses only one optimisation epoch per rollout, so the ratio rarely walks far enough to need a tighter clip.
- is the rope tethering the boat to its starting point. The KL penalty pulls the policy toward the reference (post-SFT) model. Heavy keeps you safe but prevents you from discovering anything the reference cannot already do. R1 uses a deliberately tiny — almost no tether — so the policy is free to invent the long-chain reasoning behaviours that the SFT base would never have produced.
- is the throttle. Learning rate is two orders of magnitude smaller than the SFT learning rate, because every RL gradient is amplified by a Monte-Carlo estimator with variance . A bigger throttle on a noisier engine is how you tear the gearbox apart.
The Mathematical Form of the GRPO Update
Recall the GRPO objective derived in Section 15.2. For a prompt and a group of sampled completions with rewards , the per-prompt objective is:
where each symbol carries one of the hyperparameters:
| Symbol | Definition | Hyperparameter touched |
|---|---|---|
| A_i | (r_i - mu) / sigma over the group of G completions | G (group size shapes mu, sigma) |
| rho_{i,t} | pi_theta(y_{i,t} | x, y_{i,<t}) / pi_old(y_{i,t} | x, y_{i,<t}) | epsilon (clip range) |
| clip(., 1-eps, 1+eps) | Hard cap on per-token policy drift | epsilon |
| KL_{i,t} | exp(log pi_ref - log pi_theta) - (log pi_ref - log pi_theta) - 1 | beta (KL coefficient) |
| theta <- theta + eta * grad J | AdamW step on the objective | eta (learning rate) |
The advantage with no critic
The single line that defines GRPO is where are computed over the completions of one prompt. There is no value head, no GAE, no Monte-Carlo return — just the empirical mean and standard deviation of the group's rewards. The variance of this estimator is , which is exactly why sits at the centre of every hyperparameter discussion.
The clip with one optimisation epoch
Classic PPO uses 4–10 epochs over each rollout, which lets the ratio drift several times and makes a tight clip essential. GRPO as run in R1 uses one epoch — one optimiser step per rollout — so the ratio rarely wanders. That choice changes the role of : it is no longer the binding constraint, it is a safety rail. R1's is therefore the inherited PPO default; DAPO, which uses multiple epochs and asymmetric clipping, has to set and for the same reason classic PPO needs a tight clip.
The KL estimator and why beta is so small
DeepSeek's unbiased estimator with is non-negative by construction (it is a Bregman divergence) and zero at agreement. With , even a large per-token KL of contributes only to the objective — three orders of magnitude smaller than a typical surrogate term of . The KL is therefore decorative early on. It bites only after thousands of steps when the policy has wandered far enough that per-token KL sits in the range; at that point the integrated penalty becomes the dominant force preventing collapse.
The DeepSeek R1 Recipe
The published numbers, in one table:
| Hyperparameter | Value | Role | Why this number |
|---|---|---|---|
| Group size G | 16 | Variance reduction | sigma^2/G drops 16x vs G=1; cost is 16 forward passes per prompt |
| Clip range epsilon | 0.2 | Trust region | Inherited PPO default; safe with one epoch |
| KL coefficient beta | 1e-3 | Drift damper | Small enough to allow reasoning emergence |
| Learning rate eta | 1e-6 | Throttle (AdamW) | Two orders below SFT LR; matches MC gradient noise |
| Sampling temp | 1.0 | Exploration | Don't truncate the policy's own distribution |
| Max length | 32768 | Reasoning headroom | Long enough to host multi-step CoT |
| PPO epochs / rollout | 1 | Off-policy distance | Keeps ratio close to 1; lets epsilon be loose |
| Batch size (prompts) | 1024 | Gradient noise floor | Effective batch = 1024 x 16 = 16k completions |
| Reward type | Rule-based | Signal | Accuracy check + format check; no reward model |
| LR schedule | Constant | Stability | No warmup, no decay; RL is too short for cosine to help |
Three of these are unusual; the rest are PPO defaults. The unusual ones are (PPO classically uses ), (PPO classically uses ), and one PPO epoch (PPO classically uses 4–10). Together these three changes encode the entire GRPO philosophy: trade off-policy correction for sample-mean variance reduction, and trade KL safety for exploration freedom.
Manual Numerical Walkthrough
One prompt, four completions, every number computed by hand.
Click to expand: one GRPO update step on G=4 completions of a math prompt
Setup. Prompt: "What is ?". The policy samples completions; the rule-based checker scores them. Hyperparameters: .
Step 1 — Rewards. . Three of the four completions reached "56"; one did not.
Step 2 — Group statistics. . . .
Step 3 — Advantages. and . The single wrong completion gets a large negative weight; the three correct ones share a moderate positive weight. This is the "relative" in group-relative: scoring is against the group, not against an external baseline.
Step 4 — Policy ratios. Suppose the per-token ratios after one optimiser step on this batch came out as . Completion 2 has drifted aggressively upward (ratio 1.30) and completion 3 has drifted downward (ratio 0.85).
Step 5 — Apply the clip, . gives . Only completion 2 is clipped — its ratio was outside the trust region.
Step 6 — Per-completion surrogate. :
- :
- : (clip wins — caps the upside)
- : (clip not binding; full negative update kept)
- :
Step 7 — KL estimator, . Take : . Multiply by : . Effectively zero — exactly the point of small beta.
Step 8 — GRPO objective for this prompt. . The gradient of with respect to reinforces tokens that appeared in completions 1, 2, 4 and discourages the tokens that appeared in completion 3. With learning rate , the resulting weight update has norm on the order of — tiny per step, but compounded over steps the policy moves substantially.
What this reveals. The clip bit on exactly one of four rows. The KL term is essentially zero. The single wrong completion contributed of the gradient norm — a single negative example dominated the update. That dominance is exactly why needs to be 16, not 4: if you only ever update on four samples and one of them happens to be a wild outlier, every step is a fight with that outlier.
Visualizing the Hyperparameter Regimes
The widget below runs the same GRPO update for one prompt under four preset hyperparameter regimes — DeepSeek R1, DAPO, classic-PPO defaults, and an over-regularised collapse — plus the ability to drag , , and continuously and watch the per-completion bars rearrange.
Four movements worth performing in the widget. First, drag from 16 down to 2 — the std of the group rewards becomes unstable and the advantages start swinging wildly between resamples (use the seed slider). Second, drag from 0.2 down to 0.05 — the "Fraction clipped" card climbs and the mean surrogate flattens, because the policy is forbidden from moving. Third, drag from 0.001 up to 0.1 — the sky-blue KL overlay starts eating the green surrogate bars from the right; the policy is being pulled back to the reference faster than the reward signal can push it forward. Fourth, click the "Classic PPO defaults" preset and observe that both small G and large beta are applied at once — that is the regime in which a vanilla PPO loop would refuse to discover R1-style reasoning, and it is precisely what DeepSeek had to walk away from.
Plain Python: One GRPO Update Step
The plain-Python version below performs one full GRPO update step on a single prompt with completions, using only the standard library. No PyTorch, no batching, no GPU — just the arithmetic that the R1 training loop executes for every prompt in every batch.
Two structural details. First, the order of operations is fixed: rewards → group-normalised advantages → per-completion ratio → clipped surrogate → KL drag → objective. Swap any two of those and the loss is wrong in a way that does not always raise an exception — the run will silently fail to converge. Second, the KL estimator looks unfamiliar but is the GRPO paper's contribution: it is the unique non-negative, unbiased estimator of that you can compute from a single sample. Using directly (which is unbiased but signed) costs the run its monotone-improvement guarantee.
PyTorch: The Production GRPO Loss
The production version vectorises everything to — batch of prompts, group of completions, response tokens. Three things become necessary at this scale: a response-token mask (so the prompt doesn't contribute gradient), a reference policy distinct from the old policy (because R1 refreshes every step but rarely), and per-token advantage broadcasting.
Three observations on what makes this code production-ready vs. toy:
- Two policies, not one. is for the importance ratio; is for the KL anchor. R1 keeps them separate because the old policy is refreshed every gradient step (so the ratio stays near 1) while the reference is refreshed every 100–500 steps (so the KL penalty stays meaningful). Conflating them collapses GRPO to standard PPO with a tiny KL coefficient, which loses the entire point.
- Token-level masking is non-negotiable. The objective is averaged over response tokens only — never the prompt tokens or pad tokens. Skip the mask and the prompt's log-probability contributes to the gradient, which is exactly backward (you would be reinforcing the prompt instead of the response). Most reported "GRPO doesn't work" failures trace to a missing or wrong mask.
- Group statistics use . Early in R1 training many prompts return all-zero or all-one reward groups (model is too weak or too strong). Without the clamp the divide is NaN, which propagates everywhere via the optimiser's second moment. The clamp keeps the advantage at zero for those prompts — they contribute no gradient — which is the correct behaviour.
At Massive Scale: Why These Numbers and Not Others
Match the hyperparameters to the bottlenecks they relieve at the 671B-parameter scale R1 was trained on:
| Bottleneck | Hyperparameter that addresses it | Why this value at frontier scale |
|---|---|---|
| Sampling cost (forward passes per prompt) | G = 16 | Compute per step = G x forward-pass-cost; G > 32 doubles the rollout phase without halving variance |
| KV-cache memory at long context | max_len = 32K | Long CoT requires the headroom; KV-cache for 32K tokens at 671B x MoE ~ 80GB per GPU |
| Gradient noise (variance of MC estimator) | LR = 1e-6 | AdamW step size eta * g; with sigma ~ 1 on the surrogate, eta = 1e-6 gives parameter drift ~1e-6 per step |
| Off-policy drift from rollout to update | 1 PPO epoch / rollout | Each epoch widens rho; one epoch keeps clip almost inactive and saves the gradient bias correction |
| Catastrophic forgetting of SFT skills | beta = 1e-3 plus periodic pi_ref refresh | Heavy beta kills exploration; refreshing pi_ref re-anchors after the policy has discovered new behaviours |
| Communication overhead (sync across nodes) | Batch = 1024 prompts | Each gradient step amortises one all-reduce; small batches make communication dominate |
Two further constraints become visible only at this scale. First, the rollout phase is sequential per prompt (you cannot generate token 5 before token 4) but parallel across prompts. With prompts and completions each, R1 generates sequences in parallel — saturating the inference cluster the rollouts run on. Increasing beyond 16 would either require more inference GPUs or longer wall-clock per step. Second, the gradient phase is one optimiser step on the entire -completion batch. Larger effective batches would amortise communication better but run out of GPU memory for the activations; sits at the upper edge of what fits.
Engineering Reality and Gotchas
Three failure modes earn their flag in every production GRPO log:
- Reward-zero groups silently kill the gradient. When the model is too weak to solve a prompt, all completions get reward 0, the group std is zero, the clamp returns , and the advantage becomes a vector of zeros — which contributes no gradient. Early in R1 training this happens to ~30% of prompts. The R1 fix is curriculum-style prompt filtering: drop prompts whose recent group-pass-rate is at 0% or 100% from the training mix, because they contribute no learning signal in either case. Without this filter, ~40% of compute is spent on prompts that contribute nothing.
- The reference policy must be refreshed, but not too often. Refresh every step and GRPO collapses to standard PPO; refresh never and KL grows without bound until dominates the surrogate and learning stalls. R1's reported schedule refreshes every 100–500 steps; the right number depends on how fast per-token KL is growing. A practical rule: refresh when mean per-token KL exceeds (corresponding to , half the surrogate signal).
- Length normalisation is the single biggest hidden knob. The factor in the objective makes long completions contribute the same gradient weight as short ones. This is R1's choice, but Dr.GRPO (Section 15.5) argues it implicitly rewards verbosity and drops it. Switching the normalisation between these two forms — one line of code — changes whether the model learns to think more or less. There is no universally right answer; it depends on what the reward measures and what trade-off you want.
The one sentence to carry forward: GRPO's magic is in the recipe, not the equation — is the single point in hyperparameter space where rule-based rewards on math prompts can drive a 671B model from imitation into emergent reasoning — and the rest of this chapter is about why that point exists, what perturbs it, and how the variants in Section 15.5 move along its frontier.