Boo-AI — Master Artificial Intelligence by Building from Scratch

Section 15.2 derived GRPO from PPO by eliminating the critic and replacing the value-baseline with a per-prompt z-score over a group of sampled completions. That derivation gives you a loss function with four exposed knobs: the group size $G$ , the PPO clip range $\epsilon$ , the KL coefficient $\beta$ , and the learning rate. The algorithm is now exactly as good as the numbers you put into those knobs. DeepSeek R1 picked a specific combination — most of which is unusual compared to classic PPO defaults — and that combination is much of what made the run successful. This section is about why those numbers.

The thesis of this section. GRPO is not a single algorithm, it is a family of algorithms parameterised by $(G, \epsilon, \beta, \eta)$ . DeepSeek R1 sits at $G = 16$ , $\epsilon = 0.2$ , $\beta = 10^{-3}$ , $\eta = 10^{-6}$ . Each of those values trades a specific failure mode against a specific cost, and the four together define a stable corridor in which long-chain reasoning can emerge from a rule-based reward.

The Real Problem: Hyperparameters Are the Whole Algorithm

Most papers describe an algorithm in equations and bury the hyperparameters in an appendix. For RL fine-tuning of large language models, that ordering is wrong. The equations are common to PPO, GRPO, and their cousins; the numbers determine whether the run converges to a frontier reasoning model or collapses to repetitive nonsense within a few hundred steps. The R1 release made this explicit: the supplementary material is dominated not by new math but by the exact numerical settings of an otherwise standard GRPO loop. Reproducing R1 means matching the numbers, not just the symbols.

Three categories of catastrophe lurk at the boundaries:

Failure mode	Hyperparameter regime	What it looks like
Reward hacking / format collapse	beta too small AND epsilon too large	Model finds a degenerate response that passes the checker and repeats it
No learning	beta too large OR LR too small	Loss curve is flat; KL hugs zero; policy never moves
Variance blow-up	G too small	Loss bounces 10x; gradient norm spikes; updates fight noise
Long-tail truncation	MAX_LEN too small	Aha-moment reasoning gets cut off; reward is wrongly attributed to truncation
KL drift	beta reasonable but pi_ref never refreshed	After ~20K steps the KL penalty dwarfs the surrogate; learning stalls

Every one of these failure modes is visible in the loss curve within a few hundred steps if you know which curve to watch. The R1 hyperparameter choices were calibrated to keep the run inside a narrow band where none of them fires. Understanding those choices is the difference between a paper you can read and a paper you can reproduce.

Why this matters now. Every open-source GRPO recipe released since R1 — DAPO, Dr.GRPO, Olmo 3, Qwen-3 — diverges from the R1 hyperparameters in at least one place, and each divergence is a deliberate response to a failure mode that bit the author. Section 15.5 catalogues those variants; this section establishes the baseline they all start from.

Intuition: Four Knobs That Move Together

Think of GRPO as a sailing problem. You are steering a boat (the policy) across a noisy sea (sampled rewards) toward an island (the reasoning behaviour you want to reinforce). The four hyperparameters are four physical levers on the boat, and they have to be balanced:

$G$ is your sample size against the waves. Each prompt generates $G$ trajectories. The mean and std of their rewards is the signal you steer by. Bigger $G$ means less noise but more forward-pass cost. $G = 16$ is the sweet spot where the std estimate stabilises at order-of-magnitude accuracy without burning four times the compute that $G = 64$ would.
$\epsilon$ is how hard you are allowed to push the rudder in one step. The PPO clip forces each per-token policy ratio to stay inside $(1 - \epsilon, 1 + \epsilon)$ . Bigger $\epsilon$ lets the policy move further per update but risks instability; smaller $\epsilon$ is safe but slow. $\epsilon = 0.2$ is the literal copy-paste from PPO — it works because GRPO uses only one optimisation epoch per rollout, so the ratio rarely walks far enough to need a tighter clip.
$\beta$ is the rope tethering the boat to its starting point. The KL penalty pulls the policy toward the reference (post-SFT) model. Heavy $\beta$ keeps you safe but prevents you from discovering anything the reference cannot already do. R1 uses a deliberately tiny $\beta = 10^{-3}$ — almost no tether — so the policy is free to invent the long-chain reasoning behaviours that the SFT base would never have produced.
$\eta$ is the throttle. Learning rate $10^{-6}$ is two orders of magnitude smaller than the SFT learning rate, because every RL gradient is amplified by a Monte-Carlo estimator with variance $\sigma^2 / G$ . A bigger throttle on a noisier engine is how you tear the gearbox apart.

Why all four must move together. The honest answer is that they trade against each other. If you double

G

you halve the gradient variance and can afford to double

\eta

. If you halve

\beta

the policy explores more, which widens the per-token ratio distribution, which forces you to tighten

\epsilon

. The R1 numbers are one stable point in a four-dimensional space — and DAPO, Dr.GRPO, and Olmo 3 are other stable points in the same space.

The Mathematical Form of the GRPO Update

Recall the GRPO objective derived in Section 15.2. For a prompt $x$ and a group of $G$ sampled completions $\{y_i\}_{i=1}^{G}$ with rewards $r_i$ , the per-prompt objective is:

$J_{\text{GRPO}}(\theta) = \frac{1}{G} \sum_{i=1}^{G} \frac{1}{|y_i|} \sum_{t=1}^{|y_i|} \left[ \min(\rho_{i,t} A_i, \; \text{clip}(\rho_{i,t}, 1-\epsilon, 1+\epsilon) A_i) - \beta \, \widehat{\text{KL}}_{i,t} \right]$

where each symbol carries one of the hyperparameters:

Symbol	Definition	Hyperparameter touched
A_i	(r_i - mu) / sigma over the group of G completions	G (group size shapes mu, sigma)
rho_{i,t}	pi_theta(y_{i,t} \| x, y_{i,<t}) / pi_old(y_{i,t} \| x, y_{i,<t})	epsilon (clip range)
clip(., 1-eps, 1+eps)	Hard cap on per-token policy drift	epsilon
KL_{i,t}	exp(log pi_ref - log pi_theta) - (log pi_ref - log pi_theta) - 1	beta (KL coefficient)
theta <- theta + eta * grad J	AdamW step on the objective	eta (learning rate)

The advantage with no critic

The single line that defines GRPO is $A_i = (r_i - \mu_g) / \sigma_g$ where $\mu_g, \sigma_g$ are computed over the $G$ completions of one prompt. There is no value head, no GAE, no Monte-Carlo return — just the empirical mean and standard deviation of the group's rewards. The variance of this estimator is $\text{Var}(\mu_g) \approx \sigma_r^2 / G$ , which is exactly why $G$ sits at the centre of every hyperparameter discussion.

The clip with one optimisation epoch

Classic PPO uses 4–10 epochs over each rollout, which lets the ratio $\rho$ drift several times and makes a tight clip essential. GRPO as run in R1 uses one epoch — one optimiser step per rollout — so the ratio rarely wanders. That choice changes the role of $\epsilon$ : it is no longer the binding constraint, it is a safety rail. R1's $\epsilon = 0.2$ is therefore the inherited PPO default; DAPO, which uses multiple epochs and asymmetric clipping, has to set $\epsilon_{\text{high}} = 0.28$ and $\epsilon_{\text{low}} = 0.2$ for the same reason classic PPO needs a tight clip.

The KL estimator and why beta is so small

DeepSeek's unbiased estimator $\widehat{\text{KL}} = e^{r} - r - 1$ with $r = \log \pi_{\text{ref}} - \log \pi_{\theta}$ is non-negative by construction (it is a Bregman divergence) and zero at agreement. With $\beta = 10^{-3}$ , even a large per-token KL of $0.05$ contributes only $5 \cdot 10^{-5}$ to the objective — three orders of magnitude smaller than a typical surrogate term of $\sim 0.05$ . The KL is therefore decorative early on. It bites only after thousands of steps when the policy has wandered far enough that per-token KL sits in the $0.5 - 2.0$ range; at that point the integrated penalty becomes the dominant force preventing collapse.

Why R1 picked an unusually small beta. Long-chain reasoning behaviours — backtracking, self-correction, the famous "wait, let me reconsider" aha-moment — require the policy to drift substantially from the SFT reference. A standard

\beta = 0.04

would pin the policy near the reference and silently prevent these behaviours from emerging. DeepSeek's deliberate choice was to remove the KL leash for the first phase of training, then re-anchor by occasionally refreshing

\pi_{\text{ref}}

to the current policy.

The DeepSeek R1 Recipe

The published numbers, in one table:

Hyperparameter	Value	Role	Why this number
Group size G	16	Variance reduction	sigma^2/G drops 16x vs G=1; cost is 16 forward passes per prompt
Clip range epsilon	0.2	Trust region	Inherited PPO default; safe with one epoch
KL coefficient beta	1e-3	Drift damper	Small enough to allow reasoning emergence
Learning rate eta	1e-6	Throttle (AdamW)	Two orders below SFT LR; matches MC gradient noise
Sampling temp	1.0	Exploration	Don't truncate the policy's own distribution
Max length	32768	Reasoning headroom	Long enough to host multi-step CoT
PPO epochs / rollout	1	Off-policy distance	Keeps ratio close to 1; lets epsilon be loose
Batch size (prompts)	1024	Gradient noise floor	Effective batch = 1024 x 16 = 16k completions
Reward type	Rule-based	Signal	Accuracy check + format check; no reward model
LR schedule	Constant	Stability	No warmup, no decay; RL is too short for cosine to help

Three of these are unusual; the rest are PPO defaults. The unusual ones are $G = 16$ (PPO classically uses $G = 4$ ), $\beta = 10^{-3}$ (PPO classically uses $0.04$ ), and one PPO epoch (PPO classically uses 4–10). Together these three changes encode the entire GRPO philosophy: trade off-policy correction for sample-mean variance reduction, and trade KL safety for exploration freedom.

Manual Numerical Walkthrough

One prompt, four completions, every number computed by hand.

Click to expand: one GRPO update step on G=4 completions of a math prompt

Setup. Prompt: "What is $7 \times 8$ ?". The policy samples $G = 4$ completions; the rule-based checker scores them. Hyperparameters: $\epsilon = 0.2, \beta = 10^{-3}$ .

Step 1 — Rewards. $r = [1, 1, 0, 1]$ . Three of the four completions reached "56"; one did not.

Step 2 — Group statistics. $\mu = (1+1+0+1)/4 = 0.75$ . $\sigma^2 = ((1-0.75)^2 \cdot 3 + (0-0.75)^2)/4 = (0.1875 + 0.5625)/4 = 0.1875$ . $\sigma \approx 0.433$ .

Step 3 — Advantages. $A_1 = A_2 = A_4 = (1 - 0.75)/0.433 \approx +0.577$ and $A_3 = (0 - 0.75)/0.433 \approx -1.732$ . The single wrong completion gets a large negative weight; the three correct ones share a moderate positive weight. This is the "relative" in group-relative: scoring is against the group, not against an external baseline.

Step 4 — Policy ratios. Suppose the per-token ratios after one optimiser step on this batch came out as $\rho = [1.05, 1.30, 0.85, 1.10]$ . Completion 2 has drifted aggressively upward (ratio 1.30) and completion 3 has drifted downward (ratio 0.85).

Step 5 — Apply the clip, $\epsilon = 0.2$ . $\text{clip}(\rho, 0.8, 1.2)$ gives $[1.05, 1.20, 0.85, 1.10]$ . Only completion 2 is clipped — its ratio was outside the trust region.

Step 6 — Per-completion surrogate. $\text{surr}_i = \min(\rho_i A_i, \text{clip}(\rho_i) A_i)$ :

$i=1$ : $\min(1.05 \cdot 0.577, 1.05 \cdot 0.577) = +0.606$
$i=2$ : $\min(1.30 \cdot 0.577, 1.20 \cdot 0.577) = +0.692$ (clip wins — caps the upside)
$i=3$ : $\min(0.85 \cdot -1.732, 0.85 \cdot -1.732) = -1.472$ (clip not binding; full negative update kept)
$i=4$ : $\min(1.10 \cdot 0.577, 1.10 \cdot 0.577) = +0.635$

Step 7 — KL estimator, $\beta = 10^{-3}$ . Take $\widehat{\text{KL}}_i = \rho_i - \log \rho_i - 1$ : $[0.0012, 0.0376, 0.0125, 0.0047]$ . Multiply by $\beta$ : $[1.2 \cdot 10^{-6}, 3.8 \cdot 10^{-5}, 1.3 \cdot 10^{-5}, 4.7 \cdot 10^{-6}]$ . Effectively zero — exactly the point of small beta.

Step 8 — GRPO objective for this prompt. $J = \frac{1}{4}(0.606 + 0.692 - 1.472 + 0.635) - \beta \cdot \overline{\text{KL}} \approx \frac{0.461}{4} - 1.5 \cdot 10^{-5} \approx +0.115$ . The gradient of $J$ with respect to $\theta$ reinforces tokens that appeared in completions 1, 2, 4 and discourages the tokens that appeared in completion 3. With learning rate $\eta = 10^{-6}$ , the resulting weight update has norm on the order of $10^{-7}$ — tiny per step, but compounded over $8000$ steps the policy moves substantially.

What this reveals. The clip bit on exactly one of four rows. The KL term is essentially zero. The single wrong completion contributed $|-1.472| / (0.606 + 0.692 + 1.472 + 0.635) = 42\%$ of the gradient norm — a single negative example dominated the update. That dominance is exactly why $G$ needs to be 16, not 4: if you only ever update on four samples and one of them happens to be a wild outlier, every step is a fight with that outlier.

Visualizing the Hyperparameter Regimes

The widget below runs the same GRPO update for one prompt under four preset hyperparameter regimes — DeepSeek R1, DAPO, classic-PPO defaults, and an over-regularised collapse — plus the ability to drag $G$ , $\epsilon$ , and $\beta$ continuously and watch the per-completion bars rearrange.

Loading GRPO hyperparameters visualizer…

Four movements worth performing in the widget. First, drag $G$ from 16 down to 2 — the std of the group rewards becomes unstable and the advantages start swinging wildly between resamples (use the seed slider). Second, drag $\epsilon$ from 0.2 down to 0.05 — the "Fraction clipped" card climbs and the mean surrogate flattens, because the policy is forbidden from moving. Third, drag $\beta$ from 0.001 up to 0.1 — the sky-blue KL overlay starts eating the green surrogate bars from the right; the policy is being pulled back to the reference faster than the reward signal can push it forward. Fourth, click the "Classic PPO defaults" preset and observe that both small G and large beta are applied at once — that is the regime in which a vanilla PPO loop would refuse to discover R1-style reasoning, and it is precisely what DeepSeek had to walk away from.

Plain Python: One GRPO Update Step

The plain-Python version below performs one full GRPO update step on a single prompt with $G = 16$ completions, using only the standard library. No PyTorch, no batching, no GPU — just the arithmetic that the R1 training loop executes for every prompt in every batch.

🐍grpo_one_step.py

Explanation(7)

Code(52)

4The hyperparameters DeepSeek R1 actually used

Five numbers do almost all the work. G = 16 sets the noise floor of the advantage estimate; EPSILON = 0.2 is the trust-region radius the update is allowed to move in one step; BETA = 0.001 is unusually small — a deliberate choice that lets the policy drift from the reference, which is what makes R1's reasoning traces visibly evolve over training; LR = 1e-6 is two orders of magnitude smaller than supervised LRs because every step is amplified by a high-variance Monte-Carlo gradient; MAX_LEN = 32768 sets how long a chain-of-thought can grow before truncation.

EXECUTION STATE

G = 16

EPSILON = 0.2

BETA = 0.001

LR = 1e-6

13Rule-based rewards in {0, 1}

R1 uses no reward model. Each completion is graded by a deterministic checker (correct answer? format ok?) and given a binary score. The crude reward signal is exactly what makes group normalisation necessary — without it the gradient would have catastrophic variance because most completions either get 0 or 1.

EXECUTION STATE

rewards = G = 16 floats in {0, 1}

19Group-normalised advantage replaces the critic

This is the line that defines GRPO. mu and sd are computed over the group of G completions for ONE prompt. The advantage A_i is a z-score within that group: a completion that scored above the group mean gets positive weight, below gets negative. No value network, no GAE, no bootstrapping — just the per-prompt sample statistics. This is why G matters so much: with G = 1 there is nothing to normalise against, and with G = 4 the std estimate is so noisy that half of training is fighting that noise.

EXECUTION STATE

mu = ≈ 0.5 (depends on seed)

sd = ≈ 0.5

A = z-scores; mean 0, std 1

25Log-probability ratios from the two policies

rho_i = pi_new(y_i | x) / pi_old(y_i | x). In the real loop, pi_old is the model that produced the rollout and pi_new is the current trainable model. With a fresh rollout and one optimisation step, rho is very close to 1, and the clip almost never bites. The point of computing it anyway is to leave room for multiple PPO epochs over the same rollout — R1 uses only one, which is part of why a small EPSILON still works.

EXECUTION STATE

rho = G values, ~ 1.0 +/- 0.1

30PPO-style clipped surrogate, per row

min(unclipped, clipped) is the standard PPO trick: if the unclipped ratio would update the policy too aggressively, the gradient is clipped to zero in that direction. Crucially, the clip only bites in the direction that would INCREASE the surrogate — that is, it stops over-confident moves but lets you walk away from bad completions freely. With EPSILON = 0.2, a rho of 1.3 on a positive advantage is forced down to 1.2 (gradient zero past that), while a rho of 0.7 on a negative advantage is forced up to 0.8.

EXECUTION STATE

EPSILON = 0.2 (R1)

(1-eps, 1+eps) = (0.8, 1.2)

37The KL estimator from the GRPO paper

The DeepSeek-Math GRPO paper introduces an unbiased, non-negative KL estimator: KL_i ~ rho_i - log(rho_i) - 1. It's zero when rho = 1 and strictly positive everywhere else. Why not just use log_ratios as a KL? Because that estimator can go negative and unbalance the gradient. This form is mathematically clean — its expectation under pi_old equals the true KL[pi_new || pi_old], and it adds no variance bias.

EXECUTION STATE

kl_i (rho=1.0) = 0.0

kl_i (rho=1.1) = ~ 0.0048

kl_i (rho=1.3) = ~ 0.0376

40The full GRPO objective per group

J = mean over G of (surrogate - beta * KL). With beta = 0.001, the KL term is essentially decorative for the first thousand steps; it only becomes load-bearing once the policy has drifted enough that mean KL exceeds ~1 — which, in R1, happens after roughly 8K steps. This is the deliberate engineering choice: a heavier beta would freeze the policy near the SFT reference and prevent the long-reasoning behaviours from emerging at all.

EXECUTION STATE

J = scalar in roughly [-0.5, +0.5]

BETA * mean_KL = ~ 0.001 * 0.01 ~ 1e-5

45 lines without explanation

1import math
2import random
3
4# DeepSeek R1 GRPO hyperparameters (publicly reported)
5G        = 16          # group size: completions per prompt
6EPSILON  = 0.2         # PPO clip range
7BETA     = 0.001       # KL coefficient (very small on purpose)
8LR       = 1e-6        # AdamW learning rate
9TEMP     = 1.0         # sampling temperature during rollout
10MAX_LEN  = 32768       # max generation length in tokens
11
12# One math question; assume we already sampled G completions and graded them
13# with a rule-based reward in {0, 1} (1 = correct, 0 = wrong).
14random.seed(0)
15rewards = [random.random() < 0.55 for _ in range(G)]
16rewards = [1.0 if r else 0.0 for r in rewards]
17
18# Step 1: group-normalised advantages. No critic; just a z-score over the group.
19mu   = sum(rewards) / G
20var  = sum((r - mu) ** 2 for r in rewards) / G
21sd   = math.sqrt(var) + 1e-8
22A    = [(r - mu) / sd for r in rewards]
23
24# Step 2: per-token log-prob ratios from the old and new policies.
25# In a real run these come from running both models on the rollout. Here we
26# fake a tiny shift so the clip mechanism is visible.
27log_ratios = [random.gauss(0.0, 0.1) for _ in range(G)]
28rho        = [math.exp(lr) for lr in log_ratios]
29
30# Step 3: PPO-style clipped surrogate, per completion.
31def surrogate(rho_i, A_i):
32    unclipped = rho_i * A_i
33    clipped   = max(min(rho_i, 1 + EPSILON), 1 - EPSILON) * A_i
34    return min(unclipped, clipped)
35
36surr = [surrogate(rho[i], A[i]) for i in range(G)]
37
38# Step 4: unbiased KL estimator from the GRPO paper, KL >= 0, zero at rho=1.
39kl   = [r - math.log(r) - 1 for r in rho]
40
41# Step 5: GRPO objective per completion = surrogate - beta * KL. Mean over group.
42J    = sum(surr[i] - BETA * kl[i] for i in range(G)) / G
43
44clipped_frac = sum(1 for i in range(G)
45                   if rho[i] != max(min(rho[i], 1 + EPSILON), 1 - EPSILON)
46                   and A[i] != 0) / G
47
48print(f"Group reward mean / std   : {mu:.3f} / {sd:.3f}")
49print(f"Mean surrogate            : {sum(surr)/G:+.4f}")
50print(f"Mean KL drag (beta * KL)  : {BETA * sum(kl)/G:.6f}")
51print(f"GRPO objective J          : {J:+.4f}")
52print(f"Fraction of rows clipped  : {clipped_frac:.2%}")

Two structural details. First, the order of operations is fixed: rewards → group-normalised advantages → per-completion ratio → clipped surrogate → KL drag → objective. Swap any two of those and the loss is wrong in a way that does not always raise an exception — the run will silently fail to converge. Second, the KL estimator $\rho - \log \rho - 1$ looks unfamiliar but is the GRPO paper's contribution: it is the unique non-negative, unbiased estimator of $\text{KL}[\pi_{\text{old}} \| \pi_{\text{new}}]$ that you can compute from a single sample. Using $\log \rho$ directly (which is unbiased but signed) costs the run its monotone-improvement guarantee.

PyTorch: The Production GRPO Loss

The production version vectorises everything to $(B, G, T)$ — batch of prompts, group of completions, response tokens. Three things become necessary at this scale: a response-token mask (so the prompt doesn't contribute gradient), a reference policy distinct from the old policy (because R1 refreshes $\pi_{\text{old}}$ every step but $\pi_{\text{ref}}$ rarely), and per-token advantage broadcasting.

🐍grpo_loss.py

Explanation(7)

Code(41)

4Three log-probability tensors and one reward tensor

logp_new comes from the policy currently being trained. logp_old comes from the same policy at the moment of the rollout (frozen for this update). logp_ref comes from the post-SFT base model — the immutable anchor that the KL penalty pulls toward. For R1, pi_ref is updated rarely or not at all; some recipes refresh it every ~100 GRPO steps to let the policy drift further without dragging the KL term up indefinitely.

EXECUTION STATE

logp_new.shape = (B, G, T)

logp_old.shape = (B, G, T)

logp_ref.shape = (B, G, T)

rewards.shape = (B, G)

17Per-prompt mean and std for advantage normalisation

dim=1 is the group axis. We compute mu and sd over the G completions of EACH prompt independently. Clamping sd to 1e-8 protects the divide when all completions in a group got the same reward (very common at the start of training when the model is too weak to ever solve a prompt — every row is 0, std is 0, and without the clamp the advantage is NaN).

EXECUTION STATE

mu.shape = (B, 1)

sd.shape = (B, 1)

21The advantage broadcasts to every response token

A has shape (B, G); unsqueezing to (B, G, 1) lets it broadcast over the T-axis. Every response token in completion i inherits the same scalar advantage. This is the second crucial deviation from PPO: no per-token credit assignment, no GAE, no temporal-difference smoothing. The whole trajectory is rewarded or penalised as a unit. For reasoning tasks this is fine because the reward only depends on the final answer; for tasks with per-step rewards GRPO would be a bad fit.

24Compute the ratio in log space, then exponentiate

log_ratio = logp_new - logp_old is numerically stable; computing the ratio directly as exp(logp_new)/exp(logp_old) underflows for long sequences. Exponentiating gives ratio of shape (B, G, T); each entry is per-token rho. With one PPO epoch (R1's choice) ratio is very close to 1 across the board — typically ratio.std() ~ 0.01 for the first update, growing to ~0.05 after a few epochs if R1 had used more than one (it doesn't).

EXECUTION STATE

ratio.shape = (B, G, T)

29PPO clipped surrogate at the token granularity

Per token, take the smaller of (unclipped, clipped). torch.minimum is element-wise, so this is vectorised across the entire (B, G, T) tensor in one op. The clip range (1-eps, 1+eps) = (0.8, 1.2) for eps=0.2. Tokens whose ratio walks outside this range get their gradient zeroed in the dangerous direction. Note that 'dangerous' depends on the sign of A: for A>0 the clip caps the upside (we can't reward too aggressively), and for A<0 the clip caps the downside (we can't punish too aggressively). Both protect against runaway updates.

EXECUTION STATE

surrogate.shape = (B, G, T)

34Unbiased KL estimator vs the SFT reference

This is the GRPO paper's estimator: KL ~ exp(r) - r - 1 where r = log pi_ref - log pi_new. Non-negative everywhere, zero when policies agree on a token. The estimator is computed per token. With BETA = 1e-3 (R1) the per-token contribution is in the 1e-5 range; integrated over all the response tokens in a batch it adds up to maybe 1e-2 of the total objective. Crank BETA to 4e-2 (classic PPO defaults) and KL dominates — the policy stops moving.

EXECUTION STATE

kl_token.shape = (B, G, T)

kl_token at agreement = 0.0

41Mask out the prompt and padding, then average

mask is 1 only on response tokens (not prompt, not pad). We sum the masked objective along T and divide by the response length to get a per-completion scalar, then take the mean over G and B. The mean over T means longer completions don't dominate the gradient by virtue of length alone — a subtle but important choice. (Dr.GRPO disagrees with this normalisation; see Section 15.5.) Final return: negative of the mean objective, because we MINIMISE in PyTorch.

EXECUTION STATE

objective shape after mean = (B, G) -> scalar

34 lines without explanation

1import torch
2import torch.nn.functional as F
3
4def grpo_loss(
5    logp_new,   # (B, G, T)  log pi_new(y_t | x, y_{<t})
6    logp_old,   # (B, G, T)  log pi_old(y_t | x, y_{<t})
7    logp_ref,   # (B, G, T)  log pi_ref(y_t | x, y_{<t})
8    rewards,    # (B, G)     scalar reward per completion
9    mask,       # (B, G, T)  1 for response tokens, 0 for prompt / padding
10    *,
11    epsilon: float = 0.2,
12    beta: float = 1e-3,
13):
14    """
15    DeepSeek R1's GRPO loss. No critic, no value head, no GAE.
16    Reference (pi_ref) is the post-SFT model that the policy started from.
17    """
18    # --- Step 1: group-normalised advantages (no critic) -----------------
19    mu = rewards.mean(dim=1, keepdim=True)                  # (B, 1)
20    sd = rewards.std(dim=1, keepdim=True).clamp(min=1e-8)   # (B, 1)
21    A  = (rewards - mu) / sd                                # (B, G)
22    A  = A.unsqueeze(-1)                                    # (B, G, 1) -> broadcast over T
23
24    # --- Step 2: token-level log-ratio and policy ratio ------------------
25    log_ratio = logp_new - logp_old                         # (B, G, T)
26    ratio     = log_ratio.exp()                             # (B, G, T)
27
28    # --- Step 3: PPO clipped surrogate, per token ------------------------
29    unclipped = ratio * A
30    clipped   = ratio.clamp(1 - epsilon, 1 + epsilon) * A
31    surrogate = torch.minimum(unclipped, clipped)           # (B, G, T)
32
33    # --- Step 4: GRPO unbiased KL estimator (vs reference) ---------------
34    # KL = pi_ref/pi_new - log(pi_ref/pi_new) - 1, non-negative, zero at equality
35    log_ref_new = logp_ref - logp_new
36    kl_token    = log_ref_new.exp() - log_ref_new - 1.0     # (B, G, T)
37
38    # --- Step 5: mask + mean over response tokens, then over group/batch -
39    objective = surrogate - beta * kl_token                 # (B, G, T)
40    objective = (objective * mask).sum(dim=-1) / mask.sum(dim=-1).clamp(min=1)
41    return -(objective.mean())                              # minimise -J

Three observations on what makes this code production-ready vs. toy:

Two policies, not one. $\pi_{\text{old}}$ is for the importance ratio; $\pi_{\text{ref}}$ is for the KL anchor. R1 keeps them separate because the old policy is refreshed every gradient step (so the ratio stays near 1) while the reference is refreshed every 100–500 steps (so the KL penalty stays meaningful). Conflating them collapses GRPO to standard PPO with a tiny KL coefficient, which loses the entire point.
Token-level masking is non-negotiable. The objective is averaged over response tokens only — never the prompt tokens or pad tokens. Skip the mask and the prompt's log-probability contributes to the gradient, which is exactly backward (you would be reinforcing the prompt instead of the response). Most reported "GRPO doesn't work" failures trace to a missing or wrong mask.
Group statistics use $\text{std}.\text{clamp}(\min=10^{-8})$ . Early in R1 training many prompts return all-zero or all-one reward groups (model is too weak or too strong). Without the clamp the divide is NaN, which propagates everywhere via the optimiser's second moment. The clamp keeps the advantage at zero for those prompts — they contribute no gradient — which is the correct behaviour.

Implementation note. Computing

\text{logp\_new}, \text{logp\_old}, \text{logp\_ref}

on the same

(B, G, T)

tensor takes three forward passes per update. For

B = 1024, G = 16, T \approx 4000

that is roughly

6.5 \cdot 10^{7}

forward-pass tokens per step — comparable to a small SFT epoch. The trick R1 uses is to keep

\pi_{\text{ref}}

frozen for hundreds of steps and cache its log-probs, eliminating one of the three forward passes most of the time.

At Massive Scale: Why These Numbers and Not Others

Match the hyperparameters to the bottlenecks they relieve at the 671B-parameter scale R1 was trained on:

Bottleneck	Hyperparameter that addresses it	Why this value at frontier scale
Sampling cost (forward passes per prompt)	G = 16	Compute per step = G x forward-pass-cost; G > 32 doubles the rollout phase without halving variance
KV-cache memory at long context	max_len = 32K	Long CoT requires the headroom; KV-cache for 32K tokens at 671B x MoE ~ 80GB per GPU
Gradient noise (variance of MC estimator)	LR = 1e-6	AdamW step size eta * g; with sigma ~ 1 on the surrogate, eta = 1e-6 gives parameter drift ~1e-6 per step
Off-policy drift from rollout to update	1 PPO epoch / rollout	Each epoch widens rho; one epoch keeps clip almost inactive and saves the gradient bias correction
Catastrophic forgetting of SFT skills	beta = 1e-3 plus periodic pi_ref refresh	Heavy beta kills exploration; refreshing pi_ref re-anchors after the policy has discovered new behaviours
Communication overhead (sync across nodes)	Batch = 1024 prompts	Each gradient step amortises one all-reduce; small batches make communication dominate

Two further constraints become visible only at this scale. First, the rollout phase is sequential per prompt (you cannot generate token 5 before token 4) but parallel across prompts. With $B = 1024$ prompts and $G = 16$ completions each, R1 generates $16384$ sequences in parallel — saturating the inference cluster the rollouts run on. Increasing $G$ beyond 16 would either require more inference GPUs or longer wall-clock per step. Second, the gradient phase is one optimiser step on the entire $16384$ -completion batch. Larger effective batches would amortise communication better but run out of GPU memory for the activations; $G = 16$ sits at the upper edge of what fits.

Why open-source GRPO recipes pick different numbers. Smaller labs cannot afford R1's

16384

-completion batch. Open-source GRPO recipes typically shrink to

B = 64

128

with the same

G = 16

, producing an effective batch of

1024-2048

. To compensate for the noisier gradient they often add a small amount of weight decay, raise

\beta

10^{-2}

, or drop the LR to

5 \cdot 10^{-7}

. Each of those adjustments is a downscaling of R1's recipe to the available compute, not an improvement on it.

Engineering Reality and Gotchas

Three failure modes earn their flag in every production GRPO log:

Reward-zero groups silently kill the gradient. When the model is too weak to solve a prompt, all $G$ completions get reward 0, the group std is zero, the clamp returns $10^{-8}$ , and the advantage becomes a vector of zeros — which contributes no gradient. Early in R1 training this happens to ~30% of prompts. The R1 fix is curriculum-style prompt filtering: drop prompts whose recent group-pass-rate is at 0% or 100% from the training mix, because they contribute no learning signal in either case. Without this filter, ~40% of compute is spent on prompts that contribute nothing.
The reference policy must be refreshed, but not too often. Refresh $\pi_{\text{ref}}$ every step and GRPO collapses to standard PPO; refresh never and KL grows without bound until $\beta \cdot KL$ dominates the surrogate and learning stalls. R1's reported schedule refreshes every 100–500 steps; the right number depends on how fast per-token KL is growing. A practical rule: refresh when mean per-token KL exceeds $0.5$ (corresponding to $\beta \cdot KL \approx 5 \cdot 10^{-4}$ , half the surrogate signal).
Length normalisation is the single biggest hidden knob. The factor $1/|y_i|$ in the objective makes long completions contribute the same gradient weight as short ones. This is R1's choice, but Dr.GRPO (Section 15.5) argues it implicitly rewards verbosity and drops it. Switching the normalisation between these two forms — one line of code — changes whether the model learns to think more or less. There is no universally right answer; it depends on what the reward measures and what trade-off you want.

How a frontier team validates a hyperparameter set before committing. Three checks. (a) Run one prompt through the update loop and inspect the per-completion advantage, ratio, clip flag, and KL by hand — if any are NaN or out of plausible range, the loss is wrong before scale matters. (b) Run a 1k-step pilot at full scale and watch four curves: reward mean, surrogate mean, KL mean, fraction clipped. If KL mean exceeds 1 before step 1000, or fraction clipped exceeds 30%, or surrogate stays at zero, abort and re-tune. (c) Only after these two checks pass do you commit to the full training run. R1's reported numbers passed all three; most attempts to reproduce R1 with classic-PPO defaults fail check (b) within 100 steps.

The one sentence to carry forward: GRPO's magic is in the recipe, not the equation — $G = 16, \epsilon = 0.2, \beta = 10^{-3}, \eta = 10^{-6}$ is the single point in hyperparameter space where rule-based rewards on math prompts can drive a 671B model from imitation into emergent reasoning — and the rest of this chapter is about why that point exists, what perturbs it, and how the variants in Section 15.5 move along its frontier.

Where we go from here. Section 15.4 turns to reward design — the rule-based

r \in \{0, 1\}

we assumed throughout this section is a one-line description of an artefact that, in practice, accounts for as much of R1's success as the GRPO algorithm itself. We will see why DeepSeek deliberately rejected learned reward models, what the format-reward looks like, and how rule-based rewards interact with the hyperparameters fixed here.