Chapter 15
22 min read
Section 88 of 117

GRPO Variants: DAPO, Dr.GRPO, and Olmo 3

GRPO: Group Relative Policy Optimisation

The Real Problem: Vanilla GRPO Has Three Biases

The previous section implemented GRPO from scratch. It worked. It also has three biases that nobody noticed for the first six months of the algorithm's life, and that turn out to compound viciously as you scale to longer chains of thought, harder problems, and bigger models.

Here is the picture concretely. You finish a DeepSeek-R1-style training run on a 7B base. AIME accuracy goes from 13% to 31%. You scale the same recipe to 32B and re-run on the same data. Accuracy improves to 38% — but the chains of thought are now 3000 tokens long, the model occasionally outputs the same paragraph six times, and a quarter of your gradient steps over the last 20% of training produced zero net change because every rollout in those groups happened to score identically. The recipe works, but it has plateaued in a way that the original GRPO paper does not predict.

Between mid-2024 and early-2026 three groups identified what was going wrong, each from a different angle, each with a different patch:

  1. Length bias. Vanilla GRPO averages the per-token loss within each rollout before averaging across the group. A 1000-token rollout contributes the same total gradient as a 100-token rollout — meaning each individual token in the longer rollout pulls 10x less than a token in the short one. Long chains of thought are systematically under-weighted. ByteDance's DAPO paper named this and fixed it.
  2. Difficulty bias. The σG\sigma_G divisor in the advantage formula inflates gradients on easy prompts (small variance) and shrinks gradients on hard prompts (large variance). Exactly backwards from what you want — easy prompts have nothing to teach and hard prompts have everything to teach. NUS/Sea's Dr.GRPO paper named this and fixed it.
  3. Wasted-group bias. When every rollout in a group gets the same reward (all wrong OR all right) the advantage is zero, the gradient is zero, and the entire forward+backward pass on that group was paid for with no learning. Vanilla GRPO does nothing about this. DAPO's dynamic sampling rejects those groups and resamples another prompt. Allen AI's Olmo 3 recipe wraps these fixes into a production-grade RLVR pipeline.
The unifying observation: none of these variants changes the underlying RL objective. They are all reweightings of the same gradient signal. The signal was there in vanilla GRPO; it was just allocated wrong. The fixes are in how the per-token contributions are summed, which advantages get the σ divisor, and which groups get a gradient step at all.

The lesson for anyone running RL at scale: when an algorithm plateaus, look at the reduction step — how per-token terms become per-rollout terms become per-batch terms. That is where the silent bias accumulates, and that is the lever the post-GRPO generation pulls.

Intuition: Each Variant Is a Surgical Patch

Imagine a teacher grading homework from four students. Each student turns in an answer of different length. The teacher has a fixed amount of red-pen attention to spend.

  • Vanilla GRPO says: "I will spend the same attention on each student's page, regardless of length." The student who wrote 10 lines gets 10x more red-pen-per-line than the student who wrote 100 lines. Each individual sentence in the long answer gets shallowly corrected; the short answer gets deeply critiqued line-by-line.
  • DAPO says: "I will spend the same attention per line, across every student." The long answer gets more total ink, the short answer gets less total ink, but each sentence is read with equal care. Sentence-level feedback is fair.
  • Dr.GRPO says: "The grading scale should stay constant. I will not silently make easy problems louder than hard problems just because the class agreed on easy ones." When the variance in the class is small (everyone scored similarly) the grading is naturally subdued. When the variance is large (there is real disagreement) the grading stays at the same scale, so hard problems get their fair share of correction.

DAPO and Dr.GRPO are complementary, not competing. DAPO fixes the per-token weighting; Dr.GRPO fixes the per-prompt weighting. You can apply both at once — and Olmo 3 does, plus production engineering for throughput.

The pattern these papers exemplify: a published RL algorithm is almost never the optimum. It is the version that was good enough to be published. Within 6–12 months somebody finds that one of its arithmetic conveniences is a load-bearing bug, names the bug, and ships a patch. GRPO is no exception. Watch for the same trajectory with whatever supersedes GRPO in 2027.

The Math: One Loss, Three Reweightings

Recall the GRPO loss from the previous section. For a group of GG rollouts a1,,aGa_1, \ldots, a_G sampled from a single prompt ss, with per-token log-probabilities logπθ(ai,ts,ai,<t)\log \pi_\theta(a_{i,t} \mid s, a_{i, <t}) and per-token importance ratios ρi,t=πθ(ai,t)/πθold(ai,t)\rho_{i,t} = \pi_\theta(a_{i,t}) / \pi_{\theta_{\text{old}}}(a_{i,t}), the loss is:

LGRPO(θ)=1Gi=1G1Lit=1Limin ⁣(ρi,tAi,  clip(ρi,t,1ϵ,1+ϵ)Ai)+βKL ⁣(πθπref)\mathcal{L}_{\text{GRPO}}(\theta) = -\frac{1}{G} \sum_{i=1}^{G} \frac{1}{L_i} \sum_{t=1}^{L_i} \min\!\big(\rho_{i,t} A_i,\; \mathrm{clip}(\rho_{i,t}, 1-\epsilon, 1+\epsilon)\, A_i\big) + \beta\, \mathrm{KL}\!\big(\pi_\theta \,\|\, \pi_{\text{ref}}\big)

and the advantage is Ai=(RiμG)/(σG+ε)A_i = (R_i - \mu_G) / (\sigma_G + \varepsilon) with μG=1GjRj\mu_G = \tfrac{1}{G}\sum_j R_j, σG2=1Gj(RjμG)2\sigma_G^2 = \tfrac{1}{G}\sum_j (R_j - \mu_G)^2. The clip range ϵ=0.2\epsilon = 0.2 is symmetric.

Look at the two reductions: 1Gi1Lit\tfrac{1}{G} \sum_i \tfrac{1}{L_i} \sum_t. Each token contributes with weight 1/(GLi)1 / (G \cdot L_i). A token in a 100-token rollout has weight 1/(G100)1/(G \cdot 100); a token in a 1000-token rollout has weight 1/(G1000)1/(G \cdot 1000). That is the length bias, in one line.

DAPO's loss

DAPO replaces the per-rollout mean with a group-wide token-count normaliser:

LDAPO(θ)=1j=1GLji=1Gt=1Limin ⁣(ρi,tAi,  clip(ρi,t,1ϵlo,1+ϵhi)Ai)+βKL\mathcal{L}_{\text{DAPO}}(\theta) = -\frac{1}{\sum_{j=1}^{G} L_j} \sum_{i=1}^{G} \sum_{t=1}^{L_i} \min\!\big(\rho_{i,t} A_i,\; \mathrm{clip}(\rho_{i,t}, 1-\epsilon_{\text{lo}}, 1+\epsilon_{\text{hi}})\, A_i\big) + \beta\, \mathrm{KL}

Two structural changes from vanilla GRPO:

  1. The denominator is jLj\sum_j L_j, the total number of tokens in the entire group. Every token now has weight 1/jLj1 / \sum_j L_j regardless of which rollout it came from. Length bias removed.
  2. The clip becomes asymmetric: ϵlo=0.2\epsilon_{\text{lo}} = 0.2, ϵhi=0.28\epsilon_{\text{hi}} = 0.28. The lower bound stays conservative; the upper bound opens up, so unlikely-but-good tokens can climb faster.

DAPO also adds two non-loss patches — dynamic sampling (reject groups where every rollout has the same reward) and overlong reward shaping (linearly decay reward beyond a soft length budget). Those live outside the loss formula but inside the training loop.

Dr.GRPO's loss

Dr.GRPO keeps DAPO's reduction but changes the advantage:

AiDr=RiμGA_i^{\text{Dr}} = R_i - \mu_G

No σG\sigma_G in the denominator. The full Dr.GRPO loss is:

LDr.GRPO(θ)=1j=1GLji,tmin ⁣(ρi,tAiDr,  clip(ρi,t,1ϵ,1+ϵ)AiDr)+βKL\mathcal{L}_{\text{Dr.GRPO}}(\theta) = -\frac{1}{\sum_{j=1}^{G} L_j} \sum_{i, t} \min\!\big(\rho_{i,t} A_i^{\text{Dr}},\; \mathrm{clip}(\rho_{i,t}, 1-\epsilon, 1+\epsilon)\, A_i^{\text{Dr}}\big) + \beta\, \mathrm{KL}

Why removing σG\sigma_G matters: the std divisor renormalises the advantage scale to AiO(1)|A_i| \sim O(1) regardless of how large or small the underlying reward differences are. That sounds good — until you notice that for an easy prompt where 7/8 rollouts succeed, σG\sigma_G is small and Ai|A_i| is large, while for a hard prompt with mixed outcomes, σG\sigma_G is large and Ai|A_i| is small. The policy ends up spending most of its gradient budget on prompts it already mostly solves.

Olmo 3's composite recipe

Allen AI's Olmo 3 RLVR recipe (early 2026) does not introduce a new loss. It picks the best pieces of DAPO and Dr.GRPO, fixes the engineering, and ships:

  • Dr.GRPO advantage (no σG\sigma_G) plus token-level reduction (the DAPO denominator).
  • Dynamic sampling at the prompt level: a curriculum-aware sampler that retires prompts whose policy success rate has hit either 0% or above 95%, and biases the buffer toward 30–70% success-rate prompts.
  • Asymmetric clip (ϵlo=0.2\epsilon_{\text{lo}} = 0.2, ϵhi=0.3\epsilon_{\text{hi}} = 0.3) — a touch looser on the upper bound than DAPO.
  • No KL term on rule-verified tasks (math, code). The bounded reward and the dynamic-sampling filter are enough to prevent reward hacking; the KL term mostly slowed convergence. KL is kept for taste-based tasks (helpfulness, safety) where there is no programmatic ground truth.
  • Streaming rollouts: rollouts that finish early are processed as they arrive instead of padded to the group's max length. This is a throughput patch, not a math patch, but it interacts with token-level reduction in a load-bearing way.
The Olmo 3 lesson: by the time you ship a real RL run for a frontier model, "the algorithm" is twenty named tricks layered on a textbook objective. Each individual trick gains 5–15% in training efficiency or final score. The composite gain is multiplicative — Olmo 3 reports roughly 3x improvement in steps-to-target over a vanilla GRPO baseline on the same hardware.

DAPO: Decoupled Clip and Dynamic Sampling

ByteDance released DAPO in early 2025 alongside their Qwen-2.5-32B post-training run. The paper's headline benchmark: 50% on AIME 2024 with a 32B model trained on half the FLOPs of the DeepSeek-R1 recipe. The gain decomposes into four named patches.

Patch 1: Clip-Higher

Vanilla PPO/GRPO clips the importance ratio symmetrically: ρ[10.2,1+0.2]\rho \in [1-0.2, 1+0.2]. The ByteDance team observed that this symmetric clip systematically suppresses the upward direction. Concretely: a token that the old policy assigned probability 0.05 to, but that turned out to be the right token (positive advantage), can have its probability raised by at most 20% per step — to 0.06, then 0.072, then 0.086, etc. After 20 steps you are at 0.19. The policy could not climb out of a confident-but-wrong basin fast enough.

Clip-higher widens the upper bound asymmetrically: ρ[10.2,1+0.28]\rho \in [1-0.2, 1+0.28]. The lower bound stays at 0.8 — you still cannot drop a token's probability by more than 20% per step, so the policy cannot collapse away from a currently-good token too quickly. But the upper bound at 1.28 lets the same low-probability-but-good token climb 28% per step, which compounds to 0.05 → 0.064 → 0.082, etc. — measurably faster.

Why not just use a larger symmetric clip? Because you would also let the policy drop currently-good-tokens 28% per step, which destabilises any rollout that contains a single unlucky token. The asymmetry exists because the FALL direction carries more risk than the CLIMB direction — falling means forgetting; climbing means exploring.

Patch 2: Dynamic Sampling

In the previous section we noted that vanilla GRPO gives zero gradient on groups where every rollout has the same reward. In practice this is not rare — for a model partway through training on a math problem, the "all wrong" case is common (the policy has not figured this problem out yet) and so is the "all right" case (this problem is already in the training distribution). DAPO measured the rate empirically on their 32B run and found 30–40% of groups were dead for most of training.

The fix is straightforward: before taking the gradient step, check whether the group has nonzero variance. If not, reject it and sample a different prompt. The rollout cost was sunk; the backward-pass cost is saved. Across a million-step run this recovers roughly half the wasted FLOPs.

At larger scale the implementation gets fancier — you keep a rolling per-prompt success-rate estimate, drop prompts that have gone to 0% or 100% success over the last N rollouts, and weight the prompt buffer toward the 30–70% success-rate band where the gradient signal is densest.

Patch 3: Token-Level Policy Gradient Loss

The length bias we keep mentioning lives in vanilla GRPO's reduction formula: 1Gi1Lit()\tfrac{1}{G} \sum_i \tfrac{1}{L_i} \sum_t (\ldots). DAPO replaces it with 1jLji,t()\tfrac{1}{\sum_j L_j} \sum_{i, t} (\ldots). Two consequences:

  1. Each token in the group has equal weight in the gradient regardless of which rollout it came from. Long chains of thought are no longer down-weighted per-token.
  2. A long rollout contributes proportionally more total gradient than a short rollout of the same advantage. This is the intuitively-correct outcome but it has a side effect: the model gets a marginal incentive to generate longer outputs. Hence patch 4.

Patch 4: Overlong Reward Shaping

With token-level loss, a model can game its reward by padding correct-format tokens into a long, rambling answer. DAPO penalises this with a soft length budget. Let LsoftL_{\text{soft}} be the budget (typically 4096 tokens), LmaxL_{\text{max}} the hard cap (typically 8192). The shaped reward is:

R~(s,a)=R(s,a){1LLsoftLmaxLLmaxLsoftLsoft<L<Lmax0LLmax\tilde R(s, a) = R(s, a) \cdot \begin{cases} 1 & L \leq L_{\text{soft}} \\ \frac{L_{\text{max}} - L}{L_{\text{max}} - L_{\text{soft}}} & L_{\text{soft}} < L < L_{\text{max}} \\ 0 & L \geq L_{\text{max}} \end{cases}

Linear ramp from full reward at LsoftL_{\text{soft}} to zero reward at LmaxL_{\text{max}}. The verifier still says the answer is right; the shaped reward says "but you took too long to say it". This is the cheapest possible anti-rambling regulariser — no extra model, no extra forward pass, just a multiplicative scaling on the existing reward.

Composition of the four patches: Clip-higher (1) is independent of the others. Dynamic sampling (2) is independent of the others. Token-level loss (3) introduces a length incentive that overlong reward shaping (4) cancels. Patches 3 and 4 must be applied together; patches 1 and 2 can be added piecemeal to any GRPO baseline as drop-in improvements.

Dr.GRPO: GRPO Done Right

The Sea/NUS team published Dr.GRPO in mid-2025 — "Dr." for "done right". Their thesis is one sentence: the σG\sigma_G divisor in vanilla GRPO is mathematically convenient but empirically harmful. Remove it.

The argument unfolds in three steps.

Step 1: Why σ was added in the first place

DeepSeek's GRPO paper motivated the std divisor as "normalisation" — they wanted advantages with AiO(1)|A_i| \sim O(1) regardless of the underlying reward scale. With a bounded reward in [0,Rmax][0, R_{\max}], the divisor turns reward units into std-units so the optimiser's learning rate is scale-invariant. Plausible argument; consistent with PPO's per-batch advantage normalisation; very natural to write down.

Step 2: What σ actually does at training time

Consider two prompts in the same batch:

PromptG rollouts succeedμσAdvantage of a winner
Easy7 out of 80.960.33(1.1 − 0.96)/0.33 = +0.42
Hard3 out of 80.410.55(1.1 − 0.41)/0.55 = +1.25

That table looks reasonable. The hard prompt's winner gets a bigger advantage. But now flip it: consider a partial-credit prompt where most rollouts got format-only (reward 0.1) and a couple got full credit (reward 1.1):

PromptRewardsμσA of a 1.1-winner
Easy partial-credit[1.1, 1.1, 1.1, 0.1, 0.1, 1.1, 1.1, 1.1]0.850.39+0.64
Hard partial-credit[0.1, 1.1, 0.1, 1.1, 0.1, 0.1, 1.1, 0.1]0.4750.49+1.28
Bimodal hard[0, 1.1, 0, 1.1, 0, 1.1, 0, 1.1]0.550.55+1.00
Tight cluster (easy)[1.05, 1.07, 1.10, 1.10, 1.04, 1.08, 1.09, 1.10]1.0790.024+0.88 (huge per-token push)

The last row is the failure mode. A prompt where the policy is already getting near-maximum reward, with all 8 rollouts clustered in [1.04,1.10][1.04, 1.10], produces an advantage of +0.88 with σG=0.024\sigma_G = 0.024. The policy gradient will spend disproportionate effort on this nearly-solved prompt instead of on the hard ones. Conversely, a prompt with high variance — exactly where you want a strong signal — gets its signal divided down.

Step 3: The fix

Remove the divisor. Use Ai=RiμGA_i = R_i - \mu_G. The advantage scale now lives in the original reward units. A clustered easy prompt produces tiny advantages (correct — there is little to learn). A bimodal hard prompt produces large advantages (correct — there is a lot to learn). The optimiser's learning rate now has to absorb the reward scale, which is a one-time tuning rather than a per-prompt artefact.

The empirical headline: on the same hardware and the same data, Dr.GRPO matched or beat vanilla GRPO on 12 of 14 reasoning benchmarks the paper tested, with the largest gains on out-of-distribution math problems — exactly where the difficulty-bias of vanilla GRPO would hurt most.

Olmo 3: A Production RLVR Recipe

Allen AI's Olmo 3 (late 2025 / early 2026) is the first frontier open-source model trained end-to-end with the variants stack. It does not introduce new math. It introduces a production engineering recipe for combining the existing patches, plus a curriculum that exploits dynamic sampling at the batch level.

The Olmo 3 loss

LOlmo(θ)=1jLji,tmin ⁣(ρi,t(RiμG),  clip(ρi,t,0.8,1.3)(RiμG))\mathcal{L}_{\text{Olmo}}(\theta) = -\frac{1}{\sum_j L_j} \sum_{i, t} \min\!\big(\rho_{i,t} (R_i - \mu_G),\; \mathrm{clip}(\rho_{i,t}, 0.8, 1.3)\, (R_i - \mu_G)\big)

No KL term for verifiable tasks. No σG\sigma_G. DAPO-style token-level reduction. Asymmetric clip slightly looser than DAPO's. Plus a prompt curriculum built on top of dynamic sampling:

  1. Maintain a per-prompt success-rate estimate over a rolling window of N=64N=64 rollouts.
  2. Bucket prompts into [0–10%, 10–30%, 30–70%, 70–90%, 90–100%] success bands.
  3. Sampling weight is roughly w(band)[0.05,0.20,0.50,0.20,0.05]w(\text{band}) \propto [0.05, 0.20, 0.50, 0.20, 0.05] — most of the gradient budget goes to the 30–70% band, where the policy is on the verge of solving the prompt and has the most to learn.
  4. Prompts in the 0–10% band are paused for K=10K=10 epochs in case the policy needs more general capability before retrying. Prompts in the 90–100% band are retired entirely.

The result: more than 80% of GPU time is spent on prompts where the policy can both succeed and fail — i.e. where the gradient signal is most informative. Compared to a uniform sampler with the same total dataset, Olmo 3 reports roughly 3x reduction in training tokens to reach the same target score.

Why Olmo 3 drops the KL term: on rule-verified tasks, reward hacking against a deterministic verifier is structurally hard — the verifier is the ground truth, not an approximation. The KL term was originally a guard against the reward MODEL being gamed. With no reward model and only verifier code, the KL leash mostly slowed convergence without preventing misbehaviour. On taste-based tasks (RLHF for helpfulness/safety) Olmo 3 still uses KL — it is the verifiability of the reward, not a doctrine about KL, that decides whether to include it.

Side-by-Side Comparison

The four innovations distribute across the variants as follows:

FeatureVanilla GRPODAPODr.GRPOOlmo 3
Advantage formula(R − μ) / σ(R − μ) / σR − μR − μ
Loss reductionmean per-rollout, mean per-groupsum over tokens / total tokenssum over tokens / total tokenssum over tokens / total tokens
Clip range[0.8, 1.2] symmetric[0.8, 1.28] asymmetric[0.8, 1.2] symmetric[0.8, 1.3] asymmetric
Dynamic samplingNoYes (per-group)No (paper-level)Yes (per-prompt curriculum)
Overlong reward shapingNoYesNoYes
KL to referenceYes (β = 0.04)Yes (β = 0.04)Yes (β = 0.04)No (verifiable tasks)
Notable userDeepSeek-R1Qwen-2.5-32B reasoningSea-ReasonerOlmo 3 base RL
Reported gain over GRPO≈ 2× tokens-to-target≈ 1.5× generalisation≈ 3× tokens-to-target

A useful way to read the table: DAPO is the "fix the reduction" family, Dr.GRPO is the "fix the advantage" family, Olmo 3 is "apply both, plus engineering". None of these is a finished story — by the time you read this, expect 2–3 further variants to have appeared, each isolating a new bias that the current generation overlooks.

Manual Numerical Walkthrough

Same setup as the previous section: prompt asks for the smaller root of x25x+6=0x^2 - 5x + 6 = 0, ground truth is 2, reward formula R=1.0ranswer+0.1rformatR = 1.0 \cdot r_{\text{answer}} + 0.1 \cdot r_{\text{format}}. We will compute one training step under each variant and see how the same four rollouts produce three different gradient allocations.

Click to expand: one training step under GRPO, DAPO, and Dr.GRPO with the SAME rollouts

Setup. Four rollouts, all from the same prompt. Reward RR and token length LL:

iDescriptionRL (tokens)
1Correct + well-formatted (short)1.10220
2Wrong answer + correct format (rambling)0.10950
3No tags, no extractable answer (rambling)0.001100
4Correct + well-formatted (short)1.10180

Group statistics.

μG=14(1.10+0.10+0.00+1.10)=0.575\mu_G = \tfrac{1}{4}(1.10 + 0.10 + 0.00 + 1.10) = 0.575

σG2=14(Ri0.575)2=14(0.2756+0.2256+0.3306+0.2756)=0.2769\sigma_G^2 = \tfrac{1}{4}\sum (R_i - 0.575)^2 = \tfrac{1}{4}(0.2756 + 0.2256 + 0.3306 + 0.2756) = 0.2769, so σG0.526\sigma_G \approx 0.526.

Step 1 — Advantages under each variant.

iR_iR_i − μGRPO/DAPO A_i = (R−μ)/σDr.GRPO A_i = R−μ
11.10+0.525+0.998+0.525
20.10−0.475−0.903−0.475
30.00−0.575−1.093−0.575
41.10+0.525+0.998+0.525

GRPO and DAPO share the same per-rollout advantages — they differ only in the loss reduction. Dr.GRPO's advantages are smaller in magnitude and stay in REWARD units, not std-units.

Step 2 — Per-token gradient contribution. What pulls each token in each rollout. Assume the per-token ratio ρi,t1\rho_{i,t} \approx 1 at the first step (right after sampling) so we can ignore the clip-min for now.

iL_iGRPO: A_i / L_iDAPO: A_i (per token)Dr.GRPO: A_i (per token)
1220+0.00454+0.998+0.525
2950−0.00095−0.903−0.475
31100−0.00099−1.093−0.575
4180+0.00555+0.998+0.525

Read the GRPO column. Rollout 4 (short, good) pulls each of its 180 tokens with +0.00555. Rollout 3 (long, bad) pulls each of its 1100 tokens with only −0.00099. The magnitudes are off by a factor of 5.6 even though the underlying advantages were within a factor of 2 of each other. Length bias, visible in two columns of numbers.

Read the DAPO column. Per-token pulls are unaffected by length. A token in the long-bad rollout pulls with −0.903; a token in the short-good rollout pulls with +0.998. Now the gradient flows in proportion to the advantage, not the rollout-length artefact.

Read the Dr.GRPO column. Same length behaviour as DAPO. Smaller magnitudes because σ is no longer in the denominator. The optimiser learning rate has to be tuned a bit higher to compensate — Olmo 3 uses 5×1075 \times 10^{-7} where vanilla GRPO used 1×1071 \times 10^{-7}.

Step 3 — Total gradient share. Multiply the per-token contribution by the rollout length (the denominator's effect washes out across rollouts because it is the same constant). Normalise so the four shares sum to 100%:

iL_iGRPO shareDAPO shareDr.GRPO share
122025%9%9%
295025%37%37%
3110025%45%45%
418025%9%9%

The diagnosis. Under GRPO, every rollout contributes the SAME total gradient share (25% each) — because the (1/L_i) reduction makes long rollouts shed per-token signal to compensate for having more tokens. Under DAPO and Dr.GRPO, total gradient mass flows toward the longer rollouts even when their per-token advantages are negative. This is the correct behaviour: a long confidently-wrong rollout should pull the policy strongly away from that mode, not get the same nudge as a short confidently-right one.

Step 4 — Dynamic sampling check (DAPO and Olmo 3). Rewards are [1.1,0.1,0.0,1.1][1.1, 0.1, 0.0, 1.1]. They are not all equal, so σG>0\sigma_G > 0, so the group is KEPT. We take the gradient step. Compare with a degenerate case: if rewards were [1.1,1.1,1.1,1.1][1.1, 1.1, 1.1, 1.1], DAPO and Olmo 3 would discard this group and resample. Vanilla GRPO would compute zero advantages and pay the backward-pass cost for nothing.

Step 5 — Overlong shaping (DAPO). The rambling rollouts have L=950L = 950 and L=1100L = 1100. With Lsoft=1024L_{\text{soft}} = 1024, Lmax=2048L_{\text{max}} = 2048, rollout 3 (length 1100) gets reward shaped to 0.0(20481100)/(20481024)=0.00.0 \cdot (2048-1100)/(2048-1024) = 0.0 (already zero, so no change). Rollout 2 (length 950) is under the soft budget, so no penalty. If the policy had output 1500 tokens with reward 1.1 (correct but rambling), the shaped reward would drop to 1.1(20481500)/(20481024)0.591.1 \cdot (2048-1500)/(2048-1024) \approx 0.59 — a meaningful penalty that the gradient picks up.

The whole compute budget. Four rollouts on a 7B policy with chain-of-thought averaging ~600 tokens: roughly 24 seconds of H100 time at 50 tok/s decoded generation. Four verifier calls: under a millisecond on CPU. One backward pass: roughly 1.5 seconds. Variant switch costs zero — flipping from GRPO to Dr.GRPO is two lines of code and produces measurably different training dynamics.

Interactive: Watch the Three Variants Diverge

The widget below shows four rollouts of adjustable reward and length. The three coloured bars under each rollout show the per-token gradient contribution each variant assigns to that rollout. The table at the bottom shows the total gradient share across the group. Try the "Length-bias preset" — long bad rollouts barely show up on the GRPO bar while dominating the DAPO and Dr.GRPO bars. Try the "All-correct" preset and watch every variant collapse to zero (DAPO + Olmo 3 would also discard the group via dynamic sampling). Try the "Uniform length" preset and watch GRPO match DAPO exactly — length bias only manifests when rollouts have different lengths.

Loading variant comparison lab…
What to look for: the bar for rollout ii under variant vv shows per-token magnitude; the "sum" readout next to it shows total gradient mass (per-token × length). Under GRPO, every rollout's sum is roughly equal; under DAPO and Dr.GRPO, longer rollouts always pull more total mass. That single visual difference is the entire conceptual gap between the variants.

Plain Python: Three Advantage Functions, One File

The full variant stack — three advantage formulas, four DAPO patches, the Dr.GRPO removal of σ — in 120 lines of pure Python. No torch. Reads like a math derivation, runs on a laptop.

GRPO, DAPO, and Dr.GRPO — three advantage functions in one file
🐍grpo_variants.py
22Rollout dataclass — the unit each variant operates on

A rollout carries the bounded reward R from the rule-based verifier, plus three per-token log-prob arrays. logp_old is captured at sampling time (no grad), logp_new is recomputed under the CURRENT policy (this is the gradient pathway), logp_ref is captured against the frozen reference for the KL penalty. The length L is the structural variable the variants disagree about — vanilla GRPO divides by L, DAPO does not.

EXECUTION STATE
reward = float in [0, 1.1] from the rule verifier
logp_old = list[float], length L — log-probs at sample time
logp_new = list[float], length L — log-probs with grad on
length = int — equals len(logp_old)
34grpo_advantages — the original (R - μ)/σ recipe

This is the formula DeepSeek-R1 used: per-rollout centred reward divided by group standard deviation. The pstdev call is the population std (divides by G, not G-1) — matches the DeepSeek implementation. The `or eps` floor prevents divide-by-zero when every rollout in the group has the same reward (all correct or all wrong). The output is a vector of advantages, one per rollout, that sums to zero by construction.

EXECUTION STATE
mu = mean(rewards) — Monte-Carlo baseline for E[R|s]
sigma = pstdev(rewards) — group-level scale
return = list[float], length G, sums to 0
43dapo_advantages — identical formula, different intent

DAPO does NOT change the advantage formula. It changes the loss REDUCTION (sum over tokens instead of mean) and adds three other patches. We keep this stub function to make the variant-switching logic in the training loop uniform — every variant exposes the same advantages() API even when the math is the same.

52dr_grpo_advantages — the surgical removal of σ

Dr.GRPO's central claim: dividing by sigma_G introduces a difficulty bias. On easy prompts where most rollouts succeed, sigma is small, so |A_i| is large — but the policy doesn't NEED a large gradient on a prompt it already mostly solves. On hard prompts where rollouts have wildly mixed rewards, sigma is large, so |A_i| is small — but a hard prompt is exactly where you want a sharp signal. Removing sigma keeps the gradient scale tied to actual reward differences. The formula reduces to A_i = R_i - mu_G — Monte-Carlo baseline only, no normalisation.

EXECUTION STATE
return = list[float], length G, scale is in REWARD UNITS not std-units
63clip_higher — the asymmetric PPO clip

Standard PPO uses one clip range on both sides. DAPO's clip-higher uses a tighter LOWER bound (0.20 — don't let the policy collapse away from a good rollout too fast) and a looser UPPER bound (0.28 — let unlikely-but-correct tokens climb faster). The empirical motivation: in vanilla GRPO with both bounds at 0.20, low-probability tokens that turned out to be good were clipped before their probability could rise enough to matter, so the policy lost the ability to climb out of confident-but-wrong basins. Clip-higher widens the climb-up lane without widening the fall-off lane.

EXECUTION STATE
eps_low = 0.20 — same as vanilla PPO
eps_high = 0.28 — DAPO's empirical sweet spot for 7B–32B models
lo = 0.80 — minimum ratio
hi = 1.28 — maximum ratio
75is_useful_group — dynamic sampling, the filter

If every rollout in a group has the same reward (all 0 OR all R_max), then mu_G equals every R_i, every advantage is zero, every gradient is zero. The group contributes ZERO information to the policy update — but you still paid the GPU cost to generate it. DAPO rejects these groups and resamples a different prompt. Over a million-step training run this typically rescues 20–40 percent of the wasted gradient steps that vanilla GRPO would have produced, which is the single largest source of DAPO's wall-clock speedup over vanilla GRPO.

EXECUTION STATE
return = bool — False means caller MUST resample
87token_level_loss — DAPO's loss reduction

Two reductions matter here. Vanilla GRPO computes loss_i = mean_t(...) per rollout, then mean_i over the group. That is a (1/L_i)*(1/G) weighting per token, which means a 100-token rollout gets the same total weight as a 1000-token rollout — but each individual token in the long rollout has 10x less influence. DAPO computes loss = sum_{i,t}(...) / sum_i L_i — every token across the group gets equal weight. Long rollouts contribute more total gradient, short rollouts less, and each individual token's pull is unbiased by its rollout's length.

EXECUTION STATE
total_tokens = sum of all rollout lengths in the group
ratio = exp(logp_new - logp_old) — importance-sampling correction
accum = running sum of -clip(ratio, A) over every token in the group
102overlong_shaping — DAPO's length penalty

DAPO's token-level loss removes the 1/L penalty, which inadvertently incentivises rambling: a long rollout with the same reward as a short one accumulates more gradient mass on its (more numerous) correct-format tokens. DAPO patches this by linearly shrinking reward when length exceeds a soft budget L_soft, dropping to zero at the hard budget L_max. Used in training the 32B reasoning models where chain-of-thought regularly hits 4k–8k tokens. Without this, training will produce models that pad reasoning indefinitely.

EXECUTION STATE
L_soft = 4096 — soft budget, reward unchanged below this
L_max = 8192 — hard budget, reward forced to 0 above this
frac = linear ramp in (0, 1) between L_soft and L_max
110dapo_loss — the four-patch loss, fused

The whole DAPO algorithm is this twelve-line function. Filter out degenerate groups (patch 2). Compute advantages (same as GRPO). Reduce loss over tokens with sum-then-divide-by-total-tokens (patch 3). The clip-higher (patch 1) is applied inside token_level_loss. Overlong shaping (patch 4) is applied BEFORE this function sees the rollout, in the reward stage. None of these patches individually is a paper-worthy innovation — together they account for DAPO's roughly 2x improvement on AIME 2024 over the DeepSeek-R1 GRPO baseline at the same FLOP budget.

EXECUTION STATE
return = scalar loss OR None (caller resamples)
116dr_grpo_loss — minimal patch, big effect

Dr.GRPO is the smallest variant change: replace (R - μ)/σ with (R - μ), keep DAPO's token-level reduction. That is it. Two lines. The Sea/NUS paper shows this beats both vanilla GRPO and DAPO on out-of-distribution generalisation — because removing sigma stops the policy from spending most of its gradient budget on easy prompts. For production RLVR runs where you have a curriculum of math/code problems of mixed difficulty, Dr.GRPO consistently allocates more gradient to the medium-hard problems where progress is actually possible.

EXECUTION STATE
advs = list[float] — centred rewards, NOT std-normalised
110 lines without explanation
1"""
2GRPO, DAPO, and Dr.GRPO — three advantage functions in one file.
3
4The training loop is identical for all three. The only things that change are:
5  (a) how we compute the per-rollout advantage,
6  (b) how we reduce the per-token loss to a per-rollout scalar,
7  (c) whether we filter out degenerate groups before the gradient step,
8  (d) what the clip range looks like.
9
10We treat "rollout" as a black box that contains a reward R and a list of
11per-token log-probs. The mechanics of generation, KV-caching, and the
12policy itself are unchanged from the previous section — what's different
13is the SHAPING of the gradient signal.
14"""
15from __future__ import annotations
16import math, statistics
17from dataclasses import dataclass
18
19# ─── 0. A rollout is (reward, token_logps, length) ──────────────────────
20@dataclass
21class Rollout:
22    reward:    float          # rule-based reward in [0, R_max]
23    logp_old:  list[float]    # per-token log prob at sampling time
24    logp_new:  list[float]    # per-token log prob under CURRENT policy
25    logp_ref:  list[float]    # per-token log prob under FROZEN ref
26    length:    int            # = len(logp_old)
27
28# ─── 1. VANILLA GRPO advantage (DeepSeek-R1 paper) ──────────────────────
29# A_i = (R_i - mu_G) / (sigma_G + eps). The PER-TOKEN gradient is then
30# A_i / L_i because the loss reduces with MEAN over tokens. The 1/L is
31# the "length bias": longer rollouts get diluted gradient per token.
32def grpo_advantages(group: list[Rollout], eps: float = 1e-6) -> list[float]:
33    rewards = [r.reward for r in group]
34    mu      = statistics.mean(rewards)
35    sigma   = statistics.pstdev(rewards) or eps
36    return [(r - mu) / (sigma + eps) for r in rewards]
37
38# ─── 2. DAPO advantage (ByteDance, 2025) ────────────────────────────────
39# Same A_i formula, but DAPO changes the LOSS REDUCTION from mean to sum
40# (no 1/L), so longer rollouts contribute proportionally more gradient.
41# DAPO also adds three other patches; this function only owns the
42# advantage piece. See dapo_loss below for the four-piece total.
43def dapo_advantages(group: list[Rollout], eps: float = 1e-6) -> list[float]:
44    return grpo_advantages(group, eps)   # advantage formula unchanged
45
46# ─── 3. Dr.GRPO advantage (NUS/Sea, 2025) ───────────────────────────────
47# A_i = R_i - mu_G. No std division. The argument: dividing by sigma_G
48# inflates gradients on EASY prompts (small variance => small sigma =>
49# large |A|) and shrinks gradients on HARD prompts (high variance).
50# Removing sigma keeps the gradient scale tied to the underlying reward
51# scale, which is what we wanted in the first place.
52def dr_grpo_advantages(group: list[Rollout]) -> list[float]:
53    rewards = [r.reward for r in group]
54    mu      = statistics.mean(rewards)
55    return [r - mu for r in rewards]
56
57# ─── 4. DAPO patch 1: clip-higher (asymmetric clip) ─────────────────────
58# Vanilla PPO/GRPO uses one clip range eps_clip on both sides:
59#   ratio in [1 - eps_clip, 1 + eps_clip]
60# DAPO splits it: eps_low keeps the "down" side conservative (don't kill
61# good rollouts), eps_high opens the "up" side (let unlikely-but-good
62# tokens climb faster). Default: eps_low=0.2, eps_high=0.28.
63def clip_higher(ratio: float, adv: float,
64                eps_low: float = 0.20, eps_high: float = 0.28) -> float:
65    lo, hi = 1.0 - eps_low, 1.0 + eps_high
66    clipped = max(lo, min(hi, ratio))
67    return min(ratio * adv, clipped * adv)        # PPO-style min
68
69# ─── 5. DAPO patch 2: dynamic sampling (drop degenerate groups) ─────────
70# If all G rollouts in a group succeed (sigma=0, all advantages=0) or all
71# fail, the gradient on that group is exactly zero. DAPO REJECTS the
72# group and re-samples a new prompt — this keeps every gradient step
73# carrying signal, which compounds across training.
74def is_useful_group(group: list[Rollout], R_max: float = 1.1) -> bool:
75    rewards = [r.reward for r in group]
76    if all(abs(r - 0.0)   < 1e-9 for r in rewards): return False
77    if all(abs(r - R_max) < 1e-9 for r in rewards): return False
78    return True
79
80# ─── 6. DAPO patch 3: token-level loss reduction ───────────────────────
81# Vanilla GRPO computes per-rollout loss as MEAN over tokens, then MEAN
82# over the group. DAPO computes SUM over tokens, then DIVIDE BY THE
83# TOTAL TOKEN COUNT IN THE GROUP — so long rollouts and short rollouts
84# contribute fairly per-token, instead of per-rollout.
85def token_level_loss(group: list[Rollout],
86                     advs:  list[float]) -> float:
87    """sum_i sum_t (-A_i * clip_higher(ratio_t, A_i))  /  sum_i L_i"""
88    total_tokens = sum(r.length for r in group)
89    if total_tokens == 0: return 0.0
90    accum = 0.0
91    for r, A in zip(group, advs):
92        for lpo, lpn in zip(r.logp_old, r.logp_new):
93            ratio = math.exp(lpn - lpo)
94            accum += -clip_higher(ratio, A)        # PG term per token
95    return accum / total_tokens
96
97# ─── 7. DAPO patch 4: overlong reward shaping ──────────────────────────
98# Long-context training pushes models to ramble. DAPO penalises rollouts
99# that exceed a soft length limit so they cannot game the token-level
100# loss by generating filler. L_soft is the budget; beyond it, reward
101# shrinks linearly until L_max, where it is clipped to 0.
102def overlong_shaping(reward: float, length: int,
103                     L_soft: int = 4096, L_max: int = 8192) -> float:
104    if length <= L_soft: return reward
105    if length >= L_max:  return 0.0
106    frac = (L_max - length) / (L_max - L_soft)     # in [0, 1]
107    return reward * frac
108
109# ─── 8. The full DAPO loss, glued together ─────────────────────────────
110def dapo_loss(group: list[Rollout]) -> float | None:
111    if not is_useful_group(group):
112        return None                                # caller re-samples
113    advs = dapo_advantages(group)
114    return token_level_loss(group, advs)
115
116# ─── 9. The Dr.GRPO loss — same skeleton, different advantage ──────────
117def dr_grpo_loss(group: list[Rollout]) -> float:
118    advs = dr_grpo_advantages(group)
119    # Dr.GRPO also drops the 1/L; same token-level reduction as DAPO.
120    return token_level_loss(group, advs)

PyTorch: A Variant-Switchable Training Step

In production code you do not want three parallel training loops. You want one loop that flips between variants on a config flag, so you can A/B test variants on the same data and the same hardware. The function below is exactly that — one signature, three behaviours, every variant-specific line gated on variant\texttt{variant}.

One training step, switchable between GRPO / DAPO / Dr.GRPO
🐍variant_step.py
21The variant flag — one function, three behaviours

The signature exposes every variant-specific knob as a kwarg: variant picks the advantage formula and reduction, eps_low/eps_high control clip-higher, R_max bounds the reward for the all-correct filter. In production code this would be a config dict; here we keep them explicit so every dial is visible. Defaults: vanilla GRPO behaviour except for the eps_high=0.28 which is harmless to GRPO since GRPO uses symmetric clipping anyway.

EXECUTION STATE
variant = 'grpo' | 'dapo' | 'dr_grpo' — the dispatch flag
kl_coef = 0.04 — same coefficient works for all three variants
30Dynamic sampling — kill the dead groups

Only DAPO runs this check. If every rollout's reward is zero, the policy failed completely on this prompt — there is nothing to learn from. If every reward equals R_max, the policy fully solved it — also nothing to learn. Returning None signals the training loop to resample a different prompt without paying the backward-pass cost. In a batched implementation you would filter at the prompt level and pack remaining survivors; the principle is identical.

EXECUTION STATE
all_zero = bool — every rollout failed
all_max = bool — every rollout succeeded
37Advantage — Dr.GRPO branches here

For Dr.GRPO we compute adv = rewards - mu and stop. No std division. For vanilla GRPO and DAPO we compute the classic (R - mu) / sigma. The unbiased=False matches DeepSeek's reference implementation (population std, divides by G). The clamp_min(1e-6) on sigma is the safety net for groups where rewards happen to all coincide but did not trip the dynamic-sampling filter (e.g. all rewards equal 0.5 from partial credit). adv.detach() is critical — the advantage is a TARGET, not a function of the current parameters, so it must not feed gradient back into rewards.

EXECUTION STATE
mu = (scalar) group mean reward
sigma = (scalar) population std, >= 1e-6
adv = (G,) detached, mean=0 (and std=1 for grpo/dapo)
47Re-score under the current policy — the gradient pathway

We push the same generated tokens through the current policy and gather log-probs at the chosen ids. This is THE expensive step — a full forward pass through the policy network on (G, T) tokens — but it is where the gradient lives. logp_old (no grad) was captured at sampling time; logp_new (grad) is computed here. Their difference is what the importance ratio amplifies or shrinks. Shape note: log_softmax returns (G, T, V), gather collapses V down to 1, squeeze brings it back to (G, T).

EXECUTION STATE
logits = (G, T, V) — policy outputs WITH gradient
logp_new = (G, T) — log-prob of each actually-sampled token
ratio = (G, T) — exp(logp_new - logp_old)
56Clip — the only structural change between GRPO and DAPO clip

GRPO clips ratio into [0.8, 1.2]. DAPO clips into [0.8, 1.28] — same lower bound, wider upper bound. The clip-higher patch is geometrically asymmetric: it forbids the policy from dropping a token's probability by more than 20 percent in one step (the conservative direction) but allows it to raise a token's probability by up to 28 percent (the exploratory direction). This biases learning toward INCLUSION of low-probability-but-good tokens rather than EXCLUSION of currently-favoured tokens.

EXECUTION STATE
lo = 0.8 — same for all three variants
hi = 1.2 (grpo, dr_grpo) | 1.28 (dapo)
clipped = (G, T) — ratio after clamp
64Per-token surrogate with PPO min trick

torch.min(ratio*A, clipped*A) is the PPO clipped surrogate at token granularity. When A > 0 (good rollout) the min selects the SMALLER of the two products — clipping kicks in if the ratio went too high, capping the gradient. When A < 0 (bad rollout) the min still selects the smaller (more negative) product — clipping kicks in if the ratio went too low, again capping the gradient. The mask multiplication zeros out padding tokens so they do not contribute to the loss.

EXECUTION STATE
A = (G, 1) — advantage broadcast over time axis
surr = (G, T) — per-token contribution to policy gradient
69Reduction — this is THE structural split

Vanilla GRPO: per_rollout = sum_t(surr) / L_i, then mean over G. Notice the (1/L_i) inside the loop — every long rollout has its per-token contribution diluted by length. DAPO and Dr.GRPO: divide by TOTAL token count in the group, not per-rollout count, so every token gets equal weight regardless of which rollout it came from. This single line is responsible for the entire length-bias phenomenon — a 1000-token rollout under GRPO pulls 10x less per-token than a 100-token rollout with the same advantage. Under DAPO/Dr.GRPO they pull equally per token.

EXECUTION STATE
per_rollout = (G,) — GRPO only, length-diluted
total_tokens = scalar — DAPO/Dr.GRPO normaliser
pg_loss = scalar policy-gradient loss
78KL to reference — anchored to the same split

We use the same reduction philosophy for the KL term as for the policy-gradient term. Vanilla GRPO averages within-rollout then across-group; DAPO/Dr.GRPO sum across all tokens then divide by total token count. If you mixed the two reductions you would accidentally re-introduce length bias through the KL term and undo the patch you just applied. The kl_coef=0.04 is unchanged from the previous section — it is an out-of-band hyperparameter that all three variants share.

EXECUTION STATE
kl = (G, T) — per-token log-prob ratio policy/ref
kl_loss = scalar
84Total loss — same as PPO, same as GRPO

pg_loss + kl_coef * kl_loss. Identical signature in all three variants. This is what gets backpropped to the policy parameters. None of these variants changes the optimiser, the learning rate, or the gradient-clipping at the parameter level — the only changes are in how the per-token contributions are computed and summed. That is why you can A/B test variants on the same codebase by flipping the flag, and why an entire research line emerged from re-examining a handful of arithmetic lines.

70 lines without explanation
1"""
2A variant-switchable RL training step. One function, three behaviours.
3Select with variant='grpo' | 'dapo' | 'dr_grpo'.
4
5Shapes throughout:
6  gens      (G, T) int     — sampled token ids
7  logp_old  (G, T) float32 — log-probs at sample time, no grad
8  logp_ref  (G, T) float32 — log-probs under frozen reference, no grad
9  rewards   (G,)   float32 — rule-based scalar reward per rollout
10  mask      (G, T) float32 — 1 where token is part of the rollout, else 0
11
12  G = group size (rollouts per prompt). Typical: 8 (DAPO), 16 (R1).
13  T = padded max length in the group.
14"""
15import torch
16from torch.nn.functional import log_softmax
17
18def variant_step(policy, gens, logp_old, logp_ref, rewards, mask,
19                 variant: str = "grpo",
20                 kl_coef: float = 0.04,
21                 eps_low: float = 0.20,
22                 eps_high: float = 0.28,
23                 R_max: float = 1.1):
24    """One gradient-bearing step for GRPO / DAPO / Dr.GRPO."""
25
26    # ─── 1. Group filter (DAPO patch 2: dynamic sampling) ──────────
27    if variant == "dapo":
28        all_zero = (rewards.abs() < 1e-9).all()
29        all_max  = (rewards.sub(R_max).abs() < 1e-9).all()
30        if all_zero or all_max:
31            return None       # caller resamples a new prompt
32
33    # ─── 2. Advantages — the variant-specific formula ──────────────
34    mu = rewards.mean()
35    if variant == "dr_grpo":
36        adv = rewards - mu                            # no std
37    else:                                             # grpo, dapo
38        sigma = rewards.std(unbiased=False).clamp_min(1e-6)
39        adv = (rewards - mu) / sigma
40    adv = adv.detach()                                # (G,) — no grad
41
42    # ─── 3. Re-score under current policy ──────────────────────────
43    logits   = policy(gens).logits                    # (G, T, V)
44    logp_new = log_softmax(logits, dim=-1).gather(
45        -1, gens.unsqueeze(-1)
46    ).squeeze(-1)                                     # (G, T)
47
48    # Importance ratio per token
49    ratio = torch.exp(logp_new - logp_old)            # (G, T)
50
51    # ─── 4. Clip — symmetric (GRPO) or asymmetric (DAPO) ───────────
52    if variant == "dapo":
53        lo, hi = 1.0 - eps_low, 1.0 + eps_high        # clip-higher
54    else:
55        lo, hi = 1.0 - 0.2,     1.0 + 0.2             # symmetric
56    clipped = torch.clamp(ratio, lo, hi)              # (G, T)
57
58    # Per-token surrogate (we minimise -surrogate)
59    A = adv.unsqueeze(-1)                             # (G, 1) broadcast
60    surr = torch.min(ratio * A, clipped * A) * mask   # (G, T)
61
62    # ─── 5. Reduction — mean-per-rollout (GRPO) vs token-level (else)
63    if variant == "grpo":
64        # mean over tokens within rollout, then mean over group
65        per_rollout = surr.sum(dim=1) / mask.sum(dim=1).clamp_min(1)
66        pg_loss = -per_rollout.mean()
67    else:
68        # SUM over all tokens, divide by total token count in group
69        total_tokens = mask.sum().clamp_min(1)
70        pg_loss = -surr.sum() / total_tokens
71
72    # ─── 6. KL-to-reference (same in all three variants) ───────────
73    kl = (logp_new - logp_ref) * mask                 # (G, T)
74    if variant == "grpo":
75        kl_loss = (kl.sum(1) / mask.sum(1).clamp_min(1)).mean()
76    else:
77        kl_loss = kl.sum() / mask.sum().clamp_min(1)
78
79    return pg_loss + kl_coef * kl_loss                # scalar

At Massive Scale: Why These Fixes Matter

The four bias patches look like small arithmetic changes. At scale they have outsized impact, for three reasons that compound:

Reason 1: Chain-of-thought is long

Reasoning-tuned models at the 32B–70B scale routinely produce chains of thought in the 2k–8k token range. The length bias of vanilla GRPO becomes severe: a 4000-token rollout pulls each of its tokens with 1/(G4000)1/(G \cdot 4000) weight, against 1/(G400)1/(G \cdot 400) for a 400-token rollout — a 10x per-token disparity. The model is taught to prefer brevity over depth, exactly opposite to what reasoning training is supposed to produce. DAPO's token-level reduction is essential for any model whose chains of thought routinely exceed 1k tokens.

Reason 2: GPU hours are the budget

A frontier RL run costs millions of GPU-hours. Dynamic sampling's 20–40% reduction in wasted backward-passes translates directly into months of saved compute. On Olmo 3's reported budget the savings were roughly 600k H100-hours — comparable to the entire budget of a smaller 7B run. At this scale, an arithmetic patch is an infrastructure decision.

Reason 3: Distribution shift compounds

Dr.GRPO's difficulty bias was small per-step but consistent. Over a million-step run, the policy effectively learns the easy half of the curriculum well and underweighted the hard half. The out-of-distribution evaluation gap that Dr.GRPO closes is small (typically 3–6 percentage points on AIME) but exactly in the regime that matters for benchmark-defining reasoning capability. At the post-training stage, where the cheap-to-improve gains have already been captured, fixing difficulty bias is one of the few remaining ways to move in-the-wild capability.

The compounding catch: these biases interact. Length bias and difficulty bias both push the gradient budget away from the long, hard, partially-solved chains of thought that are most informative. Vanilla GRPO drains gradient from exactly the regime where reasoning capability lives. The variants together restore it — but the gain is bigger than the sum of either fix alone, because they were pulling in the same direction.

Engineering Reality: Which Variant Should You Use?

After all this — DAPO, Dr.GRPO, Olmo 3, four patches and a curriculum — the right question for a practitioner is what to actually deploy. Practical guidance:

If you are starting fresh

Use the Olmo 3 recipe: Dr.GRPO advantage (no σ), DAPO token-level reduction, asymmetric clip [0.8,1.3][0.8, 1.3], per-prompt dynamic sampling, no KL on verifiable tasks. This is the current production sweet spot. The four patches together will give you roughly 3x faster convergence than a vanilla GRPO baseline on the same data.

If you are migrating an existing GRPO codebase

Apply the patches in this order, validating after each:

  1. Dynamic sampling first. Cheapest fix. Zero interaction with other patches. Wins on its own. Implementation: one if-statement before the gradient step. If you cannot deploy anything else, deploy this.
  2. Token-level reduction second. Implementation: change a single denominator in the loss. Required prerequisite for any model with long chains of thought. Pairs naturally with:
  3. Overlong reward shaping (concurrent with token-level). If you adopt token-level reduction without overlong shaping you will eventually see rambling. Apply them together.
  4. Clip-higher. Cheapest hyperparameter sweep: try ϵhi{0.25,0.28,0.30}\epsilon_{\text{hi}} \in \{0.25, 0.28, 0.30\}, keep what wins on validation. Do not bother with ϵhi>0.35\epsilon_{\text{hi}} > 0.35 — it tends to introduce instability that washes out the gain.
  5. Dr.GRPO advantage (no σ) last. Most disruptive change because it changes the gradient SCALE, so you have to retune the learning rate. Roughly 5x larger LR is the typical adjustment. Run a small LR sweep before declaring it works.

If your task is taste-based, not verifiable

Keep the KL term. Reward hacking against a neural reward model is the central failure mode of taste-based RLHF and the KL leash prevents it. The variant patches still help — token-level reduction in particular is independent of reward-model vs verifier — but the "no KL" recipe of Olmo 3 is specific to rule-verified rewards.

What this section leaves on the table: we have focused on the loss-side patches. There is a second axis of variants — sampler-side improvements like REINFORCE++ (no clip, per-prompt baselines), GSPO (group-sequence policy optimisation), and prompt-level importance correction (PIPO) — that we do not cover here. By the time you read this, expect 2–3 of those variants to be production-grade. The pattern repeats: a published algorithm is the starting line, not the finish line.

The deeper lesson the post-GRPO generation teaches is that RL algorithms are reducible to the way per-token contributions are weighted into a scalar loss. The textbook objective almost never tells you how to do the reduction; the choice is buried in implementation details that nobody publishes, until somebody notices the choice was load-bearing all along. When you train your own model, scrutinise the reductions. That is where the next 2x improvement is hiding.

Loading comments...