The Real Problem: Vanilla GRPO Has Three Biases
The previous section implemented GRPO from scratch. It worked. It also has three biases that nobody noticed for the first six months of the algorithm's life, and that turn out to compound viciously as you scale to longer chains of thought, harder problems, and bigger models.
Here is the picture concretely. You finish a DeepSeek-R1-style training run on a 7B base. AIME accuracy goes from 13% to 31%. You scale the same recipe to 32B and re-run on the same data. Accuracy improves to 38% — but the chains of thought are now 3000 tokens long, the model occasionally outputs the same paragraph six times, and a quarter of your gradient steps over the last 20% of training produced zero net change because every rollout in those groups happened to score identically. The recipe works, but it has plateaued in a way that the original GRPO paper does not predict.
Between mid-2024 and early-2026 three groups identified what was going wrong, each from a different angle, each with a different patch:
- Length bias. Vanilla GRPO averages the per-token loss within each rollout before averaging across the group. A 1000-token rollout contributes the same total gradient as a 100-token rollout — meaning each individual token in the longer rollout pulls 10x less than a token in the short one. Long chains of thought are systematically under-weighted. ByteDance's DAPO paper named this and fixed it.
- Difficulty bias. The divisor in the advantage formula inflates gradients on easy prompts (small variance) and shrinks gradients on hard prompts (large variance). Exactly backwards from what you want — easy prompts have nothing to teach and hard prompts have everything to teach. NUS/Sea's Dr.GRPO paper named this and fixed it.
- Wasted-group bias. When every rollout in a group gets the same reward (all wrong OR all right) the advantage is zero, the gradient is zero, and the entire forward+backward pass on that group was paid for with no learning. Vanilla GRPO does nothing about this. DAPO's dynamic sampling rejects those groups and resamples another prompt. Allen AI's Olmo 3 recipe wraps these fixes into a production-grade RLVR pipeline.
The unifying observation: none of these variants changes the underlying RL objective. They are all reweightings of the same gradient signal. The signal was there in vanilla GRPO; it was just allocated wrong. The fixes are in how the per-token contributions are summed, which advantages get the σ divisor, and which groups get a gradient step at all.
The lesson for anyone running RL at scale: when an algorithm plateaus, look at the reduction step — how per-token terms become per-rollout terms become per-batch terms. That is where the silent bias accumulates, and that is the lever the post-GRPO generation pulls.
Intuition: Each Variant Is a Surgical Patch
Imagine a teacher grading homework from four students. Each student turns in an answer of different length. The teacher has a fixed amount of red-pen attention to spend.
- Vanilla GRPO says: "I will spend the same attention on each student's page, regardless of length." The student who wrote 10 lines gets 10x more red-pen-per-line than the student who wrote 100 lines. Each individual sentence in the long answer gets shallowly corrected; the short answer gets deeply critiqued line-by-line.
- DAPO says: "I will spend the same attention per line, across every student." The long answer gets more total ink, the short answer gets less total ink, but each sentence is read with equal care. Sentence-level feedback is fair.
- Dr.GRPO says: "The grading scale should stay constant. I will not silently make easy problems louder than hard problems just because the class agreed on easy ones." When the variance in the class is small (everyone scored similarly) the grading is naturally subdued. When the variance is large (there is real disagreement) the grading stays at the same scale, so hard problems get their fair share of correction.
DAPO and Dr.GRPO are complementary, not competing. DAPO fixes the per-token weighting; Dr.GRPO fixes the per-prompt weighting. You can apply both at once — and Olmo 3 does, plus production engineering for throughput.
The Math: One Loss, Three Reweightings
Recall the GRPO loss from the previous section. For a group of rollouts sampled from a single prompt , with per-token log-probabilities and per-token importance ratios , the loss is:
and the advantage is with , . The clip range is symmetric.
Look at the two reductions: . Each token contributes with weight . A token in a 100-token rollout has weight ; a token in a 1000-token rollout has weight . That is the length bias, in one line.
DAPO's loss
DAPO replaces the per-rollout mean with a group-wide token-count normaliser:
Two structural changes from vanilla GRPO:
- The denominator is , the total number of tokens in the entire group. Every token now has weight regardless of which rollout it came from. Length bias removed.
- The clip becomes asymmetric: , . The lower bound stays conservative; the upper bound opens up, so unlikely-but-good tokens can climb faster.
DAPO also adds two non-loss patches — dynamic sampling (reject groups where every rollout has the same reward) and overlong reward shaping (linearly decay reward beyond a soft length budget). Those live outside the loss formula but inside the training loop.
Dr.GRPO's loss
Dr.GRPO keeps DAPO's reduction but changes the advantage:
No in the denominator. The full Dr.GRPO loss is:
Why removing matters: the std divisor renormalises the advantage scale to regardless of how large or small the underlying reward differences are. That sounds good — until you notice that for an easy prompt where 7/8 rollouts succeed, is small and is large, while for a hard prompt with mixed outcomes, is large and is small. The policy ends up spending most of its gradient budget on prompts it already mostly solves.
Olmo 3's composite recipe
Allen AI's Olmo 3 RLVR recipe (early 2026) does not introduce a new loss. It picks the best pieces of DAPO and Dr.GRPO, fixes the engineering, and ships:
- Dr.GRPO advantage (no ) plus token-level reduction (the DAPO denominator).
- Dynamic sampling at the prompt level: a curriculum-aware sampler that retires prompts whose policy success rate has hit either 0% or above 95%, and biases the buffer toward 30–70% success-rate prompts.
- Asymmetric clip (, ) — a touch looser on the upper bound than DAPO.
- No KL term on rule-verified tasks (math, code). The bounded reward and the dynamic-sampling filter are enough to prevent reward hacking; the KL term mostly slowed convergence. KL is kept for taste-based tasks (helpfulness, safety) where there is no programmatic ground truth.
- Streaming rollouts: rollouts that finish early are processed as they arrive instead of padded to the group's max length. This is a throughput patch, not a math patch, but it interacts with token-level reduction in a load-bearing way.
DAPO: Decoupled Clip and Dynamic Sampling
ByteDance released DAPO in early 2025 alongside their Qwen-2.5-32B post-training run. The paper's headline benchmark: 50% on AIME 2024 with a 32B model trained on half the FLOPs of the DeepSeek-R1 recipe. The gain decomposes into four named patches.
Patch 1: Clip-Higher
Vanilla PPO/GRPO clips the importance ratio symmetrically: . The ByteDance team observed that this symmetric clip systematically suppresses the upward direction. Concretely: a token that the old policy assigned probability 0.05 to, but that turned out to be the right token (positive advantage), can have its probability raised by at most 20% per step — to 0.06, then 0.072, then 0.086, etc. After 20 steps you are at 0.19. The policy could not climb out of a confident-but-wrong basin fast enough.
Clip-higher widens the upper bound asymmetrically: . The lower bound stays at 0.8 — you still cannot drop a token's probability by more than 20% per step, so the policy cannot collapse away from a currently-good token too quickly. But the upper bound at 1.28 lets the same low-probability-but-good token climb 28% per step, which compounds to 0.05 → 0.064 → 0.082, etc. — measurably faster.
Patch 2: Dynamic Sampling
In the previous section we noted that vanilla GRPO gives zero gradient on groups where every rollout has the same reward. In practice this is not rare — for a model partway through training on a math problem, the "all wrong" case is common (the policy has not figured this problem out yet) and so is the "all right" case (this problem is already in the training distribution). DAPO measured the rate empirically on their 32B run and found 30–40% of groups were dead for most of training.
The fix is straightforward: before taking the gradient step, check whether the group has nonzero variance. If not, reject it and sample a different prompt. The rollout cost was sunk; the backward-pass cost is saved. Across a million-step run this recovers roughly half the wasted FLOPs.
At larger scale the implementation gets fancier — you keep a rolling per-prompt success-rate estimate, drop prompts that have gone to 0% or 100% success over the last N rollouts, and weight the prompt buffer toward the 30–70% success-rate band where the gradient signal is densest.
Patch 3: Token-Level Policy Gradient Loss
The length bias we keep mentioning lives in vanilla GRPO's reduction formula: . DAPO replaces it with . Two consequences:
- Each token in the group has equal weight in the gradient regardless of which rollout it came from. Long chains of thought are no longer down-weighted per-token.
- A long rollout contributes proportionally more total gradient than a short rollout of the same advantage. This is the intuitively-correct outcome but it has a side effect: the model gets a marginal incentive to generate longer outputs. Hence patch 4.
Patch 4: Overlong Reward Shaping
With token-level loss, a model can game its reward by padding correct-format tokens into a long, rambling answer. DAPO penalises this with a soft length budget. Let be the budget (typically 4096 tokens), the hard cap (typically 8192). The shaped reward is:
Linear ramp from full reward at to zero reward at . The verifier still says the answer is right; the shaped reward says "but you took too long to say it". This is the cheapest possible anti-rambling regulariser — no extra model, no extra forward pass, just a multiplicative scaling on the existing reward.
Dr.GRPO: GRPO Done Right
The Sea/NUS team published Dr.GRPO in mid-2025 — "Dr." for "done right". Their thesis is one sentence: the divisor in vanilla GRPO is mathematically convenient but empirically harmful. Remove it.
The argument unfolds in three steps.
Step 1: Why σ was added in the first place
DeepSeek's GRPO paper motivated the std divisor as "normalisation" — they wanted advantages with regardless of the underlying reward scale. With a bounded reward in , the divisor turns reward units into std-units so the optimiser's learning rate is scale-invariant. Plausible argument; consistent with PPO's per-batch advantage normalisation; very natural to write down.
Step 2: What σ actually does at training time
Consider two prompts in the same batch:
| Prompt | G rollouts succeed | μ | σ | Advantage of a winner |
|---|---|---|---|---|
| Easy | 7 out of 8 | 0.96 | 0.33 | (1.1 − 0.96)/0.33 = +0.42 |
| Hard | 3 out of 8 | 0.41 | 0.55 | (1.1 − 0.41)/0.55 = +1.25 |
That table looks reasonable. The hard prompt's winner gets a bigger advantage. But now flip it: consider a partial-credit prompt where most rollouts got format-only (reward 0.1) and a couple got full credit (reward 1.1):
| Prompt | Rewards | μ | σ | A of a 1.1-winner |
|---|---|---|---|---|
| Easy partial-credit | [1.1, 1.1, 1.1, 0.1, 0.1, 1.1, 1.1, 1.1] | 0.85 | 0.39 | +0.64 |
| Hard partial-credit | [0.1, 1.1, 0.1, 1.1, 0.1, 0.1, 1.1, 0.1] | 0.475 | 0.49 | +1.28 |
| Bimodal hard | [0, 1.1, 0, 1.1, 0, 1.1, 0, 1.1] | 0.55 | 0.55 | +1.00 |
| Tight cluster (easy) | [1.05, 1.07, 1.10, 1.10, 1.04, 1.08, 1.09, 1.10] | 1.079 | 0.024 | +0.88 (huge per-token push) |
The last row is the failure mode. A prompt where the policy is already getting near-maximum reward, with all 8 rollouts clustered in , produces an advantage of +0.88 with . The policy gradient will spend disproportionate effort on this nearly-solved prompt instead of on the hard ones. Conversely, a prompt with high variance — exactly where you want a strong signal — gets its signal divided down.
Step 3: The fix
Remove the divisor. Use . The advantage scale now lives in the original reward units. A clustered easy prompt produces tiny advantages (correct — there is little to learn). A bimodal hard prompt produces large advantages (correct — there is a lot to learn). The optimiser's learning rate now has to absorb the reward scale, which is a one-time tuning rather than a per-prompt artefact.
The empirical headline: on the same hardware and the same data, Dr.GRPO matched or beat vanilla GRPO on 12 of 14 reasoning benchmarks the paper tested, with the largest gains on out-of-distribution math problems — exactly where the difficulty-bias of vanilla GRPO would hurt most.
Olmo 3: A Production RLVR Recipe
Allen AI's Olmo 3 (late 2025 / early 2026) is the first frontier open-source model trained end-to-end with the variants stack. It does not introduce new math. It introduces a production engineering recipe for combining the existing patches, plus a curriculum that exploits dynamic sampling at the batch level.
The Olmo 3 loss
No KL term for verifiable tasks. No . DAPO-style token-level reduction. Asymmetric clip slightly looser than DAPO's. Plus a prompt curriculum built on top of dynamic sampling:
- Maintain a per-prompt success-rate estimate over a rolling window of rollouts.
- Bucket prompts into [0–10%, 10–30%, 30–70%, 70–90%, 90–100%] success bands.
- Sampling weight is roughly — most of the gradient budget goes to the 30–70% band, where the policy is on the verge of solving the prompt and has the most to learn.
- Prompts in the 0–10% band are paused for epochs in case the policy needs more general capability before retrying. Prompts in the 90–100% band are retired entirely.
The result: more than 80% of GPU time is spent on prompts where the policy can both succeed and fail — i.e. where the gradient signal is most informative. Compared to a uniform sampler with the same total dataset, Olmo 3 reports roughly 3x reduction in training tokens to reach the same target score.
Side-by-Side Comparison
The four innovations distribute across the variants as follows:
| Feature | Vanilla GRPO | DAPO | Dr.GRPO | Olmo 3 |
|---|---|---|---|---|
| Advantage formula | (R − μ) / σ | (R − μ) / σ | R − μ | R − μ |
| Loss reduction | mean per-rollout, mean per-group | sum over tokens / total tokens | sum over tokens / total tokens | sum over tokens / total tokens |
| Clip range | [0.8, 1.2] symmetric | [0.8, 1.28] asymmetric | [0.8, 1.2] symmetric | [0.8, 1.3] asymmetric |
| Dynamic sampling | No | Yes (per-group) | No (paper-level) | Yes (per-prompt curriculum) |
| Overlong reward shaping | No | Yes | No | Yes |
| KL to reference | Yes (β = 0.04) | Yes (β = 0.04) | Yes (β = 0.04) | No (verifiable tasks) |
| Notable user | DeepSeek-R1 | Qwen-2.5-32B reasoning | Sea-Reasoner | Olmo 3 base RL |
| Reported gain over GRPO | — | ≈ 2× tokens-to-target | ≈ 1.5× generalisation | ≈ 3× tokens-to-target |
A useful way to read the table: DAPO is the "fix the reduction" family, Dr.GRPO is the "fix the advantage" family, Olmo 3 is "apply both, plus engineering". None of these is a finished story — by the time you read this, expect 2–3 further variants to have appeared, each isolating a new bias that the current generation overlooks.
Manual Numerical Walkthrough
Same setup as the previous section: prompt asks for the smaller root of , ground truth is 2, reward formula . We will compute one training step under each variant and see how the same four rollouts produce three different gradient allocations.
Click to expand: one training step under GRPO, DAPO, and Dr.GRPO with the SAME rollouts
Setup. Four rollouts, all from the same prompt. Reward and token length :
| i | Description | R | L (tokens) |
|---|---|---|---|
| 1 | Correct + well-formatted (short) | 1.10 | 220 |
| 2 | Wrong answer + correct format (rambling) | 0.10 | 950 |
| 3 | No tags, no extractable answer (rambling) | 0.00 | 1100 |
| 4 | Correct + well-formatted (short) | 1.10 | 180 |
Group statistics.
, so .
Step 1 — Advantages under each variant.
| i | R_i | R_i − μ | GRPO/DAPO A_i = (R−μ)/σ | Dr.GRPO A_i = R−μ |
|---|---|---|---|---|
| 1 | 1.10 | +0.525 | +0.998 | +0.525 |
| 2 | 0.10 | −0.475 | −0.903 | −0.475 |
| 3 | 0.00 | −0.575 | −1.093 | −0.575 |
| 4 | 1.10 | +0.525 | +0.998 | +0.525 |
GRPO and DAPO share the same per-rollout advantages — they differ only in the loss reduction. Dr.GRPO's advantages are smaller in magnitude and stay in REWARD units, not std-units.
Step 2 — Per-token gradient contribution. What pulls each token in each rollout. Assume the per-token ratio at the first step (right after sampling) so we can ignore the clip-min for now.
| i | L_i | GRPO: A_i / L_i | DAPO: A_i (per token) | Dr.GRPO: A_i (per token) |
|---|---|---|---|---|
| 1 | 220 | +0.00454 | +0.998 | +0.525 |
| 2 | 950 | −0.00095 | −0.903 | −0.475 |
| 3 | 1100 | −0.00099 | −1.093 | −0.575 |
| 4 | 180 | +0.00555 | +0.998 | +0.525 |
Read the GRPO column. Rollout 4 (short, good) pulls each of its 180 tokens with +0.00555. Rollout 3 (long, bad) pulls each of its 1100 tokens with only −0.00099. The magnitudes are off by a factor of 5.6 even though the underlying advantages were within a factor of 2 of each other. Length bias, visible in two columns of numbers.
Read the DAPO column. Per-token pulls are unaffected by length. A token in the long-bad rollout pulls with −0.903; a token in the short-good rollout pulls with +0.998. Now the gradient flows in proportion to the advantage, not the rollout-length artefact.
Read the Dr.GRPO column. Same length behaviour as DAPO. Smaller magnitudes because σ is no longer in the denominator. The optimiser learning rate has to be tuned a bit higher to compensate — Olmo 3 uses where vanilla GRPO used .
Step 3 — Total gradient share. Multiply the per-token contribution by the rollout length (the denominator's effect washes out across rollouts because it is the same constant). Normalise so the four shares sum to 100%:
| i | L_i | GRPO share | DAPO share | Dr.GRPO share |
|---|---|---|---|---|
| 1 | 220 | 25% | 9% | 9% |
| 2 | 950 | 25% | 37% | 37% |
| 3 | 1100 | 25% | 45% | 45% |
| 4 | 180 | 25% | 9% | 9% |
The diagnosis. Under GRPO, every rollout contributes the SAME total gradient share (25% each) — because the (1/L_i) reduction makes long rollouts shed per-token signal to compensate for having more tokens. Under DAPO and Dr.GRPO, total gradient mass flows toward the longer rollouts even when their per-token advantages are negative. This is the correct behaviour: a long confidently-wrong rollout should pull the policy strongly away from that mode, not get the same nudge as a short confidently-right one.
Step 4 — Dynamic sampling check (DAPO and Olmo 3). Rewards are . They are not all equal, so , so the group is KEPT. We take the gradient step. Compare with a degenerate case: if rewards were , DAPO and Olmo 3 would discard this group and resample. Vanilla GRPO would compute zero advantages and pay the backward-pass cost for nothing.
Step 5 — Overlong shaping (DAPO). The rambling rollouts have and . With , , rollout 3 (length 1100) gets reward shaped to (already zero, so no change). Rollout 2 (length 950) is under the soft budget, so no penalty. If the policy had output 1500 tokens with reward 1.1 (correct but rambling), the shaped reward would drop to — a meaningful penalty that the gradient picks up.
The whole compute budget. Four rollouts on a 7B policy with chain-of-thought averaging ~600 tokens: roughly 24 seconds of H100 time at 50 tok/s decoded generation. Four verifier calls: under a millisecond on CPU. One backward pass: roughly 1.5 seconds. Variant switch costs zero — flipping from GRPO to Dr.GRPO is two lines of code and produces measurably different training dynamics.
Interactive: Watch the Three Variants Diverge
The widget below shows four rollouts of adjustable reward and length. The three coloured bars under each rollout show the per-token gradient contribution each variant assigns to that rollout. The table at the bottom shows the total gradient share across the group. Try the "Length-bias preset" — long bad rollouts barely show up on the GRPO bar while dominating the DAPO and Dr.GRPO bars. Try the "All-correct" preset and watch every variant collapse to zero (DAPO + Olmo 3 would also discard the group via dynamic sampling). Try the "Uniform length" preset and watch GRPO match DAPO exactly — length bias only manifests when rollouts have different lengths.
Plain Python: Three Advantage Functions, One File
The full variant stack — three advantage formulas, four DAPO patches, the Dr.GRPO removal of σ — in 120 lines of pure Python. No torch. Reads like a math derivation, runs on a laptop.
PyTorch: A Variant-Switchable Training Step
In production code you do not want three parallel training loops. You want one loop that flips between variants on a config flag, so you can A/B test variants on the same data and the same hardware. The function below is exactly that — one signature, three behaviours, every variant-specific line gated on .
At Massive Scale: Why These Fixes Matter
The four bias patches look like small arithmetic changes. At scale they have outsized impact, for three reasons that compound:
Reason 1: Chain-of-thought is long
Reasoning-tuned models at the 32B–70B scale routinely produce chains of thought in the 2k–8k token range. The length bias of vanilla GRPO becomes severe: a 4000-token rollout pulls each of its tokens with weight, against for a 400-token rollout — a 10x per-token disparity. The model is taught to prefer brevity over depth, exactly opposite to what reasoning training is supposed to produce. DAPO's token-level reduction is essential for any model whose chains of thought routinely exceed 1k tokens.
Reason 2: GPU hours are the budget
A frontier RL run costs millions of GPU-hours. Dynamic sampling's 20–40% reduction in wasted backward-passes translates directly into months of saved compute. On Olmo 3's reported budget the savings were roughly 600k H100-hours — comparable to the entire budget of a smaller 7B run. At this scale, an arithmetic patch is an infrastructure decision.
Reason 3: Distribution shift compounds
Dr.GRPO's difficulty bias was small per-step but consistent. Over a million-step run, the policy effectively learns the easy half of the curriculum well and underweighted the hard half. The out-of-distribution evaluation gap that Dr.GRPO closes is small (typically 3–6 percentage points on AIME) but exactly in the regime that matters for benchmark-defining reasoning capability. At the post-training stage, where the cheap-to-improve gains have already been captured, fixing difficulty bias is one of the few remaining ways to move in-the-wild capability.
Engineering Reality: Which Variant Should You Use?
After all this — DAPO, Dr.GRPO, Olmo 3, four patches and a curriculum — the right question for a practitioner is what to actually deploy. Practical guidance:
If you are starting fresh
Use the Olmo 3 recipe: Dr.GRPO advantage (no σ), DAPO token-level reduction, asymmetric clip , per-prompt dynamic sampling, no KL on verifiable tasks. This is the current production sweet spot. The four patches together will give you roughly 3x faster convergence than a vanilla GRPO baseline on the same data.
If you are migrating an existing GRPO codebase
Apply the patches in this order, validating after each:
- Dynamic sampling first. Cheapest fix. Zero interaction with other patches. Wins on its own. Implementation: one if-statement before the gradient step. If you cannot deploy anything else, deploy this.
- Token-level reduction second. Implementation: change a single denominator in the loss. Required prerequisite for any model with long chains of thought. Pairs naturally with:
- Overlong reward shaping (concurrent with token-level). If you adopt token-level reduction without overlong shaping you will eventually see rambling. Apply them together.
- Clip-higher. Cheapest hyperparameter sweep: try , keep what wins on validation. Do not bother with — it tends to introduce instability that washes out the gain.
- Dr.GRPO advantage (no σ) last. Most disruptive change because it changes the gradient SCALE, so you have to retune the learning rate. Roughly 5x larger LR is the typical adjustment. Run a small LR sweep before declaring it works.
If your task is taste-based, not verifiable
Keep the KL term. Reward hacking against a neural reward model is the central failure mode of taste-based RLHF and the KL leash prevents it. The variant patches still help — token-level reduction in particular is independent of reward-model vs verifier — but the "no KL" recipe of Olmo 3 is specific to rule-verified rewards.
The deeper lesson the post-GRPO generation teaches is that RL algorithms are reducible to the way per-token contributions are weighted into a scalar loss. The textbook objective almost never tells you how to do the reduction; the choice is buried in implementation details that nobody publishes, until somebody notices the choice was load-bearing all along. When you train your own model, scrutinise the reductions. That is where the next 2x improvement is hiding.