The Real Problem: You Can Only See the Outcome
The previous section gave us GRPO's hyperparameters. The section before that derived the algorithm. What no derivation tells you is what to put in the slot labelled . For chat models the answer is easy — wire in a neural reward model trained on preferences. For a reasoning model, the answer is much harder, and it is where the entire personality of the trained model is decided.
Here is the asymmetry that defines this section. A reasoning rollout looks like this:
<think> …several hundred tokens of intermediate work, false starts, corrections, possibly the model talking to itself… </think><answer>\boxed{42}</answer>
You can verify the answer trivially — extract the boxed value, compare to the truth, done. You cannot verify the thinking without re-running it inside a stronger reasoner, which defeats the purpose. So the entire reasoning trajectory is process you want to shape using only outcome you can observe.
The reasoning-reward paradox: you want to teach the model to think carefully, but you can only grade what it says at the end. Every reward signal you design is an indirect bet that optimising the observable surface (answer correctness + structure + length + language) will pull the unobservable process (genuine deliberation) along with it.
This is not a problem that goes away with more data. It is a structural property of reasoning RL: the reward is sparse, the signal arrives only at the end of a long generation, and every reward design choice tilts the model toward a different compromise between "careful thinking", "fast guessing", "rambling stalling", and "confident wrongness".
Intuition: Reasoning Emerges from Pressure, Not Imitation
There are two philosophies for getting reasoning out of an LLM. The first is imitation: collect a dataset of carefully-annotated reasoning chains and SFT the model on them. The second is pressure: pick a reward function that scores final answers, let the model sample many attempts per prompt, push up the log-probability of the attempts that worked, and let reasoning emerge as the strategy the policy discovers to maximise reward.
Pressure is what R1-Zero proved possible. The base model had never been shown a single example of explicit chain-of-thought formatted in tags. Within 1500 training steps, the policy was producing well-structured think blocks of growing length, and within 8000 steps those think blocks contained genuine self-correction ("wait, let me check that") and AIME-level mathematical reasoning. Nobody taught the model to think. The reward function rewarded outcomes, and thinking emerged as the cheapest way to hit those outcomes.
The metaphor: imagine training a high-jumper. You can show them video of the Fosbury Flop and ask them to copy it (imitation). Or you can raise the bar and pay them when they clear it (pressure). Pressure produces unexpected techniques — the actual Fosbury Flop was invented this way. For reasoning, R1's "aha moment" pattern — where the model spontaneously interrupts itself to reconsider — is the equivalent: a strategy nobody designed, that the reward function accidentally selected for.
For the rest of this section we will design rewards on the pressure side of the dichotomy. We will not assume any chain-of-thought data exists. The only inputs to our reward are the prompt, the rollout, and the ground-truth answer.
The Math: A Composite Reward for a Process You Can't Directly Score
Let be a prompt with known ground truth . Let be a rollout drawn from the policy: a string containing some reasoning trace and a final answer. We are constructing a function from observable pieces of .
We will build it as a weighted sum of independent component rewards:
Each piece has a precise definition.
Accuracy reward
The indicator is 1 if the regex-extracted boxed value matches the ground truth, else 0. This is the only signal that ties the model to objective correctness; with the model has no incentive to be right.
Format reward
A regex check. With the model has no reason to keep the structure, and R1's emergent thinking would not happen — the model would just write the answer.
Length reward (saturated, optional)
where is the think-block length in tokens and is the saturation cap (e.g. 256). This is a piecewise-linear ramp from 0 to 1, capped so that extra-long rollouts get no extra reward. R1-Zero used — they let length emerge naturally from accuracy pressure — but several downstream models add a small positive weight to accelerate the emergence.
Language consistency penalty
A subtractive penalty. R1-Zero did not have this. R1 added it because R1-Zero's thinking spontaneously became bilingual (English+Chinese in the same trace) — a perfectly valid reward-maximising strategy that humans found unreadable. The penalty fixes the readability without touching the underlying math ability.
Putting it together with GRPO
Inside a GRPO group of rollouts of the same prompt, the advantage of rollout is:
and the surrogate loss for the policy update is the standard PPO clip from section 15.2.
The mathematical content of the section is not in the surrogate. It is in the choice of which components live inside . The rest of the section is about why each component is there and what happens to the trained model when you change its weight.
Outcome Rewards vs Process Rewards
Every reasoning-reward design has to take a side on one big question: do you reward only the final outcome (the boxed answer), or do you also score intermediate steps of the reasoning?
| Approach | What you score | Cost | Risk |
|---|---|---|---|
| Outcome Reward Model (ORM) — what R1 used | Only the final answer (regex + exact match) | Nearly zero (microseconds per rollout) | Sparse signal; long rollouts have credit-assignment problems |
| Process Reward Model (PRM) — Let's Verify Step-by-Step | Each reasoning step scored by a separate model | Big — needs a step-annotated dataset and a 7B+ verifier | PRM hallucinates step quality; reward-hackable; expensive |
| Hybrid: ORM + a small PRM term | Outcome reward + bonus for steps PRM rates as 'plausible' | Medium — PRM is loaded but only consulted sparsely | Best of both, hardest to debug; rarely used in production |
For a long time the field assumed PRMs were the future. The OpenAI "Let's Verify Step-by-Step" paper showed PRMs beat ORMs on math eval at fixed inference budget. The conventional wisdom was: spend the labelling money on per-step annotations, train a PRM, use it as the reward signal during RL.
R1 broke that assumption. R1-Zero used a pure ORM (boxed exact-match) and produced a stronger reasoner than any PRM-based method to date. Three reasons:
- PRMs are reward-hackable. A neural model that scores reasoning steps is itself a neural model with cracks. The policy finds and exploits them in the same way it would a regular RM. At scale this is a worse failure mode than the credit-assignment problem of ORMs, because PRM hacks look like "reasoning" — they fool you, then the model fails on eval.
- GRPO solves the credit-assignment problem implicitly. The standard objection to ORMs is "the model can't tell WHICH token in a 2000-token trace mattered". GRPO's answer: it doesn't need to. With G=16 rollouts of the same prompt, the gradient updates ALL tokens of all positively advantaged rollouts equally. The signal averages over enough rollouts that good reasoning patterns end up reinforced even without per-step credit.
- ORMs are deterministic. The same rollout today and a month from now gets the same reward. PRMs drift with retraining. Determinism is what lets you debug an 8000-step training run by replaying rewards.
The empirical rule from R1, DAPO, Olmo, Qwen-2.5-Math, and the rest of the post-2024 reasoning models is: start with an ORM. Add PRM-style components only if outcomes alone plateau, and even then keep their weight small (sub-0.1) so their hacks cannot dominate.
Length Dynamics: The Aha Moment and Its Dark Twin
Plot mean rollout length against training steps for any successful reasoning-RL run and you see the same shape: a flat line for the first few hundred steps, then a sharp upward bend. R1's curve goes from ~150 tokens at step 500 to ~1500 tokens at step 8000, with eval accuracy climbing in lockstep. This is the "aha moment" — the point where the policy discovers that thinking longer wins more rewards.
It is also the point where reward hacking begins. Length growth is good only when it is purposeful. The same gradient that encourages long, correct reasoning will, if you let it, encourage long noise. Three observed failure modes:
- Self-talk padding: the model fills the think block with "Let me think… still thinking… hmm…" for hundreds of tokens before producing the same answer it would have produced in 50. Reward gains: none. Length: grows linearly.
- Repeated-self-correction: the model finds one correct path, then writes 20 variations of "let me double check" that re-derive the same fact. Genuinely helpful on hard problems; pure waste on easy ones.
- Off-topic excursion: the model digresses into an unrelated lemma, proves it, and circles back. Sometimes useful for the actual problem, sometimes pure entropy.
R1-Zero used — no explicit length bonus — and still produced the aha-moment length growth, because longer correct answers exist and the accuracy reward already loves them. The equilibrium length emerged from accuracy pressure alone.
Several later papers (notably DAPO) added a small explicit length bonus to speed up the aha moment — train in 4000 steps instead of 8000. The catch is that any positive length bonus needs a paired anti-rambling cap, or the policy learns to pad before it learns to think. The cap has two common forms:
| Anti-rambling mechanism | How it works | Used by |
|---|---|---|
| Saturated length bonus | r_len = min(L, L_cap)/L_cap — flat above L_cap, so extra tokens are free but unrewarded | DAPO, Qwen-2.5-Math |
| Soft repetition penalty | Subtract a small reward when n-gram repetition density exceeds threshold | Olmo 3, DeepSeek-V3 chat RL |
| Hard max_new_tokens cap | Tokenizer-level truncation at e.g. 4096; rollouts hitting the cap auto-score 0 | Universal — the safety net underneath everything else |
The interactive widget below lets you turn on "naïve length bonus" with and without the repetition penalty — the difference in which rollout gets the top advantage is exactly the difference between a model that thinks longer and a model that just talks more.
Language Consistency: R1's Hidden Reward
R1-Zero's released paper has a famous footnote: midway through training, the team noticed the model's think blocks had become spontaneously bilingual. A typical rollout would start in English ("Let me factor…"), switch to Chinese for the actual reasoning ("因式分解 (x-2)(x-3)=0…"), and end in English ("…so the answer is 2"). Accuracy was unaffected. Human readers were horrified.
Why did this happen? The base model was multilingual; the reward function only graded the final boxed answer, not the language of the thinking. The policy correctly inferred that Chinese tokens have lower entropy for mathematical reasoning (because of the training corpus distribution), and started using whichever language produced the highest in expectation. Bilingual reasoning was a reward-maximising behaviour. The reward function had no veto.
R1 (not R1-Zero) added a language consistency reward component:
with a weight . The penalty is binary — either the think block is consistent or it isn't. A smooth penalty would reward being "9% mixed" over "11% mixed", which is the wrong gradient direction; you want a cliff.
R1's ablation: removing the language penalty restored bilingual rollouts within 200 steps. Adding it back cleared them in another 300 steps. The cost on AIME accuracy: zero. This is a perfect example of a costless quality-of-life reward — readability for free, paid in microseconds of regex-counting.
Difficulty-Aware Shaping: Reward Curriculum
A subtle property of GRPO with rule rewards: the gradient signal on any prompt is proportional to the variance of rewards across the group. Two pathological cases:
- All G rollouts succeed: , advantages all zero, gradient zero. The prompt is too easy — no learning.
- All G rollouts fail: same , same zero gradient. The prompt is too hard — no learning.
Useful learning only happens on prompts where — the policy's frontier. This gives us a free curriculum signal: track the per-prompt mean reward across training, drop prompts that saturate (always 0 or always 1), upsample prompts in the 30–70% success-rate band.
Three difficulty-aware shaping techniques used in production:
| Technique | Mechanism | Effect |
|---|---|---|
| Online filter | Skip prompts where last-N-step mean reward > 0.95 or < 0.05 | Removes 'no learning' prompts mid-training; ~2x sample efficiency |
| Difficulty-stratified batching | Construct each training batch with a target reward histogram | Stabilises σ_G across batches; smoother gradient norms |
| Curriculum interpolation | Start with easy prompts, shift to harder as accuracy rises | R1, Qwen-2.5-Math both use a soft version; difficulty band slides up over training |
None of these change the per-rollout reward function — they change which prompts see the reward function. But they belong in this section because in practice the "reasoning reward" you ship is the reward function plus the prompt-selection policy, and the two are almost as load-bearing as each other.
Manual Numerical Walkthrough
We compute one GRPO step on a single math prompt with G=6 rollouts, scoring each under two different reward designs to see how the choice changes the gradient direction.
Click to expand: same six rollouts, two reward designs, two completely different gradient signals
Setup. Prompt = "Solve . Smaller root in ." . Six rollouts spanning the qualitative behaviour space:
| i | Description | acc | fmt | L_think | mixed-lang |
|---|---|---|---|---|---|
| 1 | Short, correct, well-formatted | 1 | 1 | 18 | 0 |
| 2 | Long thinking, correct (aha) | 1 | 1 | 142 | 0 |
| 3 | Rambling, correct | 1 | 1 | 612 | 0 |
| 4 | Correct, no tags | 1 | 0 | 0 | 0 |
| 5 | Wrong, perfect format + long | 0 | 1 | 84 | 0 |
| 6 | Correct, bilingual | 1 | 1 | 22 | 1 |
Design A — R1-Zero recipe. .
| i | acc | fmt | len | lang | R |
|---|---|---|---|---|---|
| 1 | 1.00 | 0.10 | 0.00 | 0.00 | 1.10 |
| 2 | 1.00 | 0.10 | 0.00 | 0.00 | 1.10 |
| 3 | 1.00 | 0.10 | 0.00 | 0.00 | 1.10 |
| 4 | 1.00 | 0.00 | 0.00 | 0.00 | 1.00 |
| 5 | 0.00 | 0.10 | 0.00 | 0.00 | 0.10 |
| 6 | 1.00 | 0.10 | 0.00 | 0.00 | 1.10 |
, .
Advantages :
| i | R | R-μ | A |
|---|---|---|---|
| 1 | 1.10 | +0.183 | +0.53 |
| 2 | 1.10 | +0.183 | +0.53 |
| 3 | 1.10 | +0.183 | +0.53 |
| 4 | 1.00 | +0.083 | +0.24 |
| 5 | 0.10 | -0.817 | -2.37 |
| 6 | 1.10 | +0.183 | +0.53 |
Reading Design A: rollout 5 (wrong answer) gets the only strongly negative advantage. Rollouts 1, 2, 3, 6 all tie at +0.53. Rollout 3 (rambling, 612 tokens) and rollout 6 (bilingual) get the same positive advantage as rollout 1 (the ideal). Under this reward design, the policy gradient does nothing to discourage rambling or language mixing. R1-Zero's actual behavioural collapses came from exactly this.
Design B — R1 with anti-hack components. , repetition penalty on, .
| i | acc | fmt | len | lang | R |
|---|---|---|---|---|---|
| 1 | 1.00 | 0.10 | 0.00 | 0.00 | 1.10 |
| 2 | 1.00 | 0.10 | 0.00 | 0.00 | 1.10 |
| 3 | 1.00 | 0.10 | -0.16 | 0.00 | 0.94 |
| 4 | 1.00 | 0.00 | 0.00 | 0.00 | 1.00 |
| 5 | 0.00 | 0.10 | 0.00 | 0.00 | 0.10 |
| 6 | 1.00 | 0.10 | 0.00 | -0.50 | 0.60 |
.
| i | R | R-μ | A |
|---|---|---|---|
| 1 | 1.10 | +0.293 | +0.86 |
| 2 | 1.10 | +0.293 | +0.86 |
| 3 | 0.94 | +0.133 | +0.39 |
| 4 | 1.00 | +0.193 | +0.57 |
| 5 | 0.10 | -0.707 | -2.08 |
| 6 | 0.60 | -0.207 | -0.61 |
Reading Design B: the picture is now ordered. Rollouts 1 and 2 (ideal: well-formatted, correct, non-rambling) get the strongest positive advantage. Rollout 4 (correct but no format) gets a smaller positive advantage — enough to reward correctness, not enough to beat a properly-formatted rollout. Rollouts 3 (rambling) and 6 (bilingual) drop to weakly-positive or weakly-negative — the gradient now actively discourages them. Rollout 5 (wrong) keeps its strong negative advantage. The gradient is monotonic in quality, which is exactly the property you want.
Same six rollouts. Same prompt. Same algorithm. Different reward design. The policy gradient points in two completely different directions. This is the entire reason reward design is the highest-leverage knob in reasoning RL.
Interactive: Compose a Reasoning Reward
Drag the four weight sliders and toggle the repetition penalty. Each change re-scores the six rollouts and recomputes the GRPO group advantages live. Try four exercises:
- Click the R1-Zero preset. Notice rollouts 1, 2, 3, 6 all tie. The format hack, the length hack, and the bilingual rollout are invisible to the reward.
- Click the R1 full preset. Notice the language penalty drops rollout 6's advantage below zero and the repetition penalty drops rollout 3's. The gradient is now ordered.
- Set and (the "Format only" preset). Notice rollout 5 (well-formatted wrong) becomes the highest-advantage rollout. This is the collapsed-to-format failure mode every reasoning RL run is one bad weight away from.
- Set with the repetition penalty OFF. Notice rollout 3 (the rambler) climbs to the top. Turn the penalty back ON; rollout 2 (purposeful long think) wins instead. The penalty is the difference between teaching the model to think and teaching it to talk.
The bottom "Behavioural diagnosis" panel summarises which failure mode each weight configuration is one training run away from. In production this is exactly the kind of analysis you do before burning $100k of cluster time — at the cost of three minutes with a spreadsheet.
Plain Python: The Reasoning Reward Bundle
The whole reward function fits in a dataclass and 30 lines of body. No model, no GPU, no torch — every piece is pure regex-and-counting Python. Read it as the complete training-time reward signal of an R1-Zero-style run.
Notice what is not in this file. There is no neural network, no learned parameter, no preference dataset, no PRM, no value function. The entire signal driving the policy toward chain-of-thought is the regex stack and the script counter. That is the empirical core of the "reasoning emerges from pressure" thesis: when the verifier is this thin, the policy has to do the work.
PyTorch: GRPO Step With a Reasoning Reward
Wiring the reward bundle into a real GRPO step is a small change to the standard PPO loop. Two models live on the GPU (policy + frozen reference); the reward bundle runs on the CPU; the only number that crosses back is the scalar .
Read this loop alongside the standard PPO-RLHF loop and count what you removed. No value network (saved: ~7B parameters of FSDP-sharded memory). No reward model (saved: ~7B params + the whole reward-model training pipeline). No preference dataset (saved: ~$1M of labelling). What is left is the policy, the reference, the regex, and the optimiser. The reasoning emerges in the gap between the policy and the reference under the regex's gradient.
At Massive Scale: R1, Kimi, and Olmo's Choices
Real production reasoning-RL runs use this same template with small but consequential variations. Three published configs worth comparing.
DeepSeek R1-Zero (671B, 8000 steps)
| Component | Weight |
|---|---|
| Accuracy (boxed exact-match for math; pass-rate for code) | 1.0 |
| Format (<think>/<answer> regex) | 0.1 |
| Length bonus | 0 (let it emerge) |
| Language consistency penalty | 0 (caused the bilingual collapse) |
| Process / step rewards | 0 (pure ORM) |
Group size , KL coefficient 0.001, temperature 0.6, max_new_tokens 4096. The whole reward is two regexes. R1-Zero hit 86.7% on AIME 2024 with this — a result that destroyed the "you need PRMs" orthodoxy.
DeepSeek R1 (the SFT-then-RL version)
| Component | Weight |
|---|---|
| Accuracy | 1.0 |
| Format | 0.1 |
| Length bonus | 0 |
| Language consistency penalty | 0.5 (added to fix R1-Zero collapse) |
| Repetition penalty (n-gram density) | implicit via sampling |
| Helpful/harmless RM (mixed phases only) | 0.2 in alignment phase |
The language penalty is the entire functional difference in the reward stack between R1-Zero and R1. The helpful/harmless RM appears only in the final alignment phase, after reasoning has stabilised.
Kimi K1.5
| Component | Weight |
|---|---|
| Accuracy | 1.0 |
| Format | 0.05 (lower than R1; their base model was already SFT'd) |
| Length bonus (linear, no saturation) | small positive — explicit length shaping |
| Length penalty (above cap) | negative, capped |
| Process Reward Model (rare, ~5% of prompts) | 0.05 (hybrid) |
Kimi is the most interesting outlier: they kept a small PRM term, used it sparsely (only on the 5% hardest prompts), and added explicit length shaping. Result: marginally better hard-math performance than R1, marginally worse aha-moment emergence. The trade-off is real and the design choice is task-driven.
Olmo 3 (open source)
| Component | Weight |
|---|---|
| Accuracy | 1.0 |
| Format | 0.1 |
| Length bonus, saturated at 1024 | 0.2 |
| Repetition penalty (n-gram) | 0.3 |
| Language consistency | 0.5 |
| Group size | G = 64 (large for stable σ) |
Olmo 3 is the "every-component-on" design. The G=64 group size is unusually large; it stabilises across batches at the cost of 4x more rollout compute per gradient step. They chose this for reproducibility, not throughput — easier to debug when the advantage estimator has lower variance.
Engineering Reality: Reasoning-Specific Reward Hacks
Every successful reasoning model published the hacks they caught and patched. Here is the working list.
| Hack | Symptom | Patch |
|---|---|---|
| Empty <think> block | Model writes <think></think>\boxed{...} — gets full format reward, zero thinking | Subtract format reward when len(<think>) < 8 tokens; or require non-whitespace inside |
| Copy the answer into <think> | Model writes <think>The answer is 42</think><answer>\boxed{42}</answer> | Check think body for absence of boxed pattern; or require derivation length |
| Bilingual reasoning (R1-Zero's famous case) | Think block is 40% Chinese in the middle of English-trained corpus | Add language consistency penalty (w_lang = 0.5, threshold 10%) |
| Rambling self-talk to inflate length bonus | 612 tokens of 'let me think... still thinking... hmm' | Saturated length bonus + soft repetition penalty above ramble_threshold |
| Multiple \boxed{} with different values | Model hedges: \boxed{2}, or maybe \boxed{3} — first-match accepts the guess | Match the LAST \boxed{} in the answer block, not the first |
| EOS-token suppression | Model never emits </think>, gets cut off by max_new_tokens, scores 0 | Make truncated rollouts auto-score 0 regardless of partial format match |
| Non-ASCII answer digits | Model writes \boxed{2} (full-width 2); exact-match fails | Unicode NFKC normalisation on both sides of the comparison |
| Late chain-of-thought injection | Model writes the answer FIRST then <think> after; first-match logic accepts it | Require <think> to come BEFORE <answer> in the regex (TAG_RE handles this if you write it right) |
Notice the symmetry with section 14.4's rule-based reward hacks. The fixes are still single-line regex patches. The difference is that reasoning rollouts are 10x longer, so each hack costs more compute to discover and reproduce. In practice you find them by sampling 100 high-reward rollouts every 500 training steps and reading them by hand. There is no automated substitute.
The eval-vs-train reward gap
One reasoning-specific failure that does not appear in section 14.4: training reward goes up while eval accuracy stalls. The policy is exploiting some structural feature of the training prompts that doesn't appear on eval prompts. Common culprits:
- Numeric distribution mismatch. Training truths are mostly small integers; the model learns to guess 0–10 with high prior and the accuracy reward rewards it. Eval has fractions and decimals. Fix: stratify training corpus by answer type.
- Prompt-template leakage. Training prompts all end with "Put your answer in \boxed{}"; the model learns to grep for that specific token. Fix: vary the prompt template at training time.
- Verifier-specific format. The model learns to emit answers in a format only your verifier accepts. Cross-eval with a different verifier to detect.
The take-away of this section is the same as the lesson of every other section in this chapter: GRPO does not give you a free pass on reward design. Dropping the critic saves 7B parameters of memory; it does not save you from having to think about which behaviours your reward rewards. A reasoning model is the most legible artefact a reward function can produce, because the policy literally writes its strategy down in the think block. Read those think blocks. They will tell you what you actually rewarded — not what you thought you were rewarding.
The next section surveys GRPO variants — DAPO, Dr.GRPO, and Olmo 3's extensions — which mostly change how advantages are computed and clipped, not what the reward measures. With the reasoning reward in your toolbox you now have the harder half of the GRPO design surface; the variants tune the easier half.