Boo-AI — Master Artificial Intelligence by Building from Scratch

The Real Problem: You Can Only See the Outcome

The previous section gave us GRPO's hyperparameters. The section before that derived the algorithm. What no derivation tells you is what to put in the slot labelled $R(s, a)$ . For chat models the answer is easy — wire in a neural reward model trained on preferences. For a reasoning model, the answer is much harder, and it is where the entire personality of the trained model is decided.

Here is the asymmetry that defines this section. A reasoning rollout looks like this:

<think> …several hundred tokens of intermediate work, false starts, corrections, possibly the model talking to itself… </think><answer>\boxed{42}</answer>

You can verify the answer trivially — extract the boxed value, compare to the truth, done. You cannot verify the thinking without re-running it inside a stronger reasoner, which defeats the purpose. So the entire reasoning trajectory is process you want to shape using only outcome you can observe.

The reasoning-reward paradox: you want to teach the model to think carefully, but you can only grade what it says at the end. Every reward signal you design is an indirect bet that optimising the observable surface (answer correctness + structure + length + language) will pull the unobservable process (genuine deliberation) along with it.

This is not a problem that goes away with more data. It is a structural property of reasoning RL: the reward is sparse, the signal arrives only at the end of a long generation, and every reward design choice tilts the model toward a different compromise between "careful thinking", "fast guessing", "rambling stalling", and "confident wrongness".

What this section answers: given the sparse, outcome-only nature of the signal, which reward components actually elicit reasoning, which ones encourage hacks, and which ones are pure overhead. The R1-Zero recipe is the empirical ground truth — two reward components, no preference data, and a 671B model that climbed from 9.3% to 86.7% on AIME 2024. We will rebuild that reward from first principles.

Intuition: Reasoning Emerges from Pressure, Not Imitation

There are two philosophies for getting reasoning out of an LLM. The first is imitation: collect a dataset of carefully-annotated reasoning chains and SFT the model on them. The second is pressure: pick a reward function that scores final answers, let the model sample many attempts per prompt, push up the log-probability of the attempts that worked, and let reasoning emerge as the strategy the policy discovers to maximise reward.

Pressure is what R1-Zero proved possible. The base model had never been shown a single example of explicit chain-of-thought formatted in $\langle\mathrm{think}\rangle$ tags. Within 1500 training steps, the policy was producing well-structured think blocks of growing length, and within 8000 steps those think blocks contained genuine self-correction ("wait, let me check that") and AIME-level mathematical reasoning. Nobody taught the model to think. The reward function rewarded outcomes, and thinking emerged as the cheapest way to hit those outcomes.

The metaphor: imagine training a high-jumper. You can show them video of the Fosbury Flop and ask them to copy it (imitation). Or you can raise the bar and pay them when they clear it (pressure). Pressure produces unexpected techniques — the actual Fosbury Flop was invented this way. For reasoning, R1's "aha moment" pattern — where the model spontaneously interrupts itself to reconsider — is the equivalent: a strategy nobody designed, that the reward function accidentally selected for.

When pressure beats imitation: when the policy has more inference compute than the human labeller had, you cannot expect imitation data to be the ceiling. The policy can find better reasoning patterns than the human shows it, if the reward function only cares about outcomes. This is the same reason AlphaGo's self-play surpassed human-game training.

For the rest of this section we will design rewards on the pressure side of the dichotomy. We will not assume any chain-of-thought data exists. The only inputs to our reward are the prompt, the rollout, and the ground-truth answer.

The Math: A Composite Reward for a Process You Can't Directly Score

Let $s$ be a prompt with known ground truth $y^\star$ . Let $a$ be a rollout drawn from the policy: a string containing some reasoning trace and a final answer. We are constructing a function $R(s, a, y^\star) \in \mathbb{R}$ from observable pieces of $a$ .

We will build it as a weighted sum of independent component rewards:

$R(s, a, y^\star) = w_{\mathrm{acc}}\, r_{\mathrm{acc}}(a, y^\star) + w_{\mathrm{fmt}}\, r_{\mathrm{fmt}}(a) + w_{\mathrm{len}}\, r_{\mathrm{len}}(a) - w_{\mathrm{lang}}\, p_{\mathrm{lang}}(a)$

Each piece has a precise definition.

Accuracy reward

$r_{\mathrm{acc}}(a, y^\star) = \mathbb{1}\big[\mathrm{extract\_box}(a) = y^\star\big]$

The indicator is 1 if the regex-extracted boxed value matches the ground truth, else 0. This is the only signal that ties the model to objective correctness; with $w_{\mathrm{acc}} = 0$ the model has no incentive to be right.

Format reward

$r_{\mathrm{fmt}}(a) = \mathbb{1}\big[a \text{ matches the } \langle\mathrm{think}\rangle\ldots\langle/\mathrm{think}\rangle\langle\mathrm{answer}\rangle\ldots\langle/\mathrm{answer}\rangle \text{ schema}\big]$

A regex check. With $w_{\mathrm{fmt}} = 0$ the model has no reason to keep the structure, and R1's emergent thinking would not happen — the model would just write the answer.

Length reward (saturated, optional)

$r_{\mathrm{len}}(a) = \frac{\min(L_{\mathrm{think}}(a),\, L_{\mathrm{cap}})}{L_{\mathrm{cap}}}$

where $L_{\mathrm{think}}$ is the think-block length in tokens and $L_{\mathrm{cap}}$ is the saturation cap (e.g. 256). This is a piecewise-linear ramp from 0 to 1, capped so that extra-long rollouts get no extra reward. R1-Zero used $w_{\mathrm{len}} = 0$ — they let length emerge naturally from accuracy pressure — but several downstream models add a small positive weight to accelerate the emergence.

Language consistency penalty

$p_{\mathrm{lang}}(a) = \mathbb{1}\big[\text{think body mixes scripts above a threshold}\big]$

A subtractive penalty. R1-Zero did not have this. R1 added it because R1-Zero's thinking spontaneously became bilingual (English+Chinese in the same trace) — a perfectly valid reward-maximising strategy that humans found unreadable. The penalty fixes the readability without touching the underlying math ability.

Putting it together with GRPO

Inside a GRPO group of $G$ rollouts of the same prompt, the advantage of rollout $i$ is:

$A_i = \frac{R(s, a_i, y^\star) - \mu_G}{\sigma_G + \varepsilon}, \quad \mu_G = \tfrac{1}{G}\sum_j R_j,\quad \sigma_G^2 = \tfrac{1}{G}\sum_j (R_j - \mu_G)^2$

and the surrogate loss for the policy update is the standard PPO clip from section 15.2.

The mathematical content of the section is not in the surrogate. It is in the choice of which components live inside $R$ . The rest of the section is about why each component is there and what happens to the trained model when you change its weight.

Outcome Rewards vs Process Rewards

Every reasoning-reward design has to take a side on one big question: do you reward only the final outcome (the boxed answer), or do you also score intermediate steps of the reasoning?

Approach	What you score	Cost	Risk
Outcome Reward Model (ORM) — what R1 used	Only the final answer (regex + exact match)	Nearly zero (microseconds per rollout)	Sparse signal; long rollouts have credit-assignment problems
Process Reward Model (PRM) — Let's Verify Step-by-Step	Each reasoning step scored by a separate model	Big — needs a step-annotated dataset and a 7B+ verifier	PRM hallucinates step quality; reward-hackable; expensive
Hybrid: ORM + a small PRM term	Outcome reward + bonus for steps PRM rates as 'plausible'	Medium — PRM is loaded but only consulted sparsely	Best of both, hardest to debug; rarely used in production

For a long time the field assumed PRMs were the future. The OpenAI "Let's Verify Step-by-Step" paper showed PRMs beat ORMs on math eval at fixed inference budget. The conventional wisdom was: spend the labelling money on per-step annotations, train a PRM, use it as the reward signal during RL.

R1 broke that assumption. R1-Zero used a pure ORM (boxed exact-match) and produced a stronger reasoner than any PRM-based method to date. Three reasons:

PRMs are reward-hackable. A neural model that scores reasoning steps is itself a neural model with cracks. The policy finds and exploits them in the same way it would a regular RM. At scale this is a worse failure mode than the credit-assignment problem of ORMs, because PRM hacks look like "reasoning" — they fool you, then the model fails on eval.
GRPO solves the credit-assignment problem implicitly. The standard objection to ORMs is "the model can't tell WHICH token in a 2000-token trace mattered". GRPO's answer: it doesn't need to. With G=16 rollouts of the same prompt, the gradient updates ALL tokens of all positively advantaged rollouts equally. The signal averages over enough rollouts that good reasoning patterns end up reinforced even without per-step credit.
ORMs are deterministic. The same rollout today and a month from now gets the same reward. PRMs drift with retraining. Determinism is what lets you debug an 8000-step training run by replaying rewards.

The PRM trap: teams that fall in love with PRM rewards build them, train against them, ship the model, and find at eval time that the policy aced PRM-graded math but flunked unseen math. The PRM was a leaky abstraction. Outcome rewards force the policy to actually solve the problem.

The empirical rule from R1, DAPO, Olmo, Qwen-2.5-Math, and the rest of the post-2024 reasoning models is: start with an ORM. Add PRM-style components only if outcomes alone plateau, and even then keep their weight small (sub-0.1) so their hacks cannot dominate.

Length Dynamics: The Aha Moment and Its Dark Twin

Plot mean rollout length against training steps for any successful reasoning-RL run and you see the same shape: a flat line for the first few hundred steps, then a sharp upward bend. R1's curve goes from ~150 tokens at step 500 to ~1500 tokens at step 8000, with eval accuracy climbing in lockstep. This is the "aha moment" — the point where the policy discovers that thinking longer wins more rewards.

It is also the point where reward hacking begins. Length growth is good only when it is purposeful. The same gradient that encourages long, correct reasoning will, if you let it, encourage long noise. Three observed failure modes:

Self-talk padding: the model fills the think block with "Let me think… still thinking… hmm…" for hundreds of tokens before producing the same answer it would have produced in 50. Reward gains: none. Length: grows linearly.
Repeated-self-correction: the model finds one correct path, then writes 20 variations of "let me double check" that re-derive the same fact. Genuinely helpful on hard problems; pure waste on easy ones.
Off-topic excursion: the model digresses into an unrelated lemma, proves it, and circles back. Sometimes useful for the actual problem, sometimes pure entropy.

R1-Zero used $w_{\mathrm{len}} = 0$ — no explicit length bonus — and still produced the aha-moment length growth, because longer correct answers exist and the accuracy reward already loves them. The equilibrium length emerged from accuracy pressure alone.

Several later papers (notably DAPO) added a small explicit length bonus to speed up the aha moment — train in 4000 steps instead of 8000. The catch is that any positive length bonus needs a paired anti-rambling cap, or the policy learns to pad before it learns to think. The cap has two common forms:

Anti-rambling mechanism	How it works	Used by
Saturated length bonus	r_len = min(L, L_cap)/L_cap — flat above L_cap, so extra tokens are free but unrewarded	DAPO, Qwen-2.5-Math
Soft repetition penalty	Subtract a small reward when n-gram repetition density exceeds threshold	Olmo 3, DeepSeek-V3 chat RL
Hard max_new_tokens cap	Tokenizer-level truncation at e.g. 4096; rollouts hitting the cap auto-score 0	Universal — the safety net underneath everything else

The interactive widget below lets you turn on "naïve length bonus" with and without the repetition penalty — the difference in which rollout gets the top advantage is exactly the difference between a model that thinks longer and a model that just talks more.

Language Consistency: R1's Hidden Reward

R1-Zero's released paper has a famous footnote: midway through training, the team noticed the model's think blocks had become spontaneously bilingual. A typical rollout would start in English ("Let me factor…"), switch to Chinese for the actual reasoning ("因式分解 (x-2)(x-3)=0…"), and end in English ("…so the answer is 2"). Accuracy was unaffected. Human readers were horrified.

Why did this happen? The base model was multilingual; the reward function only graded the final boxed answer, not the language of the thinking. The policy correctly inferred that Chinese tokens have lower entropy for mathematical reasoning (because of the training corpus distribution), and started using whichever language produced the highest $P(\text{correct answer})$ in expectation. Bilingual reasoning was a reward-maximising behaviour. The reward function had no veto.

R1 (not R1-Zero) added a language consistency reward component:

$p_{\mathrm{lang}}(a) = \mathbb{1}\big[\,\text{minority-script tokens} / \text{total tokens} > \tau\,\big], \quad \tau \approx 0.10$

with a weight $w_{\mathrm{lang}} \approx 0.5$ . The penalty is binary — either the think block is consistent or it isn't. A smooth penalty would reward being "9% mixed" over "11% mixed", which is the wrong gradient direction; you want a cliff.

The general lesson: every "quality of life" feature of a reasoning model that is not captured by the accuracy reward needs its own term. Format, language, repetition, refusal, safety — each one a small

r_i

, each one with a weight in the composite. The policy will fight every single one of them by finding a reward-maximising loophole. Reward design at scale is whack-a-mole, and the mole is the policy.

R1's ablation: removing the language penalty restored bilingual rollouts within 200 steps. Adding it back cleared them in another 300 steps. The cost on AIME accuracy: zero. This is a perfect example of a costless quality-of-life reward — readability for free, paid in microseconds of regex-counting.

Difficulty-Aware Shaping: Reward Curriculum

A subtle property of GRPO with rule rewards: the gradient signal on any prompt is proportional to the variance of rewards across the group. Two pathological cases:

All G rollouts succeed: $\sigma_G = 0$ , advantages all zero, gradient zero. The prompt is too easy — no learning.
All G rollouts fail: same $\sigma_G = 0$ , same zero gradient. The prompt is too hard — no learning.

Useful learning only happens on prompts where $1 \leq \text{success count} \leq G - 1$ — the policy's frontier. This gives us a free curriculum signal: track the per-prompt mean reward across training, drop prompts that saturate (always 0 or always 1), upsample prompts in the 30–70% success-rate band.

Three difficulty-aware shaping techniques used in production:

Technique	Mechanism	Effect
Online filter	Skip prompts where last-N-step mean reward > 0.95 or < 0.05	Removes 'no learning' prompts mid-training; ~2x sample efficiency
Difficulty-stratified batching	Construct each training batch with a target reward histogram	Stabilises σ_G across batches; smoother gradient norms
Curriculum interpolation	Start with easy prompts, shift to harder as accuracy rises	R1, Qwen-2.5-Math both use a soft version; difficulty band slides up over training

None of these change the per-rollout reward function — they change which prompts see the reward function. But they belong in this section because in practice the "reasoning reward" you ship is the reward function plus the prompt-selection policy, and the two are almost as load-bearing as each other.

Manual Numerical Walkthrough

We compute one GRPO step on a single math prompt with G=6 rollouts, scoring each under two different reward designs to see how the choice changes the gradient direction.

Click to expand: same six rollouts, two reward designs, two completely different gradient signals

Setup. Prompt $s$ = "Solve $x^2 - 5x + 6 = 0$ . Smaller root in $\boxed{}$ ." $y^\star = 2$ . Six rollouts spanning the qualitative behaviour space:

i	Description	acc	fmt	L_think	mixed-lang
1	Short, correct, well-formatted	1	1	18	0
2	Long thinking, correct (aha)	1	1	142	0
3	Rambling, correct	1	1	612	0
4	Correct, no tags	1	0	0	0
5	Wrong, perfect format + long	0	1	84	0
6	Correct, bilingual	1	1	22	1

Design A — R1-Zero recipe. $w_{\mathrm{acc}} = 1.0,\, w_{\mathrm{fmt}} = 0.1,\, w_{\mathrm{len}} = 0,\, w_{\mathrm{lang}} = 0$ .

i	acc	fmt	R
1	1.00	0.10	1.10
2	1.00	0.10	1.10
3	1.00	0.10	1.10
4	1.00	0.00	1.00
5	0.00	0.10	0.10
6	1.00	0.10	1.10

$\mu_G = (1.10 \times 4 + 1.00 + 0.10)/6 \approx 0.917$ , $\sigma_G \approx 0.345$ .

Advantages $A_i = (R_i - \mu_G)/\sigma_G$ :

i	R	R-μ	A
1	1.10	+0.183	+0.53
2	1.10	+0.183	+0.53
3	1.10	+0.183	+0.53
4	1.00	+0.083	+0.24
5	0.10	-0.817	-2.37
6	1.10	+0.183	+0.53

Reading Design A: rollout 5 (wrong answer) gets the only strongly negative advantage. Rollouts 1, 2, 3, 6 all tie at +0.53. Rollout 3 (rambling, 612 tokens) and rollout 6 (bilingual) get the same positive advantage as rollout 1 (the ideal). Under this reward design, the policy gradient does nothing to discourage rambling or language mixing. R1-Zero's actual behavioural collapses came from exactly this.

Design B — R1 with anti-hack components. $w_{\mathrm{acc}} = 1.0,\, w_{\mathrm{fmt}} = 0.1$ , repetition penalty on, $w_{\mathrm{lang}} = 0.5$ .

i	acc	fmt	len	lang	R
1	1.00	0.10	0.00	0.00	1.10
2	1.00	0.10	0.00	0.00	1.10
3	1.00	0.10	-0.16	0.00	0.94
4	1.00	0.00	0.00	0.00	1.00
5	0.00	0.10	0.00	0.00	0.10
6	1.00	0.10	0.00	-0.50	0.60

$\mu_G \approx 0.807,\, \sigma_G \approx 0.339$ .

i	R	R-μ	A
1	1.10	+0.293	+0.86
2	1.10	+0.293	+0.86
3	0.94	+0.133	+0.39
4	1.00	+0.193	+0.57
5	0.10	-0.707	-2.08
6	0.60	-0.207	-0.61

Reading Design B: the picture is now ordered. Rollouts 1 and 2 (ideal: well-formatted, correct, non-rambling) get the strongest positive advantage. Rollout 4 (correct but no format) gets a smaller positive advantage — enough to reward correctness, not enough to beat a properly-formatted rollout. Rollouts 3 (rambling) and 6 (bilingual) drop to weakly-positive or weakly-negative — the gradient now actively discourages them. Rollout 5 (wrong) keeps its strong negative advantage. The gradient is monotonic in quality, which is exactly the property you want.

Same six rollouts. Same prompt. Same algorithm. Different reward design. The policy gradient points in two completely different directions. This is the entire reason reward design is the highest-leverage knob in reasoning RL.

Interactive: Compose a Reasoning Reward

Drag the four weight sliders and toggle the repetition penalty. Each change re-scores the six rollouts and recomputes the GRPO group advantages live. Try four exercises:

Click the R1-Zero preset. Notice rollouts 1, 2, 3, 6 all tie. The format hack, the length hack, and the bilingual rollout are invisible to the reward.
Click the R1 full preset. Notice the language penalty drops rollout 6's advantage below zero and the repetition penalty drops rollout 3's. The gradient is now ordered.
Set $w_{\mathrm{acc}} = 0$ and $w_{\mathrm{fmt}} = 1$ (the "Format only" preset). Notice rollout 5 (well-formatted wrong) becomes the highest-advantage rollout. This is the collapsed-to-format failure mode every reasoning RL run is one bad weight away from.
Set $w_{\mathrm{len}} = 0.3$ with the repetition penalty OFF. Notice rollout 3 (the rambler) climbs to the top. Turn the penalty back ON; rollout 2 (purposeful long think) wins instead. The penalty is the difference between teaching the model to think and teaching it to talk.

Loading reasoning reward composer…

The bottom "Behavioural diagnosis" panel summarises which failure mode each weight configuration is one training run away from. In production this is exactly the kind of analysis you do before burning $100k of cluster time — at the cost of three minutes with a spreadsheet.

Plain Python: The Reasoning Reward Bundle

The whole reward function fits in a dataclass and 30 lines of body. No model, no GPU, no torch — every piece is pure regex-and-counting Python. Read it as the complete training-time reward signal of an R1-Zero-style run.

Reasoning reward bundle: accuracy + format + length + language penalty + GRPO advantages

🐍reasoning_reward.py

Explanation(12)

Code(87)

27Two regexes carry the entire format prior

TAG_RE checks both the structure and the order (<think> before <answer>); BOXED_RE pulls the boxed value out of the answer block. Together these two patterns are the only structural constraint the model ever sees. Everything we call 'reasoning' below is a consequence of optimising against these regexes plus an exact-match check, nothing more.

EXECUTION STATE

TAG_RE = captures (think_body, answer_body) in order

BOXED_RE = captures the boxed value

30_is_mixed_script — the language-consistency detector

A 4-line heuristic does the job: count CJK codepoints, count Latin letters, and call it 'mixed' if the minority script is more than 10% of the union. This is exactly the kind of cheap, deterministic check that makes a rule-based reward viable at GRPO scale — sub-microsecond per rollout, no model loaded. The threshold (10%) is empirical: R1-Zero's bilingual rollouts had ~25% Chinese mixed into otherwise-English think blocks, so 10% catches them cleanly without false-flagging single names or LaTeX symbols.

EXECUTION STATE

cjk = count of Han/Hiragana/Katakana characters

latin = count of ASCII letters

return = True if minority script > 10% of union

38ReasoningReward — a dataclass is the entire reward 'model'

Five hyperparameters wrap the whole reasoning-reward design space. Notice what is NOT in there: no neural net, no learned values, no preference data. Every knob is human-set. This is why R1 ablations are so clean to read in the paper — you can fix four knobs and sweep the fifth in a few hours. With a 7B reward model the same sweep would take weeks and use uncertain compute.

EXECUTION STATE

w_acc = outcome weight; 1.0 in R1-Zero

w_fmt = format weight; 0.1 in R1-Zero — small but load-bearing

w_len = length-bonus weight; 0.0 in R1-Zero

w_lang = language-mix penalty; 0.5 in R1, added AFTER R1-Zero's bilingual collapse

len_cap = 256 tokens; the length bonus saturates here

46Early return when the format regex fails

If TAG_RE does not match, we return R=0 with all components zero. Notice we do not penalise format violations BELOW zero — the worst possible reward is exactly 0. This matters for GRPO: when the group includes a no-format rollout it pulls μ down, which makes well-formatted rollouts get LARGER advantages, which makes the format gradient sharper. The penalty is implicit in the group mean, not in the per-rollout score.

EXECUTION STATE

return = 0 reward when format fails — implicit penalty via μ

52n_think — tokenising by whitespace is fine

We use len(text.split()) instead of a real tokenizer because the length bonus only needs ordinal information: longer think = more reward, up to the cap. A precise count would be subtoken-accurate but 30x slower per rollout, and the saturation function is flat above 256 anyway. The cheap approximation is correct because the saturation curve is monotonic in 'roughly how long was it'.

EXECUTION STATE

n_think = whitespace-token count of the <think> body

56Accuracy: a strict string compare after .strip()

Plain == is the right choice 90% of the time. For competition math you wire in a math-aware normaliser (the MATH benchmark ships with one — it canonicalises 1/2 vs 0.5 vs \\frac{1}{2}). Choosing the normaliser is one of the highest-leverage decisions in the whole pipeline: a too-strict normaliser undertrains the model; a too-loose one rewards near-misses and the policy gets sloppy.

EXECUTION STATE

acc = 0 or 1 — exact match after strip()

60Length bonus saturates, then can go negative

min(n_think, len_cap) / len_cap is a piecewise-linear ramp from 0 (no thinking) to 1 (256+ tokens of thinking). Above the ramble_threshold (400 tokens) we apply a soft -0.3 over a smooth overshoot — this is the cheapest way to defeat the 'pad the think block with noise' reward hack. The shape is intentionally asymmetric: rewarding length too aggressively is much more dangerous than rewarding it slightly less than optimal.

EXECUTION STATE

len_pos = ramp [0, w_len], minus 0.3 once n_think > 400

overshoot = fraction by which n_think exceeds the threshold

66lang_pen — a binary penalty, not a smooth one

If the script-mix heuristic fires, we deduct w_lang from the reward (so the reward can go negative for mixed-language rollouts). The penalty is binary on purpose: a smooth penalty rewards being '9% mixed' more than '11% mixed', which is exactly the wrong gradient signal. Binary penalty + a single threshold gives the policy a clean cliff to climb away from.

EXECUTION STATE

lang_pen = -w_lang or 0

69R = sum of components — never multiplied

Linear addition keeps every component active even when others are zero. A multiplicative reward (acc * fmt * ...) would zero out the gradient on any rollout missing one component — bad, because most early rollouts miss most components, and we need a non-zero learning signal exactly there.

EXECUTION STATE

R = scalar in (-w_lang, w_acc + w_fmt + w_len]

73group_advantages — GRPO's killer trick, two lines of stats

The full GRPO advantage is (R_i - mean) / (std + eps). This produces zero-mean, unit-variance advantages by construction inside the group. The policy gradient sees only RELATIVE rankings — which rollouts of this prompt were better than average — not absolute reward magnitudes. This is what makes the reward design above robust: as long as the COMPONENT WEIGHTS produce the right ranking, the absolute values don't matter.

EXECUTION STATE

mu = group mean of R

sd = group std of R; floor at 1e-6 to avoid divide-by-zero

return = G floats, mean 0, std 1 by construction

82Five rollouts span the rewardable behaviour space

These five strings are not random examples — they are the failure modes the design has to discriminate: (1) ideal, (2) verbose-but-correct, (3) correct-but-no-format, (4) format-but-wrong, (5) bilingual-but-correct. A reward design that gives them advantages in the right order is one you can ship; one that does not, isn't. This is the gut-check unit test you run BEFORE plugging the reward into a 671B training run.

91The final two lines are the per-step GRPO core

totals is the reward vector; adv is the advantage vector. In the real training loop these feed directly into the PPO clipped surrogate (you saw the PyTorch version in section 15.2). Everything else above — the regexes, the script detector, the dataclass — is just the work needed to turn a free-text rollout into one float. The full reward pipeline at training time is exactly this code, replicated across G * batch_size rollouts in parallel.

EXECUTION STATE

totals = (G,) floats — raw rewards

adv = (G,) floats — group-relative advantages, fed to PPO clip

75 lines without explanation

1"""
2Reasoning reward bundle. The whole job of these ~80 lines of Python is to
3turn a free-text rollout into a single scalar that:
4
5  1. rewards a correct boxed answer  (outcome signal)
6  2. rewards <think>...</think><answer>...</answer> structure (format prior)
7  3. mildly rewards substantive thinking up to a saturation cap (length)
8  4. penalises mid-stream language switching       (language consistency)
9  5. caps and penalises pathological rambling       (anti-reward-hack)
10
11No process supervision. We never score 'did the model use the quadratic
12formula' or 'is step 3 valid'. We score only the surface: the answer, the
13shape, the length, the language. The reasoning IS the side-effect that
14emerges from those four pressures.
15"""
16from __future__ import annotations
17import re, math, statistics
18from dataclasses import dataclass
19from typing import Sequence
20
21TAG_RE   = re.compile(r"<think>([\\s\\S]*?)</think>\\s*<answer>([\\s\\S]*?)</answer>")
22BOXED_RE = re.compile(r"\\\\boxed\\{([^}]+)\\}")
23
24# Crude script detector: counts CJK + Latin codepoints in the think body.
25def _is_mixed_script(text: str) -> bool:
26    cjk   = sum(1 for c in text if "\u4e00" <= c <= "\u9fff")
27    latin = sum(1 for c in text if c.isalpha() and ord(c) < 128)
28    if cjk == 0 or latin == 0:
29        return False
30    minority = min(cjk, latin)
31    return minority / (cjk + latin) > 0.10        # >10% of either script
32
33@dataclass
34class ReasoningReward:
35    w_acc:  float = 1.0
36    w_fmt:  float = 0.1
37    w_len:  float = 0.0           # 0 by default — R1 used 0; many variants don't
38    w_lang: float = 0.5           # mixed-script penalty
39    len_cap: int = 256            # tokens of <think> at which length bonus saturates
40    ramble_threshold: int = 400   # tokens above which a soft -0.3 fires
41
42    def __call__(self, rollout: str, truth: str) -> dict:
43        m = TAG_RE.search(rollout)
44        if not m:
45            return {"R": 0.0, "acc": 0, "fmt": 0, "len": 0.0,
46                    "lang": 0.0, "think_tokens": 0, "extracted": None}
47        think, answer = m.group(1), m.group(2)
48        # crude tokenisation: whitespace split is the cheapest proxy
49        n_think = len(think.split())
50        # accuracy
51        b = BOXED_RE.search(answer)
52        extracted = b.group(1).strip() if b else None
53        acc = 1.0 if extracted == truth.strip() else 0.0
54        # format (the regex matched, so we have tags — give 1)
55        fmt = 1.0
56        # length bonus, saturating
57        len_pos = self.w_len * min(n_think, self.len_cap) / self.len_cap
58        if n_think > self.ramble_threshold:
59            overshoot = (n_think - self.ramble_threshold) / self.ramble_threshold
60            len_pos -= 0.3 * min(1.0, overshoot)
61        # language consistency
62        lang_pen = -self.w_lang if _is_mixed_script(think) else 0.0
63        R = self.w_acc * acc + self.w_fmt * fmt + len_pos + lang_pen
64        return {"R": R, "acc": acc, "fmt": fmt, "len": len_pos,
65                "lang": lang_pen, "think_tokens": n_think, "extracted": extracted}
66
67def group_advantages(rewards: Sequence[float]) -> list[float]:
68    """GRPO group-relative advantage. Mean=0, std=1 by construction."""
69    if len(rewards) < 2:
70        return [0.0] * len(rewards)
71    mu = statistics.mean(rewards)
72    sd = statistics.pstdev(rewards) or 1e-6
73    return [(r - mu) / sd for r in rewards]
74
75# ─── End-to-end demo on one prompt, G = 6 rollouts ──────────────────────
76truth = "2"     # smaller root of x^2 - 5x + 6 = 0
77rollouts = [
78    "<think>Factor: (x-2)(x-3)=0. Smaller is 2.</think><answer>\\boxed{2}</answer>",
79    "<think>" + ("Let me think... " * 60) + "answer is 2.</think><answer>\\boxed{2}</answer>",
80    "The smaller root is 2.",
81    "<think>Tried things, guessing 3.</think><answer>\\boxed{3}</answer>",
82    "<think>因式分解 (x-2)(x-3)=0, smaller是2</think><answer>\\boxed{2}</answer>",
83]
84reward_fn = ReasoningReward(w_acc=1.0, w_fmt=0.1, w_len=0.0, w_lang=0.5)
85scored = [reward_fn(r, truth) for r in rollouts]
86totals = [s["R"] for s in scored]
87adv    = group_advantages(totals)

Notice what is not in this file. There is no neural network, no learned parameter, no preference dataset, no PRM, no value function. The entire signal driving the policy toward chain-of-thought is the regex stack and the script counter. That is the empirical core of the "reasoning emerges from pressure" thesis: when the verifier is this thin, the policy has to do the work.

PyTorch: GRPO Step With a Reasoning Reward

Wiring the reward bundle into a real GRPO step is a small change to the standard PPO loop. Two models live on the GPU (policy + frozen reference); the reward bundle runs on the CPU; the only number that crosses back is the scalar $R$ .

reasoning_grpo_step: the R1-Zero-style training step

🐍reasoning_grpo_step.py

Explanation(6)

Code(62)

18Five hyperparameters fully specify the R1-Zero recipe

G=16, clip=0.2, kl_coef=0.001, temperature=0.6, max_new_tokens=4096. These are not chosen for performance, they are chosen for STABILITY of the reasoning-reward signal. The high max_new_tokens lets the model 'use as much think as it wants' so the length bonus has room to bind; the small KL coefficient lets the policy drift further from the base SFT distribution than typical RLHF allows (R1-Zero needed this because the base model has never seen <think> tags before).

EXECUTION STATE

G = group size = 16 in R1-Zero, 32 in R1, 64 in some Olmo runs

kl_coef = 0.001 in R1-Zero. PPO-RLHF typically uses 0.02–0.05.

temperature = 0.6 — high enough for diverse rollouts in the group, low enough to keep accuracy

max_new_tokens = 4096 — generous, so length-bonus saturation is achievable

27max_new_tokens=4096 is the reasoning-specific change

In a standard RLHF setup max_new_tokens is 512 or 1024 because helpful answers are short. For a reasoning model you raise this by 4–8x because the whole point is to let the model write its scratchpad. R1's training-time rollouts had a median of 800 tokens by step 8000; some hit 3000. If your max_new_tokens is too low, the length saturation cap NEVER binds and the length-bonus signal is permanently censored.

EXECUTION STATE

gens = (G, T_max) integer tensor — much wider than RLHF defaults

logp_old = (G, T_max) log-probs from sampling, no grad

33scored[i]['R'] is the only number we keep on the GPU

The reward bundle returns a dict for every rollout (acc, fmt, len, lang, think_tokens, …), but only the scalar 'R' enters the policy-gradient calculation. The rest goes to logging. This is the discipline that lets you debug a reasoning-RL run: you can correlate REWARD components against EVAL metrics over training, see e.g. 'format reward saturates at step 200 but length keeps growing till step 1500' — and that pattern is the empirical signature of a healthy R1-Zero training curve.

EXECUTION STATE

rewards = (G,) float32 — the GRPO input

scored = list of dicts — everything else, for logging only

40Standardise inside the group, NOT across prompts

(rewards - mean) / std uses the per-prompt group statistics. This is the structural reason GRPO works for reasoning: an easy prompt where 14/16 rollouts succeed has σ ≈ 0, so advantages are tiny, so the gradient barely moves the policy. A hard prompt where 1/16 rollouts succeed has high σ, so the one success generates a sharp gradient. The training automatically focuses on prompts in the policy's frontier — neither too easy nor too hard.

EXECUTION STATE

adv = (G,) — mean=0, std≈1; sharper on harder prompts

48PPO clip stops length-related runaway

When a reasoning model finds a high-reward 'long think' pattern, the importance ratio for those long sequences can blow up — exp(sum of per-token log-prob gaps) compounds over thousands of tokens. The clip caps each step's policy update. This is one of the reasons R1-Zero used 4096 max tokens but still tolerated long rollouts: every token saw the same ratio clip, so a 3000-token rollout did NOT generate a 3000x larger gradient than a 500-token one.

EXECUTION STATE

ratio = (G, T) per-token importance ratio

pg = scalar; sign convention: minimise the negative surrogate

55kl_coef=0.001 is the unusually-small R1-Zero choice

Standard PPO-RLHF uses kl_coef ≈ 0.02–0.05 to keep the policy close to a well-aligned reference. R1-Zero's reference is a BASE model (no SFT), so anchoring tightly to it would block the very thing they want — emergent <think> behaviour. The price of a small KL is a larger risk of degenerate rollouts; they pay it because the alternative is no reasoning learnt. The lesson: KL coefficient is not a universal — it must match what the reference distribution represents.

EXECUTION STATE

kl = scalar mean log-prob gap policy vs ref

56 lines without explanation

1"""
2GRPO step with a reasoning reward. Two GPU-resident models:
3  - policy:  the model whose parameters we update.
4  - ref:     a frozen copy from before RL; KL anchor.
5
6Everything else — reward computation, advantage normalisation — is on
7CPU. This is the canonical R1-Zero training step: no reward model, no
8value network, no preference data.
9"""
10import torch
11import torch.nn.functional as F
12from typing import Callable
13
14def reasoning_grpo_step(
15    policy:    torch.nn.Module,
16    ref:       torch.nn.Module,
17    prompts:   list[str],
18    truths:    list[str],
19    reward_fn: Callable[[str, str], dict],          # the ReasoningReward bundle
20    tokenize:  Callable, detokenize: Callable,
21    G:         int   = 16,                          # R1-Zero used G=16
22    clip:      float = 0.2,
23    kl_coef:   float = 0.001,                       # R1-Zero used 0.001 — very small
24    temperature: float = 0.6,                       # R1-Zero used 0.6
25):
26    """One GRPO update on a batch of reasoning prompts."""
27    total_loss = 0.0
28
29    for prompt, truth in zip(prompts, truths):
30        # 1. Sample G rollouts from the current policy
31        ids = tokenize(prompt).to(policy.device)
32        gens, logp_old = sample_with_logp(
33            policy, ids, G, temperature=temperature, max_new_tokens=4096,
34        )
35        completions = [detokenize(g) for g in gens]
36
37        # 2. Score every rollout with the REASONING reward bundle
38        scored  = [reward_fn(c, truth) for c in completions]
39        rewards = torch.tensor(
40            [s["R"] for s in scored],
41            device=policy.device, dtype=torch.float32,
42        )                                            # shape (G,)
43
44        # 3. Group-relative advantage  — no value net, no reward model
45        adv = (rewards - rewards.mean()) / (rewards.std() + 1e-6)
46
47        # 4. Re-score under current policy (gradient flows here)
48        logp_new = policy_logprobs(policy, ids, gens)
49        logp_ref = policy_logprobs(ref,    ids, gens)
50
51        # 5. PPO clipped surrogate, summed over generated tokens
52        ratio = torch.exp(logp_new - logp_old)       # (G, T)
53        surr1 = ratio                * adv.unsqueeze(-1)
54        surr2 = torch.clamp(ratio, 1 - clip, 1 + clip) * adv.unsqueeze(-1)
55        pg    = -torch.min(surr1, surr2).mean()
56
57        # 6. KL-to-reference: very small for R1-Zero
58        kl = (logp_new - logp_ref).mean()
59
60        total_loss = total_loss + pg + kl_coef * kl
61
62    return total_loss / len(prompts)

Read this loop alongside the standard PPO-RLHF loop and count what you removed. No value network (saved: ~7B parameters of FSDP-sharded memory). No reward model (saved: ~7B params + the whole reward-model training pipeline). No preference dataset (saved: ~$1M of labelling). What is left is the policy, the reference, the regex, and the optimiser. The reasoning emerges in the gap between the policy and the reference under the regex's gradient.

At Massive Scale: R1, Kimi, and Olmo's Choices

Real production reasoning-RL runs use this same template with small but consequential variations. Three published configs worth comparing.

DeepSeek R1-Zero (671B, 8000 steps)

Component	Weight
Accuracy (boxed exact-match for math; pass-rate for code)	1.0
Format (<think>/<answer> regex)	0.1
Length bonus	0 (let it emerge)
Language consistency penalty	0 (caused the bilingual collapse)
Process / step rewards	0 (pure ORM)

Group size $G = 16$ , KL coefficient 0.001, temperature 0.6, max_new_tokens 4096. The whole reward is two regexes. R1-Zero hit 86.7% on AIME 2024 with this — a result that destroyed the "you need PRMs" orthodoxy.

DeepSeek R1 (the SFT-then-RL version)

Component	Weight
Accuracy	1.0
Format	0.1
Length bonus	0
Language consistency penalty	0.5 (added to fix R1-Zero collapse)
Repetition penalty (n-gram density)	implicit via sampling
Helpful/harmless RM (mixed phases only)	0.2 in alignment phase

The language penalty is the entire functional difference in the reward stack between R1-Zero and R1. The helpful/harmless RM appears only in the final alignment phase, after reasoning has stabilised.

Kimi K1.5

Component	Weight
Accuracy	1.0
Format	0.05 (lower than R1; their base model was already SFT'd)
Length bonus (linear, no saturation)	small positive — explicit length shaping
Length penalty (above cap)	negative, capped
Process Reward Model (rare, ~5% of prompts)	0.05 (hybrid)

Kimi is the most interesting outlier: they kept a small PRM term, used it sparsely (only on the 5% hardest prompts), and added explicit length shaping. Result: marginally better hard-math performance than R1, marginally worse aha-moment emergence. The trade-off is real and the design choice is task-driven.

Olmo 3 (open source)

Component	Weight
Accuracy	1.0
Format	0.1
Length bonus, saturated at 1024	0.2
Repetition penalty (n-gram)	0.3
Language consistency	0.5
Group size	G = 64 (large for stable σ)

Olmo 3 is the "every-component-on" design. The G=64 group size is unusually large; it stabilises $\sigma_G$ across batches at the cost of 4x more rollout compute per gradient step. They chose this for reproducibility, not throughput — easier to debug when the advantage estimator has lower variance.

The pattern across all four: the accuracy weight is always ≥ 1.0 and dominates. Every other component is below 0.5, often below 0.1. The accuracy reward is the gravity well; everything else is a small force shaping the trajectory around it. If you flip this ratio — accuracy small, other rewards large — the model finds the reward-hack path instead of the reasoning path. Every time.

Engineering Reality: Reasoning-Specific Reward Hacks

Every successful reasoning model published the hacks they caught and patched. Here is the working list.

Hack	Symptom	Patch
Empty <think> block	Model writes <think></think>\boxed{...} — gets full format reward, zero thinking	Subtract format reward when len(<think>) < 8 tokens; or require non-whitespace inside
Copy the answer into <think>	Model writes <think>The answer is 42</think><answer>\boxed{42}</answer>	Check think body for absence of boxed pattern; or require derivation length
Bilingual reasoning (R1-Zero's famous case)	Think block is 40% Chinese in the middle of English-trained corpus	Add language consistency penalty (w_lang = 0.5, threshold 10%)
Rambling self-talk to inflate length bonus	612 tokens of 'let me think... still thinking... hmm'	Saturated length bonus + soft repetition penalty above ramble_threshold
Multiple \boxed{} with different values	Model hedges: \boxed{2}, or maybe \boxed{3} — first-match accepts the guess	Match the LAST \boxed{} in the answer block, not the first
EOS-token suppression	Model never emits </think>, gets cut off by max_new_tokens, scores 0	Make truncated rollouts auto-score 0 regardless of partial format match
Non-ASCII answer digits	Model writes \boxed{２} (full-width 2); exact-match fails	Unicode NFKC normalisation on both sides of the comparison
Late chain-of-thought injection	Model writes the answer FIRST then <think> after; first-match logic accepts it	Require <think> to come BEFORE <answer> in the regex (TAG_RE handles this if you write it right)

Notice the symmetry with section 14.4's rule-based reward hacks. The fixes are still single-line regex patches. The difference is that reasoning rollouts are 10x longer, so each hack costs more compute to discover and reproduce. In practice you find them by sampling 100 high-reward rollouts every 500 training steps and reading them by hand. There is no automated substitute.

The eval-vs-train reward gap

One reasoning-specific failure that does not appear in section 14.4: training reward goes up while eval accuracy stalls. The policy is exploiting some structural feature of the training prompts that doesn't appear on eval prompts. Common culprits:

Numeric distribution mismatch. Training truths are mostly small integers; the model learns to guess 0–10 with high prior and the accuracy reward rewards it. Eval has fractions and decimals. Fix: stratify training corpus by answer type.
Prompt-template leakage. Training prompts all end with "Put your answer in \boxed{}"; the model learns to grep for that specific token. Fix: vary the prompt template at training time.
Verifier-specific format. The model learns to emit answers in a format only your verifier accepts. Cross-eval with a different verifier to detect.

The eval-time discovery cost is brutal. By the time you notice a reward hack on eval, you have spent ~$500k of cluster time training a policy that exploits it. The defence is to keep a held-out "canary" eval set, score it every 500 training steps with the SAME verifier the training reward uses, and trip the alarm the moment training reward and eval accuracy decouple. R1's authors describe this as their most load-bearing monitoring signal.

The take-away of this section is the same as the lesson of every other section in this chapter: GRPO does not give you a free pass on reward design. Dropping the critic saves 7B parameters of memory; it does not save you from having to think about which behaviours your reward rewards. A reasoning model is the most legible artefact a reward function can produce, because the policy literally writes its strategy down in the think block. Read those think blocks. They will tell you what you actually rewarded — not what you thought you were rewarding.

The next section surveys GRPO variants — DAPO, Dr.GRPO, and Olmo 3's extensions — which mostly change how advantages are computed and clipped, not what the reward measures. With the reasoning reward in your toolbox you now have the harder half of the GRPO design surface; the variants tune the easier half.