Boo-AI — Master Artificial Intelligence by Building from Scratch

The Real Problem: Where Do Reasoning Traces Come From?

For most of the post-ChatGPT era, “teaching” a large language model to reason has meant exactly one thing: collect a corpus of high-quality, step-by-step solutions written by humans (or by another, already-reasoning model), and supervise-fine-tune the base model on that corpus. Every reasoning model from GPT-4's chain-of-thought capability through OpenAI's o1 went through some version of this pipeline: base model → SFT on chain-of-thought → reward model trained on preferences → PPO. Reasoning was treated as a behaviour that had to be taught, and the supervised step was the place where the teaching happened.

That pipeline has one expensive, brittle bottleneck: the SFT corpus. A real reasoning SFT dataset is hundreds of thousands to millions of examples, each containing a problem and a careful, end-to-end worked-out solution. Where do those solutions come from? Three unsatisfying answers: (1) you pay human annotators — expensive, slow, and inconsistent at the scale of millions of examples; (2) you distill from a stronger existing reasoning model — but that only works if someone has already built one, and the resulting model cannot exceed its teacher; (3) you mix the two with rejection sampling — better, but now you need the stronger model AND the human annotators AND a quality filter, and the loop is still bounded by whichever model you started with.

On top of the SFT problem, the conventional pipeline carries two more learned networks: the reward model, which approximates human preference signal and which has its own training cost, scaling problems, and well-documented pathologies (length bias, format bias, reward hacking — see §14.3), and a PPO critic, a second transformer the size of the policy that estimates state-value and which doubles the GPU memory budget of the RL phase. By the time you have all three networks — policy, reward model, critic — plus a frozen reference for the KL term, you are training four networks at once and the SFT step is the cheapest of the four.

The unspoken assumption. The whole pipeline rests on a quiet belief: that reasoning is a new behaviour the base model does not already have, and that SFT is the step that puts it there. If that belief is wrong — if reasoning is already latent in the base model and only needs to be elicited — then the entire SFT and reward-model machinery exists to solve a problem that does not really exist. That is the bet R1-Zero made.

Intuition: Latent Skill vs Taught Skill

A modern base model has been trained on a staggering amount of text that already contains reasoning: math textbooks, Stack Overflow threads, GitHub repositories, scientific papers, competition solutions, Wikipedia derivations. By the time the base model has finished pretraining on tens of trillions of tokens, it has seen countless examples of someone working through a problem step by step, catching a mistake, backtracking, and trying again. The patterns are in the weights. The question is whether the policy — the conditional distribution over the next token given a problem statement — will actually use those patterns when prompted.

Here is the analogy that crystallises the R1-Zero hypothesis. Imagine a chess grandmaster who has spent years studying opening theory and endgame manuals but who, in actual games, plays the first move that feels right and then panics when their opponent deviates. The knowledge is in there. They do not need a tutor to teach them how to calculate variations; they need a context in which calculating variations is consistently rewarded. Give them a series of games where playing carefully wins and playing carelessly loses, and over time their behaviour will shift — not because they learned new chess theory, but because the existing theory was now the path of least resistance.

That is what R1-Zero proposes for language models. The base model contains, somewhere in its conditional distribution, a thin slice of probability mass corresponding to long, careful, mistake-correcting chains of thought. The SFT step does not create that slice; it just amplifies it by training the model on examples drawn from it. But amplification by SFT is expensive (you need millions of annotated examples) and limited (you can only amplify what your teachers can produce). Amplification by reinforcement learning against a verifiable reward is, in principle, neither: a regex that checks whether the final answer is correct can be applied to billions of self-generated samples for free.

The intuition in one line. SFT teaches the model what reasoning looks like. RL teaches the model what reasoning is for. If the base model already knows what it looks like, RL alone may be enough — and R1-Zero is the experiment that asks whether that “may” is actually “is.”

The R1-Zero Hypothesis, Formally

The R1-Zero paper's central claim can be stated in one sentence with three pieces:

Hypothesis (R1-Zero). Starting from a sufficiently strong base model $\pi_{\text{base}}$ , optimising the policy directly with reinforcement learning against a rule-based, verifiable reward — with no supervised fine-tuning, no learned reward model, and no value network — is sufficient to elicit long-form chain-of-thought reasoning, self-verification, and backtracking behaviours that match or exceed pipelines built on large supervised reasoning corpora.

Three load-bearing words in that sentence deserve a closer look. Sufficiently strong means — in practice, in 2024 — a 671-billion-parameter Mixture-of-Experts model (DeepSeek-V3-Base) with ~37B active parameters, trained on 14.8T tokens with a heavy share of math and code. On a 7B model the same recipe substantially underperforms a 7B SFT baseline; on a 671B base model it produces a reasoning model competitive with the strongest closed systems of its era. The hypothesis is not “RL alone works at every scale”; it is “RL alone works at scale.”

Verifiable means the reward function can read the model's output and decide, deterministically and without learning, whether it is correct. Math problems with closed-form answers, coding problems with unit tests, and logic puzzles with a unique solution all admit verifiable rewards. Creative writing, open-ended dialogue, and most chat tasks do not — which is precisely why R1-Zero is a reasoning hypothesis, not a general-purpose RLHF hypothesis. (Chapter 17 covers how R1 stitches together the reasoning-RL phase with conventional alignment to get a model that is good at both.)

No SFT means the policy at step 0 of RL is $\pi_\theta^{(0)} = \pi_{\text{base}}$ — the same model that produces “blue sky 1+1=2 the wave function collapses” when you ask it “What is 17+26?” with no chat template. The reference policy used for the KL penalty is also the base model: $\pi_{\text{ref}} = \pi_{\text{base}}$ . There is no SFT checkpoint anywhere in the pipeline. That is the part of the hypothesis that surprised the field most.

Visualizing the Two Pipelines

Toggle between the two paths below. The conventional path has five stages and four trained networks (SFT model, reward model, critic, policy). R1-Zero has three stages and two networks (policy, frozen reference). The arrows do not just represent passes of compute — they represent assumptions about where reasoning comes from.

Two pipelines to a reasoning model

The conventional path assumes that reasoning must be *taught* by example. The SFT step needs a large corpus of high-quality chain-of-thought traces — most of which currently come from another, already-reasoning model or from expensive human annotation.

The Mathematical Idea: GRPO With $\pi_{\text{ref}} = \pi_{\text{base}}$

R1-Zero is, in equation form, the GRPO objective applied with the base model as both the starting policy and the reference. Let $\pi_\theta$ be the policy being trained, with parameters $\theta$ initialised to the base-model weights. For a prompt $x$ , the policy samples a group of $K$ completions $\{y^{(1)}, \dots, y^{(K)}\}$ . Each completion is scored by the rule-based reward $r(x, y)$ , and the group-relative advantage of completion $i$ is

$A^{(i)} = \frac{r(x, y^{(i)}) - \mu}{\sigma + \varepsilon}, \quad \mu = \frac{1}{K} \sum_{j} r(x, y^{(j)}), \quad \sigma = \sqrt{\frac{1}{K} \sum_{j} \bigl(r(x, y^{(j)}) - \mu\bigr)^2}.$

The advantage is a per-completion scalar, positive when the completion was above the group's average and negative when it was below. Every token in completion $i$ inherits the same advantage; GRPO does not allocate credit at the token level. The objective for one prompt is then the standard clipped-PG objective with a KL penalty:

$\mathcal{L}(\theta) = -\, \mathbb{E}_{i, t} \bigl[ \min\bigl( \rho^{(i)}_t \, A^{(i)}, \; \mathrm{clip}(\rho^{(i)}_t, 1-\epsilon, 1+\epsilon) \, A^{(i)} \bigr) \bigr] + \beta \cdot \mathbb{E}_{i, t} \bigl[ \mathrm{KL}\bigl( \pi_\theta \,\Vert\, \pi_{\text{ref}} \bigr) \bigr],$

where $\rho^{(i)}_t = \pi_\theta(y^{(i)}_t \mid x, y^{(i)}_{<t}) / \pi_{\theta_{\text{old}}}(y^{(i)}_t \mid x, y^{(i)}_{<t})$ is the importance ratio between the current policy and the policy snapshot at the start of the GRPO step, and $\epsilon$ is the PPO-style clipping range. The KL term penalises drift away from the reference policy $\pi_{\text{ref}}$ , and the coefficient $\beta$ controls how willing the policy is to trade reward for distributional drift.

Every symbol in those two equations is identical to a standard PPO + KL-to-SFT setup — except for two pieces. First, $A^{(i)}$ is the group-relative advantage rather than $r^{(i)} - V_\phi(s^{(i)})$ , eliminating the critic network $V_\phi$ . Second, $\pi_{\text{ref}}$ is the base model, not the SFT model. The first change cuts memory in half; the second change is the whole hypothesis.

Why the KL is computed against the base model and not the previous step. The KL term is a regulariser against distribution collapse: without it, RL will happily find a degenerate policy that emits one high-reward token sequence on every prompt and loses all of its general capability. The reference is the distributional anchor. In R1-Zero the anchor is the pretrained base model, so the KL term is asking the policy to remain recognisably a language model — not to mode-collapse onto whatever sequence the rule-based reward happened to favour first.

Rule-Based Reward: No Critic, No Reward Model

The reward function in R1-Zero has two scalar components and is defined entirely in code — not learned, not human-rated, not approximated. For a math problem with gold answer $y^\star$ , the reward of a completion $y$ is

$r(x, y) \;=\; \underbrace{\mathbb{1}\bigl[\, \mathrm{extract}(y) = y^\star \,\bigr]}_{\text{accuracy}} \;+\; \underbrace{0.1 \cdot \mathbb{1}\bigl[\, \mathrm{tagged}(y) \,\bigr]}_{\text{format}},$

where $\mathrm{extract}(y)$ pulls the contents of the <answer> tag out of the completion and $\mathrm{tagged}(y)$ checks whether the “think then answer” template is present. The accuracy reward is binary and dominant (weight 1); the format reward is also binary but small (weight 0.1). Both are deterministic functions of the text and the gold answer.

Three observations follow immediately. First, this reward function runs in microseconds per completion — comparable to the tokenizer, orders of magnitude cheaper than a learned reward model. Second, there is no gradient signal flowing into the reward function, which means there is no reward hacking in the usual sense (you cannot “adversarially exploit” a regex by writing nicer prose). Third, the reward is sparse but verifiable: most completions in early training get 0.0 or 0.1, and only the ones that actually solve the problem get 1.1. The model has to find a correct solution by sampling before it can learn anything — which is why the experiment only works at scale, where the base model already gets a non-negligible pass@K on the training set.

Component	Value when present	What it teaches
Accuracy	1.0	The final extracted answer matches the gold answer. Dominant signal.
Format	0.1	The completion follows <think>...</think><answer>...</answer>. Small but non-zero — gives a learning signal even on completions with wrong answers.
Length / readability	(none)	Deliberately omitted in R1-Zero. The model is allowed to write completions in any language, any style, any length. Failure mode: language mixing and unreadable internal monologues. R1 fixes this by adding a readability reward in stage 2.
Helpfulness / harmlessness	(none)	Also deliberately omitted. R1-Zero is a reasoning model, not a chat model. R1 adds these in stage 4.

Manual Numerical Walkthrough: One GRPO Step on a Math Problem

Open the panel below and follow the arithmetic with a pencil. The numbers are tiny on purpose — once you can compute one GRPO step by hand, the PyTorch implementation is just shape-bookkeeping on top of this.

Manual Numerical Walkthrough — open to see every number

The setup. One prompt: What is 17 + 26? with gold answer $y^\star = \text{\"43\"}$ . We sample $K = 4$ completions from the base policy:

i	completion (truncated)	accuracy?	format?	r
0	<think>17+26=43</think><answer>43</answer>	✓	✓	1.1
1	The answer is 43.	✓ (text)	✗	0.0
2	<think>17+26=33</think><answer>33</answer>	✗	✓	0.1
3	</think> blue sky 1+1=2 …	✗	✗	0.0

Step 1 — Group mean and std. $\mu = (1.1 + 0.0 + 0.1 + 0.0)/4 = 0.30$ . $\sigma = \sqrt{\bigl((0.8)^2 + (-0.3)^2 + (-0.2)^2 + (-0.3)^2\bigr) / 4} = \sqrt{0.215} \approx 0.464$ .

Step 2 — Advantages.

i	r − μ	(r − μ) / σ	A
0	+0.80	+0.80 / 0.464	+1.724
1	−0.30	−0.30 / 0.464	−0.647
2	−0.20	−0.20 / 0.464	−0.431
3	−0.30	−0.30 / 0.464	−0.647

Step 3 — What each completion's tokens are about to learn. Every token in completion 0 will have its log-probability under $\pi_\theta$ nudged up by something proportional to $+1.724$ . Every token in completions 1, 2, 3 will have its log-prob nudged down. Notice that completions 1 and 3 get the same magnitude of downward pressure (both at $-0.647$ ) — from a single problem, GRPO cannot tell “correct content, wrong format” apart from “garbled output.” The fine-grained discrimination comes from aggregating across thousands of different problems: across a batch, the “correct content” tokens of completion 1 will eventually appear in other contexts where they ARE rewarded, while the “blue sky” tokens of completion 3 will not.

Step 4 — KL contribution. Suppose the current policy and the base reference still agree on most tokens, with $\mathrm{KL}(\pi_\theta \,\Vert\, \pi_{\text{ref}}) \approx 0.02$ per token averaged over completion 0's 26 generated tokens. With $\beta = 0.04$ the KL term adds $0.04 \cdot 26 \cdot 0.02 \approx 0.021$ to the loss for that completion — small in absolute terms but persistent across every step, acting as a brake on distribution drift.

Step 5 — The composite loss for this group. The policy-gradient term is the negative advantage-weighted log-prob, summed across all four completions. If the per-completion summed log-probs were (−26.0, −8.0, −22.0, −45.0), the PG term is approximately $-(1.724 \cdot (-26) + (-0.647) \cdot (-8) + (-0.431) \cdot (-22) + (-0.647) \cdot (-45)) / \text{tokens}$ . The exact number depends on the token count; what matters is the sign pattern: positive-advantage completions reduce the loss when their tokens become more likely, and negative-advantage completions reduce the loss when their tokens become less likely.

What changes after one update. The model has taken one tiny step toward emitting more completion-0-like text (long, formatted, correct) and away from completion-3-like text (garbled). Compound this over hundreds of thousands of math problems and the policy drifts from the base model's chaotic distribution into a recognisable “think-then-answer” reasoning mode.

Interactive: Group-Relative Advantages

Drag the reward sliders to feel how the advantage signal responds. Two preset edge cases are worth visiting. (1) The easy problem: all four rewards at 1.1. The group std collapses to zero, every advantage becomes 0/0 (rescued by the $\varepsilon$ ), and the model learns nothing. That is correct behaviour — if every sample succeeds, there is nothing to amplify. (2) The hard problem: all four rewards at 0. Same outcome, zero advantages, no learning signal. For R1-Zero to make progress, the base model must already produce a mixed group: some right, some wrong, on the same prompt. This is why R1-Zero only works on a base model strong enough to get non-trivial pass@K on the training set. That is the single biggest scale-dependent condition for the hypothesis to hold.

Group-relative advantages: what each completion is about to learn

✓ correct + formattedr = 1.10

A =+1.725

✓ correct, no formatr = 0.00

A =-0.647

✗ wrong, formattedr = 0.10

A =-0.431

✗ garbled outputr = 0.00

A =-0.647

group mean

0.300

group std

0.464

positive advantages

1 of 4

negative advantages

3 of 4

Try the “Easy problem” preset: every reward is 1.1, the group std collapses to ~0, and every advantage goes to 0. The model learns NOTHING from a problem it already solves perfectly — exactly the right behaviour. The “Hard problem” preset has the opposite issue: all rewards are 0, no completion beats the mean, no learning signal. R1-Zero needs problems where the group is mixed.

Plain Python: Rule-Based Reward + Group Advantage

Here is the entire R1-Zero learning signal — reward function plus group-relative advantage plus a sketch of what the policy update does to log-probabilities — in 60 lines of plain Python. The transformer is replaced by a list of strings; the gradient step is replaced by a manual log-probability update; but the math is identical to what runs on 671B parameters.

The R1-Zero learning signal in pure Python

🐍r1zero_signal.py

Explanation(9)

Code(103)

24completions[] — what the base model actually emits before any RL

Sampling from a base model with no SFT looks like this: a smattering of approximately-coherent attempts, one or two that happen to be right, one that is plausibly formatted but wrong, and at least one piece of garbled text where the model wandered off the task. This is the raw material R1-Zero has to work with. SFT would have hidden the garbled completion behind a chat template; R1-Zero must learn to suppress it through reward alone.

EXAMPLE

On a real DeepSeek-V3-Base, problem 'What is 17+26?' with temperature=1.0 and K=16 produces roughly 4 correct + format-OK, 6 correct but un-formatted, 3 wrong answers in the format, 3 incoherent. Base-model pass@1 ≈ 25%; pass@64 ≈ 70%.

39PATTERN — the entire 'reward model' is one regex

DeepSeek-R1-Zero's accuracy reward is a regex match against a gold answer, and the format reward is a check that '<think>...</think><answer>...</answer>' exists in the output. That is the complete reward function — no neural scorer, no human grader, no learned model. For problems with verifiable ground truth (math, code, logic), you do not need to learn a reward signal at all. The reward exists in the world, and a regex can read it.

EXAMPLE

PATTERN.search('<think>...</think><answer>43</answer>').group(2) == '43'.  Compared char-for-char with gold='43'.  No tokenization. No model. ~10 microseconds per completion.

41reward() — two scalar components, sum to one number

Two reasons the reward is split into format vs accuracy. (1) Format reward gives a non-zero gradient signal even when the model is wrong — without it, every wrong completion has reward 0 and the model has no way to learn the right output shape. (2) The two are weighted asymmetrically (0.1 vs 1.0) so format is necessary but accuracy dominates. If you weight format too high, the model learns to emit beautiful <think> tags around nonsense; too low and it never learns the format at all.

EXAMPLE

reward('<think>17+26=43</think><answer>43</answer>', '43') = (1.1, 1.0, 0.1)
reward('The answer is 43.', '43')                          = (0.0, 0.0, 0.0)
reward('<think>17+26=33</think><answer>33</answer>', '43') = (0.1, 0.0, 0.1)

64totals[] — the four scalar rewards for the four completions

These four numbers ARE the supervision signal for one GRPO step. Notice that completion 1 (correct content but bad format) and completion 2 (correct format but wrong content) both get a tiny reward — that asymmetry is what teaches the model that BOTH format and accuracy matter, and you cannot trade one off against the other for free.

EXAMPLE

totals = [1.1, 0.0, 0.1, 0.0]

65mu = mean(totals) — the GRPO baseline, replacing PPO's V(s)

This is the single change that lets R1-Zero scale. PPO needs a critic network V(s) to predict the expected reward at every state — a second transformer the size of the policy, doubling the GPU memory budget for RL. GRPO throws the critic away and uses the sample-mean reward across the K completions for THIS prompt as the baseline. It works because all K completions share the same prompt s, so subtracting the group mean is an unbiased estimator of the advantage. The cost: variance. The benefit: half the memory, half the training cost, and no critic-bootstrapping pathology.

EXAMPLE

mu = (1.1 + 0.0 + 0.1 + 0.0) / 4 = 0.30. This is the 'expected reward at this prompt' under the current policy.

66sd = pstdev(totals) + eps — z-normalisation makes the update scale-invariant

Without dividing by std, a problem where rewards are (1.1, 0, 0, 0) would generate a 4x larger gradient than a problem where rewards are (0.275, 0, 0, 0) — even though both contain the same information ('one of the four was right'). Dividing by std puts every problem on an equal footing in advantage space, which is what lets you mix easy and hard problems in the same batch without easy problems dominating the gradient.

EXAMPLE

pstdev([1.1, 0.0, 0.1, 0.0]) ≈ 0.464.  sd ≈ 0.464 + 1e-8 ≈ 0.464.

67advantages[] — what each completion's tokens are about to learn

Positive advantage means 'this completion was above-average; make every token in it more likely under the next policy.' Negative advantage means 'this completion was below-average; make every token in it less likely.' Notice that BOTH wrong-and-unformatted completions get the same negative advantage (-0.647) — GRPO cannot distinguish between 'wrong because of arithmetic' and 'wrong because of garbled text' from a single problem. It only learns that those two are below-average. Aggregating across thousands of problems is what carves out the fine-grained behaviour.

EXAMPLE

advantages = [+1.724, -0.647, -0.431, -0.647]. The correct + formatted completion has the largest positive advantage by far.

84lp[] — the base-model log-probability of each completion

Each completion has a per-token log-probability under the current (initially: base) model. The sum across tokens gives one scalar per completion. R1-Zero's gradient update at the token level is: ∇ log π(token | context) × A_completion. Tokens belonging to a positive-advantage completion get their log-prob pulled up; tokens belonging to a negative-advantage completion get theirs pushed down.

EXAMPLE

Completion 0 (correct + formatted) has lp = -3.2 — already moderately likely under the base model. Completion 3 (garbled) has lp = -7.5 — already very unlikely.

86new_lp — one step of R1-Zero's effect on log-probabilities

After one GRPO step, completion 0's log-prob has gone up by η·A = 0.1·1.724 = +0.172, completion 3's has gone down by 0.1·0.647 = -0.065. Multiply by tens of thousands of problems and hundreds of training steps and you get a policy that has shifted decisively toward 'long, well-formatted reasoning that ends with a correct answer.' This entire shift happened WITHOUT a single human-written reasoning trace.

EXAMPLE

new_lp[0] = -3.2 + 0.172 = -3.028 (completion 0 is now MORE likely)
new_lp[2] = -2.8 - 0.043 = -2.843 (completion 2 is now LESS likely)

94 lines without explanation

1"""
2The two computations at the heart of R1-Zero, in 60 lines of pure Python.
3
4R1-Zero replaces the *two* hardest pieces of conventional RLHF:
5  - the SFT-on-CoT step       -> nothing (we start from the base model)
6  - the learned reward model  -> a rule-based scorer
7  - the PPO critic            -> a group-mean baseline (GRPO)
8
9Everything below runs on plain integers and floats. No transformer, no
10autograd, no GPU. Once you can read this, the production code is just
11shape-bookkeeping on top of it.
12"""
13
14import math
15import re
16from statistics import mean, pstdev
17
18# ---------------------------------------------------------------------------
19# 1. A single math problem and four candidate completions sampled from
20#    the *base* model (no SFT). One is right + formatted, one is right
21#    but unformatted, one is formatted but wrong, one is garbled. Real
22#    R1-Zero samples K=16-64 per problem; we use K=4 so the arithmetic
23#    fits on the page.
24# ---------------------------------------------------------------------------
25
26problem = "What is 17 + 26?"
27gold    = "43"
28
29completions = [
30    # Correct content, correct format.
31    "<think>17 + 26 = 17 + 20 + 6 = 37 + 6 = 43.</think><answer>43</answer>",
32    # Correct content, missing the <think> tag (format penalty).
33    "The answer is 43.",
34    # Wrong arithmetic, correct format.
35    "<think>17 + 26 = 33.</think><answer>33</answer>",
36    # Total nonsense — base model wandered off (no SFT to anchor it).
37    "</think> blue sky 1+1=2 the wave function collapses 12345",
38]
39
40# ---------------------------------------------------------------------------
41# 2. Rule-based reward. TWO components only:
42#       r_format   in {0, 0.1}    -> 0.1 iff <think>...</think><answer>...</answer> shape is present
43#       r_accuracy in {0, 1}      -> 1   iff the extracted answer equals the gold answer
44#    Total reward r = r_accuracy + r_format. No reward model, no human grader.
45# ---------------------------------------------------------------------------
46
47PATTERN = re.compile(r"<think>(.*?)</think>\s*<answer>(.*?)</answer>", re.DOTALL)
48
49def reward(completion: str, gold: str) -> tuple[float, float, float]:
50    m = PATTERN.search(completion)
51    if m is None:
52        return 0.0, 0.0, 0.0
53    r_format   = 0.1
54    extracted  = m.group(2).strip()
55    r_accuracy = 1.0 if extracted == gold else 0.0
56    return r_accuracy + r_format, r_accuracy, r_format
57
58rewards = [reward(c, gold) for c in completions]
59
60print("idx |  total  acc  fmt  | text (truncated)")
61for i, ((tot, acc, fmt), c) in enumerate(zip(rewards, completions)):
62    short = (c[:50] + "...") if len(c) > 50 else c
63    print(f" {i}  | {tot:5.2f} {acc:4.1f} {fmt:4.2f}  | {short}")
64
65# ---------------------------------------------------------------------------
66# 3. Group-relative advantage. GRPO replaces the PPO critic V(s) with the
67#    sample-mean reward of the K completions for THIS prompt, and then
68#    z-normalises so the policy update is scale-invariant across problems.
69#       A_i = (r_i - mean(r)) / (std(r) + eps)
70# ---------------------------------------------------------------------------
71
72totals = [t for t, _, _ in rewards]
73mu     = mean(totals)
74sd     = pstdev(totals) + 1e-8
75advantages = [(r - mu) / sd for r in totals]
76
77print(f"\nmean reward across group: {mu:.3f}")
78print(f"std  reward across group: {sd:.3f}")
79print("idx |  reward   advantage  | sign")
80for i, (r, a) in enumerate(zip(totals, advantages)):
81    sign = "+ (reinforce)" if a > 0 else "- (suppress) "
82    print(f" {i}  | {r:5.2f}   {a:+6.3f}    | {sign}")
83
84# ---------------------------------------------------------------------------
85# 4. The R1-Zero update direction. For each token in completion i, the
86#    policy gradient is roughly:
87#       grad_theta log pi(token | context) * A_i
88#    Positive A_i pulls the token's log-prob UP, negative A_i pushes it
89#    DOWN. KL to a reference model keeps us close to a sensible prior;
90#    in R1-Zero pi_ref = pi_base, so the KL term penalises drift away
91#    from the *starting* model. No SFT checkpoint involved anywhere.
92# ---------------------------------------------------------------------------
93
94# Toy: pretend the per-token log-prob of completion i is sum(log pi) = lp_i.
95# An update with learning rate eta moves lp_i by eta * A_i (sign-only here).
96lp = [-3.2, -1.1, -2.8, -7.5]   # base-model log-prob of each completion
97eta = 0.1
98new_lp = [lp_i + eta * a for lp_i, a in zip(lp, advantages)]
99
100print("\nstep 0:    lp =", [round(x, 3) for x in lp])
101print("after 1: new_lp =", [round(x, 3) for x in new_lp])
102print("\nCompletion 0 (correct + formatted) moved UP; completions 2 and 3 (wrong) moved DOWN.")
103print("That is the R1-Zero learning signal in its entirety.")

PyTorch: The R1-Zero GRPO Step

Stepping up from plain Python to PyTorch changes very little conceptually. The reward function stays a regex. The advantage stays a one-liner. The policy is now a HuggingFace causal LM, the reference is a frozen copy of the base model, and the log-prob math is done in tensors instead of Python floats. The hardest part of production GRPO is not in this file — it is in the parallel sampling infrastructure (vLLM-based rollout servers, async reward scoring, KL distillation across stages). The inner loop itself is simpler than PPO + RLHF, not more complex.

One R1-Zero GRPO step on a HuggingFace causal LM

🐍r1zero_step.py

Explanation(11)

Code(117)

24Imports — notice what is missing

There is no `AutoModelForSequenceClassification`, no `reward_model.from_pretrained`, no `peft.LoraConfig`. R1-Zero needs the policy, a frozen copy of the policy as a reference for KL, a tokenizer, and a regex. That is the entire dependency list. By comparison, conventional RLHF needs: SFT model, reward model, PPO critic, KL reference, and the policy — five models in memory at once.

EXAMPLE

Memory budget for 7B R1-Zero training: 7B policy (training) + 7B ref (frozen, no optimizer) ≈ 14B params. For 7B PPO-RLHF: 7B policy + 7B critic + 7B reward + 7B ref = 28B. Half the GPU memory before you even count activations.

32rule_based_reward() — the entire scorer is 5 lines of Python

The function takes a generated text and the gold answer, runs a regex, returns a scalar in {0.0, 0.1, 1.0, 1.1}. No model, no learned parameters, no human grader in the loop. The function is *purely deterministic*: given the same text and gold, it always returns the same number. That makes the reward signal noiseless — every drop of variance in the training signal comes from the policy's stochastic sampling, not from the reward.

EXAMPLE

rule_based_reward('<think>17+26=43</think><answer>43</answer>', '43') = 1.1
rule_based_reward('The answer is 43.', '43')                          = 0.0

41sample_group — K=16 completions, one prompt, one forward pass

The `enc["input_ids"].expand(K, -1)` trick replicates the prompt K times along the batch dimension, so a single call to `generate()` produces all K rollouts with the prompt's KV-cache being computed only once. At K=16 and prompt_len=512 this saves ~15× the prefill cost compared to running generate K separate times. In production R1-Zero, K=64 was used and the sampling was offloaded to a separate vLLM inference server to keep the training GPUs busy with the forward/backward pass.

EXAMPLE

On A100s, generate() with K=16 and max_new=1024 takes ~6 seconds per problem. With K=64 it takes ~24 seconds — sampling dominates the wall-clock cost of GRPO, not the gradient step.

71rewards — one scalar per completion

This tensor is the only place reward enters the training loop. There is no gradient flowing into `rule_based_reward`; the function is a black box from PyTorch's perspective. The reward tensor's job is to *weight* the policy gradient on each completion, not to be differentiated through. This is the same `r_i` from the math section, just stacked into a tensor.

EXAMPLE

rewards.shape = (16,)  with values like tensor([1.1, 0.0, 1.1, 0.1, 0.0, 1.1, 0.0, 0.0, ...]) — typically a sparse but positive signal once training has been running for a few hundred steps.

76A — group-relative advantage, in one line

This line replaces the PPO critic. The advantage `A_i = (r_i - mean(r)) / (std(r) + eps)` uses the *sample mean* of the K rewards as the baseline. Compare to PPO, which would compute `A_i = r_i - V_phi(s_i)` where `V_phi` is a learned value network of the same size as the policy. GRPO does in one tensor op what PPO does with a second neural network. The trade-off is variance — the sample mean is a noisier baseline than a learned critic — but K=16-64 is enough for the variance to be manageable in practice.

EXAMPLE

rewards = [1.1, 0.0, 0.1, 0.0]  =>  mean=0.3, std≈0.464  =>  A = [+1.72, -0.65, -0.43, -0.65].

81out_ref — a forward pass through a frozen copy of the BASE model

In conventional RLHF the reference model is the SFT checkpoint; the KL penalty keeps the policy from drifting away from the supervised model. In R1-Zero there IS no SFT checkpoint, so the reference is the BASE model. The KL penalty measures drift away from the original pretrained distribution. This is a subtle but consequential change: a low-β R1-Zero run is essentially saying 'optimize accuracy reward, but stay recognisably close to the base model's output distribution.' The base model's distribution includes all of its weird pretraining habits — language mixing, abrupt topic shifts, code interspersed with prose — and the KL term tolerates those at the cost of letting the policy retain them as well.

EXAMPLE

ref is loaded once at startup with .eval() + .requires_grad_(False). Its memory cost is just the parameters (no optimizer state), so it adds ~14GB for a 7B-bf16 model. No backward pass touches it.

88logp_pi and logp_ref — per-token log-probabilities

We take the model logits at positions 0..T-2, log-softmax over the vocab, and gather at the actual next token (position 1..T-1). This gives one scalar log-probability per (sample, time) pair: the model's likelihood of the token that was actually sampled. Same operation on both the policy and the reference. The difference `logp_pi - logp_ref` is the per-token KL contribution.

EXAMPLE

logp_pi.shape = logp_ref.shape = (K, T-1). For K=16, T=1024 that is 16,368 scalars per group — the entire signal that drives the gradient step.

93comp_mask — only completion tokens get a gradient

The prompt tokens are not the model's output; we did not sample them. Their log-prob is determined by the prompt itself and applying a gradient there would be wrong — it would push the model to assign higher probability to a fixed conditioning string. The mask zeroes out the prompt positions, leaving only the completion positions to contribute. We also multiply by the attention mask to zero out positions past the EOS or pad tokens. Forgetting this mask is the #1 silent bug in from-scratch GRPO implementations: it produces a loss that looks reasonable but is contaminated by gradients on positions the policy did not generate.

EXAMPLE

If prompt is 50 tokens and completion is 800 tokens, comp_mask sums to 800 per row. For K=16 that is 12,800 contributing positions per training step.

100loss — policy gradient + KL, masked and averaged over completion tokens

The full GRPO loss for R1-Zero is `L = -E_i[A_i · log π(y_i|x)] + β · KL(π ‖ π_ref)`. The first term is the policy gradient: completions with positive advantage have their log-probability pushed up, completions with negative advantage have theirs pushed down. The second term is the KL penalty: it costs `β` per unit of drift away from the base model. `β=0.04` means a 1-nat drift costs the same as a 0.04 reward gain — the model is willing to trade some accuracy for less drift. Setting β=0 (no KL) is the 'pure' R1-Zero variant; the published DeepSeek run used β in [0.001, 0.04].

EXAMPLE

Typical numbers at step 1000: loss ≈ 0.45, PG component ≈ -0.62 (negative because positive advantage means lower loss), KL ≈ 0.04 × 26 ≈ 1.04. KL term dominates early; PG term dominates late.

108optimizer.step() — only the policy moves; the reference stays frozen forever

After backward, only the policy's parameters have non-zero gradients. The reference model's parameters have `requires_grad=False`, so they do not appear in the optimizer. This is what makes R1-Zero's memory budget feasible: the reference model is *just parameters*, not parameters + gradients + optimizer state. For AdamW on bf16, the optimizer state is ~3x the parameter size, so freezing the reference saves 3x its parameter cost in GPU memory.

EXAMPLE

For a 7B policy: 14GB params + 14GB grads + 56GB AdamW state ≈ 84GB. For a 7B reference: 14GB params only. Ref is essentially free.

115Return dict — what gets logged every step

Four numbers are enough to read an R1-Zero run. (1) `loss` should trend down. (2) `reward` should trend up — from ~0.1 (random base-model garbage) to ~0.9 (almost always correct, almost always formatted) over thousands of steps. (3) `pass@K` answers the question 'does ANY completion get the right answer?' — if this stays at 0 your model never explores a correct response and RL has no signal to amplify. (4) `format` tracks the share of completions that got the format reward without the accuracy reward — early in training this is high (model learns format faster than reasoning), late in training it drops as accuracy catches up.

EXAMPLE

Step 0: loss=0.69, reward=0.15, pass@K=0.4, format=0.6 (base model gets format right sometimes, accuracy rarely).
Step 5000: loss=0.12, reward=0.88, pass@K=0.97, format=0.05 (model is now usually right; few completions are formatted-but-wrong).

106 lines without explanation

1"""
2A minimal R1-Zero training step on top of a HuggingFace causal LM.
3
4This is what runs inside the inner loop of the DeepSeek-R1-Zero codebase
5(and TRL.GRPOTrainer, and OpenRLHF.GRPO) once the parallelism, sampling
6servers, and checkpointing are stripped away. The point of this file is
7to show that the *learning signal* is exactly the plain-Python version
8above, just operating on tensors of shape (K, T) instead of lists of K
9strings.
10
11Two contracts to remember:
12
13  1. pi_ref IS pi_base. We never load an SFT checkpoint. The reference
14     policy for the KL penalty is a frozen copy of the same base model
15     we are training. Drift is measured *against the base model itself*.
16
17  2. The reward is a Python function, not a neural network. The reward
18     model in conventional RLHF gets replaced by a regex + a string
19     comparison. No backward pass through the scorer.
20"""
21
22import re
23import torch
24import torch.nn as nn
25import torch.nn.functional as F
26from transformers import AutoModelForCausalLM, AutoTokenizer
27
28PATTERN = re.compile(r"<think>(.*?)</think>\s*<answer>(.*?)</answer>", re.DOTALL)
29
30
31def rule_based_reward(text: str, gold: str) -> float:
32    """The entire 'reward model' for R1-Zero on a math problem."""
33    m = PATTERN.search(text)
34    if m is None:
35        return 0.0
36    extracted = m.group(2).strip()
37    return (1.0 if extracted == gold else 0.0) + 0.1
38
39
40@torch.no_grad()
41def sample_group(policy, tokenizer, prompt: str, K: int = 16, max_new: int = 1024):
42    """Sample K completions from the policy at temperature 1.0."""
43    enc = tokenizer(prompt, return_tensors="pt").to(policy.device)
44    out = policy.generate(
45        enc["input_ids"].expand(K, -1),
46        max_new_tokens=max_new,
47        do_sample=True, temperature=1.0, top_p=1.0,
48        return_dict_in_generate=True,
49    )
50    texts = tokenizer.batch_decode(out.sequences[:, enc["input_ids"].size(1):],
51                                   skip_special_tokens=True)
52    return out.sequences, texts  # (K, T_total)  +  list[str]
53
54
55def grpo_step(
56    policy: nn.Module,         # the model we are training (init: base)
57    ref:    nn.Module,         # frozen copy of the *base* model
58    tokenizer,
59    prompt: str,
60    gold: str,
61    optimizer,
62    K: int = 16,
63    beta: float = 0.04,        # KL coefficient (DeepSeek-R1-Zero used 0.001-0.04)
64    clip_eps: float = 0.2,
65):
66    # 1. Sample K completions from the current policy.
67    seqs, texts = sample_group(policy, tokenizer, prompt, K=K)
68
69    # 2. Score every completion with the rule-based reward.
70    rewards = torch.tensor(
71        [rule_based_reward(t, gold) for t in texts],
72        device=policy.device, dtype=torch.float32,
73    )                                              # (K,)
74
75    # 3. Group-relative advantage. NO critic network involved.
76    A = (rewards - rewards.mean()) / (rewards.std() + 1e-8)   # (K,)
77
78    # 4. Forward pass through the *policy* and the *reference* model
79    #    on the FULL sequence (prompt + completion). We need per-token
80    #    log-probs to compute both the policy-gradient term and the KL.
81    attention_mask = (seqs != tokenizer.pad_token_id).long()
82    out_pi  = policy(input_ids=seqs, attention_mask=attention_mask)
83    out_ref = ref   (input_ids=seqs, attention_mask=attention_mask)
84
85    # Per-token log p_theta(y_t | y_<t) and p_ref(y_t | y_<t).
86    logits_pi  = out_pi.logits[:,  :-1, :]
87    logits_ref = out_ref.logits[:, :-1, :]
88    targets    = seqs[:, 1:]
89    logp_pi    = F.log_softmax(logits_pi,  dim=-1).gather(-1, targets.unsqueeze(-1)).squeeze(-1)
90    logp_ref   = F.log_softmax(logits_ref, dim=-1).gather(-1, targets.unsqueeze(-1)).squeeze(-1)
91
92    # 5. Only the COMPLETION tokens contribute to the loss; mask out the prompt.
93    prompt_len = tokenizer(prompt, return_tensors="pt").input_ids.size(1)
94    comp_mask  = torch.zeros_like(logp_pi)
95    comp_mask[:, prompt_len - 1 :] = 1.0
96    comp_mask  = comp_mask * attention_mask[:, 1:]
97
98    # 6. The R1-Zero / GRPO loss.
99    #    - Policy-gradient term: -A_i * sum_t log_pi(token)        (per completion)
100    #    - KL penalty:           +beta * sum_t (log_pi - log_ref)  (per token)
101    pg_per_token = -A.unsqueeze(-1) * logp_pi                       # (K, T-1)
102    kl_per_token = logp_pi - logp_ref                               # (K, T-1)
103    loss = ((pg_per_token + beta * kl_per_token) * comp_mask).sum() / comp_mask.sum()
104
105    # 7. Backprop and step. The reference model is frozen — no autograd
106    #    flows into it, so no double-memory cost beyond the forward pass.
107    optimizer.zero_grad()
108    loss.backward()
109    torch.nn.utils.clip_grad_norm_(policy.parameters(), max_norm=1.0)
110    optimizer.step()
111
112    return {
113        "loss":     loss.item(),
114        "reward":   rewards.mean().item(),
115        "pass@K":   (rewards > 0.5).any().item(),  # any completion scored a correct answer?
116        "format":   ((rewards > 0.0) & (rewards < 0.5)).float().mean().item(),
117    }

At Massive Scale: Why R1-Zero Worked on 671B and Not on 7B

The R1-Zero hypothesis is sensitive to scale in a way that almost no other RL technique is. Three concrete reasons explain why DeepSeek published a 671B success story and not a 7B one, and why small-model replications of R1-Zero have been so much weaker than their large-model siblings.

Pass@K floor of the base model

R1-Zero can only learn from a group of $K$ completions if at least one of them gets a non-zero reward. If every completion fails, the group mean is 0, every advantage is 0, and the gradient is 0 — the model has gathered no information from the prompt. The relevant statistic is therefore the base model's $\text{pass@}K$ on the training set:

$P(\text{group has signal}) = 1 - \bigl(1 - p_1\bigr)^K,$

where $p_1$ is the base model's single-shot accuracy on a problem. For DeepSeek-V3-Base on AIME-level math, $p_1 \approx 0.15$ ; with $K = 16$ the probability that the group contains at least one correct completion is $1 - 0.85^{16} \approx 0.93$ — almost every prompt produces useful signal. For a 7B base model with $p_1 \approx 0.02$ , the same calculation gives $1 - 0.98^{16} \approx 0.28$ — less than a third of prompts produce signal, and training is starved.

Memory budget and the death of the critic

At 671B parameters in bf16, the policy alone is ~1.34 TB of weights. Add gradients (1.34 TB) and AdamW optimizer state (5.4 TB) and you are at ~8 TB for the policy without activations — already requiring hundreds of H800s. PPO would demand a critic of the same size, doubling the budget. GRPO's replacement of the critic with a one-line group-mean baseline is what makes R1-Zero feasible at this scale, not just elegant. At 7B the saving is real but not decisive; at 671B it is the difference between a possible experiment and an impossible one.

Sampling throughput and the rollout cost

Each GRPO step samples K completions per prompt at up to 32K tokens each. With K=64 and a 64-prompt batch, one step generates ~130M tokens of completion — comparable to a meaningful slice of the pretraining corpus, every step. R1-Zero was only feasible because DeepSeek had built a fully separated rollout architecture: a fleet of vLLM inference servers running the policy snapshot, decoupled from the training cluster running the gradient step, with the policy weights synced periodically across them. Without that infrastructure, the wall-clock cost of a single GRPO step on a 671B model would be measured in hours rather than minutes.

Component	PPO + RLHF	R1-Zero
SFT model	yes (~10⁶ labeled examples)	no
Reward model	yes (~10⁵ preference pairs + training)	no (regex)
Policy	yes	yes
Critic	yes (~same size as policy)	no (group-mean baseline)
KL reference	the SFT model	the BASE model
Networks in memory during RL	4 (policy, critic, ref, RM)	2 (policy, ref)
Networks trained during RL	2 (policy, critic)	1 (policy)
Reward latency per sample	~50ms (RM forward pass)	~10μs (regex)

Engineering Reality: What Pure RL Breaks

R1-Zero is a strong reasoning model and a weak chat model. The same minimalism that makes the recipe work also produces a set of failure modes that you do not see in SFT-trained systems. The chapter's remaining sections (§16.3–§16.6) go into each in detail; here is the short catalogue so the reader knows what is coming.

Language mixing. With no readability reward and a base model that has seen Chinese, English, and code in similar proportions, the policy can drift toward thinking in one language and answering in another. Inside a single <think> block, R1-Zero will sometimes switch from English prose to Chinese characters to Python syntax and back, all in the service of solving the problem. The answer is still correct; the trace is unreadable to a human reviewer.
The “aha moment.” Around steps 1k–3k of training, R1-Zero begins spontaneously emitting self-correction phrases — “Wait, that's wrong, let me try a different approach” — without ever having been shown such phrases in an SFT corpus. The next section traces this phenomenon and its implications. It is the single most-cited result of the R1-Zero paper.
Reward hacking is constrained but not eliminated. Because the reward is a deterministic regex, you cannot “fool the reward model” the way you can with a learned scorer. But the regex can be gamed: the policy can learn to emit <answer>43</answer> inside the <think> block to maximise the chance the regex extractor picks the right string. The DeepSeek paper documents several such micro-exploits and the regex hardening required to close them.
No chat capability. R1-Zero is trained exclusively on verifiable-reward problems. On open-ended user requests it continues to output <think>…</think><answer>…</answer> even when the user wanted a casual answer. R1 (the production model) fixes this with two further training stages on top of R1-Zero — covered in chapter 17.
Sensitivity to reward-function bugs. Because there is no learned reward model to smooth over edge cases, a bug in the regex propagates directly into the policy. An early version of R1-Zero's extractor accepted any number inside the <answer> tag, including trailing whitespace and punctuation. The policy learned to emit “43.” (with a period) on problems where the gold was “43” — harmless until the extractor was tightened, at which point pass-rate collapsed overnight. The reward function in pure RL is, in effect, part of the model architecture.

The single sentence to remember. R1-Zero's bet was that reasoning is latent in the base model and that a regex, applied to enough samples, is a sharper teacher than a million human annotators. The bet won at 671B; whether it generalises beyond verifiable-reward domains is the open question of the next several years of post-training research.

The Real Problem: Where Do Reasoning Traces Come From?

Intuition: Latent Skill vs Taught Skill

The R1-Zero Hypothesis, Formally

Visualizing the Two Pipelines

The Mathematical Idea: GRPO With πref=πbase\pi_{\text{ref}} = \pi_{\text{base}}πref​=πbase​

Rule-Based Reward: No Critic, No Reward Model

Manual Numerical Walkthrough: One GRPO Step on a Math Problem

Interactive: Group-Relative Advantages

Plain Python: Rule-Based Reward + Group Advantage

PyTorch: The R1-Zero GRPO Step

At Massive Scale: Why R1-Zero Worked on 671B and Not on 7B

Pass@K floor of the base model

Memory budget and the death of the critic

Sampling throughput and the rollout cost

Engineering Reality: What Pure RL Breaks

The Mathematical Idea: GRPO With $\pi_{\text{ref}} = \pi_{\text{base}}$