The Real Problem: Where Do Reasoning Traces Come From?
For most of the post-ChatGPT era, “teaching” a large language model to reason has meant exactly one thing: collect a corpus of high-quality, step-by-step solutions written by humans (or by another, already-reasoning model), and supervise-fine-tune the base model on that corpus. Every reasoning model from GPT-4's chain-of-thought capability through OpenAI's o1 went through some version of this pipeline: base model → SFT on chain-of-thought → reward model trained on preferences → PPO. Reasoning was treated as a behaviour that had to be taught, and the supervised step was the place where the teaching happened.
That pipeline has one expensive, brittle bottleneck: the SFT corpus. A real reasoning SFT dataset is hundreds of thousands to millions of examples, each containing a problem and a careful, end-to-end worked-out solution. Where do those solutions come from? Three unsatisfying answers: (1) you pay human annotators — expensive, slow, and inconsistent at the scale of millions of examples; (2) you distill from a stronger existing reasoning model — but that only works if someone has already built one, and the resulting model cannot exceed its teacher; (3) you mix the two with rejection sampling — better, but now you need the stronger model AND the human annotators AND a quality filter, and the loop is still bounded by whichever model you started with.
On top of the SFT problem, the conventional pipeline carries two more learned networks: the reward model, which approximates human preference signal and which has its own training cost, scaling problems, and well-documented pathologies (length bias, format bias, reward hacking — see §14.3), and a PPO critic, a second transformer the size of the policy that estimates state-value and which doubles the GPU memory budget of the RL phase. By the time you have all three networks — policy, reward model, critic — plus a frozen reference for the KL term, you are training four networks at once and the SFT step is the cheapest of the four.
Intuition: Latent Skill vs Taught Skill
A modern base model has been trained on a staggering amount of text that already contains reasoning: math textbooks, Stack Overflow threads, GitHub repositories, scientific papers, competition solutions, Wikipedia derivations. By the time the base model has finished pretraining on tens of trillions of tokens, it has seen countless examples of someone working through a problem step by step, catching a mistake, backtracking, and trying again. The patterns are in the weights. The question is whether the policy — the conditional distribution over the next token given a problem statement — will actually use those patterns when prompted.
Here is the analogy that crystallises the R1-Zero hypothesis. Imagine a chess grandmaster who has spent years studying opening theory and endgame manuals but who, in actual games, plays the first move that feels right and then panics when their opponent deviates. The knowledge is in there. They do not need a tutor to teach them how to calculate variations; they need a context in which calculating variations is consistently rewarded. Give them a series of games where playing carefully wins and playing carelessly loses, and over time their behaviour will shift — not because they learned new chess theory, but because the existing theory was now the path of least resistance.
That is what R1-Zero proposes for language models. The base model contains, somewhere in its conditional distribution, a thin slice of probability mass corresponding to long, careful, mistake-correcting chains of thought. The SFT step does not create that slice; it just amplifies it by training the model on examples drawn from it. But amplification by SFT is expensive (you need millions of annotated examples) and limited (you can only amplify what your teachers can produce). Amplification by reinforcement learning against a verifiable reward is, in principle, neither: a regex that checks whether the final answer is correct can be applied to billions of self-generated samples for free.
The R1-Zero Hypothesis, Formally
The R1-Zero paper's central claim can be stated in one sentence with three pieces:
Hypothesis (R1-Zero). Starting from a sufficiently strong base model , optimising the policy directly with reinforcement learning against a rule-based, verifiable reward — with no supervised fine-tuning, no learned reward model, and no value network — is sufficient to elicit long-form chain-of-thought reasoning, self-verification, and backtracking behaviours that match or exceed pipelines built on large supervised reasoning corpora.
Three load-bearing words in that sentence deserve a closer look. Sufficiently strong means — in practice, in 2024 — a 671-billion-parameter Mixture-of-Experts model (DeepSeek-V3-Base) with ~37B active parameters, trained on 14.8T tokens with a heavy share of math and code. On a 7B model the same recipe substantially underperforms a 7B SFT baseline; on a 671B base model it produces a reasoning model competitive with the strongest closed systems of its era. The hypothesis is not “RL alone works at every scale”; it is “RL alone works at scale.”
Verifiable means the reward function can read the model's output and decide, deterministically and without learning, whether it is correct. Math problems with closed-form answers, coding problems with unit tests, and logic puzzles with a unique solution all admit verifiable rewards. Creative writing, open-ended dialogue, and most chat tasks do not — which is precisely why R1-Zero is a reasoning hypothesis, not a general-purpose RLHF hypothesis. (Chapter 17 covers how R1 stitches together the reasoning-RL phase with conventional alignment to get a model that is good at both.)
No SFT means the policy at step 0 of RL is — the same model that produces “blue sky 1+1=2 the wave function collapses” when you ask it “What is 17+26?” with no chat template. The reference policy used for the KL penalty is also the base model: . There is no SFT checkpoint anywhere in the pipeline. That is the part of the hypothesis that surprised the field most.
Visualizing the Two Pipelines
Toggle between the two paths below. The conventional path has five stages and four trained networks (SFT model, reward model, critic, policy). R1-Zero has three stages and two networks (policy, frozen reference). The arrows do not just represent passes of compute — they represent assumptions about where reasoning comes from.
The Mathematical Idea: GRPO With
R1-Zero is, in equation form, the GRPO objective applied with the base model as both the starting policy and the reference. Let be the policy being trained, with parameters initialised to the base-model weights. For a prompt , the policy samples a group of completions . Each completion is scored by the rule-based reward , and the group-relative advantage of completion is
The advantage is a per-completion scalar, positive when the completion was above the group's average and negative when it was below. Every token in completion inherits the same advantage; GRPO does not allocate credit at the token level. The objective for one prompt is then the standard clipped-PG objective with a KL penalty:
where is the importance ratio between the current policy and the policy snapshot at the start of the GRPO step, and is the PPO-style clipping range. The KL term penalises drift away from the reference policy , and the coefficient controls how willing the policy is to trade reward for distributional drift.
Every symbol in those two equations is identical to a standard PPO + KL-to-SFT setup — except for two pieces. First, is the group-relative advantage rather than , eliminating the critic network . Second, is the base model, not the SFT model. The first change cuts memory in half; the second change is the whole hypothesis.
Rule-Based Reward: No Critic, No Reward Model
The reward function in R1-Zero has two scalar components and is defined entirely in code — not learned, not human-rated, not approximated. For a math problem with gold answer , the reward of a completion is
where pulls the contents of the <answer> tag out of the completion and checks whether the “think then answer” template is present. The accuracy reward is binary and dominant (weight 1); the format reward is also binary but small (weight 0.1). Both are deterministic functions of the text and the gold answer.
Three observations follow immediately. First, this reward function runs in microseconds per completion — comparable to the tokenizer, orders of magnitude cheaper than a learned reward model. Second, there is no gradient signal flowing into the reward function, which means there is no reward hacking in the usual sense (you cannot “adversarially exploit” a regex by writing nicer prose). Third, the reward is sparse but verifiable: most completions in early training get 0.0 or 0.1, and only the ones that actually solve the problem get 1.1. The model has to find a correct solution by sampling before it can learn anything — which is why the experiment only works at scale, where the base model already gets a non-negligible pass@K on the training set.
| Component | Value when present | What it teaches |
|---|---|---|
| Accuracy | 1.0 | The final extracted answer matches the gold answer. Dominant signal. |
| Format | 0.1 | The completion follows <think>...</think><answer>...</answer>. Small but non-zero — gives a learning signal even on completions with wrong answers. |
| Length / readability | (none) | Deliberately omitted in R1-Zero. The model is allowed to write completions in any language, any style, any length. Failure mode: language mixing and unreadable internal monologues. R1 fixes this by adding a readability reward in stage 2. |
| Helpfulness / harmlessness | (none) | Also deliberately omitted. R1-Zero is a reasoning model, not a chat model. R1 adds these in stage 4. |
Manual Numerical Walkthrough: One GRPO Step on a Math Problem
Open the panel below and follow the arithmetic with a pencil. The numbers are tiny on purpose — once you can compute one GRPO step by hand, the PyTorch implementation is just shape-bookkeeping on top of this.
Manual Numerical Walkthrough — open to see every number
The setup. One prompt: What is 17 + 26? with gold answer y^\star = \text{\"43\"}. We sample completions from the base policy:
| i | completion (truncated) | accuracy? | format? | r |
|---|---|---|---|---|
| 0 | <think>17+26=43</think><answer>43</answer> | ✓ | ✓ | 1.1 |
| 1 | The answer is 43. | ✓ (text) | ✗ | 0.0 |
| 2 | <think>17+26=33</think><answer>33</answer> | ✗ | ✓ | 0.1 |
| 3 | </think> blue sky 1+1=2 … | ✗ | ✗ | 0.0 |
Step 1 — Group mean and std. . .
Step 2 — Advantages.
| i | r − μ | (r − μ) / σ | A |
|---|---|---|---|
| 0 | +0.80 | +0.80 / 0.464 | +1.724 |
| 1 | −0.30 | −0.30 / 0.464 | −0.647 |
| 2 | −0.20 | −0.20 / 0.464 | −0.431 |
| 3 | −0.30 | −0.30 / 0.464 | −0.647 |
Step 3 — What each completion's tokens are about to learn. Every token in completion 0 will have its log-probability under nudged up by something proportional to . Every token in completions 1, 2, 3 will have its log-prob nudged down. Notice that completions 1 and 3 get the same magnitude of downward pressure (both at ) — from a single problem, GRPO cannot tell “correct content, wrong format” apart from “garbled output.” The fine-grained discrimination comes from aggregating across thousands of different problems: across a batch, the “correct content” tokens of completion 1 will eventually appear in other contexts where they ARE rewarded, while the “blue sky” tokens of completion 3 will not.
Step 4 — KL contribution. Suppose the current policy and the base reference still agree on most tokens, with per token averaged over completion 0's 26 generated tokens. With the KL term adds to the loss for that completion — small in absolute terms but persistent across every step, acting as a brake on distribution drift.
Step 5 — The composite loss for this group. The policy-gradient term is the negative advantage-weighted log-prob, summed across all four completions. If the per-completion summed log-probs were (−26.0, −8.0, −22.0, −45.0), the PG term is approximately . The exact number depends on the token count; what matters is the sign pattern: positive-advantage completions reduce the loss when their tokens become more likely, and negative-advantage completions reduce the loss when their tokens become less likely.
What changes after one update. The model has taken one tiny step toward emitting more completion-0-like text (long, formatted, correct) and away from completion-3-like text (garbled). Compound this over hundreds of thousands of math problems and the policy drifts from the base model's chaotic distribution into a recognisable “think-then-answer” reasoning mode.
Interactive: Group-Relative Advantages
Drag the reward sliders to feel how the advantage signal responds. Two preset edge cases are worth visiting. (1) The easy problem: all four rewards at 1.1. The group std collapses to zero, every advantage becomes 0/0 (rescued by the ), and the model learns nothing. That is correct behaviour — if every sample succeeds, there is nothing to amplify. (2) The hard problem: all four rewards at 0. Same outcome, zero advantages, no learning signal. For R1-Zero to make progress, the base model must already produce a mixed group: some right, some wrong, on the same prompt. This is why R1-Zero only works on a base model strong enough to get non-trivial pass@K on the training set. That is the single biggest scale-dependent condition for the hypothesis to hold.
Plain Python: Rule-Based Reward + Group Advantage
Here is the entire R1-Zero learning signal — reward function plus group-relative advantage plus a sketch of what the policy update does to log-probabilities — in 60 lines of plain Python. The transformer is replaced by a list of strings; the gradient step is replaced by a manual log-probability update; but the math is identical to what runs on 671B parameters.
PyTorch: The R1-Zero GRPO Step
Stepping up from plain Python to PyTorch changes very little conceptually. The reward function stays a regex. The advantage stays a one-liner. The policy is now a HuggingFace causal LM, the reference is a frozen copy of the base model, and the log-prob math is done in tensors instead of Python floats. The hardest part of production GRPO is not in this file — it is in the parallel sampling infrastructure (vLLM-based rollout servers, async reward scoring, KL distillation across stages). The inner loop itself is simpler than PPO + RLHF, not more complex.
At Massive Scale: Why R1-Zero Worked on 671B and Not on 7B
The R1-Zero hypothesis is sensitive to scale in a way that almost no other RL technique is. Three concrete reasons explain why DeepSeek published a 671B success story and not a 7B one, and why small-model replications of R1-Zero have been so much weaker than their large-model siblings.
Pass@K floor of the base model
R1-Zero can only learn from a group of completions if at least one of them gets a non-zero reward. If every completion fails, the group mean is 0, every advantage is 0, and the gradient is 0 — the model has gathered no information from the prompt. The relevant statistic is therefore the base model's on the training set:
where is the base model's single-shot accuracy on a problem. For DeepSeek-V3-Base on AIME-level math, ; with the probability that the group contains at least one correct completion is — almost every prompt produces useful signal. For a 7B base model with , the same calculation gives — less than a third of prompts produce signal, and training is starved.
Memory budget and the death of the critic
At 671B parameters in bf16, the policy alone is ~1.34 TB of weights. Add gradients (1.34 TB) and AdamW optimizer state (5.4 TB) and you are at ~8 TB for the policy without activations — already requiring hundreds of H800s. PPO would demand a critic of the same size, doubling the budget. GRPO's replacement of the critic with a one-line group-mean baseline is what makes R1-Zero feasible at this scale, not just elegant. At 7B the saving is real but not decisive; at 671B it is the difference between a possible experiment and an impossible one.
Sampling throughput and the rollout cost
Each GRPO step samples K completions per prompt at up to 32K tokens each. With K=64 and a 64-prompt batch, one step generates ~130M tokens of completion — comparable to a meaningful slice of the pretraining corpus, every step. R1-Zero was only feasible because DeepSeek had built a fully separated rollout architecture: a fleet of vLLM inference servers running the policy snapshot, decoupled from the training cluster running the gradient step, with the policy weights synced periodically across them. Without that infrastructure, the wall-clock cost of a single GRPO step on a 671B model would be measured in hours rather than minutes.
| Component | PPO + RLHF | R1-Zero |
|---|---|---|
| SFT model | yes (~10⁶ labeled examples) | no |
| Reward model | yes (~10⁵ preference pairs + training) | no (regex) |
| Policy | yes | yes |
| Critic | yes (~same size as policy) | no (group-mean baseline) |
| KL reference | the SFT model | the BASE model |
| Networks in memory during RL | 4 (policy, critic, ref, RM) | 2 (policy, ref) |
| Networks trained during RL | 2 (policy, critic) | 1 (policy) |
| Reward latency per sample | ~50ms (RM forward pass) | ~10μs (regex) |
Engineering Reality: What Pure RL Breaks
R1-Zero is a strong reasoning model and a weak chat model. The same minimalism that makes the recipe work also produces a set of failure modes that you do not see in SFT-trained systems. The chapter's remaining sections (§16.3–§16.6) go into each in detail; here is the short catalogue so the reader knows what is coming.
- Language mixing. With no readability reward and a base model that has seen Chinese, English, and code in similar proportions, the policy can drift toward thinking in one language and answering in another. Inside a single <think> block, R1-Zero will sometimes switch from English prose to Chinese characters to Python syntax and back, all in the service of solving the problem. The answer is still correct; the trace is unreadable to a human reviewer.
- The “aha moment.” Around steps 1k–3k of training, R1-Zero begins spontaneously emitting self-correction phrases — “Wait, that's wrong, let me try a different approach” — without ever having been shown such phrases in an SFT corpus. The next section traces this phenomenon and its implications. It is the single most-cited result of the R1-Zero paper.
- Reward hacking is constrained but not eliminated. Because the reward is a deterministic regex, you cannot “fool the reward model” the way you can with a learned scorer. But the regex can be gamed: the policy can learn to emit
<answer>43</answer>inside the <think> block to maximise the chance the regex extractor picks the right string. The DeepSeek paper documents several such micro-exploits and the regex hardening required to close them. - No chat capability. R1-Zero is trained exclusively on verifiable-reward problems. On open-ended user requests it continues to output
<think>…</think><answer>…</answer>even when the user wanted a casual answer. R1 (the production model) fixes this with two further training stages on top of R1-Zero — covered in chapter 17. - Sensitivity to reward-function bugs. Because there is no learned reward model to smooth over edge cases, a bug in the regex propagates directly into the policy. An early version of R1-Zero's extractor accepted any number inside the <answer> tag, including trailing whitespace and punctuation. The policy learned to emit “43.” (with a period) on problems where the gold was “43” — harmless until the extractor was tightened, at which point pass-rate collapsed overnight. The reward function in pure RL is, in effect, part of the model architecture.
The single sentence to remember. R1-Zero's bet was that reasoning is latent in the base model and that a regex, applied to enough samples, is a sharper teacher than a million human annotators. The bet won at 671B; whether it generalises beyond verifiable-reward domains is the open question of the next several years of post-training research.