The Question R1-Zero Was Actually Asking
Strip away the architecture diagrams and the benchmark tables and R1-Zero is, at heart, a controlled experiment that asks a single falsifiable question: does the supervised fine-tuning stage actually teach a base model anything it could not be made to do by sampling alone? For five years, the entire post-training stack — SFT corpora, preference data, learned reward models, critic networks — rested on the assumption that the answer was “yes.” R1-Zero ran the experiment, and the answer came back closer to “not really.”
The previous five sections of this chapter showed you what R1-Zero is: a base model, a regex, GRPO, a frozen reference. What this final section does is take a step back and ask what the experiment showed — what we know now about how massive models learn that we did not know a year ago. Five lessons land directly. Two open questions remain.
Lesson 1: Reasoning Is Elicited, Not Taught
The most consequential R1-Zero result is what it implies about the base model. If a regex and a few thousand GRPO steps can turn DeepSeek-V3-Base into a model that emits long, self-correcting, backtracking chains of thought, then those chains of thought were already in the base model's conditional distribution somewhere. SFT did not need to put them there. The base model was a compressed library of reasoning behaviour; the RL phase was a librarian who learned which book to fetch.
This reframes everything we thought we knew about why SFT on chain-of-thought worked. It worked because it amplified a thin slice of probability mass that was already present, not because it taught a new skill. The implication for budgets is brutal: most of the cost of an SFT-on-CoT corpus — human annotation, quality filtering, rejection sampling against a stronger model — was paying for amplification that a verifiable reward and 10,000 GPU-hours can do for free. Not on every task. Not on every model. But on every task where the base model has a non-trivial pass@K and the reward is verifiable, the SFT step is, in the strict mathematical sense, redundant.
The deeper claim is about what pretraining is for. The conventional view treated pretraining as “giving the model facts and grammar” and post-training as “giving it skills.” R1-Zero suggests the cleaner decomposition is and : the skills are already encoded by the cross-entropy loss on tens of trillions of tokens, and what post-training does is select which encoded skill the policy emits at each step. The base model is the capability frontier; the post-training pipeline is a controller that decides which capability to surface.
Lesson 2: RL Collapses Pass@K Into Pass@1
The cleanest empirical signature of R1-Zero is what it does to the pass@K curve. Before RL, a strong base model has a wide gap between pass@1 (the greedy-decode success rate) and pass@K for — the model can solve the problem if you let it try several times, but on any single try it usually fails. After RL, that gap is mostly gone: pass@1 has risen sharply, pass@K is roughly flat, and the model now solves on the first try what it used to solve on the sixteenth.
Mechanistically, this happens because the GRPO update pushes probability mass away from the K-1 low-reward completions and onto the one high-reward completion. Sampling at temperature 1.0 after training therefore lands on the consolidated mode almost every time. The diversity that caused the original pass@K gap has been consumed in the process of closing it. Best-of-N at inference, which used to deliver 30-50% relative gains on a base model, delivers single-digit relative gains on an R1-Zero model. The free lunch has been eaten.
This is good news and bad news. The good news: at inference, you no longer need to sample 64 times and majority-vote. The greedy decode is now nearly as good as best-of-64, which translates directly to cheaper serving. The bad news: an R1-Zero model is a worse starting point for further RL than the base model it came from. The gap is gone. The next RL run has nothing to amplify. If you want another step-change in reasoning capability you have to go back to pretraining, not run more GRPO.
The ceiling theorem (informal). R1-Zero-style RL on a verifiable reward asymptotically converts pass@K of the base model into pass@1 of the trained model. It cannot push past pass@K(base). The pretraining run determined the ceiling; the RL run determined how close to it you get.
The Elicitation Gap, Formally
Define the elicitation gap at sampling budget as
where is the evaluation distribution and is the standard unbiased estimator with the base model's per-sample success probability on prompt . The elicitation gap is non-negative, bounded above by , and goes to zero in two regimes: (a) when on every prompt (the model is already saturated, nothing to amplify), and (b) when on every prompt (the model never succeeds even with K tries, nothing to learn from).
Empirically — this is the second R1-Zero observation worth carrying around — an R1-Zero run with sampling budget closes a fraction of the gap:
The coefficient depends on the KL coefficient , the number of training steps, the diversity of the training prompts, and the exploration noise, but in every published replication I am aware of it lands somewhere in that 0.6–0.9 band. That single empirical constant lets you compute, from a few hundred GPU-hours of pass@K screening on the base model, a prediction of the post-RL pass@1 accurate to a handful of percentage points — before you spend the millions of dollars on the RL run itself.
| Base model + eval | pass@1 base | pass@64 base | gap | predicted pass@1 (γ=0.7) | measured post-RL pass@1 |
|---|---|---|---|---|---|
| DeepSeek-V3-Base + AIME-2024 | 0.155 | 0.700 | +0.545 | 0.536 | ~0.71 |
| DeepSeek-V3-Base + MATH-500 | 0.620 | 0.910 | +0.290 | 0.823 | ~0.86 |
| Qwen-7B + AIME-2024 | 0.022 | 0.241 | +0.219 | 0.175 | ~0.19 |
| Llama-3-70B + GSM8k | 0.870 | 0.972 | +0.102 | 0.941 | ~0.94 |
Lesson 3: Scale Determines What RL Can Reach
The gap formulation makes the scale-dependence of R1-Zero almost embarrassingly explicit. RL can only do as much as the base model's pass@K allows, and pass@K is determined by , which is a property of the pretraining run. A 671B model has a fat per-prompt distribution — many prompts in the 0.05–0.50 range — and a correspondingly large gap. A 7B model on the same eval has a distribution crushed against 0, a narrow pass@K curve, and almost no gap to close.
The arithmetic is elementary but worth doing once. For a problem with , the chance that any of K=64 samples succeeds is — essentially certain, so the prompt is a strong RL training signal. For , the same calculation gives — only one prompt in four produces a positive completion, the other three are wasted GPU-hours. At you are down to , which is basically zero training signal: GRPO sits on its hands.
This is why “just rerun R1-Zero on a smaller base” has produced consistently disappointing replications. The recipe is public, the GRPO code is open, the reward function is six lines of Python. What is not transferable is the base model's distribution — and below some scale threshold, that distribution simply does not contain enough above-noise mass on hard reasoning problems for RL to lock onto. The model is structurally insufficient for the experiment, not algorithmically.
R1-Zero is a 671B story, not an algorithm story. The algorithmic contribution — GRPO with a frozen base as reference — is elegant and important, but the result is downstream of a pretraining run that nobody outside two or three labs can currently afford. The lesson is not “everyone can do this now”; it is “pretraining quality is the bottleneck for everyone, and the bottleneck just got sharper.”
Manual Numerical Walkthrough: Computing the Elicitation Gap
Compute the gap and the RL prediction with a pencil on a four-prompt toy. Once the arithmetic clicks, the production decision rule below is just the same calculation scaled to 200 prompts and a real model.
Manual Numerical Walkthrough — open to see every number
Setup. Four prompts. For each, we run the base model 128 times and count how many of the 128 completions are correct. The per-prompt estimate is just the empirical success rate.
| prompt | n_correct / 128 | p₁ estimate |
|---|---|---|
| 1 (easy) | 92 / 128 | 0.719 |
| 2 (medium) | 18 / 128 | 0.141 |
| 3 (hard) | 3 / 128 | 0.023 |
| 4 (very hard) | 0 / 128 | 0.000 |
Step 1 — pass@1. .
Step 2 — pass@64 with the unbiased estimator. For each prompt, compute :
| prompt | p₁ | 1 − (1 − p₁)⁶⁴ |
|---|---|---|
| 1 | 0.719 | 1.000 |
| 2 | 0.141 | 1 − 0.859⁶⁴ ≈ 0.9999 |
| 3 | 0.023 | 1 − 0.977⁶⁴ ≈ 0.776 |
| 4 | 0.000 | 1 − 1.000⁶⁴ = 0.000 |
Step 3 — pass@64. .
Step 4 — gap. .
Step 5 — RL prediction with . . R1-Zero RL is predicted to take this model from 22.1% to ~55% pass@1.
Step 6 — per-prompt fate. shifts roughly as per prompt:
| prompt | p₁ before | p₁ after RL | what happened |
|---|---|---|---|
| 1 | 0.719 | 0.916 | already-easy problem gets nearly saturated |
| 2 | 0.141 | 0.742 | medium problem becomes the main winner — RL ate the gap |
| 3 | 0.023 | 0.550 | hard-but-solvable problem moves the most in absolute terms |
| 4 | 0.000 | 0.000 | no signal ever existed; RL learns nothing here |
The takeaway. RL did almost all of its redistribution work on prompts 2 and 3 — the ones with in the “within reach” band. Prompt 1 was already easy and had little room to grow. Prompt 4 was a wall. The shape of the post-RL improvement is entirely determined by where each prompt sits on the axis at the start.
Interactive: How RL Reshapes the Pass@K Curve
Drag the sliders and feel the geometry of the lesson. Three observations are worth chasing. (1) Move base down toward 0.01: the cyan base curve flattens against the x-axis, the gap collapses, and the amber post-RL curve barely moves. RL needs raw material to work with. (2) Move up past 0.7: the gap also collapses, this time because the base is already nearly saturated. RL is at its most useful in the middle — the “within reach” band of . (3) Move to 0 and to 1: you see the two endpoints of the empirical band. Every published R1-Zero replication lands somewhere between these two amber curves.
Plain Python: Measuring the Elicitation Gap
Here is the entire R1-Zero ceiling theorem — pass@K, the elicitation gap, the post-RL prediction, the diversity collapse — in 130 lines of plain Python with no model, no GPU, and a Beta-distributed synthetic eval set. Once you can read this, the production decision rule below is just the same code with a real sampling server in place of the simulator.
PyTorch: Estimating Pass@K From a Frozen Policy
Production GPU code that turns the previous script into a go/no-go decision. The sampling is batched with .expand(K, -1) so the prompt's KV-cache is computed once per prompt; the pass@K estimator is the numerically stable unbiased form so a single sweep at samples gives you every for without re-sampling; the decision function bakes in the empirical from published R1-Zero replications and converts the elicitation gap into a dollar-cost estimate. This is the script that every frontier lab now runs before greenlighting an RL phase.
Lesson 4: Pretraining Is the Real Bottleneck
Combine the previous four lessons and a sobering corollary falls out. The post-RL pass@1 is bounded by , the gap closure factor is roughly constant across replications, and the per-prompt distribution is set by the pretraining run. Therefore the only long-term lever for better reasoning performance is the pretraining mixture. RL is a consolidation step; it cannot manufacture capability the base model does not have, no matter how many GPU-hours you throw at it.
This reorders the priorities of the entire post-ChatGPT stack. For three years, the marginal dollar in the labs went to better SFT data, better preference data, better reward models, and bigger critics. R1-Zero showed that all of those line items optimise the same metric — the gap-closure factor — and that has a hard ceiling near 1. The only way to move past that ceiling is to enlarge the base model's pass@K, which means more pretraining compute, better pretraining data, better pretraining objectives. The lesson, put bluntly: after R1-Zero, the optimal allocation of an incremental dollar is roughly 90% pretraining, 10% post-training, on the relevant verifiable-reward domains.
The labs noticed immediately. Within six months of the R1-Zero paper, three of the four largest frontier labs publicly announced a rebalancing toward larger pretraining runs and a reduction in the size of their reasoning-SFT teams. The R1-Zero hypothesis turned out to be a hypothesis about where the marginal capability actually lives, and the answer was “in pretraining, upstream of post-training, where it always was.”
Lesson 5: Verifiable Reward Is a Different Universe
Every lesson so far has carried a quiet asterisk: on verifiable-reward domains. Math problems, coding challenges, logic puzzles, formal proofs — tasks where a deterministic function can read the output and return a correct scalar. The R1-Zero recipe lives inside that universe and is not directly transferable outside it. The fifth lesson is the delineation: what is the boundary between domains where R1-Zero works and domains where it cannot?
Three properties define the verifiable-reward universe. (1) Determinism: the same output and the same gold give the same reward, always. (2) Sparsity tolerance: the reward can be 0 on most completions early in training without breaking learning (because GRPO's group-mean baseline is always defined as long as some variance exists in the group). (3) Cheap evaluation: the reward function runs in microseconds per completion, so the rollout server can score K=64 generations as fast as it can produce them. Math, code, and formal logic satisfy all three. Creative writing, dialogue, open-ended advice, multi-turn agentic tasks — none of them satisfy the first.
That is why the next year of post-training research has divided cleanly into two camps. One camp (verifiable-reward) inherits the R1-Zero recipe wholesale and extends the regex/unit-test family of rewards into theorem proving, SQL generation, tool use with verifiable outputs, and agentic workflows with end-state checks. The other camp (preference-reward) is back to the RLHF stack, but now with the explicit acknowledgement that on non-verifiable tasks, the reward model is the bottleneck and there is no R1-Zero shortcut around it. The two camps are not in conflict; they are operating in different universes that the R1-Zero result drew a sharp line between.
| Property | Verifiable (R1-Zero applies) | Non-verifiable (RLHF still needed) |
|---|---|---|
| Reward function | regex / unit test / proof checker | learned model on human preferences |
| Reward latency | ~10 μs | ~50–500 ms (forward pass) |
| Reward hacking | constrained by regex syntax | open-ended, requires constant patching |
| Example domains | math, code, logic, formal proofs, SQL | chat, writing, advice, summarisation |
| Need for SFT? | no (R1-Zero) | yes (still load-bearing) |
| Need for critic? | no (GRPO) | no (DPO/IPO eliminated it) |
| Need for reward model? | no | yes (the central object) |
Engineering Reality: What Every Lab Did After R1
The R1 paper landed in January 2025. The downstream engineering response across the field was unusually uniform — uniform enough to be worth cataloguing as a single set of post-R1 norms.
- Pre-RL pass@K screening became standard. The PyTorch screening script above is now part of the standard pre-training-decision pipeline at every major lab. Before committing to an RL phase, the lead runs pass@128 on the candidate base model against the target eval and computes. A run with is usually not greenlit.
- Critics were quietly retired. Within nine months of R1, GRPO-style group-mean baselines had replaced PPO critics in the production training stack at Meta, Mistral, Alibaba, xAI, and the open-source TRL library. The critic network — for years a non-negotiable component of RL for LLMs — disappeared from a generation of code without controversy.
- The reward-model team got smaller. On verifiable-reward domains the reward model has no role to play. Several labs publicly disbanded their reward-modelling teams on reasoning tasks and reassigned the headcount to pretraining-data curation. (Reward-modelling teams on chat-alignment tasks remained intact, in line with Lesson 5.)
- vLLM-decoupled rollout became table stakes. R1-Zero's feasibility at 671B depended on running the rollout sampling on a separate fleet of inference servers from the training cluster, with periodic weight sync. This architectural choice — not GRPO itself — is what most replications copied first. OpenRLHF, TRL, and verl all shipped first-class support for vLLM-backed rollout within a quarter.
- The training set shifted toward math and code. If your post-training pipeline is now “regex on the output,” your training set has to be tasks where a regex can score the output. The fraction of the post-training corpus allocated to math, competitive programming, formal logic, and tool use jumped sharply across the frontier labs in 2025. The chat data did not go away, but it stopped growing.
- The pretraining mixture got more math-heavy. Lesson 4 implies that the way to improve a future R1-Zero run is to raise the base model's on the relevant domains, and the way to raise on math is to put more math in pretraining. Frontier pretraining mixtures publicly jumped from ~5% math/code to ~20% math/code in the year following R1, with measurable downstream effect on .
What R1-Zero Did Not Settle
Two questions are still actively contested as of writing, and the next several years of post-training research will hinge on the answers.
Does the ceiling theorem hold for novel reasoning?
The strong reading of Lesson 2 says: RL cannot push past because there is nothing past it to push to. The weak reading admits that during RL, the policy can compose chain-of-thought patterns it had never sampled coherently before, which means the same K sampled from the post-RL model will solve problems neither the base model nor any single base-K sampling sweep ever solved. The empirical evidence is mixed: most large-scale runs do show some post-RL pass@K above the pre-RL pass@K on the hardest problems, but the effect is small (a few percentage points) and confounded by decoding-strategy changes. If the strong reading is true, reasoning capability is permanently bounded by pretraining. If the weak reading is true, RL can compose ladder rungs the pretraining run never explicitly placed. Resolving this is the most important open empirical question in post-training.
Does the recipe transfer to non-verifiable domains?
Lesson 5 marks the boundary between R1-Zero's universe and everything else. The aspirational result — not yet demonstrated at scale — would be a “process reward model” trustworthy enough to substitute for the regex on domains where there is no regex. Several research lines (PRM800k, OpenAI's o1-system-card hints at PRM scoring, Anthropic's constitutional-AI loops) are pushing in this direction. None has yet produced the “R1-Zero for chat” result — a recipe where rule-based-style RL on a learned process reward, starting from a base model with no SFT, produces a chat model that is competitive with one trained the conventional way. If somebody publishes that result, the SFT step goes the way of the critic. Until then, RLHF's full stack is still load-bearing on non-verifiable domains, and R1-Zero remains a powerful but domain-bounded result.
The single sentence to remember. R1-Zero showed the field that the base model was already most of the reasoning model, that SFT-on-CoT was a load-bearing assumption rather than a load-bearing component, and that the optimal allocation of an incremental training dollar is — for verifiable-reward domains, at least — further upstream than anyone had been spending it. The next chapter, on R1 proper, shows what happens when you wrap this insight in just enough conventional alignment to ship it.