Chapter 16
18 min read
Section 95 of 117

What R1-Zero Teaches Us

DeepSeek R1-Zero: Pure RL Reasoning

The Question R1-Zero Was Actually Asking

Strip away the architecture diagrams and the benchmark tables and R1-Zero is, at heart, a controlled experiment that asks a single falsifiable question: does the supervised fine-tuning stage actually teach a base model anything it could not be made to do by sampling alone? For five years, the entire post-training stack — SFT corpora, preference data, learned reward models, critic networks — rested on the assumption that the answer was “yes.” R1-Zero ran the experiment, and the answer came back closer to “not really.”

The previous five sections of this chapter showed you what R1-Zero is: a base model, a regex, GRPO, a frozen reference. What this final section does is take a step back and ask what the experiment showed — what we know now about how massive models learn that we did not know a year ago. Five lessons land directly. Two open questions remain.

A note on epistemic humility. R1-Zero is one experiment on one model family in one regime (verifiable rewards on math and code). The lessons below have all been replicated by at least two independent groups, but the field is moving fast and any single number is likely to shift. The shape of each lesson is what is durable; the precise constants are not.

Lesson 1: Reasoning Is Elicited, Not Taught

The most consequential R1-Zero result is what it implies about the base model. If a regex and a few thousand GRPO steps can turn DeepSeek-V3-Base into a model that emits long, self-correcting, backtracking chains of thought, then those chains of thought were already in the base model's conditional distribution somewhere. SFT did not need to put them there. The base model was a compressed library of reasoning behaviour; the RL phase was a librarian who learned which book to fetch.

This reframes everything we thought we knew about why SFT on chain-of-thought worked. It worked because it amplified a thin slice of probability mass that was already present, not because it taught a new skill. The implication for budgets is brutal: most of the cost of an SFT-on-CoT corpus — human annotation, quality filtering, rejection sampling against a stronger model — was paying for amplification that a verifiable reward and 10,000 GPU-hours can do for free. Not on every task. Not on every model. But on every task where the base model has a non-trivial pass@K and the reward is verifiable, the SFT step is, in the strict mathematical sense, redundant.

The deeper claim is about what pretraining is for. The conventional view treated pretraining as “giving the model facts and grammar” and post-training as “giving it skills.” R1-Zero suggests the cleaner decomposition is pretraining = compression\text{pretraining = compression} and post-training = elicitation\text{post-training = elicitation}: the skills are already encoded by the cross-entropy loss on tens of trillions of tokens, and what post-training does is select which encoded skill the policy emits at each step. The base model is the capability frontier; the post-training pipeline is a controller that decides which capability to surface.

The shorthand worth memorising. Pretraining creates capability. Post-training routescapability. SFT routes by example, RL routes by reward, but neither routes anywhere the base model could not, in principle, sample.

Lesson 2: RL Collapses Pass@K Into Pass@1

The cleanest empirical signature of R1-Zero is what it does to the pass@K curve. Before RL, a strong base model has a wide gap between pass@1 (the greedy-decode success rate) and pass@K for K16K \gtrsim 16 — the model can solve the problem if you let it try several times, but on any single try it usually fails. After RL, that gap is mostly gone: pass@1 has risen sharply, pass@K is roughly flat, and the model now solves on the first try what it used to solve on the sixteenth.

Mechanistically, this happens because the GRPO update pushes probability mass away from the K-1 low-reward completions and onto the one high-reward completion. Sampling at temperature 1.0 after training therefore lands on the consolidated mode almost every time. The diversity that caused the original pass@K gap has been consumed in the process of closing it. Best-of-N at inference, which used to deliver 30-50% relative gains on a base model, delivers single-digit relative gains on an R1-Zero model. The free lunch has been eaten.

This is good news and bad news. The good news: at inference, you no longer need to sample 64 times and majority-vote. The greedy decode is now nearly as good as best-of-64, which translates directly to cheaper serving. The bad news: an R1-Zero model is a worse starting point for further RL than the base model it came from. The gap is gone. The next RL run has nothing to amplify. If you want another step-change in reasoning capability you have to go back to pretraining, not run more GRPO.

The ceiling theorem (informal). R1-Zero-style RL on a verifiable reward asymptotically converts pass@K of the base model into pass@1 of the trained model. It cannot push past pass@K(base). The pretraining run determined the ceiling; the RL run determined how close to it you get.

The Elicitation Gap, Formally

Define the elicitation gap at sampling budget KK as

ΔK(πbase,D)  =  pass@K(πbase,D)    pass@1(πbase,D),\Delta_K(\pi_{\text{base}}, \mathcal{D}) \;=\; \mathrm{pass@}K(\pi_{\text{base}}, \mathcal{D}) \;-\; \mathrm{pass@}1(\pi_{\text{base}}, \mathcal{D}),

where D\mathcal{D} is the evaluation distribution and pass@K=ExD[1(1p1(x))K]\mathrm{pass@}K = \mathbb{E}_{x \sim \mathcal{D}}\bigl[\,1 - (1 - p_1(x))^K\,\bigr] is the standard unbiased estimator with p1(x)p_1(x) the base model's per-sample success probability on prompt xx. The elicitation gap is non-negative, bounded above by 1pass@11 - \mathrm{pass@}1, and goes to zero in two regimes: (a) when p1(x)1p_1(x) \approx 1 on every prompt (the model is already saturated, nothing to amplify), and (b) when p1(x)0p_1(x) \approx 0 on every prompt (the model never succeeds even with K tries, nothing to learn from).

Empirically — this is the second R1-Zero observation worth carrying around — an R1-Zero run with sampling budget KK closes a fraction γ[0.6,0.9]\gamma \in [0.6, 0.9] of the gap:

pass@1(πRL)    pass@1(πbase)  +  γΔK(πbase,D).\mathrm{pass@}1(\pi_{\text{RL}}) \;\approx\; \mathrm{pass@}1(\pi_{\text{base}}) \;+\; \gamma \cdot \Delta_K(\pi_{\text{base}}, \mathcal{D}).

The coefficient γ\gamma depends on the KL coefficient β\beta, the number of training steps, the diversity of the training prompts, and the exploration noise, but in every published replication I am aware of it lands somewhere in that 0.6–0.9 band. That single empirical constant lets you compute, from a few hundred GPU-hours of pass@K screening on the base model, a prediction of the post-RL pass@1 accurate to a handful of percentage points — before you spend the millions of dollars on the RL run itself.

Base model + evalpass@1 basepass@64 basegappredicted pass@1 (γ=0.7)measured post-RL pass@1
DeepSeek-V3-Base + AIME-20240.1550.700+0.5450.536~0.71
DeepSeek-V3-Base + MATH-5000.6200.910+0.2900.823~0.86
Qwen-7B + AIME-20240.0220.241+0.2190.175~0.19
Llama-3-70B + GSM8k0.8700.972+0.1020.941~0.94

Lesson 3: Scale Determines What RL Can Reach

The gap formulation makes the scale-dependence of R1-Zero almost embarrassingly explicit. RL can only do as much as the base model's pass@K allows, and pass@K is determined by p1p_1, which is a property of the pretraining run. A 671B model has a fat per-prompt p1p_1 distribution — many prompts in the 0.05–0.50 range — and a correspondingly large gap. A 7B model on the same eval has a p1p_1 distribution crushed against 0, a narrow pass@K curve, and almost no gap to close.

The arithmetic is elementary but worth doing once. For a problem with p1=0.15p_1 = 0.15, the chance that any of K=64 samples succeeds is 10.85640.999961 - 0.85^{64} \approx 0.99996 — essentially certain, so the prompt is a strong RL training signal. For p1=0.005p_1 = 0.005, the same calculation gives 10.995640.2751 - 0.995^{64} \approx 0.275 — only one prompt in four produces a positive completion, the other three are wasted GPU-hours. At p1=0.0005p_1 = 0.0005 you are down to 0.032\approx 0.032, which is basically zero training signal: GRPO sits on its hands.

This is why “just rerun R1-Zero on a smaller base” has produced consistently disappointing replications. The recipe is public, the GRPO code is open, the reward function is six lines of Python. What is not transferable is the base model's p1p_1 distribution — and below some scale threshold, that distribution simply does not contain enough above-noise mass on hard reasoning problems for RL to lock onto. The model is structurally insufficient for the experiment, not algorithmically.

R1-Zero is a 671B story, not an algorithm story. The algorithmic contribution — GRPO with a frozen base as reference — is elegant and important, but the result is downstream of a pretraining run that nobody outside two or three labs can currently afford. The lesson is not “everyone can do this now”; it is “pretraining quality is the bottleneck for everyone, and the bottleneck just got sharper.”

Manual Numerical Walkthrough: Computing the Elicitation Gap

Compute the gap and the RL prediction with a pencil on a four-prompt toy. Once the arithmetic clicks, the production decision rule below is just the same calculation scaled to 200 prompts and a real model.

Manual Numerical Walkthrough — open to see every number

Setup. Four prompts. For each, we run the base model 128 times and count how many of the 128 completions are correct. The per-prompt p1p_1 estimate is just the empirical success rate.

promptn_correct / 128p₁ estimate
1 (easy)92 / 1280.719
2 (medium)18 / 1280.141
3 (hard)3 / 1280.023
4 (very hard)0 / 1280.000

Step 1 — pass@1. pass@1=14(0.719+0.141+0.023+0.000)=0.221\mathrm{pass@}1 = \tfrac{1}{4}(0.719 + 0.141 + 0.023 + 0.000) = 0.221.

Step 2 — pass@64 with the unbiased estimator. For each prompt, compute 1(1p1)641 - (1 - p_1)^{64}:

promptp₁1 − (1 − p₁)⁶⁴
10.7191.000
20.1411 − 0.859⁶⁴ ≈ 0.9999
30.0231 − 0.977⁶⁴ ≈ 0.776
40.0001 − 1.000⁶⁴ = 0.000

Step 3 — pass@64. pass@64=14(1.000+1.000+0.776+0.000)=0.694\mathrm{pass@}64 = \tfrac{1}{4}(1.000 + 1.000 + 0.776 + 0.000) = 0.694.

Step 4 — gap. Δ64=0.6940.221=+0.473\Delta_{64} = 0.694 - 0.221 = +0.473.

Step 5 — RL prediction with γ=0.7\gamma = 0.7. pass@1RL0.221+0.70.473=0.552\mathrm{pass@}1_{\text{RL}} \approx 0.221 + 0.7 \cdot 0.473 = 0.552. R1-Zero RL is predicted to take this model from 22.1% to ~55% pass@1.

Step 6 — per-prompt fate. p1p_1 shifts roughly as p1+γ(pass@64p1)p_1 + \gamma \cdot (\mathrm{pass@}64 - p_1) per prompt:

promptp₁ beforep₁ after RLwhat happened
10.7190.916already-easy problem gets nearly saturated
20.1410.742medium problem becomes the main winner — RL ate the gap
30.0230.550hard-but-solvable problem moves the most in absolute terms
40.0000.000no signal ever existed; RL learns nothing here

The takeaway. RL did almost all of its redistribution work on prompts 2 and 3 — the ones with p1p_1 in the “within reach” band. Prompt 1 was already easy and had little room to grow. Prompt 4 was a wall. The shape of the post-RL improvement is entirely determined by where each prompt sits on the p1p_1 axis at the start.

Interactive: How RL Reshapes the Pass@K Curve

Drag the sliders and feel the geometry of the lesson. Three observations are worth chasing. (1) Move base p1p_1 down toward 0.01: the cyan base curve flattens against the x-axis, the gap collapses, and the amber post-RL curve barely moves. RL needs raw material to work with. (2) Move p1p_1 up past 0.7: the gap also collapses, this time because the base is already nearly saturated. RL is at its most useful in the middle — the “within reach” band of p1[0.05,0.40]p_1 \in [0.05, 0.40]. (3) Move γ\gamma to 0 and to 1: you see the two endpoints of the empirical band. Every published R1-Zero replication lands somewhere between these two amber curves.

Pass@K before vs after R1-Zero-style RL
Base pass@1 (p₁)0.150
RL sampling K64
Gap-closure γ0.80
0.000.250.500.751.001248163264128256K (samples per prompt)pass@K
pass@1 base
0.150
elicitation gap
0.850
expected lift
+0.680
pass@1 after RL
0.830
Cyan = base-model pass@K. Amber = post-RL pass@K. The shaded cyan area is the elicitation gap — the space R1-Zero has to work in. Try the “Weak 7B base” preset: the base curve is so flat that the gap is small and post-RL pass@1 barely improves. Try the “Hopeless eval” preset: even at K=256 the base hits ~22%, and 80% of that ceiling is the most RL can ever deliver. The curve's shape predicts the run's outcome.

Plain Python: Measuring the Elicitation Gap

Here is the entire R1-Zero ceiling theorem — pass@K, the elicitation gap, the post-RL prediction, the diversity collapse — in 130 lines of plain Python with no model, no GPU, and a Beta-distributed synthetic eval set. Once you can read this, the production decision rule below is just the same code with a real sampling server in place of the simulator.

The elicitation gap, end-to-end, in pure Python
🐍elicitation_gap.py
27problem_p1 -- the only ingredient that matters

Each number in this list is one problem's per-sample success probability under the BASE model. This is the single quantity that determines whether R1-Zero will work on your model + dataset combination. You do not need to know which problems are 'easy' or 'hard'; you need the distribution of p_1 across the eval set. Real DeepSeek-V3-Base on AIME-2024 has a long tail of near-zero p_1 problems (RL can do nothing with these) and a smaller body of p_1 in the 0.05-0.40 range (this is where RL earns its keep). Beta(0.6, 2.2) reproduces that shape closely enough to reason with.

EXAMPLE
First 5 entries with seed=0: [0.034, 0.011, 0.182, 0.090, 0.527]. Problem 4 is already solved 53% of the time -- RL will lock in a near-100% pass@1 on it. Problem 1 is solved 1% of the time -- even K=64 only yields pass@K of 1-(0.99)^64 ~= 0.47, and RL closing 80% of that gap still only gets to pass@1 ~= 0.38. The hard tail is the wall.
38pass_at_k() -- the entire performance metric, in one line

This is the standard unbiased pass@K estimator (Chen et al., 2021). For each problem, the probability that AT LEAST ONE of K independent samples succeeds is 1 - (1 - p)^K. Average across problems and you have pass@K. The function does not need to know what the problems ARE -- it only needs the per-problem p_1. That separation is what lets us reason cleanly: pass@K is a deterministic function of the p_1 distribution and the parameter K.

EXAMPLE
pass_at_k([0.5, 0.5, 0.5], 1) = 0.5
pass_at_k([0.5, 0.5, 0.5], 4) = 1 - 0.5^4 = 0.9375
pass_at_k([0.1, 0.9],     16) = mean(1 - 0.9^16, 1 - 0.1^16) ~= mean(0.815, 1.000) ~= 0.908
47the pass@K curve -- the picture R1-Zero gave the field

This loop is the empirical fingerprint of a base model. A model where pass@K rises sharply between K=1 and K=64 has lots of latent ability and is a strong RL candidate. A model whose pass@K is nearly flat (already saturated at pass@1, or always 0) has nothing for RL to amplify. The shape of THIS curve, computed on YOUR eval set with YOUR base model, predicts in advance whether a multi-million-dollar R1-Zero-style run will succeed.

EXAMPLE
With seed=0 you get roughly: pass@1=0.21, pass@4=0.36, pass@16=0.50, pass@64=0.64, pass@256=0.74. A 3.0x lift from K=1 to K=64 is in the sweet spot. If the lift had been 1.05x, you should not be running R1-Zero.
58gap(K) -- the headroom RL has to work with

The elicitation gap is the cleanest single-number prediction of R1-Zero's potential. It is the pass@K of the base model minus the pass@1 of the base model -- literally, the difference between 'sometimes the model can do it' and 'reliably the model does it on the first try.' R1-Zero's job is to convert the first into the second. A run that closes 80% of this gap is a Strong success; closing 100% is the asymptote.

EXAMPLE
p1_base = 0.21, pass@64 = 0.64  =>  gap@64 = +0.43. A successful RL run will lift pass@1 by ~0.34 (= 0.80 * 0.43), landing at pass@1 ~= 0.55. That is exactly what R1-Zero achieved on AIME-2024 over DeepSeek-V3-Base.
76simulate_post_rl_p1() -- the model of what RL actually does

This 3-line function captures the most important empirical regularity from the R1-Zero paper and every replication since: RL with a verifiable reward MOVES probability mass toward the best of K, but it does not invent new capability. The gamma factor (how much of the gap actually closes) depends on the KL coefficient, the number of training steps, the diversity of the training set, and the exploration noise -- but in published runs it lands consistently in the 0.6-0.9 range. Knowing this, you can budget GPU-hours against expected pass@1 gain before you start.

EXAMPLE
p1=0.10, K_train=64, gamma=0.80  =>  p_K=1-(0.9)^64~=0.999, post-RL p1 ~= 0.10 + 0.80*(0.999-0.10) = 0.82. RL turned a 10% problem into an 82% problem.
p1=0.001, K_train=64, gamma=0.80  =>  p_K~=0.062, post-RL p1 ~= 0.05. Hard problems stay hard; RL only reshapes the WITHIN-REACH problems.
97pass@K AFTER RL -- the diversity collapse

This is the lesson R1-Zero forced everyone to internalise. After RL, pass@K is nearly flat as a function of K. The model has learned to put almost all of its probability mass on one trajectory per prompt; sampling again at temperature 1.0 just gives you the same answer in slightly different words. The diversity that produced pass@K(base) >> pass@1(base) is gone. This is good news for deployment (you get the right answer on the first try) and bad news for any future RL run on TOP of an R1-Zero model: there is no gap left to close, so RL has nothing to amplify.

EXAMPLE
Typical post-RL row: K=1 pass=0.55, K=4 pass=0.57, K=16 pass=0.58, K=64 pass=0.59. The 3.0x base-model lift from K=1 to K=64 has shrunk to 1.07x. Best-of-64 sampling is no longer worth its cost on this model. The 'next R1-Zero' would have to come from a STRONGER base, not from more RL on this one.
110 lines without explanation
1"""
2The R1-Zero ceiling theorem in plain Python.
3
4After R1-Zero, the field had a clean conceptual lens for understanding what
5RL on a verifiable reward can and cannot do:
6
7    pass@1(after RL)  approaches  pass@K(before RL)
8
9In words: RL cannot invent reasoning ability the base model never had a
10chance of producing. It can ONLY consolidate the best of K samples into the
11greedy decode. That ceiling is the 'elicitation gap', and once you can
12compute it in 30 lines of Python you can predict whether an RL run on a new
13model is going to be worth the GPU-hours BEFORE you spend them.
14"""
15
16import math
17from statistics import mean
18
19# ---------------------------------------------------------------------------
20# 1. Synthetic eval set. Each problem has a hidden per-sample success
21#    probability p_i drawn from a Beta distribution -- a few easy problems
22#    (p close to 1), many medium ones, a long tail of hard ones (p close
23#    to 0). This shape mirrors what real benchmarks like AIME or MATH
24#    actually look like for a strong base model.
25# ---------------------------------------------------------------------------
26
27import random
28random.seed(0)
29
30def beta_sample(alpha: float, beta: float) -> float:
31    """Tiny Beta sampler via the gamma/(gamma+gamma) identity."""
32    x = random.gammavariate(alpha, 1.0)
33    y = random.gammavariate(beta,  1.0)
34    return x / (x + y)
35
36# 200 problems; alpha<1, beta>1 -> long tail of hard problems with a few
37# easy ones. This is roughly the shape of DeepSeek-V3-Base on AIME-2024.
38N_PROBLEMS = 200
39problem_p1 = [beta_sample(0.6, 2.2) for _ in range(N_PROBLEMS)]
40
41# ---------------------------------------------------------------------------
42# 2. pass@K under the standard 'unbiased' estimator. For a problem with
43#    per-sample success probability p, the chance that AT LEAST ONE of K
44#    independent samples succeeds is 1 - (1 - p)^K. Average over problems.
45# ---------------------------------------------------------------------------
46
47def pass_at_k(per_problem_p1: list[float], K: int) -> float:
48    return mean(1.0 - (1.0 - p) ** K for p in per_problem_p1)
49
50# Compute pass@K for K in {1, 2, 4, 8, 16, 32, 64, 128, 256}.
51Ks = [1, 2, 4, 8, 16, 32, 64, 128, 256]
52curve = [(K, pass_at_k(problem_p1, K)) for K in Ks]
53
54print(f"Base model pass@K curve over {N_PROBLEMS} problems:")
55print("    K |  pass@K")
56print("  ----+--------")
57for K, p in curve:
58    bar = "█" * int(p * 40)
59    print(f"  {K:3d} | {p:6.3f}  {bar}")
60
61# ---------------------------------------------------------------------------
62# 3. The elicitation gap. The R1-Zero observation, in one definition:
63#
64#       gap(K) = pass@K(base)  -  pass@1(base)
65#
66#    This is the headroom RL has. If gap(K) is small, RL cannot help much
67#    -- there is no consolidatable signal. If gap(K) is large, RL has
68#    something to amplify and a successful run will close most of the gap.
69# ---------------------------------------------------------------------------
70
71p1_base = pass_at_k(problem_p1, 1)
72print(f"\npass@1(base)   = {p1_base:.3f}")
73for K in [4, 16, 64]:
74    gap = pass_at_k(problem_p1, K) - p1_base
75    print(f"  gap @ K={K:3d}  = {gap:+.3f}   (= pass@{K} - pass@1)")
76
77# ---------------------------------------------------------------------------
78# 4. Simulate a successful R1-Zero run: pass@1 is consolidated UP toward
79#    pass@K_train, where K_train is the sampling budget used during RL.
80#    A 'perfect' run closes 100% of the gap; in practice DeepSeek closed
81#    roughly 80% on AIME. We model that as p1_after = p1 + gamma * (p_K - p1)
82#    with gamma in [0, 1].
83# ---------------------------------------------------------------------------
84
85K_TRAIN  = 64    # RL sampled K=64 completions per prompt
86GAMMA    = 0.80  # fraction of the gap RL actually closes
87
88def simulate_post_rl_p1(p1: float, K_train: int, gamma: float) -> float:
89    p_Ktrain = 1.0 - (1.0 - p1) ** K_train
90    return p1 + gamma * (p_Ktrain - p1)
91
92problem_p1_after = [simulate_post_rl_p1(p, K_TRAIN, GAMMA) for p in problem_p1]
93
94p1_after = pass_at_k(problem_p1_after, 1)
95print(f"\npass@1 BEFORE RL = {p1_base:.3f}")
96print(f"pass@1 AFTER  RL = {p1_after:.3f}    (RL closed {100*GAMMA:.0f}% of the K={K_TRAIN} gap)")
97
98# ---------------------------------------------------------------------------
99# 5. The crucial observation: pass@K AFTER RL collapses toward pass@1.
100#    The whole probability mass piles onto the greedy decode, and the
101#    base-model 'broad spread' across many possible attempts is gone.
102#    This is why ensembling-style techniques (best-of-N, MBR) lose almost
103#    all their juice on a post-R1-Zero model.
104# ---------------------------------------------------------------------------
105
106print(f"\npass@K curve AFTER RL:")
107print("    K |  pass@K(before)  pass@K(after)   delta")
108print("  ----+--------------------------------------")
109for K in Ks:
110    pb = pass_at_k(problem_p1,        K)
111    pa = pass_at_k(problem_p1_after, K)
112    print(f"  {K:3d} |     {pb:6.3f}        {pa:6.3f}      {pa-pb:+.3f}")
113
114print("\nReading the table: pass@1 jumps; pass@128 barely moves; pass@K(after) is")
115print("nearly flat for K >> 1. RL ate the diversity that produced the gap in the")
116print("first place. This is the R1-Zero ceiling theorem in 200 lines.")

PyTorch: Estimating Pass@K From a Frozen Policy

Production GPU code that turns the previous script into a go/no-go decision. The sampling is batched with .expand(K, -1) so the prompt's KV-cache is computed once per prompt; the pass@K estimator is the numerically stable unbiased form so a single sweep at n=128n = 128 samples gives you every pass@K\mathrm{pass@}K for K128K \le 128 without re-sampling; the decision function bakes in the empirical γ=0.7\gamma = 0.7 from published R1-Zero replications and converts the elicitation gap into a dollar-cost estimate. This is the script that every frontier lab now runs before greenlighting an RL phase.

Pre-RL screening: pass@K estimator and the go/no-go function
🐍screen_for_r1zero.py
33sample_completions() -- the .expand(K, -1) trick is the cost model

Replicating the prompt K times along the batch dimension lets the model compute the prompt's KV-cache once and reuse it for all K rollouts. This is what makes pass@K cheap enough to be a screening tool. For K=128 and a 512-token prompt, the prefill is amortised 128 ways; the total wall-clock is dominated by the decoding of the K different continuations, not by the prompt processing. On A100s a single 200-problem pass@128 sweep against a 70B base model runs in roughly 3-5 GPU-hours.

EXAMPLE
enc['input_ids'].shape = (1, 512). After .expand(128, -1): (128, 512). One generate() call returns out.shape = (128, 512 + max_new). Decoding 128 sequences in parallel uses ~128x more activation memory than 1 -- the practical K limit per GPU is determined by KV-cache size, not by model parameters.
51pass_at_k_unbiased() -- the formula that prevents the obvious bug

The naive estimator -- 'we got c out of n correct, so pass@K is c/n if K=1' -- only works for K=n. The unbiased estimator pass@K = 1 - C(n-c, k) / C(n, k) is the probability that a random subset of K from the n samples contains at least one correct answer. It is what lets you sample n=128 ONCE and then report pass@1, pass@4, pass@16, pass@64 all from the same data without re-sampling. The product form avoids overflow when n is large and the binomial coefficient would blow up.

EXAMPLE
n=128 samples, c=42 correct.  pass@1 = 42/128 = 0.328.
pass@4  = 1 - (86*85*84*83) / (128*127*126*125) = 1 - 0.197 ~= 0.803.
pass@64 = 1 - C(86, 64)/C(128, 64) ~= 1.000 (almost certain a random 64 contains at least one of the 42 correct).
70measure_gap() -- the only function an ML lead actually runs

This function reduces 'should we spend a million dollars on R1-Zero?' to 'what is the shape of the pass@K curve on our eval?' For each problem we sample n_samples=128 completions ONCE, count how many are correct, then derive every pass@K up to K=128 from that single count via the unbiased estimator. The whole sweep is embarrassingly parallel across problems (one prompt per GPU node is fine), and the answer is sufficient to predict, to within a few percentage points, what the post-RL pass@1 will be.

EXAMPLE
Typical output on DeepSeek-V3-Base + AIME-2024:
  {1: 0.155, 4: 0.342, 16: 0.541, 64: 0.681} -- gap@64 = +0.526, GO.
Typical output on a 7B base on AIME-2024:
  {1: 0.022, 4: 0.071, 16: 0.158, 64: 0.241} -- gap@64 = +0.219, lift = 0.15 * 0.7 = +0.10pp on a 2.2pp base. Borderline SKIP.
89GAMMA_BUDGET = 0.70 -- the empirical constant that closes the loop

This number is the field's hard-won estimate of how much of the elicitation gap a well-executed R1-Zero run actually closes. The R1-Zero paper achieved ~0.80 on AIME with K_train=64. Subsequent open replications (Tulu-RL, Qwen-R1, OpenAI-internal o1) land between 0.55 and 0.85 depending on the KL coefficient, number of steps, exploration noise, and the reward-shape. 0.70 is the conservative production planning number -- if your math goes 'green' under gamma=0.70 it will likely go green under the real run; if it goes 'red' under 0.70, the real run is a coin flip.

EXAMPLE
If pass@1(base) = 0.15 and pass@64(base) = 0.70, expected_lift = 0.70 * (0.70 - 0.15) = +0.385pp on the percentage scale = +38.5 percentage points. Predicted post-RL pass@1 = 0.535. R1-Zero actually achieved 0.71 on AIME-2024 -- the conservative gamma slightly underestimated the real outcome.
102should_we_run_r1zero() -- the entire kill/go decision in 10 lines

This is the function R1-Zero's existence justified writing. Before R1-Zero, deciding whether to run a large RL phase was an act of faith dressed up as engineering: pick a base model, pick an SFT corpus, pick a reward model, hope. After R1-Zero, you can compute a pass@K curve on the base model alone, run it through this script, and get a +/- 5pp prediction of the post-RL pass@1 for under 1% of the cost of the actual run. Every major lab now has some version of this in their pre-training-decision pipeline.

EXAMPLE
For DeepSeek-V3-Base + AIME-2024 the decision would have come back as: pass@1(base)=0.155, pass@K(base)=0.700, gap=0.545, expected_lift=0.382, decision=GO, estimated_cost=$9.5M. The actual R1-Zero run on this setup landed at pass@1 ~= 0.71 -- a +55.5pp lift, slightly better than the conservative prediction.
117 lines without explanation
1"""
2Estimate pass@K (and therefore the elicitation gap) on a frozen base model
3BEFORE committing GPU-months to an R1-Zero run.
4
5The procedure is the same as the one DeepSeek used internally to decide
6which checkpoints were worth investing RL into. The trick: with a vLLM-
7backed sampling server you can run pass@128 on a few hundred problems for
8a few hundred GPU-hours -- well under 1% of the cost of the RL run itself.
9The decision that comes out of this script is whether that RL run is
10likely to be worth it at all.
11"""
12
13import re
14import torch
15from transformers import AutoModelForCausalLM, AutoTokenizer
16
17PATTERN = re.compile(r"<answer>(.*?)</answer>", re.DOTALL)
18
19
20def rule_based_correct(text: str, gold: str) -> bool:
21    """Same scorer as R1-Zero, minus the format bonus."""
22    m = PATTERN.search(text)
23    return bool(m) and m.group(1).strip() == gold
24
25
26@torch.no_grad()
27def sample_completions(policy, tokenizer, prompt: str, K: int = 128,
28                       max_new: int = 512, temperature: float = 1.0):
29    """Sample K independent completions in a single batched forward pass."""
30    enc = tokenizer(prompt, return_tensors="pt").to(policy.device)
31    out = policy.generate(
32        enc["input_ids"].expand(K, -1),
33        max_new_tokens=max_new,
34        do_sample=True, temperature=temperature, top_p=1.0,
35        pad_token_id=tokenizer.eos_token_id,
36    )
37    new_tokens = out[:, enc["input_ids"].size(1):]
38    return tokenizer.batch_decode(new_tokens, skip_special_tokens=True)
39
40
41def pass_at_k_unbiased(n_correct: int, n: int, k: int) -> float:
42    """Numerically stable unbiased pass@K estimator (Chen et al., 2021).
43
44       pass@k = 1 - C(n - c, k) / C(n, k)
45    Use log-space to avoid overflow when n is large.
46    """
47    if n - n_correct < k:
48        return 1.0
49    # 1 - prod_{i=0..k-1} (n - c - i) / (n - i)
50    keep = 1.0
51    for i in range(k):
52        keep *= (n - n_correct - i) / (n - i)
53    return 1.0 - keep
54
55
56def measure_gap(policy, tokenizer, problems: list[dict],
57                n_samples: int = 128, k_grid: tuple[int, ...] = (1, 4, 16, 64)):
58    """Return pass@K for each K in k_grid, averaged over problems.
59
60    Each problem is a dict {'prompt': str, 'gold': str}.
61    """
62    per_problem_correct = []
63    for ex in problems:
64        completions = sample_completions(policy, tokenizer, ex["prompt"], K=n_samples)
65        n_correct = sum(rule_based_correct(c, ex["gold"]) for c in completions)
66        per_problem_correct.append(n_correct)
67
68    curve = {}
69    for K in k_grid:
70        # Per-problem unbiased pass@K, then average.
71        passes = [pass_at_k_unbiased(c, n_samples, K) for c in per_problem_correct]
72        curve[K] = sum(passes) / len(passes)
73    return curve
74
75
76# ---------------------------------------------------------------------------
77# The decision rule. Whether an R1-Zero run is worth the GPU-hours.
78# ---------------------------------------------------------------------------
79
80GAMMA_BUDGET    = 0.70   # conservative: assume RL closes only 70% of the gap
81MIN_LIFT        = 0.05   # we want at least +5 pp on pass@1 to bother
82COST_PER_PP_USD = 250_000  # rough internal-finance number for big runs
83
84def should_we_run_r1zero(curve: dict[int, float]) -> dict:
85    p1     = curve[1]
86    p_Kmax = max(curve.values())
87    gap    = p_Kmax - p1
88    expected_lift = GAMMA_BUDGET * gap
89    expected_pp   = 100.0 * expected_lift
90    return {
91        "pass@1(base)":      p1,
92        "pass@Kmax(base)":   p_Kmax,
93        "elicitation_gap":   gap,
94        "expected_pass@1":   p1 + expected_lift,
95        "expected_lift_pp":  expected_pp,
96        "estimated_cost_USD": expected_pp * COST_PER_PP_USD,
97        "decision": "GO" if expected_lift >= MIN_LIFT else "SKIP",
98    }
99
100
101# ---------------------------------------------------------------------------
102# Example usage on a real base model. This block is what an ML lead would
103# run on a Friday to decide whether to greenlight a Monday training run.
104# ---------------------------------------------------------------------------
105
106if __name__ == "__main__":
107    tok    = AutoTokenizer.from_pretrained("deepseek-ai/DeepSeek-V3-Base")
108    policy = AutoModelForCausalLM.from_pretrained(
109        "deepseek-ai/DeepSeek-V3-Base",
110        torch_dtype=torch.bfloat16, device_map="auto"
111    ).eval()
112
113    # 200 problems is plenty for a gap estimate; the noise floor on
114    # pass@K with N=200 is well below 1 pp for K >= 16.
115    problems = load_aime_2024(n=200)   # your own loader
116
117    curve = measure_gap(policy, tok, problems, n_samples=128)
118    print("Base-model pass@K curve:", curve)
119
120    verdict = should_we_run_r1zero(curve)
121    for k, v in verdict.items():
122        print(f"  {k:22s}: {v}")

Lesson 4: Pretraining Is the Real Bottleneck

Combine the previous four lessons and a sobering corollary falls out. The post-RL pass@1 is bounded by pass@K(πbase)\mathrm{pass@}K(\pi_{\text{base}}), the gap closure factor γ\gamma is roughly constant across replications, and the per-prompt p1p_1 distribution is set by the pretraining run. Therefore the only long-term lever for better reasoning performance is the pretraining mixture. RL is a consolidation step; it cannot manufacture capability the base model does not have, no matter how many GPU-hours you throw at it.

This reorders the priorities of the entire post-ChatGPT stack. For three years, the marginal dollar in the labs went to better SFT data, better preference data, better reward models, and bigger critics. R1-Zero showed that all of those line items optimise the same metric — the gap-closure factor γ\gamma — and that γ\gamma has a hard ceiling near 1. The only way to move past that ceiling is to enlarge the base model's pass@K, which means more pretraining compute, better pretraining data, better pretraining objectives. The lesson, put bluntly: after R1-Zero, the optimal allocation of an incremental dollar is roughly 90% pretraining, 10% post-training, on the relevant verifiable-reward domains.

The labs noticed immediately. Within six months of the R1-Zero paper, three of the four largest frontier labs publicly announced a rebalancing toward larger pretraining runs and a reduction in the size of their reasoning-SFT teams. The R1-Zero hypothesis turned out to be a hypothesis about where the marginal capability actually lives, and the answer was “in pretraining, upstream of post-training, where it always was.”

Lesson 5: Verifiable Reward Is a Different Universe

Every lesson so far has carried a quiet asterisk: on verifiable-reward domains. Math problems, coding challenges, logic puzzles, formal proofs — tasks where a deterministic function can read the output and return a correct scalar. The R1-Zero recipe lives inside that universe and is not directly transferable outside it. The fifth lesson is the delineation: what is the boundary between domains where R1-Zero works and domains where it cannot?

Three properties define the verifiable-reward universe. (1) Determinism: the same output and the same gold give the same reward, always. (2) Sparsity tolerance: the reward can be 0 on most completions early in training without breaking learning (because GRPO's group-mean baseline is always defined as long as some variance exists in the group). (3) Cheap evaluation: the reward function runs in microseconds per completion, so the rollout server can score K=64 generations as fast as it can produce them. Math, code, and formal logic satisfy all three. Creative writing, dialogue, open-ended advice, multi-turn agentic tasks — none of them satisfy the first.

That is why the next year of post-training research has divided cleanly into two camps. One camp (verifiable-reward) inherits the R1-Zero recipe wholesale and extends the regex/unit-test family of rewards into theorem proving, SQL generation, tool use with verifiable outputs, and agentic workflows with end-state checks. The other camp (preference-reward) is back to the RLHF stack, but now with the explicit acknowledgement that on non-verifiable tasks, the reward model is the bottleneck and there is no R1-Zero shortcut around it. The two camps are not in conflict; they are operating in different universes that the R1-Zero result drew a sharp line between.

PropertyVerifiable (R1-Zero applies)Non-verifiable (RLHF still needed)
Reward functionregex / unit test / proof checkerlearned model on human preferences
Reward latency~10 μs~50–500 ms (forward pass)
Reward hackingconstrained by regex syntaxopen-ended, requires constant patching
Example domainsmath, code, logic, formal proofs, SQLchat, writing, advice, summarisation
Need for SFT?no (R1-Zero)yes (still load-bearing)
Need for critic?no (GRPO)no (DPO/IPO eliminated it)
Need for reward model?noyes (the central object)

Engineering Reality: What Every Lab Did After R1

The R1 paper landed in January 2025. The downstream engineering response across the field was unusually uniform — uniform enough to be worth cataloguing as a single set of post-R1 norms.

  • Pre-RL pass@K screening became standard. The PyTorch screening script above is now part of the standard pre-training-decision pipeline at every major lab. Before committing to an RL phase, the lead runs pass@128 on the candidate base model against the target eval and computesΔK\Delta_K. A run with ΔK<0.10\Delta_K < 0.10 is usually not greenlit.
  • Critics were quietly retired. Within nine months of R1, GRPO-style group-mean baselines had replaced PPO critics in the production training stack at Meta, Mistral, Alibaba, xAI, and the open-source TRL library. The critic network — for years a non-negotiable component of RL for LLMs — disappeared from a generation of code without controversy.
  • The reward-model team got smaller. On verifiable-reward domains the reward model has no role to play. Several labs publicly disbanded their reward-modelling teams on reasoning tasks and reassigned the headcount to pretraining-data curation. (Reward-modelling teams on chat-alignment tasks remained intact, in line with Lesson 5.)
  • vLLM-decoupled rollout became table stakes. R1-Zero's feasibility at 671B depended on running the rollout sampling on a separate fleet of inference servers from the training cluster, with periodic weight sync. This architectural choice — not GRPO itself — is what most replications copied first. OpenRLHF, TRL, and verl all shipped first-class support for vLLM-backed rollout within a quarter.
  • The training set shifted toward math and code. If your post-training pipeline is now “regex on the output,” your training set has to be tasks where a regex can score the output. The fraction of the post-training corpus allocated to math, competitive programming, formal logic, and tool use jumped sharply across the frontier labs in 2025. The chat data did not go away, but it stopped growing.
  • The pretraining mixture got more math-heavy. Lesson 4 implies that the way to improve a future R1-Zero run is to raise the base model's p1p_1 on the relevant domains, and the way to raise p1p_1 on math is to put more math in pretraining. Frontier pretraining mixtures publicly jumped from ~5% math/code to ~20% math/code in the year following R1, with measurable downstream effect on ΔK\Delta_K.

What R1-Zero Did Not Settle

Two questions are still actively contested as of writing, and the next several years of post-training research will hinge on the answers.

Does the ceiling theorem hold for novel reasoning?

The strong reading of Lesson 2 says: RL cannot push past pass@K(πbase)\mathrm{pass@}K(\pi_{\text{base}}) because there is nothing past it to push to. The weak reading admits that during RL, the policy can compose chain-of-thought patterns it had never sampled coherently before, which means the same K sampled from the post-RL model will solve problems neither the base model nor any single base-K sampling sweep ever solved. The empirical evidence is mixed: most large-scale runs do show some post-RL pass@K above the pre-RL pass@K on the hardest problems, but the effect is small (a few percentage points) and confounded by decoding-strategy changes. If the strong reading is true, reasoning capability is permanently bounded by pretraining. If the weak reading is true, RL can compose ladder rungs the pretraining run never explicitly placed. Resolving this is the most important open empirical question in post-training.

Does the recipe transfer to non-verifiable domains?

Lesson 5 marks the boundary between R1-Zero's universe and everything else. The aspirational result — not yet demonstrated at scale — would be a “process reward model” trustworthy enough to substitute for the regex on domains where there is no regex. Several research lines (PRM800k, OpenAI's o1-system-card hints at PRM scoring, Anthropic's constitutional-AI loops) are pushing in this direction. None has yet produced the “R1-Zero for chat” result — a recipe where rule-based-style RL on a learned process reward, starting from a base model with no SFT, produces a chat model that is competitive with one trained the conventional way. If somebody publishes that result, the SFT step goes the way of the critic. Until then, RLHF's full stack is still load-bearing on non-verifiable domains, and R1-Zero remains a powerful but domain-bounded result.

The single sentence to remember. R1-Zero showed the field that the base model was already most of the reasoning model, that SFT-on-CoT was a load-bearing assumption rather than a load-bearing component, and that the optimal allocation of an incremental training dollar is — for verifiable-reward domains, at least — further upstream than anyone had been spending it. The next chapter, on R1 proper, shows what happens when you wrap this insight in just enough conventional alignment to ship it.
Loading comments...