The Real Problem: How Do You Score a Reasoning Model?
The previous three sections argued that pure RL on a base model should work, set up the training machinery, and watched the "aha moment" emerge. None of that is convincing without numbers. A reasoning model that feels like it is reasoning, on a few cherry-picked traces, is the same evidence we had for every chain-of-thought paper since 2022. The R1-Zero result is interesting precisely because it produced numbers — reproducible, benchmark-aligned, comparable to the strongest closed model at the time.
But scoring a stochastic reasoning model is not the same as scoring an image classifier. A classifier returns one label and you compute top-1 accuracy. A reasoning model returns text — and that text was sampled, which means a different seed produces a different answer. The first methodological problem of the R1-Zero paper is which scalar do we put on the y-axis? The wrong choice can hide a 20-point training improvement or invent a 20-point one that does not exist.
What "the result" means: "DeepSeek-R1-Zero reached pass@1 on AIME 2024, and with majority voting over 64 samples, starting from a base model at ." Every word in that sentence is doing work — and you cannot trust the result until you can write out exactly what each one means.
Intuition: pass@1, pass@k, and Majority Voting
Imagine the model takes one AIME problem and writes a 10,000-token chain of reasoning ending in an tag. Sample it twice with the same prompt and you can get two different answers, because decoding is stochastic. So a single "is it correct?" check is itself a random variable, not a constant. The three metrics below differ in how they reduce that randomness to a single number.
- pass@1 — sample one completion, check if it is correct. Reported as the fraction of correct samples across many problems and many seeds. This is the user-experience number: if you click "regenerate" once on a new problem, how often do you get the right answer?
- pass@k — sample completions, count it as a success if at least one is correct. This is the ceiling number: it measures whether the model can solve the problem at all, given enough tries. pass@k always rises with , eventually saturating at 1.0.
- cons@k — sample completions, extract their answers, take the mode, count it correct if the mode equals gold. This is the test-time-compute number: it answers "if I am willing to spend more inference, how much accuracy do I get back?" cons@k rises with only when the model's wrong answers are spread across many distinct wrong values, while correct answers concentrate on the one true value.
The Headline Numbers
This is the table the rest of the chapter exists to explain. DeepSeek-R1-Zero is the pure-RL run on the DeepSeek-V3-Base 671B-parameter model. The reference column is OpenAI's o1-0912, the strongest reasoning model with published numbers at the time of the paper.
| Benchmark | Base model (no RL) | R1-Zero pass@1 | R1-Zero cons@64 | o1-0912 pass@1 |
|---|---|---|---|---|
| AIME 2024 | 15.6% | 71.0% | 86.7% | 74.4% |
| MATH-500 | 44.0% | 95.9% | — | 94.8% |
| GPQA Diamond | 23.1% | 73.3% | — | 77.3% |
| LiveCodeBench v4 | 12.6% | 50.0% | — | 63.4% |
| CodeForces (rating) | — | 1444 | — | 1843 |
Three things to notice. First, the AIME jump from 15.6% to 71.0% is the entire R1-Zero result. That is +55.4 points of accuracy with no supervised data — just RL on a regex-based reward. Second, R1-Zero clears or matches o1-0912 on the two math benchmarks (AIME, MATH-500) but lags on the more code-heavy ones (LiveCodeBench, CodeForces). RL with verifiable rewards is strong where the reward is tight (math), weaker where the reward is harder to define (code style, runtime correctness). Third, cons@64 on AIME beats o1-0912 outright — the gap between 71.0 and 86.7 is the value of free test-time compute.
Interactive: The 8K-Step Training Curve
Drag the slider in the chart below. The two curves are the AIME-2024 pass@1 and cons@64 reported every ~200 RL steps over an ~8000-step run on the 671B base model. The dashed line is o1-0912's 74.4% pass@1 for reference. No supervised fine-tuning; no learned reward model; no curriculum.
Two patterns in the curve are worth pausing on. The growth is smooth and monotone — there is no phase transition, no "capability emerged at step 5000". RL converts a small per-step gradient into a continuous accuracy gain across thousands of steps. And cons@64 is always above pass@1, with the gap widening as training progresses. Early in training, the model gets the right answer rarely; majority voting cannot help if the right answer is not in the bag. Late in training, the model gets the right answer often but inconsistently; majority voting now reliably picks it.
Interactive: Why Response Length Grew 10×
The most-shared figure from the R1-Zero paper is not the accuracy curve — it is the length curve. Nobody added a length term to the reward. The reward only inspects the final <answer> tag. And yet the model's average completion length grew from ~320 tokens at step 0 to over 10,000 tokens by step 8000.
<answer> tag. But longer reasoning chains are more likely to land on the correct answer on hard problems, so RL slowly amplifies longer completions. By step 8000 the average completion is ~31× longer than at step 0.The mechanism is straightforward in hindsight: RL with a noisy positive-when-correct reward amplifies any behaviour that correlates with success. On hard problems, longer chains-of-thought correlate with success because they give the model more room to backtrack, re-check, and try alternative paths. So the gradient update consistently pushes probability toward longer-completion behaviour patterns — verifying intermediate steps, restating the question, listing multiple candidate solutions. None of this was prompted or supervised; all of it was selected.
Test-Time Compute Scaling: cons@k Beats pass@1
If you take the trained R1-Zero policy and ask it the same problem more times, how much accuracy do you get per extra sample? This is the test-time-compute question. Slide the per-sample accuracy below to see how pass@k and cons@k respond to .
Two regimes are visible. When per-sample accuracy is high (right end), both curves rocket to 100% by — sampling is free wins. When per-sample accuracy is low (left end), pass@k still climbs (you might get lucky once) but cons@k stagnates, because the wrong answers outnumber the right one and the mode is always a wrong answer. R1-Zero's 71% pass@1 sits in the sweet spot where both curves are still climbing, which is why cons@64 can buy another points on AIME.
The Math: pass@k, cons@k, and Why They Differ
Let be the number of completions sampled per problem, and the number of those that are correct. The naïve estimator for pass@1 is the per-completion correctness rate . The naïve estimator for pass@k would be to subsample completions, check if any are correct, and repeat — but this is wasteful. The Codex paper (Chen et al. 2021) gives an unbiased closed form that uses all samples:
The fraction is the probability that a draw of samples (without replacement) from the available misses all correct ones — i.e., picks from the wrong ones. Subtract from 1 to get the probability of getting at least one correct sample. When (more than enough correct samples that any draw of size hits one), pass@k saturates at 1.0.
cons@k is harder to write as a closed form because it depends on the distribution over distinct wrong answers, not just the correct-vs-wrong split. The simplest model is to assume distinct plausible answers (one correct with probability , and wrong answers sharing the remaining mass uniformly). Then as if and only if , which rearranges to . Below that threshold, majority voting cannot rescue you no matter how much you sample; above it, voting eventually saturates at 100%.
Manual Numerical Walkthrough: Computing cons@8 by Hand
▶ Expand: one problem, eight samples, every step shown
Problem. "If , what is ?" Gold answer: .
Eight sampled completions. We extract the contents of <answer>…</answer>:
| # | Completion (truncated) | Extracted answer | Correct? |
|---|---|---|---|
| 1 | <think>2x=14, x=7.</think><answer>7</answer> | 7 | ✓ |
| 2 | <think>2x=14, x=7.</think><answer>7</answer> | 7 | ✓ |
| 3 | <think>2x=14, x=7.</think><answer>7</answer> | 7 | ✓ |
| 4 | <think>x=(17-3)/2=7.</think><answer>7</answer> | 7 | ✓ |
| 5 | <think>2x=20, x=10.</think><answer>10</answer> | 10 | ✗ |
| 6 | <think>2x=14, x=6.</think><answer>6</answer> | 6 | ✗ |
| 7 | <think>x=17-3=14.</think><answer>14</answer> | 14 | ✗ |
| 8 | blah blah </think>oops | (no tag) | ✗ |
Step 1 — pass@1. With samples and correct, the naïve pass@1 is (50%).
Step 2 — pass@k from the same 8 samples using the Codex formula :
- (78.6%)
- (98.6%)
- (100%, since picking 8 from 8 always hits all 4 correct)
Step 3 — cons@8. Collect the valid extracted answers: ["7", "7", "7", "7", "10", "6", "14"] (sample #8 had no tag, so it is dropped, not counted as a vote). Tally the votes:
| Answer | Votes |
|---|---|
| 7 (gold) | 4 |
| 10 | 1 |
| 6 | 1 |
| 14 | 1 |
Mode is , which equals gold, so (correct). Notice that cons@8 won despite only 50% of samples being correct — because the four wrong samples spread across three distinct wrong answers (10, 6, 14), no wrong answer matched the correct one's vote count.
Counterfactual. Suppose three of the wrong samples had all said instead. Then the tally would be — still cons@8 = correct, mode is 7. But if five of seven had said 14, the mode would be 14 and cons@8 would flip to wrong, even though the correct answer is still present in the bag. This is the 1/A failure mode visualised.
Plain Python: pass@k and cons@k from Rollouts
Here is the full eval pipeline in plain Python: extraction, pass@k, cons@k, and a benchmark loop. No PyTorch, no GPU. This code reads completions from a JSONL file (or from any sampler function) and produces the exact numbers in the R1-Zero table.
PyTorch: Logging These Metrics During GRPO
In a real training run the metrics live in a callback that fires every few hundred steps, runs the sampler on a held-out problem set, and logs three numbers. The callback shares no state with the GRPO step; it just needs the policy, the tokenizer, and the eval set.
At Massive Scale: What 71% on AIME Cost
Numbers don't fall out of a model for free. Estimating the R1-Zero compute bill from the published run-length, sampling cadence, and model size:
| Resource | Estimate | Note |
|---|---|---|
| RL steps | ~8,000 | Roughly the x-axis of the training curve |
| Rollouts per step | K × batch ≈ 16 × 64 = 1024 | GRPO group size × prompts per step |
| Avg generated tokens / rollout | ~10,000 (late training) | From the length curve |
| Generated tokens / RL step | ~10M | 1024 × 10K |
| Total generated tokens (training) | ~80B | 8000 × 10M |
| Training-side forward+backward / step | ~6× rollout tokens | Policy + ref forward + backward |
| Eval runs | ~40 (every 200 steps) | K=64, 30 AIME problems each |
| Generated tokens / eval | ~19M | 64 × 30 × 10K |
| Total generated tokens (eval) | ~760M | <1% of training rollout tokens |
| Aggregate compute order | ~10^25 FLOPs | Roughly 1/100th of pretraining |
Two observations. First, rollouts dominate the bill — about 80% of GPU-hours during R1-Zero go into generate() calls, not into the gradient update. This is why DeepSeek (and others) push the sampler onto separate vLLM inference workers, freeing the training cluster from prefill latency. Second, eval cost is negligible relative to training — under 1% of the rollout budget — so there is no excuse for under-evaluating. Most failed R1-Zero reproductions trip on this: they evaluate twice during a 2000-step run and miss the smooth curve, then conclude the run did not work.
Engineering Reality: What the Numbers Hide
The benchmark table is true and reproducible, but it hides four engineering realities that the next sections on R1-Zero's failure modes will pick up on.
- Long completions are not free at inference time. Serving a 10K-token reasoning response costs ~10× the decode FLOPs of a 1K-token answer, and the KV-cache grows linearly. Users see this as slower time-to-first-answer. R1-Zero traded a lot of latency for accuracy.
- cons@64 is not how anyone uses the model. No production deployment runs 64 rollouts per user query and votes. The 86.7% number is a research metric — it tells you the model can get there, not that it will when called once.
- Readability suffers. R1-Zero's long completions are tangled — mid-stream language switching, repeated "wait let me reconsider" segments, no structure. This is what motivates the R1 (non-zero) pipeline in the next chapter: a cold-start SFT pass to teach the model to produce its reasoning cleanly before continuing with RL.
- Coding lags math. R1-Zero clears o1-0912 on AIME but trails it by 13+ points on LiveCodeBench and ~400 Elo on CodeForces. The reward signal for code (compile? pass tests?) is noisier, slower, and admits more partial-credit ambiguity than math's "does the integer match?" test. RL is only as good as the reward.
The headline result is real and matters: a base model, with no supervised reasoning data, can be trained with pure RL on a verifiable reward to match the strongest closed reasoning model of its era on math benchmarks. The four caveats above are exactly what the rest of the chapter — and chapter 17 on the full R1 pipeline — exists to address.