Boo-AI — Master Artificial Intelligence by Building from Scratch

The Real Problem: How Do You Score a Reasoning Model?

The previous three sections argued that pure RL on a base model should work, set up the training machinery, and watched the "aha moment" emerge. None of that is convincing without numbers. A reasoning model that feels like it is reasoning, on a few cherry-picked traces, is the same evidence we had for every chain-of-thought paper since 2022. The R1-Zero result is interesting precisely because it produced numbers — reproducible, benchmark-aligned, comparable to the strongest closed model at the time.

But scoring a stochastic reasoning model is not the same as scoring an image classifier. A classifier returns one label and you compute top-1 accuracy. A reasoning model returns text — and that text was sampled, which means a different seed produces a different answer. The first methodological problem of the R1-Zero paper is which scalar do we put on the y-axis? The wrong choice can hide a 20-point training improvement or invent a 20-point one that does not exist.

What "the result" means: "DeepSeek-R1-Zero reached $71.0\%$ pass@1 on AIME 2024, and $86.7\%$ with majority voting over 64 samples, starting from a base model at $15.6\%$ ." Every word in that sentence is doing work — and you cannot trust the result until you can write out exactly what each one means.

Intuition: pass@1, pass@k, and Majority Voting

Imagine the model takes one AIME problem and writes a 10,000-token chain of reasoning ending in an $<\text{answer}>$ tag. Sample it twice with the same prompt and you can get two different answers, because decoding is stochastic. So a single "is it correct?" check is itself a random variable, not a constant. The three metrics below differ in how they reduce that randomness to a single number.

pass@1 — sample one completion, check if it is correct. Reported as the fraction of correct samples across many problems and many seeds. This is the user-experience number: if you click "regenerate" once on a new problem, how often do you get the right answer?
pass@k — sample $k$ completions, count it as a success if at least one is correct. This is the ceiling number: it measures whether the model can solve the problem at all, given enough tries. pass@k always rises with $k$ , eventually saturating at 1.0.
cons@k — sample $k$ completions, extract their answers, take the mode, count it correct if the mode equals gold. This is the test-time-compute number: it answers "if I am willing to spend $k\times$ more inference, how much accuracy do I get back?" cons@k rises with $k$ only when the model's wrong answers are spread across many distinct wrong values, while correct answers concentrate on the one true value.

Why both pass@1 and cons@64? pass@1 reports the single-shot quality of the policy — what a user actually experiences. cons@64 reports the asymptotic quality once you exploit the model's diversity. The gap between them is a sharp diagnostic: a small gap means the policy is well-calibrated (sampling does not help much); a large gap means the policy "knows" the answer but does not always write it down.

The Headline Numbers

This is the table the rest of the chapter exists to explain. DeepSeek-R1-Zero is the pure-RL run on the DeepSeek-V3-Base 671B-parameter model. The reference column is OpenAI's o1-0912, the strongest reasoning model with published numbers at the time of the paper.

Benchmark	Base model (no RL)	R1-Zero pass@1	R1-Zero cons@64	o1-0912 pass@1
AIME 2024	15.6%	71.0%	86.7%	74.4%
MATH-500	44.0%	95.9%	—	94.8%
GPQA Diamond	23.1%	73.3%	—	77.3%
LiveCodeBench v4	12.6%	50.0%	—	63.4%
CodeForces (rating)	—	1444	—	1843

Three things to notice. First, the AIME jump from 15.6% to 71.0% is the entire R1-Zero result. That is +55.4 points of accuracy with no supervised data — just RL on a regex-based reward. Second, R1-Zero clears or matches o1-0912 on the two math benchmarks (AIME, MATH-500) but lags on the more code-heavy ones (LiveCodeBench, CodeForces). RL with verifiable rewards is strong where the reward is tight (math), weaker where the reward is harder to define (code style, runtime correctness). Third, cons@64 on AIME beats o1-0912 outright — the gap between 71.0 and 86.7 is the value of free test-time compute.

Interactive: The 8K-Step Training Curve

Drag the slider in the chart below. The two curves are the AIME-2024 pass@1 and cons@64 reported every ~200 RL steps over an ~8000-step run on the 671B base model. The dashed line is o1-0912's 74.4% pass@1 for reference. No supervised fine-tuning; no learned reward model; no curriculum.

R1-Zero on AIME 2024 — accuracy vs RL step

pass@1 cons@64 o1-0912 reference (74.4%)

step

8,000

pass@1

71.0%

cons@64

86.7%

avg length

10,050 tok

Drag the slider. pass@1 climbs from 15.6% (the base model with no RL) to 71.0% at step 8000 — crossing o1-0912's 74.4% within striking distance using pure RL on a verifiable reward, no SFT data. cons@64 reaches 86.7%, exceeding o1-0912.

Two patterns in the curve are worth pausing on. The growth is smooth and monotone — there is no phase transition, no "capability emerged at step 5000". RL converts a small per-step gradient into a continuous accuracy gain across thousands of steps. And cons@64 is always above pass@1, with the gap widening as training progresses. Early in training, the model gets the right answer rarely; majority voting cannot help if the right answer is not in the bag. Late in training, the model gets the right answer often but inconsistently; majority voting now reliably picks it.

Interactive: Why Response Length Grew 10×

The most-shared figure from the R1-Zero paper is not the accuracy curve — it is the length curve. Nobody added a length term to the reward. The reward only inspects the final <answer> tag. And yet the model's average completion length grew from ~320 tokens at step 0 to over 10,000 tokens by step 8000.

Average response length on AIME 2024 — emergent "think longer" behaviour

Step 8,000 · 10,050 tokens

Nobody told the model to write longer answers. The reward only checks the final <answer> tag. But longer reasoning chains are more likely to land on the correct answer on hard problems, so RL slowly amplifies longer completions. By step 8000 the average completion is ~31× longer than at step 0.

The mechanism is straightforward in hindsight: RL with a noisy positive-when-correct reward amplifies any behaviour that correlates with success. On hard problems, longer chains-of-thought correlate with success because they give the model more room to backtrack, re-check, and try alternative paths. So the gradient update consistently pushes probability toward longer-completion behaviour patterns — verifying intermediate steps, restating the question, listing multiple candidate solutions. None of this was prompted or supervised; all of it was selected.

The length curve is the emergent-reasoning curve. Phrases like "wait, let me reconsider", "another way to see this is", and "let me double-check" — the so-called aha-moment patterns from the previous section — all cost tokens. They show up in the data only because the model is allowed to spend tokens. The length curve and the accuracy curve are two views of the same underlying RL pressure.

Test-Time Compute Scaling: cons@k Beats pass@1

If you take the trained R1-Zero policy and ask it the same problem more times, how much accuracy do you get per extra sample? This is the test-time-compute question. Slide the per-sample accuracy below to see how pass@k and cons@k respond to $k$ .

Test-time compute scaling — pass@k vs cons@k as you sample more

pass@k (any sample correct) cons@k (majority vote)

per-sample accuracy (pass@1)35%

Two regimes. When pass@1 is high (right of the spectrum), cons@k and pass@k both shoot to ~100% almost immediately. When pass@1 is low and below 1/A (where A is the number of plausible wrong answers), cons@k stays stuck because the model never produces enough correct samples to outvote the wrong modes — only pass@k benefits from sampling. The R1-Zero result sits in the regime where BOTH curves are climbing, which is why the paper reports both.

Two regimes are visible. When per-sample accuracy is high (right end), both curves rocket to 100% by $k=4$ — sampling is free wins. When per-sample accuracy is low (left end), pass@k still climbs (you might get lucky once) but cons@k stagnates, because the wrong answers outnumber the right one and the mode is always a wrong answer. R1-Zero's 71% pass@1 sits in the sweet spot where both curves are still climbing, which is why cons@64 can buy another $+16$ points on AIME.

The Math: pass@k, cons@k, and Why They Differ

Let $n$ be the number of completions sampled per problem, and $c$ the number of those that are correct. The naïve estimator for pass@1 is the per-completion correctness rate $\hat p = c/n$ . The naïve estimator for pass@k would be to subsample $k$ completions, check if any are correct, and repeat — but this is wasteful. The Codex paper (Chen et al. 2021) gives an unbiased closed form that uses all $n$ samples:

$\text{pass@}k = 1 - \frac{\binom{n - c}{k}}{\binom{n}{k}}$

The fraction is the probability that a draw of $k$ samples (without replacement) from the $n$ available misses all $c$ correct ones — i.e., picks $k$ from the $n-c$ wrong ones. Subtract from 1 to get the probability of getting at least one correct sample. When $n - c < k$ (more than enough correct samples that any draw of size $k$ hits one), pass@k saturates at 1.0.

cons@k is harder to write as a closed form because it depends on the distribution over distinct wrong answers, not just the correct-vs-wrong split. The simplest model is to assume $A$ distinct plausible answers (one correct with probability $p$ , and $A-1$ wrong answers sharing the remaining $1-p$ mass uniformly). Then $\text{cons@}k \to 1$ as $k \to \infty$ if and only if $p > (1-p)/(A-1)$ , which rearranges to $p > 1/A$ . Below that threshold, majority voting cannot rescue you no matter how much you sample; above it, voting eventually saturates at 100%.

The 1/A threshold is the failure mode of cons@k. On problems where the model has a dominant wrong answer — a typical mistake everyone makes — majority voting amplifies the wrong answer and degrades accuracy. cons@k only helps when correct answers concentrate on one mode and wrong answers spread out.

Manual Numerical Walkthrough: Computing cons@8 by Hand

▶ Expand: one problem, eight samples, every step shown

Problem. "If $2x + 3 = 17$ , what is $x$ ?" Gold answer: $7$ .

Eight sampled completions. We extract the contents of <answer>…</answer>:

#	Completion (truncated)	Extracted answer	Correct?
1	<think>2x=14, x=7.</think><answer>7</answer>	7	✓
2	<think>2x=14, x=7.</think><answer>7</answer>	7	✓
3	<think>2x=14, x=7.</think><answer>7</answer>	7	✓
4	<think>x=(17-3)/2=7.</think><answer>7</answer>	7	✓
5	<think>2x=20, x=10.</think><answer>10</answer>	10	✗
6	<think>2x=14, x=6.</think><answer>6</answer>	6	✗
7	<think>x=17-3=14.</think><answer>14</answer>	14	✗
8	blah blah </think>oops	(no tag)	✗

Step 1 — pass@1. With $n=8$ samples and $c=4$ correct, the naïve pass@1 is $c/n = 4/8 = 0.500$ (50%).

Step 2 — pass@k from the same 8 samples using the Codex formula $1 - \binom{n-c}{k}/\binom{n}{k}$ :

$\text{pass@}2 = 1 - \binom{4}{2}/\binom{8}{2} = 1 - 6/28 \approx 0.786$ (78.6%)
$\text{pass@}4 = 1 - \binom{4}{4}/\binom{8}{4} = 1 - 1/70 \approx 0.986$ (98.6%)
$\text{pass@}8 = 1 - \binom{4}{8}/\binom{8}{8} = 1 - 0/1 = 1.000$ (100%, since picking 8 from 8 always hits all 4 correct)

Step 3 — cons@8. Collect the valid extracted answers: ["7", "7", "7", "7", "10", "6", "14"] (sample #8 had no tag, so it is dropped, not counted as a vote). Tally the votes:

Answer	Votes
7 (gold)	4
10	1
6	1
14	1

Mode is $7$ , which equals gold, so $\text{cons@}8 = 1$ (correct). Notice that cons@8 won despite only 50% of samples being correct — because the four wrong samples spread across three distinct wrong answers (10, 6, 14), no wrong answer matched the correct one's vote count.

Counterfactual. Suppose three of the wrong samples had all said $14$ instead. Then the tally would be $\{7: 4, 14: 3, 10: 1\}$ — still cons@8 = correct, mode is 7. But if five of seven had said 14, the mode would be 14 and cons@8 would flip to wrong, even though the correct answer is still present in the bag. This is the 1/A failure mode visualised.

Plain Python: pass@k and cons@k from Rollouts

Here is the full eval pipeline in plain Python: extraction, pass@k, cons@k, and a benchmark loop. No PyTorch, no GPU. This code reads completions from a JSONL file (or from any sampler function) and produces the exact numbers in the R1-Zero table.

pass@k, cons@k, and the AIME evaluator — 60 lines

🐍rzero_eval.py

Explanation(5)

Code(107)

26completions — the raw output we score

This is the entire input to the metric machinery: a list of K strings sampled from the policy. There is no probability vector, no token IDs, no logits — by the time pass@k and cons@k run, the model has already been called K times and the rollouts have been collected. This is why these metrics can be computed offline from a JSONL dump of past rollouts without re-running the model.

EXAMPLE

K=64 in the DeepSeek paper. With K=64 and 30 AIME problems, one evaluation pass costs 1920 generations — at ~10K tokens each, ~19M generated tokens per eval. This is why R1-Zero only evaluates every few hundred RL steps, not every step.

36extract() — the regex IS the verifier

The format reward in chapter 16.1 forces the model to wrap its answer in <answer>...</answer>. That contract carries over to evaluation: pulling the answer out is a 5-line regex. There is no LLM-as-judge, no ROUGE, no BLEU, no semantic similarity. For math problems with a single integer or fraction answer, this is unambiguous and reproducible. The eval is byte-identical to the training reward computation — there is no train/test distribution skew on the scoring rule.

EXAMPLE

extract('<answer>7</answer>') == '7'
extract('the answer is 7') == None  # no tags -> counted as wrong

50pass_at_k — the unbiased estimator from the Codex paper

Naïvely, if you sampled 64 completions and 12 were correct, pass@1 = 12/64 = 18.75% and pass@16 should be the probability of getting at least one correct in 16 samples drawn without replacement from those 64. That probability is 1 - C(52, 16) / C(64, 16) ≈ 0.998. The formula treats your K samples as a finite pool and computes the exact hypergeometric probability — no Monte Carlo. The if-branch handles the saturated case where you have so few wrong samples that any draw of size k must include a correct one.

EXAMPLE

n=64, c=12, k=1  -> 1 - C(52,1)/C(64,1) = 1 - 52/64 = 0.1875 = 18.75%
n=64, c=12, k=16 -> 1 - C(52,16)/C(64,16) ≈ 0.998 = 99.8%

73cons_at_k — three lines, one Counter, one comparison

cons@k is the cheapest way to convert extra inference compute into accuracy: instead of trusting one greedy sample, sample k, and report the most common answer. It works because for math/code the answer space is small (one integer, one expression) and wrong answers are scattered across many distinct wrong values, while correct answers concentrate on one value. For free-form text generation this would not work — there is no notion of 'most common' summary.

EXAMPLE

From the 8 completions above, extract() returns ['7','7','7','7','10','6','14',None]. Counter({'7':4,'10':1,'6':1,'14':1}).most_common(1) = [('7', 4)]. Mode is '7' == gold, so cons@8 = True.

86evaluate() — the loop that produced every number in the R1-Zero table

For each problem we sample K completions, count how many are correct, compute pass@1 (= c/n) and cons@K (= 1 if mode == gold else 0), and average across problems. The function returns three numbers — pass@1, cons@K, and average completion length — which are exactly the three columns the paper plots over RL steps. The whole eval harness is ~30 lines of Python. The thing that takes 6 GPU-hours per evaluation is the sampler, not the scorer.

EXAMPLE

AIME 2024 final scores from this loop: { 'pass@1': 71.0, 'cons@K': 86.7, 'avg_len': 10050 } at step 8000. At step 0 (base model): { 'pass@1': 15.6, 'cons@K': 23.3, 'avg_len': 320 }. The +55 points on pass@1 is the entire R1-Zero result.

102 lines without explanation

1"""
2pass@k, cons@k, and a tiny evaluator — the three numbers that determine
3whether an RL-trained reasoning model "worked".
4
5These are the only metrics the DeepSeek-R1 paper reports for R1-Zero.
6Both of them collapse a noisy, sampled output distribution into a single
7scalar you can put on the y-axis of a training-curve plot.
8"""
9
10import math
11import re
12from collections import Counter
13from statistics import mean
14
15# ---------------------------------------------------------------------------
16# 1. One AIME-style problem. In the real eval there are 30 of them; here we
17#    use one so the arithmetic is visible. The model samples K completions,
18#    we extract the answer from each, and we score against the gold.
19# ---------------------------------------------------------------------------
20
21problem = "If 2x + 3 = 17, what is x?"
22gold    = "7"
23
24# 8 sampled completions from the model. Some are right, some are wrong,
25# some have the right answer with a wrong intermediate step (still counts).
26completions = [
27    "<think>2x = 14, x = 7.</think><answer>7</answer>",        # correct
28    "<think>2x = 14, x = 7.</think><answer>7</answer>",        # correct (dup)
29    "<think>2x = 14, x = 7.</think><answer>7</answer>",        # correct (dup)
30    "<think>x = (17-3)/2 = 7.</think><answer>7</answer>",      # correct (different path)
31    "<think>2x = 20, x = 10.</think><answer>10</answer>",      # wrong arithmetic
32    "<think>2x = 14, x = 6.</think><answer>6</answer>",        # right setup, wrong divide
33    "<think>x = 17 - 3 = 14, so x=14.</think><answer>14</answer>",  # forgot to divide
34    "blah blah </think>oops",                                  # garbled
35]
36
37ANSWER_RE = re.compile(r"<answer>(.*?)</answer>", re.DOTALL)
38
39def extract(c: str) -> str | None:
40    m = ANSWER_RE.search(c)
41    return m.group(1).strip() if m else None
42
43def is_correct(c: str, gold: str) -> bool:
44    return extract(c) == gold
45
46# ---------------------------------------------------------------------------
47# 2. pass@k — probability that AT LEAST ONE of k samples is correct.
48#    Unbiased estimator from Codex (Chen et al. 2021):
49#       pass@k = 1 - C(n - c, k) / C(n, k)
50#    where n = total samples drawn, c = number correct.
51#    This is what gives an apples-to-apples answer when you sampled n=64
52#    but want to report pass@1, pass@4, pass@16, etc. from the same rollouts.
53# ---------------------------------------------------------------------------
54
55def pass_at_k(n: int, c: int, k: int) -> float:
56    if n - c < k:               # impossible to pick k wrong ones -> guaranteed pass
57        return 1.0
58    return 1.0 - math.comb(n - c, k) / math.comb(n, k)
59
60n = len(completions)
61c = sum(is_correct(x, gold) for x in completions)
62print(f"n={n}  c={c}")
63for k in (1, 2, 4, 8):
64    print(f"  pass@{k:>2} = {pass_at_k(n, c, k):.3f}")
65
66# ---------------------------------------------------------------------------
67# 3. cons@k — majority voting (a.k.a. self-consistency).
68#    Sample k completions, extract their answers, take the modal answer,
69#    score correct iff the mode equals gold.
70#    The DeepSeek paper reports cons@64 alongside pass@1 because majority
71#    voting is the cheapest test-time-compute scaling trick that exists.
72# ---------------------------------------------------------------------------
73
74def cons_at_k(completions: list[str], gold: str, k: int) -> bool:
75    answers = [extract(c) for c in completions[:k] if extract(c) is not None]
76    if not answers:
77        return False
78    mode, _count = Counter(answers).most_common(1)[0]
79    return mode == gold
80
81print()
82print("answers extracted:", [extract(c) for c in completions])
83for k in (1, 2, 4, 8):
84    print(f"  cons@{k:>2} = {cons_at_k(completions, gold, k)}")
85
86# ---------------------------------------------------------------------------
87# 4. The full benchmark loop. AIME 2024 has 30 problems. For each we sample
88#    K completions, then aggregate pass@1 and cons@K across problems.
89#    This is *the* loop that produced every number in the R1-Zero table.
90# ---------------------------------------------------------------------------
91
92def evaluate(problems: list[dict], sampler, K: int = 64) -> dict:
93    pass1_per_prob: list[float] = []
94    cons_correct  : list[bool]  = []
95    avg_len       : list[float] = []
96    for prob in problems:
97        comps = sampler(prob["question"], K)            # K text completions
98        n     = len(comps)
99        c     = sum(is_correct(x, prob["gold"]) for x in comps)
100        pass1_per_prob.append(pass_at_k(n, c, k=1))     # = c / n
101        cons_correct.append(cons_at_k(comps, prob["gold"], k=K))
102        avg_len.append(mean(len(x) for x in comps))
103    return {
104        "pass@1" : 100.0 * mean(pass1_per_prob),
105        "cons@K" : 100.0 * mean(cons_correct),
106        "avg_len": mean(avg_len),
107    }

PyTorch: Logging These Metrics During GRPO

In a real training run the metrics live in a callback that fires every few hundred steps, runs the sampler on a held-out problem set, and logs three numbers. The callback shares no state with the GRPO step; it just needs the policy, the tokenizer, and the eval set.

Eval callback for an R1-Zero training run

🐍rzero_eval_callback.py

Explanation(7)

Code(126)

45@torch.no_grad() + .eval() — the eval is autograd-free

Evaluation never builds a computation graph and never touches optimizer state. The decorator + .eval()/.train() bracket means activations are not retained for backward, dropout is off, and norm layers use running statistics. For a 671B-parameter policy this is the difference between fitting in 8 GPUs (eval) and needing 64 (training). We can afford to run the eval on a much smaller cluster than the training run.

EXAMPLE

Training memory for 671B-bf16 with AdamW: params 1.3TB + grads 1.3TB + opt-state 5.4TB ≈ 8TB. Eval memory: params 1.3TB + small activations ≈ 1.4TB. The eval fits in 1/8th of the training cluster.

65input_ids.expand(K, -1) — one prompt, K rollouts, ONE forward

Calling generate() K times sequentially would recompute the prompt's KV-cache K times. expand() replicates the prompt along the batch dim so generate() shares the prompt's KV-cache across the K rollouts. For prompt_len=512 and K=64 this is roughly a 30% wall-clock saving on the prefill phase — and the prefill is the part you cannot otherwise hide behind decode.

EXAMPLE

On 8×A100, generate(K=64, max_new=16384) for one AIME problem takes ~110 seconds. Across 30 problems = 55 minutes. With K=1 sequential calls = ~70 minutes. The expansion trick saves ~15 minutes per eval.

76max_new_tokens=16384 — R1-Zero generates LONG

By the end of training, R1-Zero's average completion length on AIME is ~10K tokens (see the length curve below). 16384 leaves headroom for the tail. If you truncate the generation early you systematically underscore the model: it has not finished thinking. This is the single most common mistake in reproducing R1-Zero — using a 2048 or 4096 limit silently caps accuracy in the mid-50s.

EXAMPLE

AIME pass@1 with max_new=2048:  ~52%  (truncates ~40% of completions before <answer>)
AIME pass@1 with max_new=4096:  ~63%  (truncates ~15%)
AIME pass@1 with max_new=16384: ~71%  (truncates <2%)

80temperature=0.6, top_p=0.95 — same sampling as training

These are the exact hyperparameters DeepSeek used for both training rollouts and evaluation. Matching them matters: cons@K is sensitive to sampler diversity. If you evaluate with temperature 0.0 (greedy), cons@K degenerates to pass@1 because all K samples are identical. If you crank temperature to 1.0, accuracy drops because too many samples wander off the high-probability ridge. 0.6 / top_p=0.95 is the empirical sweet spot for math.

EXAMPLE

At step 8000, varying eval temperature:
  T=0.0:  pass@1 65.2%, cons@64 = 65.2% (no diversity to vote over)
  T=0.6:  pass@1 71.0%, cons@64 = 86.7%
  T=1.0:  pass@1 62.1%, cons@64 = 80.4%

92c = sum(a == gold for a in answers if a is not None)

Empty answers (no <answer> tags) are counted as wrong, not skipped. This is conservative and important: if you skip them you would inflate pass@1 by silently dropping the model's failures. For R1-Zero the format reward keeps the no-tag rate below 1% by mid-training, but during the first 500 steps it can be 20%+, and you need the metric to reflect that.

EXAMPLE

Step 0 (base model): ~78% of completions have valid <answer> tags. Step 500: ~92%. Step 2000+: >99%.

96gen_lens — tokens, not characters

Length is measured in tokens because tokens are what costs compute. A Chinese-character-heavy completion has fewer tokens than its character count suggests; an English-with-LaTeX completion has more. Reporting length in characters across tokenizer changes (BPE vs SentencePiece) is the kind of mistake that makes 'response length' charts non-comparable between papers. Pad tokens are masked so a sample that finished early is not credited for its padding.

EXAMPLE

10,000-token AIME completion ≈ 38,000 characters of mixed English / LaTeX / math. Measured-in-characters this would look like '38K-character responses', which sounds different but is the same compute.

115EVAL_EVERY = 200 — eval is rare on purpose

One eval pass on AIME takes ~55 minutes on 8×A100 (K=64, 30 problems, ~10K tokens each). Running it every step would consume more compute than the GRPO training itself. EVAL_EVERY=200 over an 8000-step run gives 40 datapoints — enough to draw a smooth training curve, cheap enough to fit in the training budget. The DeepSeek paper's figures use roughly this cadence.

EXAMPLE

For an 8000-step run with eval every 200 steps: 40 evals × 55 min = 36.7 GPU-hours just for eval. Compare to ~640 GPU-hours for the training itself: eval is ~5% of the compute bill, which is acceptable.

119 lines without explanation

1"""
2Logging pass@1, cons@K, and length during a GRPO run.
3
4The metrics module lives outside the training loop: a callback fires every
5EVAL_EVERY steps, runs the sampler on the held-out eval set, computes the
6three numbers, and pushes them to W&B. The training loop itself stays
7clean — no eval logic interleaved with the GRPO step.
8"""
9
10from __future__ import annotations
11import math
12import re
13import time
14from collections import Counter
15from dataclasses import dataclass
16
17import torch
18from torch.utils.data import Dataset
19from transformers import AutoModelForCausalLM, AutoTokenizer
20
21ANSWER_RE = re.compile(r"<answer>(.*?)</answer>", re.DOTALL)
22
23def _extract(text: str) -> str | None:
24    m = ANSWER_RE.search(text)
25    return m.group(1).strip() if m else None
26
27def _pass_at_k(n: int, c: int, k: int) -> float:
28    if n - c < k:
29        return 1.0
30    return 1.0 - math.comb(n - c, k) / math.comb(n, k)
31
32def _cons_at_k(answers: list[str | None], gold: str, k: int) -> bool:
33    valid = [a for a in answers[:k] if a is not None]
34    if not valid:
35        return False
36    mode, _ = Counter(valid).most_common(1)[0]
37    return mode == gold
38
39@dataclass
40class EvalResult:
41    step: int
42    pass1: float          # %
43    cons_k: float         # %
44    avg_len: float        # tokens
45    seconds: float
46
47@torch.no_grad()
48def evaluate(
49    policy: AutoModelForCausalLM,
50    tokenizer: AutoTokenizer,
51    eval_set: Dataset,
52    K: int = 64,
53    max_new_tokens: int = 16384,
54    temperature: float = 0.6,
55    step: int = 0,
56) -> EvalResult:
57    """Run the model on the eval set, collect K rollouts per problem, return metrics."""
58
59    policy.eval()
60    t0 = time.time()
61
62    pass1_per_problem: list[float] = []
63    cons_correct      : list[bool]  = []
64    length_acc        : list[float] = []
65
66    for problem in eval_set:
67        prompt = problem["prompt"]
68        gold   = problem["gold"]
69
70        # 1. Sample K completions with one batched generate() call.
71        enc      = tokenizer(prompt, return_tensors="pt").to(policy.device)
72        input_ids = enc["input_ids"].expand(K, -1)
73        out = policy.generate(
74            input_ids,
75            max_new_tokens=max_new_tokens,
76            do_sample=True,
77            temperature=temperature,
78            top_p=0.95,
79            pad_token_id=tokenizer.pad_token_id,
80        )                                              # (K, prompt_len + gen_len)
81
82        # 2. Decode only the generated portion (drop the prompt).
83        gen_ids  = out[:, input_ids.size(1):]
84        gen_texts = tokenizer.batch_decode(gen_ids, skip_special_tokens=True)
85
86        # 3. Score.
87        answers = [_extract(t) for t in gen_texts]
88        c       = sum(a == gold for a in answers if a is not None)
89        pass1_per_problem.append(_pass_at_k(K, c, k=1))      # = c / K
90        cons_correct.append(_cons_at_k(answers, gold, k=K))
91
92        # 4. Length statistics (tokens, not characters — fair across tokenizers).
93        gen_lens = (gen_ids != tokenizer.pad_token_id).sum(dim=1)
94        length_acc.append(gen_lens.float().mean().item())
95
96    policy.train()
97    return EvalResult(
98        step    = step,
99        pass1   = 100.0 * sum(pass1_per_problem) / len(pass1_per_problem),
100        cons_k  = 100.0 * sum(cons_correct) / len(cons_correct),
101        avg_len = sum(length_acc) / len(length_acc),
102        seconds = time.time() - t0,
103    )
104
105# --- driver: glue evaluate() into the GRPO loop ---------------------------
106
107EVAL_EVERY = 200       # ~30 evals over an 8K-step run
108
109def run_with_eval(policy, ref, tokenizer, train_set, eval_set, n_steps=8000):
110    optimizer = torch.optim.AdamW(policy.parameters(), lr=1e-6)
111    history: list[EvalResult] = []
112
113    for step in range(n_steps):
114        # ... one GRPO step (see chapter 16.1 for the full body) ...
115        # loss, _ = grpo_step(policy, ref, tokenizer, train_set, optimizer)
116
117        if step % EVAL_EVERY == 0:
118            result = evaluate(policy, tokenizer, eval_set, K=64, step=step)
119            history.append(result)
120            print(
121                f"step {step:>5} | pass@1 {result.pass1:5.1f}% | "
122                f"cons@64 {result.cons_k:5.1f}% | "
123                f"avg_len {result.avg_len:6.0f} tok | "
124                f"eval took {result.seconds/60:4.1f} min"
125            )
126    return history

At Massive Scale: What 71% on AIME Cost

Numbers don't fall out of a model for free. Estimating the R1-Zero compute bill from the published run-length, sampling cadence, and model size:

Resource	Estimate	Note
RL steps	~8,000	Roughly the x-axis of the training curve
Rollouts per step	K × batch ≈ 16 × 64 = 1024	GRPO group size × prompts per step
Avg generated tokens / rollout	~10,000 (late training)	From the length curve
Generated tokens / RL step	~10M	1024 × 10K
Total generated tokens (training)	~80B	8000 × 10M
Training-side forward+backward / step	~6× rollout tokens	Policy + ref forward + backward
Eval runs	~40 (every 200 steps)	K=64, 30 AIME problems each
Generated tokens / eval	~19M	64 × 30 × 10K
Total generated tokens (eval)	~760M	<1% of training rollout tokens
Aggregate compute order	~10^25 FLOPs	Roughly 1/100th of pretraining

Two observations. First, rollouts dominate the bill — about 80% of GPU-hours during R1-Zero go into generate() calls, not into the gradient update. This is why DeepSeek (and others) push the sampler onto separate vLLM inference workers, freeing the training cluster from prefill latency. Second, eval cost is negligible relative to training — under 1% of the rollout budget — so there is no excuse for under-evaluating. Most failed R1-Zero reproductions trip on this: they evaluate twice during a 2000-step run and miss the smooth curve, then conclude the run did not work.

Why $10^{25}$ FLOPs for the headline result? The dominant cost is generation, not gradient. At late-training lengths of ~10K tokens, every gradient step eats ~10M generated tokens times the 671B-param forward cost. The take-away: scaling the headline result means scaling rollout throughput, not gradient throughput.

Engineering Reality: What the Numbers Hide

The benchmark table is true and reproducible, but it hides four engineering realities that the next sections on R1-Zero's failure modes will pick up on.

Long completions are not free at inference time. Serving a 10K-token reasoning response costs ~10× the decode FLOPs of a 1K-token answer, and the KV-cache grows linearly. Users see this as slower time-to-first-answer. R1-Zero traded a lot of latency for accuracy.
cons@64 is not how anyone uses the model. No production deployment runs 64 rollouts per user query and votes. The 86.7% number is a research metric — it tells you the model can get there, not that it will when called once.
Readability suffers. R1-Zero's long completions are tangled — mid-stream language switching, repeated "wait let me reconsider" segments, no structure. This is what motivates the R1 (non-zero) pipeline in the next chapter: a cold-start SFT pass to teach the model to produce its reasoning cleanly before continuing with RL.
Coding lags math. R1-Zero clears o1-0912 on AIME but trails it by 13+ points on LiveCodeBench and ~400 Elo on CodeForces. The reward signal for code (compile? pass tests?) is noisier, slower, and admits more partial-credit ambiguity than math's "does the integer match?" test. RL is only as good as the reward.

The headline result is real and matters: a base model, with no supervised reasoning data, can be trained with pure RL on a verifiable reward to match the strongest closed reasoning model of its era on math benchmarks. The four caveats above are exactly what the rest of the chapter — and chapter 17 on the full R1 pipeline — exists to address.