Boo-AI — Master Artificial Intelligence by Building from Scratch

The Real Problem: Was R1-Zero Actually Reasoning?

When DeepSeek released the R1 paper, the most-quoted figure in it was not the benchmark table, the GRPO loss, or the parameter count. It was a single annotated transcript: a math problem, a chain of arithmetic, and then, in the middle of the chain, the model writing “Wait, let me re-check.” followed by a corrected calculation. The DeepSeek authors labelled this the aha moment, and the rest of the field stared at the transcript for a week.

The reason it mattered is not that the answer was right. Any model can be right. The reason it mattered is that the model noticed it was wrong mid-generation, in a way that nobody had explicitly trained it to do. There was no SFT corpus of humans saying “wait, let me re-check”. There was no reward signal that mentioned self-verification. The reward function had two components — accuracy (1.0 if the final answer matches the gold) and format (0.1 if the <think></think> tags are present). That is the whole training signal. And yet the model started pausing, doubting, and correcting itself.

Until this transcript appeared, the field had treated “reasoning behaviours” — backtracking, self-verification, considering multiple methods — as things that must be taught by showing the model thousands of examples of humans (or other models) doing them. The aha moment is the public, falsifiable, line-in-the-text demonstration that this assumption is wrong. Or at least, that it is wrong at sufficient scale.

What this section answers. Section 16.1 stated the R1-Zero hypothesis; section 16.2 described the experimental setup. This section answers the only question that matters afterwards: did the experiment actually succeed, and how would we know? We will define the aha moment operationally (three measurable curves on the training log), derive why it has to happen given the GRPO objective from section 16.1, walk through a manual GRPO step that reinforces a self-verifying completion, and instrument a real training loop to catch the moment live.

Intuition: The Pause Before the Correction

Picture a person doing mental arithmetic out loud. A novice says the answer immediately and is sometimes wrong. A skilled mental arithmetic practitioner does something different: they say the answer, then often pause, then say “wait, hold on”, then re-do the calculation by a different route, then either confirm or correct the original answer. The pause is not a sign of confusion; it is the visible signature of an internal checking routine. Skilled arithmeticians have more doubt expressed per problem than novices, not less — because they have a separate cognitive process for catching errors, and that process leaves traces in the spoken output.

Pretrained language models contain the patterns for both behaviours. The training corpus — trillions of tokens of math homework, Stack Overflow threads, textbooks, and competition write-ups — is full of both novice-style direct answers and expert-style self-checking dialogues. Both shapes have non-zero probability mass in the base model's conditional distribution. The question R1-Zero was asking is: which shape does RL with a verifiable reward push the policy toward?

The answer turns out to depend on the difficulty of the problems in the training set. On problems the model can solve in one pass with reasonable probability, the optimal policy is short and direct — the “novice” shape minimises tokens generated per unit of reward earned. But on problems the model cannot reliably solve in one pass, the optimal policy switches to the “expert” shape: generate a candidate answer, verify it, and revise if necessary. Self-verification trades tokens for a higher probability of landing on a correct final answer. When the reward function pays 1.0 for correct and 0.0 for incorrect regardless of length, that trade is heavily in favour of self-verification — on hard problems.

The aha moment is the point at which the model finally figures out that the trade is worth making. Before the moment, the model has plateaued on easy problems and is failing on hard ones. After the moment, it has discovered — through its own sampling, never through a teacher — that doubling its response length to verify the answer is a strategy that earns more reward in expectation. From that point on, every gradient step pushes the policy further in the self-verifying direction.

The intuition in one line. RL with a verifiable reward does not teach the model to be careful; it teaches the model that being careful pays. The pause before the correction is what “careful” looks like when produced by a transformer.

The Anatomy of an Aha Moment

Before any maths, look at the thing itself. The comparator below shows five completions to the same prompt — What is 17 + 26? — produced by the same model architecture at five different training steps of an R1-Zero run. The prompt does not change, the sampling temperature does not change, the reward function does not change. Only the policy weights change, and they change only through GRPO updates against a regex-based reward.

Same prompt, five training steps — watch the shape of the answer change

prompt: What is 17 + 26?

1+1=2 the wave function 17+26 is the answer 43 maybe blue sky 12345 </think> the result and so on the end

step

reward

0.00

tokens

self-verifies?

The amber-highlighted phrases are exactly what the aha-moment detector in the next code listing matches against. Notice that the self-check phrases only appear in the step 6.1k and step 9.5k samples — and notice that the step 6.1k sample actually makes a mistake and then corrects it. That is not a flaw of the example; that is the whole point. The behaviour is “catch your own arithmetic slip,” not “always be right on the first try.”

Three things are worth marking. First, at step 0 the model is incoherent. The base model does not even know what a math problem is when prompted without a chat template; it produces a free-association of tokens that loosely include the answer somewhere. Second, by step 1.5k the model has learned the format (it has the <think> tags) but its reasoning is barely more than the answer itself. Third — and this is the moment — somewhere between step 4.2k and step 6.1k the model starts to produce a fundamentally different shape of response: it makes a mistake on purpose (or at least, samples a wrong path), then notices, then corrects.

Notice what the step 6.1k completion actually does. It computes the units digit correctly (7+6=13, carry 1), then mis-computes the tens digit (1+2 = 3 instead of 1+2+1 = 4), arriving at “33”. Then it writes “Wait, let me re-check.” Then it re-does the tens calculation with the carry, arrives at 43, and explicitly flags “I confused myself.” This is not the kind of sequence the base model produces. It is also not the kind of sequence a typical SFT dataset contains — human annotators tend to write clean, error-free solutions, not solutions that contain a mid-stream correction. The R1-Zero model produced this shape by sampling, getting rewarded for landing on the correct answer, and having its self-correcting tokens reinforced by GRPO along with the correct ones.

The Mathematical Idea: Why Length and Self-Verification Emerge

The mechanism behind the aha moment is not new; it is just the clipped-PG + KL objective from section 16.1 applied for long enough, on hard enough problems, at a large enough scale. Let $r(x, y) \in \{0, 0.1, 1.0, 1.1\}$ be the rule-based reward and let $A^{(i)} = (r_i - \mu) / (\sigma + \varepsilon)$ be the group-relative advantage. The per-completion gradient signal is then

$g^{(i)} \;=\; A^{(i)} \sum_{t} \nabla_\theta \log \pi_\theta\bigl(y^{(i)}_t \mid x, y^{(i)}_{<t}\bigr).$

The first thing to notice is that $A^{(i)}$ is a scalar per completion, not per token. Every token in completion $i$ receives the same advantage. So if a self-verifying completion gets a positive advantage, EVERY token in it — including the “Wait, let me re-check” tokens — has its log-probability pushed up. The reward function never mentioned those tokens; the gradient does not care.

The second thing to notice is that longer completions get more total gradient mass. If completion $i$ has $T^{(i)}$ tokens, the sum on the right-hand side has $T^{(i)}$ terms. Two completions with the same advantage but different lengths contribute different total magnitudes of update. A self-verifying completion that triples the token count triples the gradient magnitude per dollar of advantage.

The third thing — the one that creates the inflection point — is the composition of the positive-advantage set. Let $S^{+}(t)$ be the set of completions in batch $t$ with $A^{(i)} > 0$ , and let $q(\text{verify} \mid S^{+}, t)$ be the fraction of those completions that contain self-verification phrases. The expected per-step change in policy log-probability for the self-verification pattern is approximately

$\Delta \log \pi_\theta(\text{verify}) \;\propto\; \eta \cdot |S^{+}(t)| \cdot \bar A^{+} \cdot q(\text{verify} \mid S^{+}, t),$

where $\eta$ is the learning rate and $\bar A^{+}$ is the mean positive advantage. Early in training, $q(\text{verify} \mid S^{+}, t)$ is near zero — the positive-advantage completions are mostly short, single-pass correct answers. As training continues, the easy problems get solved reliably and the residual positive-advantage signal comes increasingly from hard problems — problems where the only correct completions are the self-verifying ones. As soon as $q(\text{verify} \mid S^{+}, t)$ rises above the rate at which self-verification appears in the policy overall, the update equation above is positive, and the self-verification pattern starts to be amplified. Once amplified, the pattern appears in more sampled completions, which produces more positive-advantage self-verifying completions, which amplifies the pattern further. The feedback loop is closed, and the policy transitions to a regime where self-verification is the modal response shape.

This is what an emergent behaviour means in the technical sense: it is not in the reward function, it is not in the loss function explicitly, and it is not in the gradient at step 0. But it is present in the fixed point of the training dynamics. The loss function names accuracy and format; the fixed point of optimising that loss against a verifiable reward at scale also includes self-verification, length, multi-method comparison, and explicit backtracking. Those behaviours are consequences of the loss, not ingredients of it.

The one-sentence theorem. Under GRPO with a verifiable, sparse, length-insensitive reward, any behaviour that increases

P(\text{correct} \mid x)

on hard problems is reinforced in proportion to the rate at which it appears in the positive-advantage set. Self-verification increases that probability on hard problems. So it is reinforced.

The Three Quantitative Signatures

The aha moment is not a sentence in a transcript; that is just the most photogenic way to present it. Operationally, the moment is the first training step at which three measurable quantities cross critical thresholds together. The DeepSeek-R1 paper plots all three on the same axis (their Figure 2 and Figure 3), and the inflection is obvious to the eye.

Signal	What it measures	Pre-aha value	Post-aha value
Mean response length	Average tokens per completion across the batch	~200–800 tokens	~1500–3500 tokens
Self-verification rate	Fraction of completions matching the regex ("wait", "re-check", "actually", "I made a mistake", "Method 2", …)	<10%	>50%, then >80% by mid-training
pass@1 on hard problems	Accuracy on the hard slice of the eval set (AIME, MATH-500 hard subset, etc.)	~70–78% (plateau)	>85%, climbing to ~91%

Two things make this triple a real signature rather than three independent curves. First, the three curves inflect on the same step within a few hundred steps of each other. Response length and self-verification rate rise together because they are causally linked — self-verification phrases add tokens. Pass@1 on hard problems rises with them because it is the benefit the model is paying tokens to achieve. Second, the three curves are flat or slow-growing before the moment and steep after it. That phase-transition shape is what you would expect from the feedback-loop argument above, and not what you would expect from ordinary gradient descent on a smooth loss.

What the curves rule out. The phase-transition shape rules out the lazy explanation that the model is just “getting better with more compute.” If that were the story, all three curves would rise smoothly from step 0. They do not. They are flat and then they bend.

Interactive: The Training Curves That Define the Aha Moment

The chart below is synthesized to match the qualitative shape of Figures 2 and 3 in the DeepSeek-R1 paper. Toggle the three lines on and off, and hover the chart to read the values at each training step. The pink band marks the canonical aha window (around step 6k); the three pink-outlined dots are the values of each curve at the inflection point.

The three curves that define the aha moment

step

hover the chart

response tokens

—

pass@1

—

self-verify rate

—

Hover the chart to read each step. The vertical pink band marks the aha window (step 5.5k–6.5k). Before it: response length is rising slowly, accuracy is plateauing near 78%, self-verification rate is below 30%. After it: response length explodes by 3×, accuracy breaks past the single-pass ceiling toward 90%, and the self-verification curve crosses 0.5 for the first time. The three events happen together — that synchrony is the aha moment's signature.

Two questions to ask the chart. First: at step 4500, how good is the model already? The answer is quite good — pass@1 is 75%, response length is around 680 tokens, the model is producing coherent reasoning. If you stopped training here you would have a respectable but unremarkable reasoning model. Second: at step 6500, what new capability has emerged that did not exist at step 4500? The answer is — almost entirely — self-verification on hard problems. Pass@1 jumped from 75% to 84% in 2000 steps, an enormous gain at this scale, and the self-verify rate jumped from 0.16 to 0.61 in the same window. The whole gain is explained by the new behaviour.

Which Self-Verification Phrases Rise First

Not all self-verification phrases rise at the same rate. The heatmap below tracks five distinct patterns over training. Each row is one phrase; each cell is the per-completion rate at a 500-step interval; darker amber means the phrase appears in more completions. The pink line is the canonical aha step.

Self-verification phrase frequencies over training (per-completion rate)

Phrases do not all rise together. “wait” rises first because it is the cheapest signal (one word, easy to sample by accident). “let me re-check” rises later because it requires the model to construct an entire verification phase after it. “Method 2” rises last because it requires the model to carry out two independent solution attempts and compare them — the most expensive form of self-verification. The pink line marks the aha step; notice that all five curves accelerate sharply just before it and saturate after it.

The asymmetry is informative. The single word “wait” rises first and fastest because it is the cheapest signal — a one-token interjection that the model can stumble into accidentally and then have reinforced. The longer phrase “let me re-check” rises later because the model has to actually construct a verification phase after it — emitting the words alone is not enough to earn reward. “I made a mistake” and the comparative “Method 2” pattern rise latest and saturate lowest because they require multi-step reasoning to be in place already. The ordering is itself a clue: the model learns to flag doubt before it learns to act on doubt.

Manual Numerical Walkthrough: One Gradient Step That Prefers a Self-Check

Open the panel below and follow the arithmetic. We will compute one GRPO update by hand on a group of four completions, two of which self-verify and two of which do not, and show that the self-verifying ones receive 3× the cumulative gradient mass of the non-self-verifying ones in this single step. Multiply by 10,000 steps and you have the aha moment.

Manual Numerical Walkthrough — open to see every number

Setup. One hard math problem with gold answer 43. Four completions sampled at temperature 1.0 from a policy that is partway through training (step ~5500, just before the moment):

i	shape	answer	tokens T	reward r
0	self-verify	43 (correct)	92	1.1
1	self-verify	44 (wrong)	78	0.1
2	single-pass	43 (correct)	14	1.1
3	single-pass	33 (wrong)	10	0.1

Step 1 — Group mean and std.

$\mu = (1.1 + 0.1 + 1.1 + 0.1) / 4 = 0.6.$

$\sigma^2 = \tfrac{1}{4}\bigl((1.1 - 0.6)^2 + (0.1 - 0.6)^2 + (1.1 - 0.6)^2 + (0.1 - 0.6)^2\bigr) = \tfrac{1}{4} \cdot (0.25 + 0.25 + 0.25 + 0.25) = 0.25.$

$\sigma = 0.5.$

Step 2 — Group-relative advantages.

$A^{(0)} = (1.1 - 0.6)/0.5 = +1.0, \quad A^{(1)} = (0.1 - 0.6)/0.5 = -1.0,$

$A^{(2)} = +1.0, \quad A^{(3)} = -1.0.$

So both correct completions get the same scalar advantage $+1.0$ . From one step in isolation the gradient cannot distinguish the self-verifying correct completion (i=0) from the single-pass correct completion (i=2). Both will have their tokens pushed up.

Step 3 — Per-completion gradient mass. The per-completion contribution to the policy-gradient term is proportional to $|A^{(i)}| \cdot T^{(i)}$ . Plugging in:

i	shape	\|A\|·T	direction
0	self-verify, correct	1.0 × 92 = 92	push UP every token
1	self-verify, wrong	1.0 × 78 = 78	push DOWN every token
2	single-pass, correct	1.0 × 14 = 14	push UP every token
3	single-pass, wrong	1.0 × 10 = 10	push DOWN every token

Step 4 — Net gradient mass on self-verify vs single-pass patterns.

On the self-verifying pattern (i=0 and i=1), the net signed gradient mass is $+92 - 78 = +14$ — positive, but small relative to either completion's magnitude. On the single-pass pattern (i=2 and i=3), it is $+14 - 10 = +4$ . Both are positive, so both shapes are reinforced this step — but the self-verifying pattern receives 3.5× more net positive gradient mass.

Step 5 — Why this compounds. The 3.5× ratio comes from the length difference alone (92+78 vs 14+10). Over thousands of steps, every self-verifying completion in the batch contributes ~6× the gradient of every single-pass completion, simply because it is ~6× longer. As soon as self-verifying completions appear in the positive-advantage set at non-negligible rates, they dominate the gradient update by sheer token count. That is the engine behind the feedback loop in the math section above.

Sanity check. If you set all four completions to the same length, the self-verify gradient mass collapses to the single-pass gradient mass. The aha moment IS the length effect. Length-penalised rewards (as used in some later R1 variants) damp it, which is exactly what you would predict from this walkthrough.

Plain Python: Detecting the Aha Moment in a Stream of Completions

The detector that produced the self-verification curves on the chart above is fewer than 30 lines of Python. It runs over a stream of completions, flags the self-verification phrases, and computes the rates that get plotted. The same detector is also useful for filtering — in R1's stage 2 the team kept only completions that did self-verify, which dramatically tightens the distribution.

Detect, score, and explain the aha moment — plain Python

🐍python

Explanation(7)

Code(118)

25SELF_CHECK regex — the operational definition of self-verification

The R1-Zero paper's Figure 3 ('self-verification rate over training') is plotted by running a regex like this one over every completion in a held-out evaluation set and computing the fraction that matches. The regex is broad on purpose: 'wait', 'let me re-check', 'actually', 'I made a mistake' all count. This detector is NEVER inside the reward function — it is purely an observational tool used by humans reading the training run. The model has no idea this regex exists and is not optimising for it.

EXAMPLE

detect_aha('Wait, let me re-check: 7+6=13')['has_self_check'] == True
detect_aha('17+26 = 43.')['has_self_check'] == False

31detect_aha() returns features, not a score

Three features are tracked per completion: whether ANY self-verification phrase appears, HOW MANY appear, and the token count. The token count matters because the aha moment correlates strongly with response length growth — the model is not just adding the word 'wait', it is adding the entire re-verification phase that follows. The R1-Zero paper plots all three signals on the same x-axis (training step) and the curves rise together.

EXAMPLE

detect_aha(completions[0]) returns {'has_self_check': True, 'n_self_check': 1, 'n_tokens': 28, 'phrases': ['Wait']}

48completions[] — the critical four-way comparison

These four completions are the entire pedagogical content of this section in one variable. Two of them self-verify, two do not. Two get the correct answer, two do not. The reward function cannot distinguish self-verification from single-pass — it only sees the final extracted answer and the format. So why does self-verification get reinforced? Because of the COMPOSITION of the group at this stage of training: the self-verifying completions happen to be the ones landing on the correct answer more often, so they are the ones receiving positive advantage. That is the entire causal chain.

EXAMPLE

Among the 4 completions: only completion 0 gets reward 1.1 AND self-verifies.
Completions 1, 2, 3 either fail to land on '43' or get there without self-checking.
From the gradient's perspective, 'self-verify' and 'correct' are confounded — and that is exactly what makes the behaviour spread.

63reward() — the exact same regex-based scorer from the previous section

Nothing has changed about the reward function. It still returns 1.1 for correct + formatted, 0.1 for formatted-but-wrong, and 0.0 otherwise. The aha moment is NOT engineered into the reward function. This is the most important sentence in this whole section: a model trained on a reward that never mentioned self-checking nevertheless learned to self-check. The reward function is just a force; self-verification is the equilibrium shape that force happens to push the policy toward when the underlying task is verifiable and hard.

EXAMPLE

reward(completions[0], '43') = 1.1  (correct + formatted)
reward(completions[1], '43') = 0.1  (wrong + formatted)
reward(completions[2], '43') = 1.1  (correct + formatted)
reward(completions[3], '43') = 0.1  (wrong + formatted)

69rewards[] — the four numbers that drive everything downstream

Two of the four completions get reward 1.1, two get 0.1. The group mean is 0.6. Subtract the mean from each, and the two correct completions are above zero by +0.5 while the two wrong ones are below by -0.5. That is the entire signal — but notice it does not yet distinguish completion 0 (self-verify + correct) from completion 2 (single-pass + correct). Both look identical to GRPO at this single step. The aha moment is not visible in one step; it emerges across thousands of steps as the COMPOSITION of the 'correct' set changes.

EXAMPLE

rewards = [1.1, 0.1, 1.1, 0.1] -> mean=0.6, std≈0.5

82adv[] — positive for both correct completions, negative for both wrong ones

Group-relative advantage in this toy: [+1.0, -1.0, +1.0, -1.0]. Completion 0 and completion 2 receive the same positive scalar. That scalar then multiplies EVERY token's log-prob gradient in those completions. Completion 0 has more tokens (the 'wait, let me re-check' tokens), so MORE distinct n-gram patterns get reinforced — including the self-verification phrase. Completion 2 reinforces fewer patterns because it is shorter. Over thousands of GRPO steps, the self-verification n-grams accumulate exponentially more gradient updates than the single-pass n-grams, even though the per-completion advantage is the same.

EXAMPLE

adv = [+1.0, -1.0, +1.0, -1.0]
Completion 0 has 28 tokens × adv +1.0 = 28 unit-token gradient updates.
Completion 2 has 14 tokens × adv +1.0 = 14 unit-token gradient updates.
The 'wait, let me re-check' pattern receives 2× the gradient mass of any short single-pass pattern, every single batch.

89The aha moment as a phase transition, not a single step

Practically, the moment shows up in plots as a sharp inflection in three curves around the same step: response length, self-verification phrase rate, and pass@1 on hard problems. Before the inflection, the model gets easy problems right with a single pass and gets hard problems wrong. After the inflection, the model gets BOTH easy and hard problems right because it has internalised that 'long, self-checking responses' are the reliably-rewarded shape. The crossover happened around step 6000 in the published 671B run; on smaller models the curves never cross.

EXAMPLE

Easy problem (5+3=?):    Step 0: gets it 80% of the time. Step 10k: gets it 99%, in one pass, no self-check.
Hard problem (factor 1729): Step 0: gets it 0%. Step 5k: gets it 5% via lucky single pass. Step 10k: gets it 70% via self-verification.

111 lines without explanation

1"""
2Detecting and explaining the Aha Moment from a stream of R1-Zero completions.
3
4Two pieces here, both runnable on a laptop, both faithful to what the real
5DeepSeek-R1-Zero training loop is doing:
6
7  1. detect_aha(text) -> flags the textual signature of self-verification.
8     This is the SAME function (regex + token-count) that the published
9     R1-Zero paper uses to track its Figure 3 self-verification curve.
10
11  2. simulate_grpo_step(completions) -> shows that, when a self-verifying
12     completion gets the correct answer and a single-pass completion gets
13     it wrong, GRPO's group-relative advantage assigns POSITIVE advantage
14     to every token of the self-verifying response, including the tokens
15     that say 'Wait, let me re-check.' That is why self-verification is
16     reinforced even though nothing in the reward function names it.
17"""
18
19import re
20from statistics import mean, pstdev
21
22# ---------------------------------------------------------------------------
23# 1. The aha-moment detector. The phrases below are the empirical
24#    fingerprint of the behaviour as reported in DeepSeek-R1 Fig 3.
25#    Note that the reward function does NOT use this detector --
26#    it is purely an *observational* probe over the training data.
27# ---------------------------------------------------------------------------
28
29SELF_CHECK = re.compile(
30    r"\b(wait|hold on|actually|let me (re-?check|verify|reconsider|try again)|"
31    r"i (made a mistake|confused myself|got that wrong)|"
32    r"on second thought|let me double-check)\b",
33    re.IGNORECASE,
34)
35
36def detect_aha(text: str) -> dict:
37    """Return a dict of features that characterise a 'self-verifying' completion."""
38    hits = SELF_CHECK.findall(text)
39    return {
40        "has_self_check": len(hits) > 0,
41        "n_self_check":   len(hits),
42        "n_tokens":       len(text.split()),    # cheap stand-in for tokenizer
43        "phrases":        [h if isinstance(h, str) else h[0] for h in hits],
44    }
45
46# ---------------------------------------------------------------------------
47# 2. Four completions sampled from one prompt during step ~6000. One uses
48#    self-verification AND lands on the correct answer; one uses
49#    self-verification but lands on the wrong answer; one is single-pass and
50#    correct; one is single-pass and wrong. The reward function (correct?
51#    accuracy=1, format=0.1) cannot see the difference between styles --
52#    it only sees the final answer. Yet GRPO will still reinforce the
53#    self-verifying style. Watch why.
54# ---------------------------------------------------------------------------
55
56completions = [
57    # self-verify + correct -- the future R1-Zero
58    "<think>17+26 looks like 33. Wait, let me re-check: 7+6=13 carry 1, 1+2+1=4, so 43.</think><answer>43</answer>",
59    # self-verify + wrong -- self-checking does NOT guarantee being right
60    "<think>17+26 = 44. Hmm, actually let me try again. 7+6=13, 1+2+1=4, so 44.</think><answer>44</answer>",
61    # single-pass + correct -- the previous champion at step ~4000
62    "<think>17+26 = 43.</think><answer>43</answer>",
63    # single-pass + wrong -- the base-model default
64    "<think>17+26 = 33.</think><answer>33</answer>",
65]
66
67GOLD = "43"
68PATTERN = re.compile(r"<think>(.*?)</think>\s*<answer>(.*?)</answer>", re.DOTALL)
69
70def reward(text: str, gold: str) -> float:
71    m = PATTERN.search(text)
72    if m is None:
73        return 0.0
74    return (1.0 if m.group(2).strip() == gold else 0.0) + 0.1
75
76rewards = [reward(c, GOLD) for c in completions]
77features = [detect_aha(c) for c in completions]
78
79print("idx | reward | self-check? | tokens | snippet")
80for i, (c, r, f) in enumerate(zip(completions, rewards, features)):
81    flag = "yes" if f["has_self_check"] else "no "
82    print(f" {i}  |  {r:4.2f}  |    {flag}     |   {f['n_tokens']:3d}  | {c[:50]}...")
83
84# ---------------------------------------------------------------------------
85# 3. GRPO advantage assignment. Notice: completion 0 (self-verify + correct)
86#    is the ONLY one above the group mean. Every token in completion 0
87#    -- including the 'Wait, let me re-check' tokens -- receives a positive
88#    advantage. The reward function never mentioned self-verification, but
89#    the gradient step amplifies it anyway, because it co-occurs with the
90#    only correct answer in the group.
91# ---------------------------------------------------------------------------
92
93mu = mean(rewards)
94sd = pstdev(rewards) + 1e-8
95adv = [(r - mu) / sd for r in rewards]
96
97print(f"\ngroup mean reward = {mu:.3f}, std = {sd:.3f}")
98print("idx | reward | advantage | which tokens get the gradient?")
99for i, (r, a, f) in enumerate(zip(rewards, adv, features)):
100    style = "self-verify" if f["has_self_check"] else "single-pass"
101    direction = "PUSHED UP" if a > 0 else "PUSHED DOWN"
102    print(f" {i}  |  {r:4.2f}  |  {a:+5.2f}    | all {f['n_tokens']:3d} {style} tokens {direction}")
103
104# ---------------------------------------------------------------------------
105# 4. The aha moment is the inflection point where completion 0's STYLE
106#    (self-verification) starts winning consistently across the dataset.
107#    Before the inflection: completion 2 wins (single-pass + correct, short).
108#    After the inflection: completion 0 wins (self-verify + correct, long).
109#    The group composition tips because, on harder problems, the only way
110#    to get accuracy=1 reliably is to self-verify -- and the model has
111#    finally sampled enough self-verifying completions for them to dominate
112#    the positive-advantage slot.
113# ---------------------------------------------------------------------------
114
115print("\nThe aha moment is not a single training step.")
116print("It is the step at which P(self-verify | correct answer) first exceeds")
117print("P(single-pass | correct answer) when averaged over hard problems.")
118print("In the DeepSeek-R1-Zero run, that crossover happened around step 6000.")

PyTorch: Instrumenting an R1-Zero Run to Catch the Moment Live

The plain-Python detector above is what you would run offline over a dump of completions. In a real training run you want the same signals live, every N steps, so you can see the inflection happen and decide when to checkpoint. The PyTorch version below is a drop-in addition to the GRPO step from section 16.1 — one tracker object, one tracker.update() call per step. Everything inside the gradient graph is unchanged.

GRPO step + AhaMomentTracker — drop-in instrumentation

🐍python

Explanation(8)

Code(144)

30AhaMomentTracker — a passive observer, not a trainer

Everything inside this class is OUTSIDE the gradient graph. It holds three rolling deques of scalars and a single integer (the first step at which the aha criterion fires). It never calls .backward(), never modifies the policy. The whole reason it can be passive is that the aha moment is a CONSEQUENCE of the GRPO loss, not a cause. The class exists so a human looking at training logs can answer the question 'has the model started reasoning yet?' without re-reading thousands of completions by hand.

EXAMPLE

tracker = AhaMomentTracker(window=50, tokenizer=tok)
# After 6000 steps: tracker.aha_step == 6021, tracker.lengths[-1] ≈ 1380, tracker.sc_rates[-1] ≈ 0.51.
# All this information is stored on the CPU; the GPU never sees the tracker.

39lengths / sc_rates / rewards — three rolling buffers, same length

Each deque has window=50 entries. Each entry is the per-batch average of one signal. We use a rolling window rather than a single batch because per-batch noise is large — a single batch of 16 completions on a hard problem can have zero self-checks even when the policy's true rate is 30%. Averaging 50 batches × 16 completions = 800 samples per data point smooths out the noise enough for the inflection to be visible.

EXAMPLE

At step 6000:
  lengths   = [1290, 1310, 1320, ..., 1380]  (last 50 batch means)
  sc_rates  = [0.42, 0.47, 0.45, ..., 0.51]
  rewards   = [0.78, 0.79, 0.80, ..., 0.81]

46update() — the only method called from inside training

Called ONCE per GRPO step, before the gradient update. Three things happen: token-count each completion (using the actual tokenizer, not a regex split), regex-match each completion for self-verification, and store the batch-mean of all three signals. The tokenizer call is the only non-trivial cost — at K=16 completions × 4k tokens each, encoding takes a few milliseconds, negligible next to the forward/backward pass.

EXAMPLE

After step 6021:
  lens = [1382, 1394, 1370, 1410, ...]  # 16 ints, one per completion
  sc   = [1.0, 0.0, 1.0, 1.0, ...]      # 16 booleans cast to float
  -> rolling window now shows: lengths[-1]=1389, sc_rates[-1]=0.62

59The aha-moment criterion — two conditions, both heuristic

We declare the aha moment at the first step where (a) the rolling-window self-check rate exceeds 0.5 AND (b) the latest mean response length is more than 5x the step-0 length. The exact thresholds do not matter — pick 0.4 and 4x or 0.6 and 6x and you will see the moment fire within a few hundred steps of step 6000. What matters is that BOTH conditions are required: response-length growth without self-checking can happen when the model is just verbose; self-checking without length growth would just be noise.

EXAMPLE

Step 5500: sc=0.32, len=1050. Either condition false -> no aha.
Step 5950: sc=0.49, len=1320. sc < 0.5 -> no aha.
Step 6021: sc=0.51, len=1382. Both conditions true -> AHA fires for the first time.

87grpo_step_with_tracker — note what is UNCHANGED

Compare this function line-by-line with the GRPO step from section 16.1 and the ONLY differences are (a) the new `step` and `tracker` arguments and (b) the `tracker.update()` call on line 122. The reward function, advantage computation, KL penalty, clipping, and optimizer step are identical. This is the most important takeaway about the aha moment: it is an EMERGENT PROPERTY of an unchanged training procedure. You do not 'enable' the aha moment by switching on a flag; you observe it by watching scalars while running the same GRPO recipe you would have run anyway.

EXAMPLE

If you remove tracker.update() entirely, the model still develops the aha moment. The tracker is for the human in the loop, not the optimisation loop.

113rewards tensor — input to the tracker, NOT shaped by it

We compute rewards from the rule-based scorer, log them to the tracker, then USE them in the GRPO loss. Note the order: tracker.update() is called BEFORE the gradient step (line 116). This means the tracker always describes the state of the policy that PRODUCED the current batch, not the state after the update. This matters for interpreting plots: the curves you see at step k describe the policy that was used to sample step k's data, not the policy after step k's optimizer step.

EXAMPLE

rewards = tensor([1.1, 1.1, 0.1, 1.1, 0.0, 1.1, ...])  # K=16 scalars
mean reward = 0.83  -> logged to tracker.rewards.append(0.83) on this step.

120A = (rewards - mean) / (std + eps) — same group baseline as section 16.1

This single line is where the aha moment is actually CAUSED, even though nothing about it mentions self-verification. The mechanism: at this stage of training, the correct completions in any given group are increasingly the LONG, self-checking ones. The single-pass attempts have plateaued at ~75% accuracy and the residual hard problems can only be solved by working through the calculation twice. So the positive-advantage slot is increasingly populated by self-checking completions, and their tokens — including the 'wait' tokens — get pushed up. The advantage line did not change; the population it operates on did.

EXAMPLE

Step 1500: positive-advantage completions are short single-pass.  Avg length 60.
Step 4500: positive-advantage completions are longer single-pass.   Avg length 680.
Step 6000: positive-advantage completions are now mostly self-check. Avg length 1380. The composition tipped.

138log_prob_from_model — exact teacher-forced log-prob, two lines

Standard causal-LM log-prob: take model logits at positions 0..T-2, log-softmax over the vocab axis, gather at the true next-token indices 1..T-1. Returns a (K, T-1) tensor of per-token log-probs. We use this for both the policy (for the policy-gradient term and the numerator of the KL) and the reference model (for the denominator of the KL). The function is pure: same input, same output, no side effects. Putting it in a helper avoids accidental subtle differences between the policy and reference log-prob computation, which has been a silent bug in many from-scratch RL trainers.

EXAMPLE

log_prob_from_model(policy, seqs, attn).shape == (16, 4095)
# 16 completions × 4095 tokens (the trailing slice on T) of per-token log-probabilities under the policy.

136 lines without explanation

1"""
2Instrumenting an R1-Zero training loop to catch the aha moment live.
3
4Three things to log every N steps. The DeepSeek-R1-Zero paper plots all
5three on the same step axis and the inflection is visible to the eye:
6
7  1. mean_response_tokens     -- average completion length in the batch
8  2. self_verification_rate   -- fraction of completions matching the regex
9  3. mean_reward / pass@1     -- whether the model is getting the answer right
10
11The model below is the same GRPO-step skeleton from section 16.1; the
12new code is the AhaMomentTracker and the log_step() function. In a real
13production run you would push these three scalars to Weights & Biases or
14TensorBoard and watch them in real time over the first 10k-20k steps.
15"""
16
17import re
18import torch
19import torch.nn.functional as F
20from collections import deque
21from transformers import AutoTokenizer
22
23# ---------------------------------------------------------------------------
24# The same regex from the plain-Python version, compiled once.
25# ---------------------------------------------------------------------------
26
27SELF_CHECK = re.compile(
28    r"\b(wait|hold on|actually|let me (re-?check|verify|reconsider|try again)|"
29    r"i (made a mistake|confused myself|got that wrong)|"
30    r"on second thought|let me double-check)\b",
31    re.IGNORECASE,
32)
33
34# ---------------------------------------------------------------------------
35# AhaMomentTracker -- a rolling-window monitor over the last W batches.
36# Lives outside the model; never appears in the gradient graph. The whole
37# point is that the model is NOT optimising for these numbers; we just
38# want to SEE them rise.
39# ---------------------------------------------------------------------------
40
41class AhaMomentTracker:
42    def __init__(self, window: int = 50, tokenizer=None):
43        self.window     = window
44        self.tokenizer  = tokenizer
45        self.lengths    = deque(maxlen=window)   # mean tokens / completion
46        self.sc_rates   = deque(maxlen=window)   # self-check rate
47        self.rewards    = deque(maxlen=window)   # mean reward
48        self.aha_step   = None                   # first step crossing threshold
49
50    def update(self, step: int, texts: list[str], rewards: torch.Tensor):
51        # Per-completion length in tokens (tokenizer pass to be exact).
52        lens = [len(self.tokenizer.encode(t)) for t in texts]
53        # Per-completion self-check flag.
54        sc   = [1.0 if SELF_CHECK.search(t) else 0.0 for t in texts]
55
56        self.lengths.append(sum(lens) / len(lens))
57        self.sc_rates.append(sum(sc)  / len(sc))
58        self.rewards.append(rewards.mean().item())
59
60        # 'Aha' heuristic: self-check rate over the last W batches exceeds
61        # 0.5 AND mean response length is more than 5x the step-0 length.
62        # The exact thresholds are taste; the inflection is unambiguous in
63        # plots even without a hard cutoff.
64        if self.aha_step is None and len(self.sc_rates) == self.window:
65            if sum(self.sc_rates) / self.window > 0.5 and self.lengths[-1] > 5 * self.lengths[0]:
66                self.aha_step = step
67                print(f"\n*** AHA MOMENT at step {step} ***")
68                print(f"    self-check rate (last {self.window}): {sum(self.sc_rates)/self.window:.2f}")
69                print(f"    mean response tokens:                 {self.lengths[-1]:.0f}")
70                print(f"    mean reward:                          {self.rewards[-1]:.2f}")
71
72    def snapshot(self) -> dict:
73        return {
74            "len":  self.lengths[-1]  if self.lengths  else 0.0,
75            "sc":   self.sc_rates[-1] if self.sc_rates else 0.0,
76            "r":    self.rewards[-1]  if self.rewards  else 0.0,
77            "aha":  self.aha_step,
78        }
79
80
81# ---------------------------------------------------------------------------
82# Modified GRPO step. The only differences from section 16.1 are the two
83# tracker.update() lines. The math, the loss, the optimizer step --
84# all identical. The aha moment is something we OBSERVE in the training
85# logs; it is not produced by changing the loss function.
86# ---------------------------------------------------------------------------
87
88def grpo_step_with_tracker(
89    step,
90    policy, ref, tokenizer,
91    prompt, gold,
92    optimizer,
93    tracker,
94    K=16, beta=0.04,
95):
96    # 1. Sample K completions (same as before).
97    enc = tokenizer(prompt, return_tensors="pt").to(policy.device)
98    with torch.no_grad():
99        seqs = policy.generate(
100            enc["input_ids"].expand(K, -1),
101            max_new_tokens=4096,                # NOTE: 4k is enough for aha-era
102            do_sample=True, temperature=1.0, top_p=1.0,
103        )
104    completions = tokenizer.batch_decode(
105        seqs[:, enc["input_ids"].size(1):], skip_special_tokens=True,
106    )
107
108    # 2. Rule-based reward (same as before).
109    PATTERN = re.compile(r"<think>(.*?)</think>\s*<answer>(.*?)</answer>", re.DOTALL)
110    def reward(t):
111        m = PATTERN.search(t)
112        if m is None: return 0.0
113        return (1.0 if m.group(2).strip() == gold else 0.0) + 0.1
114
115    rewards = torch.tensor([reward(t) for t in completions], device=policy.device)
116
117    # 3. *** Log the aha-moment signatures BEFORE the gradient step. ***
118    tracker.update(step, completions, rewards)
119
120    # 4. GRPO advantage + loss (same as section 16.1).
121    A = (rewards - rewards.mean()) / (rewards.std() + 1e-8)
122    attn = (seqs != tokenizer.pad_token_id).long()
123    logp_pi  = log_prob_from_model(policy, seqs, attn)
124    logp_ref = log_prob_from_model(ref,    seqs, attn)
125    prompt_len = enc["input_ids"].size(1)
126    mask = torch.zeros_like(logp_pi); mask[:, prompt_len - 1:] = 1.0
127    mask = mask * attn[:, 1:]
128    pg = -A.unsqueeze(-1) * logp_pi
129    kl = logp_pi - logp_ref
130    loss = ((pg + beta * kl) * mask).sum() / mask.sum()
131
132    # 5. Backprop and step.
133    optimizer.zero_grad()
134    loss.backward()
135    torch.nn.utils.clip_grad_norm_(policy.parameters(), max_norm=1.0)
136    optimizer.step()
137
138    return loss.item(), tracker.snapshot()
139
140
141def log_prob_from_model(model, seqs, attn):
142    out    = model(input_ids=seqs, attention_mask=attn).logits[:, :-1, :]
143    targets= seqs[:, 1:]
144    return F.log_softmax(out, dim=-1).gather(-1, targets.unsqueeze(-1)).squeeze(-1)

At Massive Scale: Why 671B Has the Moment and 7B Does Not

Every bit of code in this section also runs on a 7B base model. The GRPO loss, the rule-based reward, the tracker — all of it is scale-invariant in the implementation. And yet, on a 7B model trained with the identical recipe, the aha moment never arrives. The response-length curve climbs slowly to maybe 600 tokens and stalls. The self-verification rate stays under 0.15. Pass@1 on hard problems plateaus around 50% and never breaks through. The DeepSeek team verified this directly: they ran R1-Zero-Lite on DeepSeek-V2-Lite-Base (a smaller MoE) and the moment did not happen.

Three reasons, all interacting:

Pass@K floor. R1-Zero only learns from problems where AT LEAST ONE of the K sampled completions is correct — otherwise $A^{(i)} = 0$ for all $i$ and the gradient is zero. A 7B base model gets ~5% pass@64 on the AIME-style problems R1-Zero trains on; a 671B base model gets ~25% pass@64. The 671B model has 5× the “learnable” signal per training batch.
Latent self-verification mass. The patterns the aha moment amplifies are already in the base model's distribution — somewhere. In a 671B pretrained on 14.8T tokens, “Wait, let me re-check” has measurable probability mass conditioned on a math context. In a 7B base model the same pattern is buried under noise. RL amplifies what is there; it does not create what is absent.
Context window. A self-verification phase doubles response length. If the policy's context window is 4k tokens and a problem already eats 2k for the prompt and first solution attempt, there is no room for a second attempt. R1-Zero used a 32k context window; smaller-context experiments hit the wall before the moment can happen.

The aha moment is therefore not a property of the algorithm. It is a property of the algorithm applied at sufficient scale. This is the same shape of finding as the original scaling-laws results (Chapter 5): capabilities that look discontinuous at the capability-vs-scale frontier are smooth across the right axes, but the right axes include base-model quality, context length, and per-problem pass@K floor.

What this means in practice. If you are reproducing R1-Zero on a smaller open base model and the moment is not happening, the lever is not the RL recipe; the lever is the base model. Either switch to a stronger base, extend the context window, or accept that you will need a small SFT step (as in R1's stage 1 cold-start) to bootstrap the self-verification pattern up to where RL can take over. The R1 paper's “R1” (as opposed to R1-Zero) pipeline is exactly this concession to scale.

Engineering Reality: Real Aha, Fake Aha, and Language Mixing

The aha moment is not pure. In a real run, several pathological shapes also appear around the same window, and an experienced engineer learns to tell them apart.

Real Aha

A real aha moment looks like this: the model emits a candidate answer, follows it with a self-verification phrase, executes an independent verification (different decomposition, different method, unit check, sanity check), and then either confirms or corrects. The verification phase contains concrete arithmetic, not just confidence-flavoured words. The probability of correctness goes up on hard problems. The behaviour generalises to problems the model has never seen.

Fake Aha (“Ritual Verification”)

A fake aha looks like this: the model emits a candidate answer, writes “Wait, let me re-check.”, then writes a verification phase that re-states the same calculation in slightly different words and arrives at the same answer. The verification adds tokens but no information. The probability of correctness is not improved — you can see this by ablating the verification phase from the completion and re-scoring. Ritual verification is what you get when the model has learned the shape of the aha moment without the function. It is reward hacking by cosmetic mimicry.

Language-Mixing Aha

The original R1-Zero famously sometimes wrote its verification phase in a different language than the question. A Chinese math problem got an English self-verification; an English problem got a Chinese self-verification. This happens because the base model has self-verification mass distributed across languages, and the rule-based reward is language-blind — it does not penalise switching. R1 (the production model) added a small language-match reward to suppress this, but the underlying mechanism is intact: the model is using whichever language has the strongest verification prior.

Failure mode	What you see in logs	Fix
Ritual verification	Response length and self-verify rate rise, but pass@1 does not.	Add a content-overlap penalty on the verification phase, or filter the dataset for problems where ritual verification cannot solve them.
Language mixing	Verification phase in a different language; readability scores collapse.	Add a language-match reward (used in R1 stage 2); cost is a small accuracy reduction.
Repetition / loops	Self-verification phase repeats the same sentence indefinitely until max_new_tokens.	Repetition penalty in sampling; lower temperature in the verification phase only.
Format hacking	Self-verify phrase appears but inside the wrong tags or after the <answer> tag.	Tighten the format regex; reject completions where the verify phrase appears in the wrong location.
Premature plateau	Self-verify rate rises and then drops back to zero (the model unlearned it).	Lower the learning rate; increase the KL coefficient β; the policy is drifting away from a working solution.

The single most important practical takeaway is that the aha moment is necessary but not sufficient for a good reasoning model. A model that has had its aha moment but whose self-verification is ritualistic is worse than a model that has not had its moment at all — the long completions cost compute at inference time, the user experience is bad, and the accuracy gain that was supposed to justify the length is absent. Section 16.5 covers the failure modes of R1-Zero in depth, and Chapter 17 shows how R1 productionises the aha-moment behaviour without the pathologies. But the moment itself — the first time the model writes “Wait, let me re-check” without ever being shown those words by a teacher — is the single most consequential observation in post-2024 LLM training.

The bigger lesson. The aha moment is not about self-verification per se. It is the proof that any behaviour latent in a large base model can be amplified by a verifiable reward, with no supervised examples of that behaviour anywhere in the pipeline. Self-verification was the first behaviour we caught emerging this way. It will not be the last.