Boo-AI — Master Artificial Intelligence by Building from Scratch

The previous two sections gave us a precision instrument. Chinchilla scaling laws (§9.1) tell us, to within a few percent, what loss a model of size $N$ trained on $D$ tokens will reach. The MoE scaling laws (§9.2) refine the prediction for sparse models. Loss is predictable. Capabilities are not.

The thesis of this section. Frontier model training is haunted by a strange asymmetry: validation loss is a smooth power law, but the benchmarks the public actually cares about — multi-digit arithmetic, MMLU, code generation, chain-of-thought reasoning — improve in flat-then-sudden jumps. Researchers call this emergence. Whether emergence is a real phase transition or an artefact of the way we grade tasks is one of the most consequential debates in modern AI, because it decides whether a $50M training run is a gamble or an extrapolation.

The Real Problem: Loss Is Smooth, Capabilities Are Not

Chinchilla gave us this beautiful equation: $L(N, D) = E + \frac{A}{N^{\alpha}} + \frac{B}{D^{\beta}}$ . Plug in any model size and any token count and the validation loss is nailed down. We can plan a training run the way an aerospace engineer plans a launch — pick the budget, pick the trajectory, predict the outcome. Then we evaluate the trained model on the real benchmarks and the prediction breaks.

Wei et al. (2022) catalogued this directly. They ran every public model in the 10M–540B parameter range across 200+ tasks from BIG-Bench and plotted accuracy against training compute. Two patterns appeared.

Smooth tasks. Token-level perplexity, simple completion, surface-level pattern matching — accuracy rises as a power law in compute, perfectly predictable from a small model.
Emergent tasks. 3-digit addition, modular arithmetic, transliteration, MMLU, chain-of-thought reasoning — accuracy is pinned at ~random for many decades of compute, then jumps to near-ceiling over less than one decade. From a 6B model you have no signal at all; from a 70B model the task is solved.

This is uncomfortable. The whole point of a scaling law is that you can train a small model, fit the curve, and extrapolate. If the capability you care about is invisible until you cross a threshold, the extrapolation is a coin flip. You spend $50M on a 671B run and either MMLU jumps from 35% to 87% — or it sits at 41% and your release report has nothing to say.

Task	First model that solved it	Effective compute
3-digit addition (exact match)	GPT-3 13B (Brown et al., 2020)	~ 3 × 10²² FLOPs
MMLU (above 25% random)	Gopher 280B / PaLM 62B	~ 10²³ FLOPs
Chain-of-thought wins over standard prompting	LaMDA 137B / PaLM 540B	~ 10²³ – 10²⁴ FLOPs
Instruction following (zero-shot, no SFT)	GPT-3 175B	~ 3 × 10²³ FLOPs
HumanEval pass@1 above 20%	Codex 12B / GPT-4	~ 10²³ – 10²⁵ FLOPs

Why this matters at frontier scale. If you cannot predict from a 7B model whether a 670B model will solve MMLU, then the decision to train the 670B model is a research bet, not an engineering plan. Every frontier lab has lost training runs to capabilities that failed to emerge — and won runs by riding a capability that emerged a decade of compute earlier than expected. Understanding the shape of the curve is the difference between betting blind and reading a weather map.

Intuition: A Chain Is Only As Strong As Its Weakest Token

The cleanest mental model for emergence is the chain. Most useful tasks are not single-token decisions — they are sequences. To add two 4-digit numbers, the model must emit one digit at a time with carries propagated correctly. To answer an MMLU question correctly, it must emit the letter of the right option. To write a working Python function, it must emit dozens of tokens in the right order. To reason through a chain-of-thought, it must take ten or twenty correct steps before the final answer.

Now suppose the model gets each individual token right with probability $p$ . If the task requires $k$ tokens in a row, all correct, and we grade with exact-match, the task accuracy is approximately $p^k$ . The shape of $p^k$ as a function of $p$ is what creates emergence.

p (per-token)	p² (k=2)	p⁸ (k=8)	p²⁰ (k=20)
0.50	0.25	0.004	0.0000010
0.75	0.56	0.10	0.0032
0.90	0.81	0.43	0.12
0.95	0.90	0.66	0.36
0.99	0.98	0.92	0.82

Read down the $p^{20}$ column. As per-token accuracy crawls from 0.75 to 0.99 — a smooth, gradual improvement — the 20-step task accuracy explodes from 0.3% to 82%. The capability was always rising; the exact-match grader was hiding it behind a wall.

This is the analogy that holds the whole concept together: a chain is only as strong as its weakest link, and a chain of probabilities is only as accurate as the worst per-step probability. Improve every link a little and the chain stays broken. Improve every link a lot and the chain suddenly holds the weight.

The water analogy. Heat a beaker of water from 20°C to 99°C and nothing visible happens — it is still a liquid. Heat another degree and it boils. The underlying physics was changing smoothly the whole time; the binary observable (liquid vs gas) crossed a threshold. Emergence in LLMs is a phase transition in the observable, not in the underlying quantity. The token-level log-probability was rising smoothly across the entire compute range.

The Mathematics of a Sudden Jump

Make the intuition formal. Let $p(C)$ be the probability that the model emits the correct next token at any individual answer position, where $C$ is the training compute. From the Chinchilla loss law we know $p(C)$ rises smoothly and monotonically with $C$ . A simple, useful surrogate is the sigmoid:

$p(C) = \sigma\big(\,s \cdot (\log_{10} C - C_{0})\,\big) = \frac{1}{1 + e^{-s\,(\log_{10} C - C_{0})}}$

Here $C_0$ is the log-compute at which the per-token probability is exactly 0.5, and $s$ controls the steepness in decades of compute. Empirically, fitting this to real benchmarks gives $s \approx 1\text{–}2$ and $C_0$ somewhere between $10^{21}$ and $10^{23}$ depending on the task.

Now grade the task with exact-match over a chain of $k$ steps:

$A_{\text{exact}}(C, k) = p(C)^{k}$

Take the derivative with respect to log-compute to see how steep the rise is at the midpoint:

$\frac{d A_{\text{exact}}}{d \log_{10} C} = k \cdot p(C)^{k-1} \cdot \frac{d p}{d \log_{10} C}$

Two things to notice. First, the slope of the exact-match curve is $k$ times the slope of the per-token curve at any point where $p$ is close to 1 — so longer chains have steeper emergence elbows. Second, when $p$ is small the slope is vanishingly small because $p^{k-1}$ kills it — which is why the flat-at-zero plateau before emergence is so wide.

Now grade the same model with token-level log-likelihood:

$L_{\text{token}}(C) = \log p(C)$

This is monotone and continuous in $p$ , hence smooth and continuous in $\log_{10} C$ . The same model, the same predictions — but the curve is a power law, not a phase transition. Schaeffer et al. (2023) argued from exactly this observation that emergence is partly an artefact of the metric: switch from exact-match to token-level scoring and the discontinuity dissolves into the smooth Chinchilla curve.

Real vs apparent emergence

Schaeffer's case is sharp but not the full story. Two things can both be true:

Some emergence is metric-driven. Multi-digit arithmetic, exact-match MMLU, code-pass@1 — for these, the model is monotonically improving its log-likelihood the whole time, and the jump on the headline metric is a thresholding artefact. Switch to per-token NLL and the curve smooths out.
Some emergence is real. In-context learning, chain-of-thought benefit, instruction following from raw pretraining, and tool use show genuine qualitative shifts that no smooth rescaling of the metric removes. Below the threshold the model ignores the prompt structure; above it, it follows. The phase transition is in the model's strategy, not in the grader.

The practical consequence is the same in both cases: you cannot extrapolate the leaderboard number from a small model. You can extrapolate the per-token NLL — and the NLL is what actually drives the leaderboard. Plot both.

Manual Numerical Walkthrough

Walk emergence through a concrete example: a model learning to add two 4-digit numbers. The task emits 4 answer tokens (the four digits of the sum), so $k = 4$ . We will assume the model has identical per-token accuracy at every position — not exactly true in practice (the carry digits are harder), but enough to expose the mechanism.

Click to expand: 4-digit addition across eight model scales

Set-up. Eight checkpoints from a Chinchilla-sized run, spaced one decade apart in compute. We assume per-token accuracy on a digit follows $p(C) = \sigma(1.5 \,(\log_{10} C - 22))$ . The task is 4-digit addition, exact-match over four digits.

Step 1 — per-token accuracy. Compute $p(C)$ at each scale:

10²⁰ FLOPs: σ(−3) = 0.047 ≈ 5%
10²¹ FLOPs: σ(−1.5) = 0.182 ≈ 18%
10²² FLOPs: σ(0) = 0.500 ≈ 50%
10²² · ³ FLOPs: σ(0.75) = 0.679
10²³ FLOPs: σ(1.5) = 0.818
10²³ · ³ FLOPs: σ(2.25) = 0.905
10²⁴ FLOPs: σ(3.0) = 0.953
10²⁵ FLOPs: σ(4.5) = 0.989

Step 2 — exact-match accuracy. Raise to the 4th power (one factor per answer digit):

10²⁰: 0.047⁴ ≈ 0.0000049 (effectively 0)
10²¹: 0.182⁴ ≈ 0.0011 (still ~0 on a benchmark of hundreds)
10²²: 0.500⁴ = 0.0625 (first measurable signal)
3×10²²: 0.679⁴ ≈ 0.213
10²³: 0.818⁴ ≈ 0.448
3×10²³: 0.905⁴ ≈ 0.671
10²⁴: 0.953⁴ ≈ 0.825
10²⁵: 0.989⁴ ≈ 0.957

Step 3 — token log-likelihood. Apply $\log p$ to the column-1 values:

10²⁰: log 0.047 = −3.06
10²¹: log 0.182 = −1.70
10²²: log 0.500 = −0.69
3×10²²: log 0.679 = −0.39
10²³: log 0.818 = −0.20
3×10²³: log 0.905 = −0.10
10²⁴: log 0.953 = −0.048
10²⁵: log 0.989 = −0.011

Step 4 — compare the three columns. Per-token accuracy rises by a factor of 21× across the range (0.047 → 0.989). Token NLL rises smoothly by a factor of 280× (−3.06 → −0.011) — also continuous, no jumps. Exact-match accuracy rises by a factor of 195,000× (5×10⁻⁶ → 0.957), and the bulk of that rise is concentrated in two decades of compute (10²² to 10²⁴). Plotted on a linear y-axis, the first three rows are visually indistinguishable from zero — the line goes from flat-on-the-floor to almost-at-ceiling across three checkpoints. That is the emergence elbow.

Step 5 — chain length sensitivity. Re-do step 2 with $k = 20$ (a chain-of-thought reasoning task with twenty inference steps). At p = 0.679 the task accuracy is 0.679²⁰ ≈ 0.00037 — still invisible. At p = 0.953 it is 0.953²⁰ ≈ 0.378. At p = 0.989 it is 0.989²⁰ ≈ 0.802. The emergence elbow has shifted ROUGHLY one decade of compute to the right — and is steeper. Longer chains create later and sharper emergence. This is why chain-of-thought reasoning emerged at a much higher scale than basic arithmetic — it has a longer chain.

Visualizing Emergence

The sandbox below lets you steer all three knobs: chain length $k$ , the log-compute mid-point $C_0$ , and the per-token steepness $s$ . The dashed green line is the underlying per-token accuracy $p(C)$ — a smooth sigmoid in log-compute. The solid red line is the task-level metric you chose: exact-match (which is $p^k$ , the curve that looks emergent) or token-level (which is just $p$ ). Drag the chain-length slider from 1 to 20 and watch the elbow move right and steepen — that is the geometry of emergence.

Loading emergence sandbox…

Three things to internalise from the sandbox. First, at $k = 1$ the red and green curves are identical — there is no emergence on single-token tasks. Second, raising $k$ pushes the red elbow to the right and makes it steeper, all without touching the underlying per-token curve. The task got harder to grade, not harder to learn. Third, switch the metric to "Token log-likelihood" and the red curve becomes a smooth sigmoid for any $k$ — because $\log p$ is what loss-based scaling laws actually predict. The emergence disappeared because the metric stopped throwing away information.

Plain Python: Simulating an Emergent Curve

The following script is the simplest possible emergence simulator — no neural network, no GPU, just a sigmoid and a Bernoulli draw. It proves the entire phenomenon needs nothing more than two ingredients: a smoothly-improving per-token probability, and an exact-match grader over a chain of length $k$ .

🐍emergence_simulator.py

Explanation(7)

Code(38)

10Per-token capability is a smooth sigmoid of log-compute

Cross-entropy loss falls predictably with compute (Chinchilla-style power law). Translated through softmax, the probability that a single next-token prediction is correct is a smooth, monotone function of log-compute — there are no kinks. We model it as a sigmoid centred at `mid` decades of FLOPs with steepness `sharp`. This is the underlying quantity that scales 'smoothly'.

EXECUTION STATE

mid = 22.0 (log10 FLOPs)

sharp = 1.5 (decades per transition)

14A task requires k correct tokens in a row

Multi-step arithmetic, multi-hop QA, code generation, chain-of-thought reasoning — all share this structure. The grader accepts the answer ONLY if every emitted token is right. Real tasks are not strictly Bernoulli (errors are correlated), but this independence assumption is the cleanest way to expose the mechanism behind emergence.

EXECUTION STATE

k = 8 (steps per task)

n_eval = 1000 (tasks per scale)

18Simulate the per-token Bernoulli draws

For each of n_eval simulated tasks, draw k Bernoulli(p) trials. The task is graded correct ONLY if all k draws came up 1. This is the exact-match grader used by GSM8K, HumanEval, MMLU multiple-choice, and most benchmark suites. It is the metric the public sees on leaderboards.

23Exact-match accuracy = E[ p^k ]

After thousands of trials this Monte Carlo estimate converges to p^k, the product of per-token correctness over the k-step chain. p=0.5 gives p^8 = 0.004 — a 50%-accurate model solves 0.4% of 8-step tasks. p=0.95 gives p^8 = 0.66. The jump from 'unmeasurable' to 'usable' happens over a small slice of compute.

EXECUTION STATE

exact_acc (p=0.5, k=8) = ≈ 0.004

exact_acc (p=0.95, k=8) = ≈ 0.66

28Token log-likelihood is the SMOOTH side of the same coin

log p(correct) is the per-token log-likelihood — exactly the quantity cross-entropy training minimises (with a sign flip). Aggregated over a corpus this is the validation loss. It rises smoothly and monotonically with compute, because log is a continuous function of p. Same model, same predictions, different metric — and the curve flips from sharp to smooth.

EXECUTION STATE

token_ll (p=0.5) = log 0.5 ≈ -0.69

token_ll (p=0.95) = log 0.95 ≈ -0.05

33Sweep compute and tabulate both metrics

Eight points spanning seven decades of compute (10^18 to 10^25 FLOPs — small undergrad project to frontier 2025 lab). At each point we record p_token, exact-match accuracy, and token log-likelihood. The printout shows the same model evaluated three ways — and tells two completely different stories about 'capability vs scale'.

35What the printed table reveals

Read the columns. p_token rises smoothly: 0.018 → 0.50 → 0.99 across the range. token_ll rises smoothly: -4.0 → -0.69 → -0.01. But exact@k=8 sits at ~0 for the first FIVE rows, then jumps to ~0.66 and ~0.92 in the last two. That is emergence — a flat-then-sharp curve in the exact-match column generated by a smooth curve in the underlying probabilities.

EXECUTION STATE

exact_acc @ 10^20 = ≈ 0.000 (looks impossible)

exact_acc @ 10^23 = ≈ 0.66 (now solvable)

exact_acc @ 10^25 = ≈ 0.92 (saturated)

31 lines without explanation

1import math
2import random
3
4# Toy model of "emergence":
5#  - per-token correctness rises smoothly with scale (sigmoid of log-compute)
6#  - a TASK requires k correct tokens in a row to be graded "correct"
7#  - we evaluate two metrics on the same model:
8#      (a) exact-match accuracy   — all k tokens must be right
9#      (b) token log-likelihood   — average log p(correct) per token
10
11def per_token_accuracy(log_compute, mid=22.0, sharp=1.5):
12    # Smooth sigmoid in log-compute. mid is where p_token = 0.5.
13    return 1.0 / (1.0 + math.exp(-sharp * (log_compute - mid)))
14
15def evaluate(log_compute, k=8, n_eval=1000, seed=0):
16    rng = random.Random(seed)
17    p = per_token_accuracy(log_compute)
18
19    # (a) Exact-match: simulate k Bernoulli draws per task, all must be 1.
20    exact_hits = 0
21    for _ in range(n_eval):
22        all_right = all(rng.random() < p for _ in range(k))
23        if all_right:
24            exact_hits += 1
25    exact_acc = exact_hits / n_eval
26
27    # (b) Token log-likelihood (per token, averaged).
28    #     log p(correct) is what cross-entropy training actually measures.
29    token_ll = math.log(max(p, 1e-12))
30    return exact_acc, token_ll
31
32# Sweep a wide compute range and see the two metrics diverge.
33scales = [18, 19, 20, 21, 22, 23, 24, 25]   # log10 FLOPs
34print(f"{'log10 FLOPs':>11} | {'p_token':>7} | {'exact@k=8':>9} | {'token_ll':>9}")
35for s in scales:
36    exact, ll = evaluate(s, k=8, n_eval=5000)
37    p = per_token_accuracy(s)
38    print(f"{s:>11} | {p:>7.3f} | {exact:>9.3f} | {ll:>9.3f}")

Run it. The output is a table that shows the same model evaluated three ways. Per-token probability and token-level log-likelihood are smooth sigmoids in log-compute. Exact-match accuracy sits at zero for five rows, then jumps. Both stories are right — they are the same model — but only one of them shows up on a leaderboard.

Sanity check yourself. Set k = 1 and rerun. The exact-match column should now match the per-token column to within Monte-Carlo noise. If it does, you have confirmed that the emergence in the k = 8 case was entirely due to chain length and the exact-match grader — there is nothing else in the simulator.

PyTorch: Two Metrics, Same Model, Different Story

In a real evaluation harness you have logits over a 100k-token vocabulary, not Bernoulli probabilities. The right thing to do is to compute both the exact-match number and the token-level NLL from the SAME forward pass. The cost is dominated by the forward, not the metric, so this is a free win.

🐍emergence_eval.py

Explanation(7)

Code(53)

9One harness, two metrics, every checkpoint

This is the function frontier labs run thousands of times across a training run — once per saved checkpoint, on every eval set. The cost is dominated by the forward pass, not the metric computation, so collecting BOTH metrics costs the same as collecting one. Skipping the NLL number is the single most common mistake in capability evals — without it you cannot tell whether a flat exact-match curve means 'no learning' or 'learning that has not yet crossed the threshold'.

17Inputs are masked so we score only the answer

Most eval prompts have a few hundred tokens of context (the question, the few-shot examples) followed by a short answer span. `answer_mask` is a boolean [B, T] tensor that is True only on the answer tokens. Scoring NLL or exact-match over the prompt tokens would dilute the signal — the model didn't write the prompt, it doesn't get credit for predicting it.

EXECUTION STATE

input_ids = shape [B, T]

answer_mask = shape [B, T], bool

22Forward pass produces logits over the vocabulary

Standard causal LM head: per-position logits of shape [B, T, V] where V is the vocabulary (~128k for DeepSeek-V3, ~32k for Llama-3). This is the raw output BEFORE softmax. Everything downstream is post-processing of this tensor.

EXECUTION STATE

logits = shape [B, T, V]

25Token-level NLL on the answer span

Compute log p(target | context) at every answer-token position, sum over the masked positions, divide by the number of masked tokens at the end. This is the SMOOTH signal — it changes by tiny amounts between adjacent checkpoints, which is exactly what makes it a good scaling-law target. Frontier labs fit power laws to this number across a Chinchilla sweep and extrapolate.

EXECUTION STATE

log_probs = shape [B, T, V]

gold_lp = shape [B, T]

33Argmax + all-right gives the leaderboard number

For exact-match we take the model's top-1 token at every answer position and AND together the per-position correctness across the answer span. `(pred == target) | (~answer_mask)` turns prompt tokens into 'don't care' so they cannot break the per-task verdict. The resulting per-task boolean is averaged to give the headline accuracy.

EXECUTION STATE

pred = shape [B, T]

all_right = shape [B], bool

39Return both numbers from the same forward pass

Returning a dict (not just the leaderboard number) is what lets the training-monitoring stack draw two curves: exact_match for the product team, avg_token_nll for the research team. The two together tell you (a) where the model is in its capability arc, and (b) how much more compute will move it. Either alone is misleading.

EXECUTION STATE

exact_match_acc = 0.0 → 1.0

avg_token_nll = ≈ 0.0 → 6.0 (smaller is better)

45Run on every checkpoint, not just the final one

Every saved checkpoint goes through this harness. The result is a per-checkpoint trace of (compute, exact_match, nll) for every benchmark. The NLL trace lets you fit a scaling law per benchmark; the exact_match trace lets you spot WHICH benchmarks emerge and at what compute. DeepSeek-V3 reports running this on ~80 benchmarks across ~50 intermediate checkpoints during pretraining.

46 lines without explanation

1import torch
2import torch.nn.functional as F
3from torch import nn
4
5# Evaluation harness used by every frontier lab: run a model checkpoint over
6# an eval suite and record BOTH metrics. The per-task exact-match number is
7# what goes on the leaderboard; the per-token NLL is what predicts scaling.
8
9@torch.no_grad()
10def evaluate_checkpoint(model: nn.Module,
11                        eval_loader,
12                        device: str = "cuda"):
13    model.eval()
14    n_tokens = 0
15    total_nll = 0.0          # token-level cross-entropy
16    n_tasks = 0
17    n_exact = 0              # tasks where EVERY answer token was top-1
18
19    for batch in eval_loader:
20        input_ids = batch["input_ids"].to(device)        # [B, T]
21        answer_mask = batch["answer_mask"].to(device)    # [B, T] bool
22        target = batch["target_ids"].to(device)          # [B, T]
23
24        logits = model(input_ids).logits                 # [B, T, V]
25
26        # ---- token-level NLL on the answer span ----
27        log_probs = F.log_softmax(logits, dim=-1)        # [B, T, V]
28        gold_lp = log_probs.gather(
29            -1, target.unsqueeze(-1)
30        ).squeeze(-1)                                    # [B, T]
31        token_nll = -(gold_lp * answer_mask).sum()
32        n_tokens += answer_mask.sum().item()
33        total_nll += token_nll.item()
34
35        # ---- exact-match: argmax must equal gold at every answer token ----
36        pred = logits.argmax(dim=-1)                     # [B, T]
37        per_token_hit = (pred == target) | (~answer_mask)
38        all_right = per_token_hit.all(dim=-1)            # [B]
39        n_tasks += input_ids.size(0)
40        n_exact += all_right.sum().item()
41
42    return {
43        "exact_match_acc": n_exact / n_tasks,
44        "avg_token_nll":   total_nll / max(n_tokens, 1),
45    }
46
47# Usage: same harness, called on every intermediate checkpoint.
48# The NLL curve is the scaling-law signal; the exact-match curve is the
49# user-visible capability signal. Plot both. They tell different stories.
50for ckpt in checkpoints:                                  # noqa: F821
51    model.load_state_dict(torch.load(ckpt))                # noqa: F821
52    metrics = evaluate_checkpoint(model, eval_loader)      # noqa: F821
53    log_to_wandb(ckpt, metrics)                            # noqa: F821

Two subtleties worth marking, both about how this harness feeds back into training decisions:

Score only the answer span. Without answer_mask, both metrics get diluted by hundreds of easy prompt tokens that the model did not have to produce. A model that emits the eval prompt back to you with 99.9% per-token accuracy looks "great" on naive metrics — and tells you nothing about its capabilities. The mask is the difference between a useful eval and a vanity number.
Return both numbers; let the consumer decide. The two metrics tell different stories at different stages of training. Early in pretraining, NLL is the only signal that moves — exact-match will sit at zero on every hard benchmark for the first half of the run. The training-monitoring stack needs NLL to know the model is learning at all. Late in pretraining, exact-match starts to twitch and finally jumps — that is the signal the product team waits for. Returning only one of the two will blind half the audience.

Implementation note on partial-credit metrics. Some benchmarks (MATH, BIG-Bench Hard) use answer-equivalence checks (final numeric value matches, modulo formatting) rather than literal token-by-token equality. These behave halfway between exact-match and token-NLL: they tolerate the model getting the wrong intermediate tokens as long as the final answer is right. On these metrics emergence is milder but still visible — the curve elbow is real but less dramatic. If you have a choice of grader, prefer the answer-equivalence one. It throws away less of the signal.

At Massive Scale: Planning for Capabilities You Cannot Predict

Emergence reframes the central question of frontier model planning. Chinchilla tells you the validation loss; emergence tells you why validation loss is not enough. The actual planning protocol used at DeepSeek, Anthropic, and Google DeepMind looks roughly like this:

Decision	Driven by	Why this metric
Model size N and tokens D	Validation loss + Chinchilla fit	Loss is smooth and extrapolable from small models; loss minimisation under a compute constraint is well-defined.
Per-benchmark training-time monitoring	Token-level NLL per benchmark	NLL moves on every checkpoint, even when exact-match is pinned at zero. It is the only early signal that a benchmark is starting to be learnt.
Final go/no-go on a benchmark	Exact-match (or answer-equivalence)	This is what the leaderboard reports. The release report has to use the public metric, and the public metric is exact-match.
Emergent-capability bets (CoT, tool use, ICL)	Empirical extrapolation across model sizes 1B → 7B → 70B	These genuinely jump and no smooth surrogate predicts them. The only honest planning protocol is multi-scale: train a ladder of sizes and watch when the jump fires.

Three quantitative facts shape the planning protocol. First, the per-benchmark NLL curve is a power law in compute; it can be fit from the first 10% of a training run and extrapolated to the end with < 5% relative error. Second, the exact-match curve cannot be extrapolated from below the elbow — every published attempt to do so has been worse than chance. Third, the location of the elbow is roughly predictable from chain length: tasks with 1–3 reasoning steps emerge near $10^{22}$ FLOPs, 5–10-step tasks near $10^{23}$ , and 20+-step tasks like multi-hop reasoning emerge above $10^{24}$ . DeepSeek-V3 sits at $3.4 \times 10^{24}$ FLOPs, which is precisely the compute where ~20-step CoT reasoning becomes reliable.

Inverse scaling and the failure modes

Emergence has a darker cousin: inverse scaling. A handful of tasks (NeQA, sycophancy, certain prompt-injection robustness measures) get worse as compute grows. The mechanism is symmetric: as the model gets better at the dominant statistical pattern in the training data, it gets worse at tasks where the right answer requires resisting that pattern. The Inverse Scaling Prize (2022) found about a dozen of these, and they are why every frontier release report now includes a section on capability regressions, not just capability gains.

Engineering Reality and Gotchas

Five production failure modes earn their flags:

Reporting only the headline metric blinds you to learning. A benchmark that is at 5% accuracy for the first six months of training isn't telling you the model is failing — it is telling you the per-token probability is below the elbow. Track NLL or token-level log-probability for every benchmark, every checkpoint, even when the exact-match number is flat. Anthropic and DeepMind both publish dual curves in their scaling reports for exactly this reason.
Extrapolating emergence from one ladder is a coin flip. A 1B → 7B → 13B ladder will tell you almost nothing about whether MMLU emerges by 70B — the per-token sigmoid is too far below its midpoint at 13B to estimate either $s$ or $C_0$ precisely. The minimum honest ladder is three sizes spanning two decades of compute, with the largest sitting where you can just barely see the exact-match needle move.
The few-shot scaffold can flip the elbow. Wei et al. found that chain-of-thought prompting moves the emergence elbow earlier by roughly one decade of compute on math benchmarks — same model, different prompt, the elbow shifts. This means "does this model show emergence on task X" depends on the prompt, not just the weights. Lock prompt templates before reporting eval numbers.
Contamination amplifies apparent emergence. A model that has seen a benchmark question during pretraining will memorise it sharply once it has enough capacity — and the memorisation looks exactly like emergence on the per-task curve. A sudden jump on one benchmark that no other benchmark of similar structure shows is the canonical signature of contamination (see §8.7). Always cross-check emergent claims against the contamination report.
Saturation hides further capability. When a benchmark hits 95%+ exact-match, the curve flattens at the top — but per-token NLL keeps falling. The model is getting better at the benchmark in ways the grader cannot resolve, and that latent headroom shows up on harder benchmarks downstream. A saturated benchmark is not a solved one; it is a benchmark that needs to be retired and replaced with a harder version.

How DeepSeek tracks emergence during pretraining. Four practices every frontier run shares: (a) evaluate ~80 benchmarks on every saved checkpoint, not just at the end; (b) log both exact-match and per-token NLL for every benchmark; (c) fit a Chinchilla-style power law to the NLL curve mid-run to predict the final NLL — emergence on the exact-match curve becomes the question of whether NLL will cross the per-benchmark threshold; (d) for genuinely emergent capabilities (instruction following, chain-of-thought, tool use), train a multi-size ladder and accept that the largest model is the only honest signal.

The one sentence to carry forward: loss is the contract a scaling law lets you sign; emergence is the surprise the leaderboard delivers when the contract closes — and the only way to keep the surprise inside the budget is to plot both metrics from the first checkpoint to the last.

Where we go from here. The next section turns the question around. We have spent three sections deciding what to train — Chinchilla size, MoE structure, capability targets. Section 9.4 asks how to train it: what learning rate, what batch size, what warmup, what weight decay, all as functions of model scale. The hyperparameter scaling laws are how a recipe that worked at 1B parameters survives the leap to 670B.