The previous two sections gave us a precision instrument. Chinchilla scaling laws (§9.1) tell us, to within a few percent, what loss a model of size trained on tokens will reach. The MoE scaling laws (§9.2) refine the prediction for sparse models. Loss is predictable. Capabilities are not.
The thesis of this section. Frontier model training is haunted by a strange asymmetry: validation loss is a smooth power law, but the benchmarks the public actually cares about — multi-digit arithmetic, MMLU, code generation, chain-of-thought reasoning — improve in flat-then-sudden jumps. Researchers call this emergence. Whether emergence is a real phase transition or an artefact of the way we grade tasks is one of the most consequential debates in modern AI, because it decides whether a $50M training run is a gamble or an extrapolation.
The Real Problem: Loss Is Smooth, Capabilities Are Not
Chinchilla gave us this beautiful equation: . Plug in any model size and any token count and the validation loss is nailed down. We can plan a training run the way an aerospace engineer plans a launch — pick the budget, pick the trajectory, predict the outcome. Then we evaluate the trained model on the real benchmarks and the prediction breaks.
Wei et al. (2022) catalogued this directly. They ran every public model in the 10M–540B parameter range across 200+ tasks from BIG-Bench and plotted accuracy against training compute. Two patterns appeared.
- Smooth tasks. Token-level perplexity, simple completion, surface-level pattern matching — accuracy rises as a power law in compute, perfectly predictable from a small model.
- Emergent tasks. 3-digit addition, modular arithmetic, transliteration, MMLU, chain-of-thought reasoning — accuracy is pinned at ~random for many decades of compute, then jumps to near-ceiling over less than one decade. From a 6B model you have no signal at all; from a 70B model the task is solved.
This is uncomfortable. The whole point of a scaling law is that you can train a small model, fit the curve, and extrapolate. If the capability you care about is invisible until you cross a threshold, the extrapolation is a coin flip. You spend $50M on a 671B run and either MMLU jumps from 35% to 87% — or it sits at 41% and your release report has nothing to say.
| Task | First model that solved it | Effective compute |
|---|---|---|
| 3-digit addition (exact match) | GPT-3 13B (Brown et al., 2020) | ~ 3 × 10²² FLOPs |
| MMLU (above 25% random) | Gopher 280B / PaLM 62B | ~ 10²³ FLOPs |
| Chain-of-thought wins over standard prompting | LaMDA 137B / PaLM 540B | ~ 10²³ – 10²⁴ FLOPs |
| Instruction following (zero-shot, no SFT) | GPT-3 175B | ~ 3 × 10²³ FLOPs |
| HumanEval pass@1 above 20% | Codex 12B / GPT-4 | ~ 10²³ – 10²⁵ FLOPs |
Intuition: A Chain Is Only As Strong As Its Weakest Token
The cleanest mental model for emergence is the chain. Most useful tasks are not single-token decisions — they are sequences. To add two 4-digit numbers, the model must emit one digit at a time with carries propagated correctly. To answer an MMLU question correctly, it must emit the letter of the right option. To write a working Python function, it must emit dozens of tokens in the right order. To reason through a chain-of-thought, it must take ten or twenty correct steps before the final answer.
Now suppose the model gets each individual token right with probability . If the task requires tokens in a row, all correct, and we grade with exact-match, the task accuracy is approximately . The shape of as a function of is what creates emergence.
| p (per-token) | p² (k=2) | p⁸ (k=8) | p²⁰ (k=20) |
|---|---|---|---|
| 0.50 | 0.25 | 0.004 | 0.0000010 |
| 0.75 | 0.56 | 0.10 | 0.0032 |
| 0.90 | 0.81 | 0.43 | 0.12 |
| 0.95 | 0.90 | 0.66 | 0.36 |
| 0.99 | 0.98 | 0.92 | 0.82 |
Read down the column. As per-token accuracy crawls from 0.75 to 0.99 — a smooth, gradual improvement — the 20-step task accuracy explodes from 0.3% to 82%. The capability was always rising; the exact-match grader was hiding it behind a wall.
This is the analogy that holds the whole concept together: a chain is only as strong as its weakest link, and a chain of probabilities is only as accurate as the worst per-step probability. Improve every link a little and the chain stays broken. Improve every link a lot and the chain suddenly holds the weight.
The water analogy. Heat a beaker of water from 20°C to 99°C and nothing visible happens — it is still a liquid. Heat another degree and it boils. The underlying physics was changing smoothly the whole time; the binary observable (liquid vs gas) crossed a threshold. Emergence in LLMs is a phase transition in the observable, not in the underlying quantity. The token-level log-probability was rising smoothly across the entire compute range.
The Mathematics of a Sudden Jump
Make the intuition formal. Let be the probability that the model emits the correct next token at any individual answer position, where is the training compute. From the Chinchilla loss law we know rises smoothly and monotonically with . A simple, useful surrogate is the sigmoid:
Here is the log-compute at which the per-token probability is exactly 0.5, and controls the steepness in decades of compute. Empirically, fitting this to real benchmarks gives and somewhere between and depending on the task.
Now grade the task with exact-match over a chain of steps:
Take the derivative with respect to log-compute to see how steep the rise is at the midpoint:
Two things to notice. First, the slope of the exact-match curve is times the slope of the per-token curve at any point where is close to 1 — so longer chains have steeper emergence elbows. Second, when is small the slope is vanishingly small because kills it — which is why the flat-at-zero plateau before emergence is so wide.
Now grade the same model with token-level log-likelihood:
This is monotone and continuous in , hence smooth and continuous in . The same model, the same predictions — but the curve is a power law, not a phase transition. Schaeffer et al. (2023) argued from exactly this observation that emergence is partly an artefact of the metric: switch from exact-match to token-level scoring and the discontinuity dissolves into the smooth Chinchilla curve.
Real vs apparent emergence
Schaeffer's case is sharp but not the full story. Two things can both be true:
- Some emergence is metric-driven. Multi-digit arithmetic, exact-match MMLU, code-pass@1 — for these, the model is monotonically improving its log-likelihood the whole time, and the jump on the headline metric is a thresholding artefact. Switch to per-token NLL and the curve smooths out.
- Some emergence is real. In-context learning, chain-of-thought benefit, instruction following from raw pretraining, and tool use show genuine qualitative shifts that no smooth rescaling of the metric removes. Below the threshold the model ignores the prompt structure; above it, it follows. The phase transition is in the model's strategy, not in the grader.
The practical consequence is the same in both cases: you cannot extrapolate the leaderboard number from a small model. You can extrapolate the per-token NLL — and the NLL is what actually drives the leaderboard. Plot both.
Manual Numerical Walkthrough
Walk emergence through a concrete example: a model learning to add two 4-digit numbers. The task emits 4 answer tokens (the four digits of the sum), so . We will assume the model has identical per-token accuracy at every position — not exactly true in practice (the carry digits are harder), but enough to expose the mechanism.
Click to expand: 4-digit addition across eight model scales
Set-up. Eight checkpoints from a Chinchilla-sized run, spaced one decade apart in compute. We assume per-token accuracy on a digit follows . The task is 4-digit addition, exact-match over four digits.
Step 1 — per-token accuracy. Compute at each scale:
- 10²⁰ FLOPs: σ(−3) = 0.047 ≈ 5%
- 10²¹ FLOPs: σ(−1.5) = 0.182 ≈ 18%
- 10²² FLOPs: σ(0) = 0.500 ≈ 50%
- 10²² · ³ FLOPs: σ(0.75) = 0.679
- 10²³ FLOPs: σ(1.5) = 0.818
- 10²³ · ³ FLOPs: σ(2.25) = 0.905
- 10²⁴ FLOPs: σ(3.0) = 0.953
- 10²⁵ FLOPs: σ(4.5) = 0.989
Step 2 — exact-match accuracy. Raise to the 4th power (one factor per answer digit):
- 10²⁰: 0.047⁴ ≈ 0.0000049 (effectively 0)
- 10²¹: 0.182⁴ ≈ 0.0011 (still ~0 on a benchmark of hundreds)
- 10²²: 0.500⁴ = 0.0625 (first measurable signal)
- 3×10²²: 0.679⁴ ≈ 0.213
- 10²³: 0.818⁴ ≈ 0.448
- 3×10²³: 0.905⁴ ≈ 0.671
- 10²⁴: 0.953⁴ ≈ 0.825
- 10²⁵: 0.989⁴ ≈ 0.957
Step 3 — token log-likelihood. Apply to the column-1 values:
- 10²⁰: log 0.047 = −3.06
- 10²¹: log 0.182 = −1.70
- 10²²: log 0.500 = −0.69
- 3×10²²: log 0.679 = −0.39
- 10²³: log 0.818 = −0.20
- 3×10²³: log 0.905 = −0.10
- 10²⁴: log 0.953 = −0.048
- 10²⁵: log 0.989 = −0.011
Step 4 — compare the three columns. Per-token accuracy rises by a factor of 21× across the range (0.047 → 0.989). Token NLL rises smoothly by a factor of 280× (−3.06 → −0.011) — also continuous, no jumps. Exact-match accuracy rises by a factor of 195,000× (5×10⁻⁶ → 0.957), and the bulk of that rise is concentrated in two decades of compute (10²² to 10²⁴). Plotted on a linear y-axis, the first three rows are visually indistinguishable from zero — the line goes from flat-on-the-floor to almost-at-ceiling across three checkpoints. That is the emergence elbow.
Step 5 — chain length sensitivity. Re-do step 2 with (a chain-of-thought reasoning task with twenty inference steps). At p = 0.679 the task accuracy is 0.679²⁰ ≈ 0.00037 — still invisible. At p = 0.953 it is 0.953²⁰ ≈ 0.378. At p = 0.989 it is 0.989²⁰ ≈ 0.802. The emergence elbow has shifted ROUGHLY one decade of compute to the right — and is steeper. Longer chains create later and sharper emergence. This is why chain-of-thought reasoning emerged at a much higher scale than basic arithmetic — it has a longer chain.
Visualizing Emergence
The sandbox below lets you steer all three knobs: chain length , the log-compute mid-point , and the per-token steepness . The dashed green line is the underlying per-token accuracy — a smooth sigmoid in log-compute. The solid red line is the task-level metric you chose: exact-match (which is , the curve that looks emergent) or token-level (which is just ). Drag the chain-length slider from 1 to 20 and watch the elbow move right and steepen — that is the geometry of emergence.
Three things to internalise from the sandbox. First, at the red and green curves are identical — there is no emergence on single-token tasks. Second, raising pushes the red elbow to the right and makes it steeper, all without touching the underlying per-token curve. The task got harder to grade, not harder to learn. Third, switch the metric to "Token log-likelihood" and the red curve becomes a smooth sigmoid for any — because is what loss-based scaling laws actually predict. The emergence disappeared because the metric stopped throwing away information.
Plain Python: Simulating an Emergent Curve
The following script is the simplest possible emergence simulator — no neural network, no GPU, just a sigmoid and a Bernoulli draw. It proves the entire phenomenon needs nothing more than two ingredients: a smoothly-improving per-token probability, and an exact-match grader over a chain of length .
Run it. The output is a table that shows the same model evaluated three ways. Per-token probability and token-level log-likelihood are smooth sigmoids in log-compute. Exact-match accuracy sits at zero for five rows, then jumps. Both stories are right — they are the same model — but only one of them shows up on a leaderboard.
Sanity check yourself. Setk = 1and rerun. The exact-match column should now match the per-token column to within Monte-Carlo noise. If it does, you have confirmed that the emergence in thek = 8case was entirely due to chain length and the exact-match grader — there is nothing else in the simulator.
PyTorch: Two Metrics, Same Model, Different Story
In a real evaluation harness you have logits over a 100k-token vocabulary, not Bernoulli probabilities. The right thing to do is to compute both the exact-match number and the token-level NLL from the SAME forward pass. The cost is dominated by the forward, not the metric, so this is a free win.
Two subtleties worth marking, both about how this harness feeds back into training decisions:
- Score only the answer span. Without
answer_mask, both metrics get diluted by hundreds of easy prompt tokens that the model did not have to produce. A model that emits the eval prompt back to you with 99.9% per-token accuracy looks "great" on naive metrics — and tells you nothing about its capabilities. The mask is the difference between a useful eval and a vanity number. - Return both numbers; let the consumer decide. The two metrics tell different stories at different stages of training. Early in pretraining, NLL is the only signal that moves — exact-match will sit at zero on every hard benchmark for the first half of the run. The training-monitoring stack needs NLL to know the model is learning at all. Late in pretraining, exact-match starts to twitch and finally jumps — that is the signal the product team waits for. Returning only one of the two will blind half the audience.
At Massive Scale: Planning for Capabilities You Cannot Predict
Emergence reframes the central question of frontier model planning. Chinchilla tells you the validation loss; emergence tells you why validation loss is not enough. The actual planning protocol used at DeepSeek, Anthropic, and Google DeepMind looks roughly like this:
| Decision | Driven by | Why this metric |
|---|---|---|
| Model size N and tokens D | Validation loss + Chinchilla fit | Loss is smooth and extrapolable from small models; loss minimisation under a compute constraint is well-defined. |
| Per-benchmark training-time monitoring | Token-level NLL per benchmark | NLL moves on every checkpoint, even when exact-match is pinned at zero. It is the only early signal that a benchmark is starting to be learnt. |
| Final go/no-go on a benchmark | Exact-match (or answer-equivalence) | This is what the leaderboard reports. The release report has to use the public metric, and the public metric is exact-match. |
| Emergent-capability bets (CoT, tool use, ICL) | Empirical extrapolation across model sizes 1B → 7B → 70B | These genuinely jump and no smooth surrogate predicts them. The only honest planning protocol is multi-scale: train a ladder of sizes and watch when the jump fires. |
Three quantitative facts shape the planning protocol. First, the per-benchmark NLL curve is a power law in compute; it can be fit from the first 10% of a training run and extrapolated to the end with < 5% relative error. Second, the exact-match curve cannot be extrapolated from below the elbow — every published attempt to do so has been worse than chance. Third, the location of the elbow is roughly predictable from chain length: tasks with 1–3 reasoning steps emerge near FLOPs, 5–10-step tasks near , and 20+-step tasks like multi-hop reasoning emerge above . DeepSeek-V3 sits at FLOPs, which is precisely the compute where ~20-step CoT reasoning becomes reliable.
Inverse scaling and the failure modes
Emergence has a darker cousin: inverse scaling. A handful of tasks (NeQA, sycophancy, certain prompt-injection robustness measures) get worse as compute grows. The mechanism is symmetric: as the model gets better at the dominant statistical pattern in the training data, it gets worse at tasks where the right answer requires resisting that pattern. The Inverse Scaling Prize (2022) found about a dozen of these, and they are why every frontier release report now includes a section on capability regressions, not just capability gains.
Engineering Reality and Gotchas
Five production failure modes earn their flags:
- Reporting only the headline metric blinds you to learning. A benchmark that is at 5% accuracy for the first six months of training isn't telling you the model is failing — it is telling you the per-token probability is below the elbow. Track NLL or token-level log-probability for every benchmark, every checkpoint, even when the exact-match number is flat. Anthropic and DeepMind both publish dual curves in their scaling reports for exactly this reason.
- Extrapolating emergence from one ladder is a coin flip. A 1B → 7B → 13B ladder will tell you almost nothing about whether MMLU emerges by 70B — the per-token sigmoid is too far below its midpoint at 13B to estimate either or precisely. The minimum honest ladder is three sizes spanning two decades of compute, with the largest sitting where you can just barely see the exact-match needle move.
- The few-shot scaffold can flip the elbow. Wei et al. found that chain-of-thought prompting moves the emergence elbow earlier by roughly one decade of compute on math benchmarks — same model, different prompt, the elbow shifts. This means "does this model show emergence on task X" depends on the prompt, not just the weights. Lock prompt templates before reporting eval numbers.
- Contamination amplifies apparent emergence. A model that has seen a benchmark question during pretraining will memorise it sharply once it has enough capacity — and the memorisation looks exactly like emergence on the per-task curve. A sudden jump on one benchmark that no other benchmark of similar structure shows is the canonical signature of contamination (see §8.7). Always cross-check emergent claims against the contamination report.
- Saturation hides further capability. When a benchmark hits 95%+ exact-match, the curve flattens at the top — but per-token NLL keeps falling. The model is getting better at the benchmark in ways the grader cannot resolve, and that latent headroom shows up on harder benchmarks downstream. A saturated benchmark is not a solved one; it is a benchmark that needs to be retired and replaced with a harder version.
The one sentence to carry forward: loss is the contract a scaling law lets you sign; emergence is the surprise the leaderboard delivers when the contract closes — and the only way to keep the surprise inside the budget is to plot both metrics from the first checkpoint to the last.