Suppose you have ten million dollars of GPU time, six months of calendar, and a corpus of fifteen trillion training tokens. How big should the model be? A 7B parameter model trained on every token you have? A 700B model trained for a single epoch? Two months of throughput says you could split the budget differently a hundred different ways. Only one of those splits gives you the lowest possible validation loss at the end. The Chinchilla scaling laws tell you which one.
The thesis of this section. For a fixed training compute budget , validation loss is a smooth, predictable function of model size and training tokens . The function has a unique minimum on the constraint , and the optimum lies near . Models that ignore the law — GPT-3, Gopher — burned billions of FLOPs in the wrong place. Models that respect it — Chinchilla, Llama-3, DeepSeek-V3 — extract an order of magnitude more capability from the same hardware.
The Real Problem: How Big Should the Model Be?
Before Chinchilla, the field believed the answer to "how big should the model be?" was simple: as big as you can afford. Kaplan et al. (2020) at OpenAI published an influential set of scaling laws fit to GPT-2-era runs and concluded that the compute-optimal allocation grew the model size much faster than the training token count — specifically that and . The practical rule that followed: build the largest possible model and train it for relatively few tokens.
Three of the most important models of the era followed this advice to the letter. GPT-3 (175B) trained on 300B tokens — 1.7 tokens per parameter. Gopher (280B) trained on 300B tokens — 1.1 tokens per parameter. Megatron-Turing NLG (530B) trained on 270B tokens — 0.5 tokens per parameter. All three were giants. All three, it turns out, were dramatically undertrained.
In 2022 a small DeepMind team led by Jordan Hoffmann ran a brutally thorough sweep: 400+ training runs at log-spaced model sizes from 70M to 16B parameters, each trained for a wide range of token counts. They re-fit the scaling laws on this richer data and found Kaplan's exponents were wrong. The correct rule was and — roughly equal scaling. They then trained a single 70B model on 1.4T tokens — at the same compute as the 280B Gopher — and named it Chinchilla. The smaller Chinchilla beat the bigger Gopher on essentially every benchmark.
| Model | Params | Tokens | D / N | Compute (FLOPs) | MMLU |
|---|---|---|---|---|---|
| GPT-3 (2020) | 175B | 300B | 1.7 | 3.1 × 10²³ | 43.9% |
| Gopher (2021) | 280B | 300B | 1.1 | 5.8 × 10²³ | 60.0% |
| Chinchilla (2022) | 70B | 1.4T | 20.0 | 5.8 × 10²³ | 67.6% |
| Llama-3 70B (2024) | 70B | 15T | 214 | 4.4 × 10²⁵ | 82.0% |
| DeepSeek-V3 (2024) | 37B active / 671B total | 14.8T | 400 (active) | ≈ 3.3 × 10²⁴ active | ~88% |
The Chinchilla row is the headline. Same compute as Gopher, a quarter of the parameters, dramatically better benchmarks. The only thing that changed was where the FLOPs were spent. Every model below Chinchilla in the table follows the Chinchilla rule or deliberately overshoots it.
Intuition: A Budget Split Between Size and Training
Picture training as buying loss reduction with FLOPs. You have a fixed wallet of FLOPs. There are two things you can buy with it: a bigger brain (more parameters), or more education (more training tokens). Both reduce validation loss, but they do so in different ways and at different prices.
- Adding parameters lets the model represent functions of higher complexity. Each new parameter is a new coefficient that can be turned to fit some pattern in the data. The first parameters buy huge loss reduction; the millionth parameter buys very little. The marginal return shrinks as a power law: .
- Adding training tokens lets the existing parameters get tuned more precisely against more examples. Each new token nudges weights toward better calibration; the marginal return again shrinks as a power law: .
- Spending FLOPs. Every training token through every parameter costs about 6 FLOPs (2 forward + 4 backward). Total cost . So and are not free choices — they are tied by the budget.
Now picture the trade-off geometrically. The constraint is a hyperbola in the plane. The loss surface is a smooth bowl tilted toward large and large . The Chinchilla insight is that the constraint hyperbola crosses the loss bowl along a curve, and this curve has a unique lowest point. That lowest point is where the marginal loss reduction per FLOP from adding parameters exactly equals the marginal loss reduction per FLOP from adding tokens.
The intuition extends one step further. If parameters are cheaper-per-loss-reduction at your current allocation, you should shift FLOPs toward more parameters. If tokens are cheaper, shift toward more tokens. The optimum is reached when neither side is cheaper than the other — Lagrange multipliers, dressed up as an engineering principle. GPT-3 was wildly off this balance: it had bought way too many parameters and way too few tokens. The marginal FLOP would have been better spent training the existing 175B parameters longer.
The Mathematics of L(N, D)
Hoffmann et al. fit a parametric form for the validation loss as a function of model size and training tokens :
Each symbol carries a clear meaning. is the irreducible loss — the entropy of natural text under the chosen tokenizer. No model, no matter how large or how long trained, can do better than . The other two terms are the reducible losses: is the penalty for having a finite model and is the penalty for having a finite dataset. Both shrink to zero as their argument grows. For the Hoffmann constants , , , , and .
The training-compute approximation
A transformer forward pass through one token costs about FLOPs (one multiply-add per parameter is two FLOPs), and the backward pass costs about twice that, so the total per-token training cost is FLOPs. Multiplying by training tokens gives the standard approximation:
This is loose at small scales (attention compute matters, embeddings matter) and tight at frontier scales. We treat it as exact for the derivations below.
The compute-optimal allocation
We want to minimise subject to the compute constraint . Form the Lagrangian
Set the partial derivatives to zero. From and we get the elegant equilibrium condition:
In plain language: at the optimum, the model-size term and the data term (weighted by their exponents) contribute equally to the reducible loss. If either side dominates, you have wasted FLOPs. Solving this together with gives the closed-form compute-optimal allocation:
where the exponents are and , and the level-set constant . Plugging in the fitted values gives and — almost symmetric, which is the heart of the Chinchilla claim.
The 20-tokens-per-parameter rule
For Hoffmann's constants, the ratio evaluated at the Chinchilla compute ( FLOPs) is approximately . This is where the famous "20 tokens per parameter" rule of thumb comes from. The exact number drifts slowly with compute — at small compute it is , at frontier compute it is — but in the regime where most production training happens, 20× is a good first-pass estimate.
Manual Numerical Walkthrough
Let us compute the compute-optimal allocation for a single, concrete compute budget by hand, using the Hoffmann constants.
Click to expand: compute-optimal split at C = 6 × 10²³ FLOPs
Step 1 — Pick the constants. . Budget: FLOPs (roughly the Gopher/Chinchilla compute).
Step 2 — Compute the exponents. . . .
Step 3 — Compute G. . . Ratio: . .
Step 4 — Compute (C/6)^a and (C/6)^b. . . .
Step 5 — Compute N* and D*. ≈ 42 billion parameters. ≈ 2.4 trillion tokens.
Step 6 — Sanity check the compute budget. . Matches the target — the algebra checks out.
Step 7 — Compute the predicted loss. .. So .
Similarly .. So .
Step 8 — Verify the equilibrium condition. . . These two are approximately equal to within rounding — the optimum balances the weighted marginal contributions, exactly as the Lagrangian predicts.
Total loss: nats per token. Chinchilla itself reported a final training loss of ~1.96 at this compute — within 0.03 of the prediction.
Step 9 — Compare to Gopher's choice. Gopher chose and — same compute, wildly different split. Plug into : ; . . That is 0.05 nats worse than the compute-optimal allocation — the difference between an MMLU score of 60% and 67%. Same hardware. Different choice.
Visualizing the Iso-FLOP Slice
The widget below evaluates the Hoffmann loss law and traces the iso-FLOP curve as a function of model size on a log axis. The dashed green line marks the compute-optimal . Slide the compute budget to see how the optimum moves; click the preset chips to jump straight to a Gopher-, GPT-3-, or Llama-3-class compute. The reference dots show where each historical model sits on its own iso-FLOP curve.
Four things to internalise from the sandbox. First, the iso-FLOP curve is a U-shape, not a monotone — there really is a sweet spot, and pushing past it on either side loses loss. Second, the minimum is broad: deviating by 1.5–2× in either direction costs only ~0.01 nats. This is why labs do not need to land exactly at — they have slack. Third, GPT-3 sits dramatically to the right of the Gopher-class optimum — visually obvious in the chart. It is in the steep over-parameterised tail of its own iso-FLOP curve. Fourth, Llama-3 70B sits to the left of its Chinchilla-optimal point — it is intentionally under-parameterised, the over-trained mirror image of GPT-3. The reason is inference cost, covered below.
Plain Python: Fit and Optimise the Loss
Below is the full Chinchilla math, end to end, in plain Python — no deep-learning framework, no GPU. We define the parametric loss, derive the closed-form compute optimum, cross-check with a Nelder-Mead numerical solver, and sweep the compute budget over six orders of magnitude.
Two non-obvious details worth a second look. First, the closed-form and the numerical-search results agree to many decimal places at every compute scale — this is your acceptance test that the math (and the constants) are consistent. If they diverge, the published constants you copied are slightly different from the ones that match the closed form, and you should re-fit before using them in a production projection. Second, the increase of with compute is the most important finding hidden inside the law: bigger budgets do not just demand bigger models, they demand disproportionately more training data. This is why the post-Chinchilla generation of models (Llama-3, DeepSeek-V3) trains on 15+ trillion tokens — not because anyone "broke" the law, but because the law itself extrapolates that direction.
Sanity-check yourself. Run the script withC = 5.76e23(Chinchilla's actual compute). The closed-form optimum should land nearN* ≈ 67B, D* ≈ 1.43T. Hoffmann reportsN = 70B, D = 1.4Tin the paper — they rounded to a memorable 70B for the actual trained model, paying a microscopic loss penalty for the round number.
PyTorch: Fitting Scaling Laws From Real Runs
Knowing the formula is not enough. Every lab has its own because every lab has its own optimiser, tokenizer, architecture quirks, and data mixture — Hoffmann's constants are a starting point, not gospel. Before committing to a frontier training run, the lab fits its own scaling law on a sweep of dozens of small runs, then extrapolates. The PyTorch snippet below shows the fitting half.
Three subtleties marked for production use:
- Log-space parameterisation matters. Fitting and directly puts the optimiser in a 400-vs-1.7 numerical regime with. Fitting , , puts them all in and Adam converges in 1000× fewer steps.
- Huber on log-loss, not MSE on raw loss. Hoffmann's Approach 3 fits in log-loss space because the relative error is what matters: predicting when the truth is is the same severity as predicting when the truth is . Huber bounds the influence of any one outlier run. MSE-on-raw silently over-weights small-loss (big-model) runs and under-weights the small-model runs that anchor the fit at the high-loss end.
- Multi-start to escape shallow ridges. The loss-landscape for has shallow ridges where many fits look equally good but extrapolate differently. Production pipelines re-fit from 20–50 random initialisations of drawn uniformly from , keep the best loss, and report the spread as an uncertainty band on the extrapolation. A 2× spread on at the target compute is normal; a 10× spread means the small runs do not generalise to the target scale and the sweep needs larger anchor points.
At Massive Scale: Why DeepSeek Trained Past Chinchilla
If Chinchilla's law says 20 tokens per parameter, why is Llama-3 70B trained on 15T tokens (214 tokens per parameter), and why does DeepSeek-V3 train on 14.8T tokens against 37B active parameters (≈ 400 per active param)? The answer is that the Chinchilla optimum minimises training compute, but a deployed frontier model spends most of its lifetime generating tokens, not training. Inference cost is FLOPs per generated token. A smaller means cheaper inference forever, even at the cost of more training compute today.
The full accounting splits the lifetime FLOP budget into two buckets:
where is the total number of tokens the model will ever serve over its deployed life. For an internal research model, is small and Chinchilla is right. For a model serving a billion users, can dwarf by 100× — and the optimum shifts hard toward a smaller, more heavily trained model.
| Use case | Inference tokens T_infer | Optimal D/N | Why |
|---|---|---|---|
| Research only (one-shot eval) | ~ 10⁹ | ≈ 20 | Chinchilla is correct; train once, evaluate, move on. |
| Internal product (millions of users) | ~ 10¹² | ≈ 50–100 | Mild over-training; inference is meaningful but training still dominates. |
| External API (Llama-3-style) | ~ 10¹⁴ | ≈ 200 | Heavy over-training; inference dominates lifetime FLOPs. |
| MoE serving model (DeepSeek-V3) | ~ 10¹⁴ on active params | ≈ 400 (per active) | MoE makes inference even cheaper per param, justifying even more training. |
DeepSeek-V3 demonstrates the second-order trick. By using a sparse Mixture-of-Experts, the inference cost scales with the active parameter count (37B), not the total (671B). The Chinchilla denominator is the active size, but the storage and gradient cost during training is the total size. The result is a model that behaves like 671B at quality and like 37B at inference cost — and aggressively over-trains the active subnetwork because inference is so cheap per active param.
Engineering Reality and Gotchas
Scaling laws are seductive because they reduce a multi-million- dollar decision to one equation. They earn their flags because the equation is only as good as the sweep that fitted it. Five production failure modes:
- The learning-rate schedule must scale with D. This is the single mistake that put Kaplan's 2020 fit off by a factor of two. If your small runs decay LR over 100k steps and your big runs decay over 1M steps, then runs with more tokens actually use their tokens — your fit will recover the right exponents. If LR decays over the same step count regardless of D, runs with more tokens train at near-zero LR for most of their schedule, and your fit will systematically under-value tokens. Hoffmann fixed this; every modern lab inherits the fix.
- Sweep small or sweep wrong. A scaling-law fit built from runs in parameters extrapolated to is a two-order-of-magnitude leap. The fitted exponents drift slowly with the regime they were fit in. Frontier labs anchor their sweep with at least one run within 10× of the target — the last 1% of cost that turns a projection into a verified prediction.
- The 6 N D approximation breaks for MoE. A mixture-of-experts model has total parameters but only routes through per token. Training FLOPs scale with , not. Plugging total params into a dense Chinchilla formula will tell you to train DeepSeek-V3 on 13T tokens; the right answer (active-params Chinchilla, then over-train for inference) is closer to 15T. The two answers happen to be close — but only because of a coincidence of constants, not because of theory.
- Data quality changes A, not α. A cleaner data mixture lowers and (the model reaches lower loss at every (N, D)) but typically leaves the exponents nearly unchanged. The practical consequence is large: a 10% improvement in data curation is equivalent to a 2–3× compute multiplier without changing where the compute-optimum lives. Section 8 of this chapter and Chapter 8 itself live on this margin.
- Loss is not capability. Chinchilla predicts validation loss, but the model you ship is judged on downstream benchmarks. The relationship between loss drop and benchmark gain is monotone but heavily non-linear (see section 9.3 on emergent abilities). A 0.05-nat loss drop can be worth 5 MMLU points in one regime and 15 in another. Chinchilla tells you how to minimise loss; it does not tell you whether that loss drop will unlock the capability you actually need. That is the next layer of the scaling-laws story.
The one sentence to carry forward: Chinchilla is not a number, it is a procedure — and the only frontier models worth their training cost are the ones that actually ran it.