The Chinchilla recipe says train a 70 B model on 1.4 T tokens. Llama 3 trained an 8 B model on 15 T tokens — ten times smaller, ten times more data. Both teams read the same scaling-law paper. Neither is wrong. They are solving different problems: Chinchilla minimises training compute; Llama 3 minimises training plus a decade of inference. This section derives the inference-aware optimum and shows why every modern foundation model is dramatically over-trained relative to Hoffmann et al. 2022.
The Real Problem: Chinchilla Ignores Inference
Hoffmann et al. (2022) asked the right question for 2022: given a fixed training compute budget, what model size and token count minimise final loss? Their answer — roughly 20 tokens per parameter, scale and together — was a revolution. Gopher (280 B params, 300 B tokens) was severely under-trained. Chinchilla (70 B params, 1.4 T tokens) beat it at one-quarter the parameter count and the same training compute. The community rebuilt every training plan around the new recipe.
Then a different question became urgent. A foundation model is not a research artefact; it is the engine of a product that will serve billions of tokens per day for years. ChatGPT alone serves on the order of a quadrillion tokens of inference per year. Llama 3 is downloaded and re-served on millions of devices. Suddenly the cost equation is not anymore — it is , and the second term can dwarf the first.
| Quantity | Chinchilla framing | Production framing |
|---|---|---|
| Objective | Minimise training FLOPs at fixed loss | Minimise lifetime FLOPs at fixed loss |
| Cost per token (train) | 6 N | 6 N |
| Cost per token (infer) | Not counted | 2 N per query token |
| Lifetime workload | D training tokens | D train + D_inf served |
| Model size N | Grows with budget | Shrinks if inference is heavy |
| Tokens per parameter | ≈ 20 | 100 – 500+ for modern products |
Intuition: Training Is Paid Once, Inference Forever
Imagine you are running a restaurant. You can buy a giant 8-burner stove for $100 000, or a small 2-burner stove for $20 000. The big stove cooks faster — fewer prep hours per meal. The small stove is slower but cheaper. Which one wins?
It depends entirely on how many meals you plan to serve. If you cook a single banquet and close the restaurant forever, the big stove is foolish — you pay $100 000 to save twenty hours of prep. If you serve ten thousand meals a year for a decade, the big stove pays for itself in saved labour.
Model size is the stove. The big stove (large ) needs less training data to reach a given quality — fewer prep hours. The small stove (small ) is cheaper to buy but slower per meal served — every inference token costs FLOPs, and you will serve trillions of them. If you plan to serve a billion users, you want the smallest possible that can still reach quality — and you compensate by training it on enormous amounts of data.
The Inference-Aware Cost Equation
We start from Chinchilla's loss law:
with . We do not change this — the loss surface is what it is. What changes is what we minimise on top of it.
The training FLOPs of a forward+backward pass on tokens follow the standard rule (forward + backward + activation recompute):
Inference does only one forward pass per generated token, no backward, no optimiser update:
where is the total number of tokens (input prompts + generated outputs) the model will serve over its lifetime, summed across every query, every user, every session, every redeployment. The lifetime cost we actually pay is:
Constrained minimisation
We want a fixed quality and we want to minimise over . The constraint traces a curve in the plane — the iso-loss curve — and we slide along it looking for the lowest .
Solve from the iso-loss equation:
Substitute into the cost and take the derivative with respect to . After some algebra (Sardana et al. 2024, eqn 5), the optimum satisfies:
The first two terms are exactly Chinchilla's training-only gradient — they vanish at the Chinchilla optimum. The third term, , is always positive. So at the Chinchilla point the total-cost gradient is positive, meaning is too large. We have to shrink until the negative term (more data is needed) balances the positive inference term.
The shifted optimum, in closed form
Define the inference-to-training ratio at the optimum:
Sardana et al. show that the inference-aware optimum sits at:
Two readings. First, when (no inference), the tokens-per-param ratio collapses to — exactly Chinchilla. Second, as grows, the bracket grows without bound, so the tokens-per-param ratio grows without bound. A modern product with sits at hundreds of tokens per parameter, exactly where Llama 3 and Phi-3 are pinned.
Manual Numerical Walkthrough
Two configurations, same target loss, one pure research and one Llama-3-scale product. All numbers derived by hand from the equations above.
Click to expand: Chinchilla vs Llama-3-style at L* = 2.00
Setup. Fix the target loss at nats. Constants: . For the iso-loss curve, .
Configuration 1 — Chinchilla research point. (model is never served, just published). Chinchilla's closed-form gives B parameters at this loss with T tokens — about 20 tokens per parameter.
Verify the loss:
✓
Training cost: FLOPs. No inference. Lifetime cost ≈ training cost.
Configuration 2 — Llama-3-scale product point. tokens (two quadrillion served — roughly the order of a Llama-3 class deployment over five years). At B and , capacity term: . Residual loss budget for data: . Required tokens: = 15 T tokens — exactly Llama 3 8B's training set, by no accident.
Tokens per parameter: — about 93× Chinchilla's ratio.
Training cost: FLOPs (slightly more than Chinchilla — over-training is not free). Inference cost: FLOPs (44× larger than training). Lifetime total: FLOPs.
The counterfactual. Serve the 65 B Chinchilla model on the same 2 × 10¹⁵ inference workload. Inference cost: FLOPs — 8× more than the 8 B over-trained model. Lifetime cost: FLOPs.
Reading the numbers. Same loss target. Same inference workload. Chinchilla lifetime cost FLOPs vs Llama-3-style FLOPs — the over-trained small model is eight times cheaper to own over its lifetime. That single ratio is why no frontier lab ships a Chinchilla-optimal model anymore.
Visualizing the Shifted Optimum
The widget below traces both cost curves as a function of at a fixed loss target. The amber curve is training cost only — its minimum is the Chinchilla point. The green curve adds inference. As you drag the inference-tokens slider, watch the green curve's minimum migrate left to smaller models, and the tokens-per-parameter ratio climb past 100×, 200×, then beyond.
Plain Python: Lifetime-Cost Optimisation
Here is the full search, written as a clean Python loop. No framework, no GPU — just the iso-loss algebra and a sweep over . This is the script every infra team runs (in some form) before committing to a pre-training run.
PyTorch: Searching the (N, D, D_inf) Grid
The same logic, vectorised across hundreds of candidate model sizes at once. This pattern shows up inside frontier labs' planning pipelines, where the search runs over many candidate architectures (dense, MoE, different attention variants) at every plausible quality target.
At Massive Scale: Llama 3, Phi, and the Over-Training Era
Every modern open-weight foundation model published since 2024 is dramatically over-trained relative to Chinchilla. The pattern is not accidental — it is exactly the inference-aware optimum.
| Model | Params | Training tokens | Tokens / param | Chinchilla ratio |
|---|---|---|---|---|
| Chinchilla (DeepMind, 2022) | 70 B | 1.4 T | 20× | 1× (reference) |
| Llama 2 7B (Meta, 2023) | 7 B | 2 T | 286× | 14× |
| Mistral 7B (Mistral, 2023) | 7 B | ≈ 8 T | 1140× | 57× |
| Llama 3 8B (Meta, 2024) | 8 B | 15 T | 1875× | 94× |
| Phi-3-mini (Microsoft, 2024) | 3.8 B | 3.3 T | 870× | 43× |
| Llama 3 70B | 70 B | 15 T | 214× | 11× |
| DeepSeek-V3 (37 B activated) | 37 B act / 671 B tot | 14.8 T | 400× (act) | 20× (act) |
Read down the table. The smaller the model, the more over-trained it is — and the more over-trained it is, the more dominant inference is in its lifetime cost. Phi-3-mini at 3.8 B parameters is designed to run on a phone; its inference-to-training ratio is extreme, and Microsoft pushed its training to 3.3 T tokens to match. Llama 3 8B is designed to run on a single A100; same logic, 15 T tokens.
What changes at scale
- Data quality becomes the bottleneck. Once you commit to T tokens, you run out of pristine web text. Modern recipes shift heavily to synthetic data, curriculum filtering, and aggressive deduplication. The scaling law assumes IID tokens of fixed quality — at 1875× tokens/param, that assumption strains.
- Training cost grows faster than Chinchilla predicts. The 6N rule counts model FLOPs but ignores data-pipeline cost, checkpoint I/O, and failed runs. Over-training a small model on 15 T tokens is cheap in FLOPs, expensive in wall-clock and dataloader engineering.
- Inference workload is uncertain at training time. You commit to before you know . The decision is a bet on adoption. Llama 3 8B's 15 T tokens is the right answer if the model gets billions of inference calls; it is wasteful if the model never ships. This is why frontier labs train a family of sizes (8 B, 70 B, 405 B) — to hedge across adoption outcomes.
- Distillation flips the equation. If a 405 B model produces synthetic data that trains an 8 B model, the large model's training is amortised across all downstream students. The inference workload that matters becomes the student's, not the teacher's. Modern LLM recipes are increasingly teacher-student pipelines optimised end-to-end on inference cost.
Engineering Reality: When Chinchilla Still Wins
Inference-aware over-training is not always the right answer. Four regimes where Chinchilla optimality still rules:
| Regime | Why Chinchilla wins | Typical model |
|---|---|---|
| Research checkpoint, never served | D_inf = 0, the inference term vanishes | Pre-publication ablations |
| Fine-tuning a base model | Base model already trained, the choice is downstream-only | Domain SFT on a pre-trained 70 B |
| Capacity-bound task (long-context, high recall) | Smaller N hits a quality floor no D can fix | Long-context 1M-token retrieval |
| Pre-training data is the bottleneck, not FLOPs | Cannot get more tokens at acceptable quality | Specialised domains: medical, legal |
And four engineering gotchas that the clean equation hides:
- The 6N and 2N rules are approximations. Real training is closer to 6.something N depending on activation recompute strategy; real inference is 2 N plus KV-cache read/write cost that grows with sequence length. At 128k context, attention FLOPs alone can double inference cost. The cleanest analysis treats the constants as functions of sequence length, not literals.
- Memory is not in the equation. A 70 B model needs ~140 GB of weights in BF16 alone. Inference cost in FLOPs is one thing; can you fit it on one GPU? Memory bandwidth, not FLOPs, is often the binding constraint at inference. The inference-aware law gives you the FLOP-optimal frontier; the memory-optimal frontier can sit elsewhere.
- Inference workload distribution matters. 2 × 10¹⁵ tokens spread across 10 ms-latency mobile chat is different from 2 × 10¹⁵ tokens in nightly batch jobs. The latter can use a larger model at higher utilisation; the former forces a smaller one regardless of training math. The scaling law is necessary but not sufficient.
- Quantisation changes the cost ratio. If you serve an 8 B model in INT4, inference FLOPs collapse by 4× — the inference-aware optimum shifts back toward larger training and smaller over-training ratios. Llama 3's 15 T-token recipe was tuned around BF16 serving; an INT4-served deployment might prefer a slightly different (N, D).
The deepest lesson. Chinchilla taught us that training compute is a constrained resource and the optimum is not where intuition ("just make N bigger") puts it. Sardana et al. taught us the same lesson at a higher level: lifetime compute is a constrained resource, and the optimum is again not where the previous generation's intuition ("just follow Chinchilla") puts it. Every time the cost frontier changes, the optimum moves. The next shift — likely driven by hardware accelerators with non-linear cost-per-FLOP, or by amortised distillation pipelines — will move it again.