Boo-AI — Master Artificial Intelligence by Building from Scratch

Suppose you have ten million dollars of GPU time, six months of calendar, and a corpus of fifteen trillion training tokens. How big should the model be? A 7B parameter model trained on every token you have? A 700B model trained for a single epoch? Two months of throughput says you could split the budget differently a hundred different ways. Only one of those splits gives you the lowest possible validation loss at the end. The Chinchilla scaling laws tell you which one.

The thesis of this section. For a fixed training compute budget $C$ , validation loss is a smooth, predictable function of model size $N$ and training tokens $D$ . The function has a unique minimum on the constraint $C = 6 N D$ , and the optimum lies near $D \approx 20 N$ . Models that ignore the law — GPT-3, Gopher — burned billions of FLOPs in the wrong place. Models that respect it — Chinchilla, Llama-3, DeepSeek-V3 — extract an order of magnitude more capability from the same hardware.

The Real Problem: How Big Should the Model Be?

Before Chinchilla, the field believed the answer to "how big should the model be?" was simple: as big as you can afford. Kaplan et al. (2020) at OpenAI published an influential set of scaling laws fit to GPT-2-era runs and concluded that the compute-optimal allocation grew the model size much faster than the training token count — specifically that $N_{\text{opt}} \propto C^{0.73}$ and $D_{\text{opt}} \propto C^{0.27}$ . The practical rule that followed: build the largest possible model and train it for relatively few tokens.

Three of the most important models of the era followed this advice to the letter. GPT-3 (175B) trained on 300B tokens — 1.7 tokens per parameter. Gopher (280B) trained on 300B tokens — 1.1 tokens per parameter. Megatron-Turing NLG (530B) trained on 270B tokens — 0.5 tokens per parameter. All three were giants. All three, it turns out, were dramatically undertrained.

In 2022 a small DeepMind team led by Jordan Hoffmann ran a brutally thorough sweep: 400+ training runs at log-spaced model sizes from 70M to 16B parameters, each trained for a wide range of token counts. They re-fit the scaling laws on this richer data and found Kaplan's exponents were wrong. The correct rule was $N_{\text{opt}} \propto C^{0.50}$ and $D_{\text{opt}} \propto C^{0.50}$ — roughly equal scaling. They then trained a single 70B model on 1.4T tokens — at the same compute as the 280B Gopher — and named it Chinchilla. The smaller Chinchilla beat the bigger Gopher on essentially every benchmark.

Model	Params	Tokens	D / N	Compute (FLOPs)	MMLU
GPT-3 (2020)	175B	300B	1.7	3.1 × 10²³	43.9%
Gopher (2021)	280B	300B	1.1	5.8 × 10²³	60.0%
Chinchilla (2022)	70B	1.4T	20.0	5.8 × 10²³	67.6%
Llama-3 70B (2024)	70B	15T	214	4.4 × 10²⁵	82.0%
DeepSeek-V3 (2024)	37B active / 671B total	14.8T	400 (active)	≈ 3.3 × 10²⁴ active	~88%

The Chinchilla row is the headline. Same compute as Gopher, a quarter of the parameters, dramatically better benchmarks. The only thing that changed was where the FLOPs were spent. Every model below Chinchilla in the table follows the Chinchilla rule or deliberately overshoots it.

Why the field got this wrong for two years. Kaplan's sweep used a fixed learning-rate schedule that decayed to zero at a fixed step count, which means runs with more tokens did not actually get to fully use those tokens — the LR was already near zero by the end. Once Hoffmann ran the sweep with the LR schedule properly scaled to each run's token count, the exponents shifted. Scaling-law fits are extraordinarily sensitive to optimiser hygiene; this is the single most common reason different labs publish different exponents.

Intuition: A Budget Split Between Size and Training

Picture training as buying loss reduction with FLOPs. You have a fixed wallet of $C$ FLOPs. There are two things you can buy with it: a bigger brain (more parameters $N$ ), or more education (more training tokens $D$ ). Both reduce validation loss, but they do so in different ways and at different prices.

Adding parameters lets the model represent functions of higher complexity. Each new parameter is a new coefficient that can be turned to fit some pattern in the data. The first parameters buy huge loss reduction; the millionth parameter buys very little. The marginal return shrinks as a power law: $\Delta L \propto N^{-\alpha}$ .
Adding training tokens lets the existing parameters get tuned more precisely against more examples. Each new token nudges weights toward better calibration; the marginal return again shrinks as a power law: $\Delta L \propto D^{-\beta}$ .
Spending FLOPs. Every training token through every parameter costs about 6 FLOPs (2 forward + 4 backward). Total cost $C \approx 6 N D$ . So $N$ and $D$ are not free choices — they are tied by the budget.

Now picture the trade-off geometrically. The constraint $6 N D = C$ is a hyperbola in the $(N, D)$ plane. The loss surface $L(N, D)$ is a smooth bowl tilted toward large $N$ and large $D$ . The Chinchilla insight is that the constraint hyperbola crosses the loss bowl along a curve, and this curve has a unique lowest point. That lowest point is where the marginal loss reduction per FLOP from adding parameters exactly equals the marginal loss reduction per FLOP from adding tokens.

The intuition extends one step further. If parameters are cheaper-per-loss-reduction at your current allocation, you should shift FLOPs toward more parameters. If tokens are cheaper, shift toward more tokens. The optimum is reached when neither side is cheaper than the other — Lagrange multipliers, dressed up as an engineering principle. GPT-3 was wildly off this balance: it had bought way too many parameters and way too few tokens. The marginal FLOP would have been better spent training the existing 175B parameters longer.

The Mathematics of L(N, D)

Hoffmann et al. fit a parametric form for the validation loss as a function of model size $N$ and training tokens $D$ :

$L(N, D) = E + \frac{A}{N^{\alpha}} + \frac{B}{D^{\beta}}$

Each symbol carries a clear meaning. $E$ is the irreducible loss — the entropy of natural text under the chosen tokenizer. No model, no matter how large or how long trained, can do better than $E$ . The other two terms are the reducible losses: $A/N^{\alpha}$ is the penalty for having a finite model and $B/D^{\beta}$ is the penalty for having a finite dataset. Both shrink to zero as their argument grows. For the Hoffmann constants $E \approx 1.69$ , $A \approx 406$ , $B \approx 411$ , $\alpha \approx 0.336$ , and $\beta \approx 0.283$ .

The training-compute approximation

A transformer forward pass through one token costs about $2 N$ FLOPs (one multiply-add per parameter is two FLOPs), and the backward pass costs about twice that, so the total per-token training cost is $\approx 6 N$ FLOPs. Multiplying by $D$ training tokens gives the standard approximation:

$C \approx 6 \, N \, D$

This is loose at small scales (attention compute matters, embeddings matter) and tight at frontier scales. We treat it as exact for the derivations below.

The compute-optimal allocation

We want to minimise $L(N, D)$ subject to the compute constraint $6 N D = C$ . Form the Lagrangian

$\mathcal{L} = E + \frac{A}{N^{\alpha}} + \frac{B}{D^{\beta}} + \lambda \, (6 N D - C)$

Set the partial derivatives to zero. From $\partial \mathcal{L} / \partial N = 0$ and $\partial \mathcal{L} / \partial D = 0$ we get the elegant equilibrium condition:

$\frac{\alpha A}{N^{\alpha}} = \frac{\beta B}{D^{\beta}}$

In plain language: at the optimum, the model-size term and the data term (weighted by their exponents) contribute equally to the reducible loss. If either side dominates, you have wasted FLOPs. Solving this together with $6 N D = C$ gives the closed-form compute-optimal allocation:

$N^{*}(C) = G \, \left(\frac{C}{6}\right)^{a}, \quad D^{*}(C) = G^{-1} \, \left(\frac{C}{6}\right)^{b}$

where the exponents are $a = \beta / (\alpha + \beta)$ and $b = \alpha / (\alpha + \beta)$ , and the level-set constant $G = (\alpha A / \beta B)^{1 / (\alpha + \beta)}$ . Plugging in the fitted values gives $a \approx 0.457$ and $b \approx 0.543$ — almost symmetric, which is the heart of the Chinchilla claim.

The 20-tokens-per-parameter rule

For Hoffmann's constants, the ratio $D^{*} / N^{*}$ evaluated at the Chinchilla compute ( $C \approx 5.76 \times 10^{23}$ FLOPs) is approximately $20$ . This is where the famous "20 tokens per parameter" rule of thumb comes from. The exact number drifts slowly with compute — at small compute it is $\sim 12$ , at frontier compute it is $\sim 80$ — but in the regime where most production training happens, 20× is a good first-pass estimate.

Manual Numerical Walkthrough

Let us compute the compute-optimal allocation for a single, concrete compute budget by hand, using the Hoffmann constants.

Click to expand: compute-optimal split at C = 6 × 10²³ FLOPs

Step 1 — Pick the constants. $E = 1.69, A = 406.4, B = 410.7, \alpha = 0.336, \beta = 0.283$ . Budget: $C = 6 \times 10^{23}$ FLOPs (roughly the Gopher/Chinchilla compute).

Step 2 — Compute the exponents. $\alpha + \beta = 0.619$ . $a = \beta / (\alpha + \beta) = 0.283 / 0.619 \approx 0.457$ . $b = \alpha / (\alpha + \beta) = 0.336 / 0.619 \approx 0.543$ .

Step 3 — Compute G. $\alpha A = 0.336 \times 406.4 = 136.55$ . $\beta B = 0.283 \times 410.7 = 116.23$ . Ratio: $136.55 / 116.23 = 1.1748$ . $G = 1.1748^{1/0.619} = 1.1748^{1.615} \approx 1.293$ .

Step 4 — Compute (C/6)^a and (C/6)^b. $C/6 = 10^{23}$ . $(C/6)^{a} = 10^{23 \times 0.457} = 10^{10.51} \approx 3.24 \times 10^{10}$ . $(C/6)^{b} = 10^{23 \times 0.543} = 10^{12.49} \approx 3.09 \times 10^{12}$ .

Step 5 — Compute N* and D*. $N^{*} = G \times (C/6)^{a} = 1.293 \times 3.24 \times 10^{10} \approx 4.19 \times 10^{10}$ ≈ 42 billion parameters. $D^{*} = G^{-1} \times (C/6)^{b} = 3.09 \times 10^{12} / 1.293 \approx 2.39 \times 10^{12}$ ≈ 2.4 trillion tokens.

Step 6 — Sanity check the compute budget. $6 \times N^{*} \times D^{*} = 6 \times 4.19 \times 10^{10} \times 2.39 \times 10^{12} \approx 6.0 \times 10^{23}$ . Matches the target — the algebra checks out.

Step 7 — Compute the predicted loss. $A / N^{*\alpha} = 406.4 / (4.19 \times 10^{10})^{0.336}$ . $(4.19 \times 10^{10})^{0.336} = 10^{10.62 \times 0.336} = 10^{3.57} \approx 3704$ . So $A/N^{*\alpha} \approx 406.4 / 3704 \approx 0.110$ .

Similarly $B / D^{*\beta} = 410.7 / (2.39 \times 10^{12})^{0.283}$ . $(2.39 \times 10^{12})^{0.283} = 10^{12.38 \times 0.283} = 10^{3.50} \approx 3162$ . So $B/D^{*\beta} \approx 410.7 / 3162 \approx 0.130$ .

Step 8 — Verify the equilibrium condition. $\alpha \cdot (A/N^{*\alpha}) = 0.336 \times 0.110 = 0.0370$ . $\beta \cdot (B/D^{*\beta}) = 0.283 \times 0.130 = 0.0368$ . These two are approximately equal to within rounding — the optimum balances the weighted marginal contributions, exactly as the Lagrangian predicts.

Total loss: $L^{*} = 1.69 + 0.110 + 0.130 \approx 1.93$ nats per token. Chinchilla itself reported a final training loss of ~1.96 at this compute — within 0.03 of the prediction.

Step 9 — Compare to Gopher's choice. Gopher chose $N = 280 \times 10^{9}$ and $D = 300 \times 10^{9}$ — same compute, wildly different split. Plug into $L(N, D)$ : $A/N^{\alpha} = 406.4 / (280 \times 10^{9})^{0.336} \approx 406.4 / 6731 \approx 0.060$ ; $B/D^{\beta} = 410.7 / (300 \times 10^{9})^{0.283} \approx 410.7 / 1816 \approx 0.226$ . $L(\text{Gopher}) \approx 1.69 + 0.06 + 0.226 = 1.98$ . That is 0.05 nats worse than the compute-optimal allocation — the difference between an MMLU score of 60% and 67%. Same hardware. Different choice.

Visualizing the Iso-FLOP Slice

The widget below evaluates the Hoffmann loss law and traces the iso-FLOP curve $L(N, C/(6N))$ as a function of model size $N$ on a log axis. The dashed green line marks the compute-optimal $N^{*}$ . Slide the compute budget to see how the optimum moves; click the preset chips to jump straight to a Gopher-, GPT-3-, or Llama-3-class compute. The reference dots show where each historical model sits on its own iso-FLOP curve.

Loading iso-FLOP explorer…

Four things to internalise from the sandbox. First, the iso-FLOP curve is a U-shape, not a monotone — there really is a sweet spot, and pushing past it on either side loses loss. Second, the minimum is broad: deviating by 1.5–2× in either direction costs only ~0.01 nats. This is why labs do not need to land exactly at $N^{*}$ — they have slack. Third, GPT-3 sits dramatically to the right of the Gopher-class optimum — visually obvious in the chart. It is in the steep over-parameterised tail of its own iso-FLOP curve. Fourth, Llama-3 70B sits to the left of its Chinchilla-optimal point — it is intentionally under-parameterised, the over-trained mirror image of GPT-3. The reason is inference cost, covered below.

Plain Python: Fit and Optimise the Loss

Below is the full Chinchilla math, end to end, in plain Python — no deep-learning framework, no GPU. We define the parametric loss, derive the closed-form compute optimum, cross-check with a Nelder-Mead numerical solver, and sweep the compute budget over six orders of magnitude.

🐍chinchilla_closed_form.py

Explanation(8)

Code(48)

9The parametric loss is the entire scaling law in one line

Hoffmann et al. (2022) fit hundreds of training runs of different sizes and durations and showed that the validation loss is well captured by three additive terms: an irreducible floor E (entropy of natural text — no model can do better), a model-size term A/N^α (shrinks as you add parameters), and a data term B/D^β (shrinks as you train on more tokens). The coefficients are fit, not derived from first principles, but they hold across four orders of magnitude in N and three in D.

EXECUTION STATE

E = 1.69 (entropy of English in nats/token)

α = 0.336 (model-size exponent)

β = 0.283 (data exponent)

12L(N, D) is the function we will optimise

Two inputs: the parameter count N and the training token count D. One output: the expected validation loss in nats per token. Notice the symmetry — both correction terms have the same shape. The whole scaling-laws story is about how to balance them given a fixed compute budget.

21Why C = 6 N D? The transformer FLOP budget

A forward pass through a transformer costs roughly 2 N FLOPs per token (one multiply-add per parameter), and the backward pass costs about 2× that, giving 6 N FLOPs per training token. Multiply by D tokens and you get C ≈ 6 N D. This is the Kaplan/Chinchilla approximation — it ignores attention compute (small at typical sequence lengths) and embeddings (small at scale).

EXECUTION STATE

FLOPs per training token = ≈ 6 N

Total training FLOPs = C = 6 N D

28Closed-form optimum from a Lagrange multiplier

Constrained optimisation: minimise L(N, D) subject to 6 N D = C. Take the gradient of L, impose the constraint, and the equilibrium condition is α A / N^α = β B / D^β (both reducible terms must equal at the optimum — otherwise you could shift compute toward the cheaper-to-reduce side). Solving gives the two power laws on lines 28–31.

32G is the level-set ratio between N* and D*

G captures the fitted asymmetry between the model-size and data terms. For Hoffmann's constants, G ≈ 0.86, so for moderate budgets N* ≈ D*^0.84 — they grow together but data grows slightly faster. The famous '20 tokens per parameter' rule of thumb falls out: at C ≈ 5.76e23 FLOPs (the Chinchilla compute), N* ≈ 67B and D* ≈ 1.4T, giving D/N ≈ 21.

EXECUTION STATE

a_exp (N exponent) = ≈ 0.457

b_exp (D exponent) = ≈ 0.543

G = ≈ 0.86

35optimum(C) returns (N*, D*, L*) directly

Plug in any compute budget and out comes the compute-optimal parameter count, training tokens, and predicted loss. This single function is what every frontier lab consults before committing to a multi-million-dollar training run. The cost of getting N or D wrong by 2× is roughly 0.02–0.03 nats of validation loss — a measurable benchmark gap.

42Numerical optimum as a sanity check

We re-solve the same problem by 1-D numerical optimisation over log10(N) (with D = C/(6N) substituted in). If the closed-form and numerical answers disagree, the constants on lines 5–7 are inconsistent. This is the standard 'two paths to the same number' debugging pattern — vital when papers report different fits and you need to know which one you have.

50Sweep six orders of magnitude in compute

C from 1e21 to 1e26 FLOPs covers everything from a small university run (1e21, single A100 for a few days) through Chinchilla-class (5e23) to frontier 2024–2025 budgets (1e26 ≈ 10^4 H100-years). The optimal D/N ratio creeps from ~12 at the low end to ~80 at the high end — empirically, bigger budgets prefer relatively more data, which is the second-order story most readers miss.

EXECUTION STATE

D/N at 1e21 FLOPs = ≈ 12

D/N at 1e23 FLOPs = ≈ 21 (Chinchilla)

D/N at 1e26 FLOPs = ≈ 80

40 lines without explanation

1import numpy as np
2from scipy.optimize import minimize
3
4# --- 1) Hoffmann et al. (2022) parametric loss ---------------------
5# L(N, D) = E + A / N^alpha + B / D^beta
6#   E         irreducible loss (entropy of natural text)
7#   A / N^a   reducible loss from finite model size
8#   B / D^b   reducible loss from finite training data
9E, A, B = 1.69, 406.4, 410.7
10ALPHA, BETA = 0.336, 0.283
11
12def L(N, D):
13    return E + A / N**ALPHA + B / D**BETA
14
15# --- 2) Closed-form compute-optimum given a FLOP budget C ----------
16# Training FLOPs are well approximated by C = 6 N D for a transformer.
17# Minimising L(N, D) subject to 6 N D = C gives:
18#   N* = G * (C/6)^a,  D* = G^-1 * (C/6)^b
19#   a  = beta  / (alpha + beta)
20#   b  = alpha / (alpha + beta)
21#   G  = (alpha * A / (beta * B)) ** (1 / (alpha + beta))
22a_exp = BETA  / (ALPHA + BETA)
23b_exp = ALPHA / (ALPHA + BETA)
24G     = (ALPHA * A / (BETA * B)) ** (1.0 / (ALPHA + BETA))
25
26def optimum(C):
27    Nstar = G * (C / 6) ** a_exp
28    Dstar = (1 / G) * (C / 6) ** b_exp
29    return Nstar, Dstar, L(Nstar, Dstar)
30
31# --- 3) Numerical optimisation, used as a sanity check -------------
32def optimum_numeric(C):
33    def neg_obj(logN):
34        N = 10**logN[0]
35        D = C / (6 * N)
36        return L(N, D)
37    res = minimize(neg_obj, x0=[10.0], method="Nelder-Mead")
38    N = 10 ** res.x[0]
39    return N, C / (6 * N), res.fun
40
41# --- 4) Sweep five compute budgets across 6 orders of magnitude ----
42for logC in (21, 22, 23, 24, 25, 26):
43    C = 10 ** logC
44    N_cf, D_cf, L_cf = optimum(C)
45    N_nm, D_nm, L_nm = optimum_numeric(C)
46    print(f"C=1e{logC:>2}  N*={N_cf:9.2e}  D*={D_cf:9.2e}"
47          f"  D/N={D_cf/N_cf:6.1f}  L={L_cf:.3f}"
48          f"  (numeric L={L_nm:.3f})")

Two non-obvious details worth a second look. First, the closed-form and the numerical-search results agree to many decimal places at every compute scale — this is your acceptance test that the math (and the constants) are consistent. If they diverge, the published constants you copied are slightly different from the ones that match the closed form, and you should re-fit before using them in a production projection. Second, the increase of $D^{*}/N^{*}$ with compute is the most important finding hidden inside the law: bigger budgets do not just demand bigger models, they demand disproportionately more training data. This is why the post-Chinchilla generation of models (Llama-3, DeepSeek-V3) trains on 15+ trillion tokens — not because anyone "broke" the law, but because the law itself extrapolates that direction.

Sanity-check yourself. Run the script with C = 5.76e23 (Chinchilla's actual compute). The closed-form optimum should land near N* ≈ 67B, D* ≈ 1.43T. Hoffmann reports N = 70B, D = 1.4T in the paper — they rounded to a memorable 70B for the actual trained model, paying a microscopic loss penalty for the round number.

PyTorch: Fitting Scaling Laws From Real Runs

Knowing the formula is not enough. Every lab has its own $A, B, \alpha, \beta$ because every lab has its own optimiser, tokenizer, architecture quirks, and data mixture — Hoffmann's constants are a starting point, not gospel. Before committing to a frontier training run, the lab fits its own scaling law on a sweep of dozens of small runs, then extrapolates. The PyTorch snippet below shows the fitting half.

🐍fit_scaling_law.py

Explanation(8)

Code(75)

9Scaling-law fitting starts from real training-run telemetry

Eleven (N, D, L) triples standing in for a real Chinchilla-style sweep. Production sweeps use 50–400 runs, each a fully-trained model with final validation loss measured. The N and D axes are deliberately log-spaced because the loss surface is smooth in log-log space but not in linear space.

EXECUTION STATE

len(runs) = 11 toy runs (sweeps use 50–400)

N range = 3 × 10⁷ to 3 × 10⁹

D range = 6 × 10⁸ to 6 × 10¹¹

26Five learnable parameters define the entire law

Note the tiny model: just five scalar parameters (E, A, B, α, β) capture loss behaviour across four orders of magnitude in N and D. The whole story of scaling laws is that this five-parameter form is enough — a smooth, additive decomposition into irreducible floor, model-size correction, and data correction. log_E/log_A/log_B are parameterised in log-space because A and B are O(100) while E is O(1); without log-parameterisation the optimiser fights numerical asymmetry.

EXECUTION STATE

trainable params = 5 (E, A, B, α, β)

35Forward pass: same formula as plain Python, but autograd-aware

Returns the predicted loss for each (N, D) row. Because the parameters are torch.nn.Parameter, autograd tracks every operation and we can backprop through L_pred. The exponentiations and power operations are all differentiable in PyTorch, so this fits cleanly into the optimisation loop on line 47.

44Huber loss is the secret sauce for robust fits

An MSE objective in linear-loss space gives undue weight to small-N runs (where L is large) and almost ignores large-N runs (where L is small). Hoffmann's Approach 3 instead fits in log-loss space with Huber — robust to the inevitable outliers from runs that diverged, used too-low warmup, or hit a numerical issue. Huber acts like MSE near zero and like L1 in the tails: smooth gradients in normal regime, bounded influence from outliers.

EXECUTION STATE

delta (Huber transition) = 1e-3 in log-loss space

4720k Adam steps to convergence

The objective is convex-ish in (E, A, B) for fixed (α, β) but non-convex jointly. Adam with lr=5e-3 converges in a few thousand steps. In production, labs run a multi-start sweep (10–50 random inits of α and β in [0.2, 0.5]) and keep the best fit — the loss surface in (α, β) has shallow ridges where multiple solutions look nearly equal.

57Read out the fitted constants

After training, the five learned scalars define the lab's local scaling law — slightly different from Hoffmann's published constants because every codebase has slightly different optimiser, tokenizer, and architecture choices. DeepSeek-V3, Llama-3, GPT-4 all fit their own version of these constants before committing to the big run.

62Project to a target compute budget

Same closed-form Lagrangian solution as the plain-Python file, now applied with the FITTED constants. C_target = 3.8e25 FLOPs is roughly the Llama-3 70B budget. Plug it in and the fit suggests an N* and D*; if those disagree with what was actually trained, that disagreement is either (a) intentional (over-training for inference, see below) or (b) a sign the small-scale sweep does not extrapolate to the target.

EXECUTION STATE

C_target = 3.8 × 10²⁵ FLOPs (~ Llama-3 70B)

predicted D*/N* at this scale = ≈ 30–60 (varies with fit)

71This is the single number that justifies a $50M training run

L_star is the predicted validation loss at the compute-optimal allocation for C_target. Frontier labs publish a curve of (predicted L, achieved L) over their sweep and demand they agree to within ~0.01 nats before committing to the full-scale run. A disagreement of 0.03 nats is grounds to halt and re-sweep — the multi-million-dollar question 'will this big model actually learn?' is answered by trust in this fit.

EXECUTION STATE

production tolerance = ≈ 0.01 nats between predicted and achieved

67 lines without explanation

1import torch
2import torch.nn as nn
3import torch.optim as optim
4
5# --- 1) Collect (N, D, L) triples from real training runs ----------
6# In a real lab this comes from a sweep: 30+ runs at log-spaced
7# N in [3e7, 3e9] and log-spaced training token counts. Each point is
8# the final validation loss of one fully-trained model.
9runs = torch.tensor([
10    # (N parameters, D tokens, observed validation loss)
11    [3.0e7,  6.0e8, 4.21],
12    [3.0e7,  6.0e9, 3.62],
13    [1.0e8,  6.0e8, 3.74],
14    [1.0e8,  6.0e9, 3.19],
15    [1.0e8,  6.0e10, 2.83],
16    [4.0e8,  6.0e9, 2.82],
17    [4.0e8,  6.0e10, 2.51],
18    [1.0e9,  6.0e9, 2.66],
19    [1.0e9,  6.0e10, 2.36],
20    [3.0e9,  6.0e10, 2.21],
21    [3.0e9,  6.0e11, 2.04],
22])
23
24N = runs[:, 0]
25D = runs[:, 1]
26L_obs = runs[:, 2]
27
28# --- 2) Differentiable parametric model with learnable scaling law -
29# We learn E, A, B, alpha, beta jointly via gradient descent.
30# Log-space parameters keep the optimisation well-conditioned.
31class ScalingLaw(nn.Module):
32    def __init__(self):
33        super().__init__()
34        self.log_E = nn.Parameter(torch.tensor(0.5))   # init E ~ 1.65
35        self.log_A = nn.Parameter(torch.tensor(6.0))   # init A ~ 400
36        self.log_B = nn.Parameter(torch.tensor(6.0))   # init B ~ 400
37        self.alpha = nn.Parameter(torch.tensor(0.35))
38        self.beta  = nn.Parameter(torch.tensor(0.30))
39
40    def forward(self, N, D):
41        E = torch.exp(self.log_E)
42        A = torch.exp(self.log_A)
43        B = torch.exp(self.log_B)
44        return E + A / N**self.alpha + B / D**self.beta
45
46model = ScalingLaw()
47opt = optim.Adam(model.parameters(), lr=5e-3)
48
49# --- 3) Huber loss in log-loss space for robustness to outliers ----
50# Hoffmann use this exact objective in Approach 3 of the paper.
51huber = nn.HuberLoss(delta=1e-3)
52
53for step in range(20000):
54    L_pred = model(N, D)
55    loss = huber(torch.log(L_pred), torch.log(L_obs))
56    opt.zero_grad()
57    loss.backward()
58    opt.step()
59
60# --- 4) Read out fitted constants and project to a compute budget --
61E = torch.exp(model.log_E).item()
62A = torch.exp(model.log_A).item()
63B = torch.exp(model.log_B).item()
64alpha = model.alpha.item()
65beta  = model.beta.item()
66
67C_target = 3.8e25                 # ~ Llama-3 70B compute
68a = beta  / (alpha + beta)
69b = alpha / (alpha + beta)
70G = (alpha * A / (beta * B)) ** (1.0 / (alpha + beta))
71N_star = G * (C_target / 6) ** a
72D_star = (1 / G) * (C_target / 6) ** b
73L_star = E + A / N_star**alpha + B / D_star**beta
74print(f"Predicted compute-optimal at C={C_target:.1e}:"
75      f" N*={N_star:.2e}, D*={D_star:.2e}, L*={L_star:.3f}")

Three subtleties marked for production use:

Log-space parameterisation matters. Fitting $A$ and $B$ directly puts the optimiser in a 400-vs-1.7 numerical regime with $E$ . Fitting $\log A$ , $\log B$ , $\log E$ puts them all in $[-1, 7]$ and Adam converges in 1000× fewer steps.
Huber on log-loss, not MSE on raw loss. Hoffmann's Approach 3 fits in log-loss space because the relative error is what matters: predicting $L = 2.00$ when the truth is $L = 2.05$ is the same severity as predicting $L = 4.00$ when the truth is $4.10$ . Huber bounds the influence of any one outlier run. MSE-on-raw silently over-weights small-loss (big-model) runs and under-weights the small-model runs that anchor the fit at the high-loss end.
Multi-start to escape shallow ridges. The loss-landscape for $(\alpha, \beta)$ has shallow ridges where many fits look equally good but extrapolate differently. Production pipelines re-fit from 20–50 random initialisations of $(\alpha, \beta)$ drawn uniformly from $[0.2, 0.5]$ , keep the best loss, and report the spread as an uncertainty band on the extrapolation. A 2× spread on $N^{*}$ at the target compute is normal; a 10× spread means the small runs do not generalise to the target scale and the sweep needs larger anchor points.

What "a sweep" actually costs. Chinchilla's 400+ runs consumed roughly 1.4 × 10²² FLOPs in total — about 2% of the cost of a single Chinchilla training run. DeepSeek-V3's reported scaling sweep cost ~1.5% of the final run. The investment is small and the payoff is enormous: a fit that is off by 2× in

N^{*}

costs you ~0.02 nats at the final run, which translates to 2–5 percentage points on downstream benchmarks. The expected-value math is one-sided.

At Massive Scale: Why DeepSeek Trained Past Chinchilla

If Chinchilla's law says 20 tokens per parameter, why is Llama-3 70B trained on 15T tokens (214 tokens per parameter), and why does DeepSeek-V3 train on 14.8T tokens against 37B active parameters (≈ 400 per active param)? The answer is that the Chinchilla optimum minimises training compute, but a deployed frontier model spends most of its lifetime generating tokens, not training. Inference cost is $2 N$ FLOPs per generated token. A smaller $N$ means cheaper inference forever, even at the cost of more training compute today.

The full accounting splits the lifetime FLOP budget into two buckets:

$C_{\text{total}} = \underbrace{6 N D}_{\text{training}} + \underbrace{2 N T_{\text{infer}}}_{\text{inference}}$

where $T_{\text{infer}}$ is the total number of tokens the model will ever serve over its deployed life. For an internal research model, $T_{\text{infer}}$ is small and Chinchilla is right. For a model serving a billion users, $T_{\text{infer}}$ can dwarf $D$ by 100× — and the optimum shifts hard toward a smaller, more heavily trained model.

Use case	Inference tokens T_infer	Optimal D/N	Why
Research only (one-shot eval)	~ 10⁹	≈ 20	Chinchilla is correct; train once, evaluate, move on.
Internal product (millions of users)	~ 10¹²	≈ 50–100	Mild over-training; inference is meaningful but training still dominates.
External API (Llama-3-style)	~ 10¹⁴	≈ 200	Heavy over-training; inference dominates lifetime FLOPs.
MoE serving model (DeepSeek-V3)	~ 10¹⁴ on active params	≈ 400 (per active)	MoE makes inference even cheaper per param, justifying even more training.

DeepSeek-V3 demonstrates the second-order trick. By using a sparse Mixture-of-Experts, the inference cost scales with the active parameter count (37B), not the total (671B). The Chinchilla denominator is the active size, but the storage and gradient cost during training is the total size. The result is a model that behaves like 671B at quality and like 37B at inference cost — and aggressively over-trains the active subnetwork because inference is so cheap per active param.

Modern restatement of Chinchilla. The clean version of the rule that survives 2024–2025 frontier practice: train past Chinchilla as long as the marginal inference savings from a smaller model exceed the marginal training cost of more tokens. For research, use Chinchilla. For production, use 20–400×, depending on expected serving load. The law itself does not change; the objective being optimised does.

Engineering Reality and Gotchas

Scaling laws are seductive because they reduce a multi-million- dollar decision to one equation. They earn their flags because the equation is only as good as the sweep that fitted it. Five production failure modes:

The learning-rate schedule must scale with D. This is the single mistake that put Kaplan's 2020 fit off by a factor of two. If your small runs decay LR over 100k steps and your big runs decay over 1M steps, then runs with more tokens actually use their tokens — your fit will recover the right exponents. If LR decays over the same step count regardless of D, runs with more tokens train at near-zero LR for most of their schedule, and your fit will systematically under-value tokens. Hoffmann fixed this; every modern lab inherits the fix.
Sweep small or sweep wrong. A scaling-law fit built from runs in $[10^7, 10^9]$ parameters extrapolated to $10^{11}$ is a two-order-of-magnitude leap. The fitted exponents drift slowly with the regime they were fit in. Frontier labs anchor their sweep with at least one run within 10× of the target — the last 1% of cost that turns a projection into a verified prediction.
The 6 N D approximation breaks for MoE. A mixture-of-experts model has total parameters $N_{\text{total}}$ but only routes through $N_{\text{active}}$ per token. Training FLOPs scale with $N_{\text{active}}$ , not $N_{\text{total}}$ . Plugging total params into a dense Chinchilla formula will tell you to train DeepSeek-V3 on 13T tokens; the right answer (active-params Chinchilla, then over-train for inference) is closer to 15T. The two answers happen to be close — but only because of a coincidence of constants, not because of theory.
Data quality changes A, not α. A cleaner data mixture lowers $A$ and $B$ (the model reaches lower loss at every (N, D)) but typically leaves the exponents $\alpha, \beta$ nearly unchanged. The practical consequence is large: a 10% improvement in data curation is equivalent to a 2–3× compute multiplier without changing where the compute-optimum lives. Section 8 of this chapter and Chapter 8 itself live on this margin.
Loss is not capability. Chinchilla predicts validation loss, but the model you ship is judged on downstream benchmarks. The relationship between loss drop and benchmark gain is monotone but heavily non-linear (see section 9.3 on emergent abilities). A 0.05-nat loss drop can be worth 5 MMLU points in one regime and 15 in another. Chinchilla tells you how to minimise loss; it does not tell you whether that loss drop will unlock the capability you actually need. That is the next layer of the scaling-laws story.

How a frontier lab actually uses this section. Three concrete steps before committing to a billion-dollar run: (a) run a 50-point sweep across two orders of magnitude in N and D with the production stack (tokenizer, optimiser, LR schedule scaled per run), fit

E, A, B, \alpha, \beta

with multi-start Huber-log fit; (b) project to the target compute, report N* and D* with uncertainty bands; (c) decide where on the over-trained spectrum to sit by accounting for expected lifetime inference traffic. Skip (a) and you publish 2020-era Kaplan exponents and waste a quarter of your compute. Skip (c) and you ship a model that is too expensive to serve at scale.

The one sentence to carry forward: Chinchilla is not a number, it is a procedure — and the only frontier models worth their training cost are the ones that actually ran it.