Sections 9.1 through 9.4 took compute, parameters, and tokens as inputs and predicted final loss BEFORE training started. This section closes the loop. You have launched the run. A trillion-parameter MoE has been chewing through tokens for three days. You have five intermediate loss checkpoints. The question every frontier lab asks at this point is the same: given what we have seen so far, where will this run actually land?
The thesis of this section. Pretraining loss curves are dominated by a single mechanism — averaging gradients over a growing token count — and that mechanism gives the loss curve a nearly-deterministic shape. Five well-spaced checkpoints fit a three-parameter power law to within roughly nats of the final loss. That accuracy is enough to kill doomed runs early, repurpose underspent compute, and validate hyperparameters on dramatically smaller proxy runs — the single highest-leverage piece of training infrastructure most labs own.
The Real Problem: A $5M Run You Cannot Restart
A frontier pretraining run on a 670B-parameter MoE costs roughly $5M of GPU time and 60 days of wall clock. The cluster is committed; the data pipeline has been prepared; the scaling-law analysis says the right answer is 14.8T tokens. You hit step 1, the loss falls from 11.0 to 4.0 in the first hour, and then it begins the long slow grind from 4.0 down toward whatever number it is going to end up at.
Three painful failure modes show up between day 1 and day 60:
| Failure mode | What it looks like at hour 24 | Cost if you do not catch it early |
|---|---|---|
| Bad hyperparameters | Loss curve has the right shape but sits 0.05-0.15 nats above where the scaling law said it should. | 60-day run finishes with a model that is mediocre at every benchmark. ~ $4M wasted. |
| Data pipeline regression | Loss is unusually noisy or has a small but persistent upward drift. | Subtle quality degradation invisible at the loss level but obvious at eval time. Re-run required. |
| Optimizer instability | Loss is fine until step 50k, then a single spike to 8.0 and never fully recovers. | If unnoticed for two days: ~ $200k of grad-clip-saturated training that learns nothing. |
The naive policy is "wait until the run ends and look at the evals." That is the policy that turns a $5M GPU bill into a $5M lesson. The grown-up policy is to predict where the run is going every few thousand steps and act on the prediction: kill a run when it is going to miss the target, scale down a run when the prediction says you bought too much compute, and (most importantly) use the predictor to validate hyperparameters on tiny proxy runs before paying the full $5M.
Intuition: Pretraining Loss Is Almost Boringly Predictable
The result that makes this section possible is counterintuitive: the outer envelope of a healthy pretraining loss curve is extraordinarily smooth. There is per-batch noise (a single batch of tricky text can spike the loss by 0.05); there is learning-rate-schedule structure (a cosine decay carves a visible signature into the last 10% of training); there are even occasional loss spikes. But underneath all that, the loss is approaching its asymptote along the same shape that every healthy run produces:
Where is the number of training tokens consumed, is the irreducible loss the model would approach with infinite training, sets the magnitude of the "excess loss over the asymptote", and is the decay rate. The constant is a tiny offset that absorbs warmup and keeps the curve finite at .
Two intuitions justify the shape. First: the gradient the model receives at token is, on average, the average gradient of the loss surface re-evaluated on a freshly drawn sample. The variance of that estimate falls like , so the residual loss above the asymptote falls polynomially in . Second: the same shape shows up in the parameter axis (Chinchilla's term in Section 9.1). One mechanism — a finite-capacity model averaging gradients over finite data — produces the same power-law signature on both axes.
The picture you should have in your head is a ball rolling into a bowl whose floor is the natural-text entropy. The ball decelerates as it nears the floor; the deceleration is smooth and uneventful; the rate of deceleration is what measures. Forecasting where the ball will be at is a matter of measuring its position at a few early times and reading the smooth shape forward.
The Mathematics of Power-Law Loss Curves
We have three unknowns — — and a stream of observations . The standard fit is non-linear least squares:
The Levenberg–Marquardt algorithm (the default in scipy.optimize.curve_fit) converges in ten or so iterations when you seed it with a sensible initial guess and bound the parameters to physically plausible ranges. Three rules on the bounds:
- . The irreducible loss cannot be below the per-token entropy of natural text — for English-dominant corpora the floor is nats. Without this bound the fitter happily picks and a giant — wonderful residual on the observed points, catastrophically wrong far out.
- . The asymptote is below every loss you have observed, by definition.
- . Outside this range you are fitting noise; values are remarkably consistent across architectures and tokenisers at .
Once we have point estimates, we want a confidence interval on . Linearise the model at the fitted parameters and propagate the parameter covariance matrix through the Jacobian:
The three partial derivatives are easy:
A practical observation about this Jacobian: regardless of where you predict. The uncertainty in at is therefore lower-bounded by the uncertainty in itself — and is exactly the parameter that early data pins down most weakly. This is why the prediction band widens as you extrapolate forward: not because the curve is more uncertain there, but because the asymptote is the parameter the data is shyest about.
Why this fit beats a quadratic in
A common ad-hoc alternative is to fit a quadratic in : . It looks smooth on a log-x plot and the fit is linear (just least squares). Two reasons it fails:
- No asymptote. The quadratic predicts a final loss of as . For short extrapolation horizons it does not matter, but at the scale where you are deciding whether to spend $5M, an extrapolation method that has the wrong limit is a footgun.
- No physical meaning. The power-law parameters are interpretable — you can compare across architectures, debug runs by reading the parameter trajectory, and refuse fits that exit the physical range. The quadratic coefficients are just numbers.
Manual Numerical Walkthrough
Let us fit a power law to five checkpoints by hand and predict the final loss at the planned horizon. Numbers chosen to be realistic for a frontier MoE pretraining run.
Click to expand: fitting a power law to five checkpoints by hand
Step 1 — the five observations. Tokens seen (trillions) and per-token cross-entropy:
i t_i (T tokens) L_i (nats) 1 0.5 3.04 2 1.2 2.65 3 2.6 2.41 4 4.0 2.30 5 6.0 2.22
We will use the locked offset throughout.
Step 2 — fix two knobs, solve the third in closed form. For any choice of and , define . The residual after subtracting the asymptote is and we want . Least squares gives:
This collapses a 3D optimisation into a 2D grid search, which we can do on paper.
Step 3 — try the textbook seed. Take and . Compute :
i t_i + t_0 ln(t_i+t_0) -0.32 * ln x_i = exp(...) 1 0.9 -0.1054 +0.0337 1.0343 2 1.6 +0.4700 -0.1504 0.8604 3 3.0 +1.0986 -0.3516 0.7035 4 4.4 +1.4816 -0.4741 0.6224 5 6.4 +1.8563 -0.5940 0.5521
Step 4 — compute the residuals and .
i L_i r_i = L_i - 1.95 r_i * x_i x_i^2
1 3.04 1.090 1.1274 1.0698
2 2.65 0.700 0.6023 0.7403
3 2.41 0.460 0.3236 0.4949
4 2.30 0.350 0.2178 0.3874
5 2.22 0.270 0.1491 0.3048
------ ------
sum: 2.4202 2.9972
A_hat = 2.4202 / 2.9972 = 0.8075 <- too small; the seed is the issueThat sits well below the seed value of . The fit residual at would be — far too large. The pair is wrong for this dataset.
Step 5 — sweep the 2D grid. Repeat steps 3–4 for and . The cell with the smallest sum-of-squared-residuals lands at with residual RMSE nats. A finer grid would polish those numbers to ; we stop here because we have learned what the procedure does.
Step 6 — extrapolate. Plug into the fitted model:
wait — that gives , which is above the last observed point of . The fit has degenerated. The paper-grade grid was too coarse to find the right ; the real optimum sits closer to and with , giving .
Step 7 — what the manual walk teaches. Manual grids are useful for understanding the mechanism but unreliable for the actual numbers. The fit lives in a narrow basin where small errors in compound into a wildly wrong . The right tool is Levenberg–Marquardt with the bounds from Section 3 — precisely what the Python below does in one line.
Step 8 — decide. With the true fit and target , the predicted sits just inside tolerance. STATUS: ON TRACK. Run continues. The same arithmetic applied at T (only the first two checkpoints in hand) would give a far less trustworthy prediction — the uncertainty on is wide that early.
Visualizing the Extrapolation
The interactive below runs three different stylised pretraining runs. Slide the "Training observed" bar to control how far into the run the predictor has seen. The emerald solid line is what the predictor has seen; the emerald dashed line is the true future the predictor has not seen; the blue line is the predictor's extrapolation, with its confidence band. The numeric error at the top right is the gap between the predicted and actual final losses.
Three things to read out of the sandbox. First: on the healthy run, the extrapolation locks onto the true final loss after roughly of training — about T tokens of a T run. Before that the confidence band is huge; after it the band collapses and the prediction is solid. Second: on the noisy run the SAME true curve produces a much wider confidence band early — the predictor is honest about being less sure. It still converges, just later. Third: the phase-change run is the cautionary tale. A late LR-decay cliff knocks the final loss below the smooth-extrapolation prediction. The predictor confidently misses, because the data it has seen does not contain the cliff. This is exactly the case where the engineer must augment the predictor with knowledge of the LR schedule — see "engineering reality" below.
Plain Python: Fitting a Power Law to Five Checkpoints
Below is the canonical offline version of the predictor. Five checkpoints, three fitted parameters, a single decision at the end. This is the script frontier labs run at their morning standup over the previous night's checkpoints.
Two structural details deserve a second look. First, the bounds on lines 17–18 are doing more work than they look like they are: without them, curve_fit will happily pick a degenerate solution with and a giant , fitting your data well and predicting wildly wrong final loss. Bounds are the difference between a production predictor and a toy. Second, lines 36–42 are the entire user interface of the system: three integers (target, tol, T_final) in, one verdict string out. Everything else is plumbing.
Sanity-check yourself. Run the script with the target set to1.5(impossibly low) and see the verdict flip toKILL. Then set target to2.5(already easily hit) and see the verdict flip toUNDERSPENT. If those two flips do not happen, your decision logic is broken — far more dangerous than a bad fit.
PyTorch: A Live Predictor in the Training Loop
The offline script is fine for morning standups. At frontier scale you want the predictor running inside the training loop, emitting a verdict every few thousand steps, and able to pull the kill switch by itself. Below is the production pattern: a thin CPU-only LossPredictor class, an update() call from every logging step, and a predict() + verdict() call from every refit step.
Three points about how this pattern interacts with the rest of the training stack:
- The predictor is CPU-only and never blocks the GPU.
update()is two list appends.predict()runs curve_fit once per 1000 steps — about ms on a single CPU core, against a training step that takes 2–10 seconds. If your refit ever shows up in the GPU profile, you have a bug, not a tradeoff. - Logging cadence vs refit cadence are decoupled. We log loss every steps (dense, for diagnostics) but refit every (sparse, for cost). Mixing the two is the most common mistake — labs that refit on every log point spend more on curve_fit than on training.
- The verdict drives an automatic action. KILL raises SystemExit, which the cluster orchestrator interprets as "tear down this job." UNDERSPENT writes a ticket for the experiment owner. ON_TRACK is silent. A predictor that emits verdicts and does not act on them is theatre — the point of the predictor is to close the loop.
info dict (fitted parameters, RMSE, n_obs, verdict) to a separate CSV. After the run you can chart over training and see when the asymptote stabilised. If is still drifting at of training, you have evidence the run was not in its asymptotic regime and the verdict was less reliable than it looked. This single CSV is the cheapest postmortem tool in the training stack.At Massive Scale: Killing Bad Runs Before They Bankrupt You
Drop production numbers into the predictor and the economics become clear:
| Quantity | Order of magnitude | Comment |
|---|---|---|
| Training run cost (DeepSeek-V3 scale) | ~ $5M GPU time, ~ 60 days wall clock | The base rate the predictor protects. |
| Checkpoints needed for a confident verdict | ~ 5-8 | Spaced roughly logarithmically - 0.5T, 1T, 2T, 4T, 7T. After ~ 15% of the budget the predictor is solid. |
| Prediction error at 15% of run | ~ 0.01-0.03 nats | Comparable to per-batch noise; well below the 0.05-nat gap between a strong run and a mediocre one. |
| Refit cost | ~ 50 ms / refit / 1000 steps | Effectively free against multi-second training steps. |
| Compute saved by an early KILL verdict at 15% of run | ~ $4M out of $5M | The headline number. One avoided bad run pays for the predictor for a decade. |
Two observations on what this means strategically. First, the predictor is the cheapest piece of training infrastructure measured in dollars-saved-per-engineering-hour — usually by two orders of magnitude. A solid predictor pays for the entire training infrastructure team in a single avoided bad run. Second, the same predictor enables cheap hyperparameter validation on proxy runs: train a 7B model for 100B tokens, fit the predictor, extrapolate to the 14.8T the 670B will eventually see. If the proxy run's extrapolated loss curve does not match the target shape, you found a bad hyperparameter setting at 1% of the cost of finding it on the full run. This is how DeepSeek and Meta cheaply sweep hyperparameter spaces that would be unaffordable on the full model.
Where the predictor sits in the experiment lifecycle
- Phase 1 — proxy sweeps. Run dozens of 7B-scale models for 100B tokens each. Fit the predictor to each run. Extrapolate to the planned full-scale horizon. Pick the hyperparameter configuration whose extrapolated final loss is best.
- Phase 2 — full launch with the predictor enabled. Launch the full 670B run with
LossPredictorwired into the training loop. Refit every 1000 steps. The first solid verdict arrives at ~ 15% of the budget. - Phase 3 — checkpoint-driven decisions. If the verdict is KILL, the cluster tears down the job and the team debugs the proxy-to-full discrepancy. If ON_TRACK, the run continues to completion. If UNDERSPENT, the team trims the token budget or repurposes the spare compute on a parallel ablation. None of this requires waiting for the run to finish.
Engineering Reality and Gotchas
The predictor looks like a tidy fit-and-extrapolate problem. Five failure modes show up in production:
- The LR schedule has a cliff the predictor cannot see. A cosine decay or a late-run linear-to-zero schedule carves an extra – nats out of the final loss that the smooth power-law model cannot anticipate. The fix is to give the predictor knowledge of the schedule: fit the power law to the loss curve scaled by the current LR ratio, or maintain two separate fits — one for the pre-decay phase, one for the decay phase. Frontier labs typically do the latter and combine via a known multiplier.
- Early checkpoints lie when warmup is long. The first 1–5% of training is warmup, where the loss curve is dominated by the schedule, not the asymptotic mechanism. Fitting the power law to those points pulls the fit toward the wrong shape. Start the fit AFTER warmup ends, or filter points below a minimum tokens-seen threshold.
- Loss spikes invalidate the smoothness assumption. A single optimizer instability at step 50k can dump a spike point that biases the fit. Robust regression (Huber loss instead of L2) is the standard fix. Alternatively, detect and remove outliers above 3 RMSE before fitting.
- Emergent phase transitions break the model. The power-law model assumes smooth approach to the asymptote. Section 9.3 showed that some benchmark metrics undergo a phase transition where the loss curve looks identical but the downstream eval suddenly improves. A pretraining-loss predictor cannot capture an emergent-ability transition; augment it with eval-loss checkpoints whose curves you also extrapolate separately.
- The predictor trusts data quality. If the training pipeline starts feeding lower-quality shards after step 100k (a deduplication regression, a botched mixing-ratio change), the loss curve will bend upward relative to the power law and the predictor will (correctly) emit KILL. Treat a sudden change in residual RMSE as a data-pipeline alert rather than a predictor bug — the predictor caught the data issue, that is its job.
The one sentence to carry forward: a pretraining run is a 60-day commitment with a 7-day decision point, and the loss predictor is what makes the seventh day actionable — every other piece of the training stack assumes you know whether to keep going.