Chinchilla told us how to spend our compute budget on parameters versus tokens. It did not tell us a much harder operational question: what learning rate, batch size, warmup, and weight decay should we use at 671B parameters? Sweep them, you say. Run a few dozen short runs, pick the best, and ship. That answer works at 100M parameters. At frontier scale it costs more than the model itself.
The thesis of this section. Hyperparameters do not survive scale-up. The optimal learning rate at width 256 is wrong by an order of magnitude at width 4096. The batch size that converged beautifully at 100M parameters wastes 80% of your gradient signal at 100B. The fix is not to sweep harder — it is to choose a parametrization under which the small-model optimum is also the big-model optimum. That parametrization is called μP (Maximal Update Parametrization), and it is the quietly load-bearing trick behind every frontier training run since 2023.
The Real Problem: You Cannot Sweep at Frontier Scale
Here is the accounting nobody puts on the slide. A single training run of DeepSeek-V3 (671B parameters, 14.8T tokens) costs roughly $5.6M of GPU time and burns about H800-hours. The standard hyperparameter sweep at small scale — vary LR over six values, β₂ over three, weight decay over three, warmup over two — is 6 × 3 × 3 × 2 = 108 runs. If you sweep at full scale, you have just spent $605M on hyperparameter selection alone. Nobody does this. Nobody can.
So everyone falls back to one of three options, and the first two are traps.
| Strategy | What it looks like | Why it fails at frontier scale |
|---|---|---|
| Sweep at full scale | Run 100+ full-size training runs and pick the best. | Costs $100M–1B per generation. Even the largest labs do not have this budget. |
| Sweep at small scale, hope for transfer | Tune at 1B parameters under standard parametrization (SP), reuse the LR at 671B. | Under SP the optimal LR drifts with width. The 1B-tuned LR is 4–16× too high at 671B. The big run diverges in the first 500 steps or trains stably to a worse loss. |
| Tune small under μP, ship big | Tune at 1B parameters under Maximal Update Parametrization, reuse the SAME hyperparameters at 671B. | Works. Empirically validated by the DeepSeek-V3 paper, the Tensor Programs V paper, and the public Llama-3 technical report. |
Intuition: A Wind-Tunnel Model for Hyperparameters
Aircraft engineers do not test full-size jets in wind tunnels. They test scale models — 1/20th the size — and they use dimensionless numbers (Reynolds, Mach) to make sure the small-model behaviour is dynamically similar to the full-size behaviour. If the Reynolds number matches, the flow pattern matches, no matter the absolute scale.
μP does the same thing for neural networks. Width is the scale axis — the analogue of model size in the wind tunnel. The "dimensionless numbers" are the per-layer learning-rate scalings. The promise of μP is exactly the promise of dimensional analysis: if the rescaled hyperparameters are right at the small scale, they are right at every scale.
The intuition for why standard parametrization fails is gradient accumulation in a different guise. Consider one row of a weight matrix in a wide layer. When the model is units wide, that row participates in dot products on every forward pass. The gradient that flows back into that row is therefore a sum of error terms. A single SGD step of size changes the row by — and that sum grows with . The natural fix is to shrink in proportion: that is exactly μP's prescription for hidden-layer learning rates.
The mental flip. In SP, you tune one global and the per-layer effective LR is whatever it happens to be at that width. In μP, you tune one global and the per-layer effective LR is explicitly rescaled by width so that the per-update change in each row stays the same across scales. The forward pass is unchanged. The optimizer is unchanged. Only the per-layer LR multiplier moves.
The Mathematics of μP and Scaling Rules
Let be the hidden width of a transformer and a fixed reference width (typically the proxy you can afford to sweep). Partition every parameter into one of three layer kinds:
- Input: token + positional embeddings, first projection from a width-independent input dimension. Fan-in is fixed (vocab size, embedding dim) — does not change with .
- Hidden: Q, K, V, attention output projection, MLP up- and down-projection. Both fan-in and fan-out scale with .
- Output: final projection to the vocabulary logits. Fan-in scales with ; fan-out is fixed (vocab size).
Under μP with the AdamW optimizer, the per-layer learning-rate multiplier follows the table below. appears in every init row because that is the standard fan-in initialization — μP rescales the LR, not the init.
| Layer kind | LR multiplier | Forward multiplier | Init std |
|---|---|---|---|
| Input | 1 | 1 | 1 / sqrt(fan_in) |
| Hidden | n_0 / n | 1 | 1 / sqrt(fan_in) |
| Output | n_0 / n | n_0 / n | 1 / sqrt(fan_in) |
The effective learning rate of layer at width is therefore where is a single global learning rate you tuned at the proxy. The key claim of μP — proven via the Tensor Programs framework in Yang & Hu (2022) — is that the optimum of is invariant in as . In plain words: tune at , use the same at every .
Two other hyperparameters need their own scaling rules. Batch size follows McCandlish et al. (2018): there is a critical batch size below which doubling the batch nearly halves training time, and above which doubling the batch barely helps. Empirically with — bigger models can absorb bigger batches before saturating. Weight decay scales as (no width rescaling under decoupled AdamW), and warmup steps are kept constant in absolute count, not as a fraction of total steps.
Putting these together gives the modern compute-optimal hyperparameter recipe: pick model width and training token count from Chinchilla; pick batch size near ; pick learning rate by sweeping a proxy at width with μP; pick weight decay and β₂ at the same proxy; keep warmup at a fixed absolute step count (typically 2000–4000). Five hyperparameter decisions, one of which (LR) is the only one that needs sweeping at all.
Manual Numerical Walkthrough
Walkthrough: scaling LR from width 256 to width 4096
Suppose you swept the LR at a proxy of width and found the optimum at . You now want to train the production model at width . Compute the per-layer effective LRs under μP and SP and compare.
| Layer | Mode | μP effective LR at n=4096 | SP effective LR at n=4096 |
|---|---|---|---|
| tok_emb | input | 3e-3 × 1 = 3e-3 | 3e-3 |
| attn.q_proj | hidden | 3e-3 × (256/4096) = 1.88e-4 | 3e-3 (same as proxy — wrong) |
| mlp.up_proj | hidden | 1.88e-4 | 3e-3 |
| lm_head | output | 1.88e-4 + forward × 1/16 | 3e-3, no forward rescale |
Now compute the magnitude of a single AdamW update to . Suppose the gradient has per-element RMS of at width 256. At width 4096 under SP, the same per-element RMS holds (fan-in init is preserved), but the gradient norm has grown by because the row participates in 16× more dot products. With AdamW the update magnitude is roughly , so the SP update at width 4096 is per element — four times bigger than what is stable at this width. Result: training diverges in the first few hundred steps, exactly the failure mode the DeepSeek-V3 paper attributes to bad LR scaling on early checkpoints.
Under μP, the per-element AdamW update at the same layer is — sixteen times smaller. That is precisely the factor needed to keep the per-step parameter change comparable to width 256. The forward pass produces the same activation magnitudes (because init was already fan-in-correct), so the loss curve looks like a shifted copy of the width-256 curve — same shape, lower asymptote, same optimal hyperparameters.
Now the proxy savings. Width 256 has roughly the parameters of width 4096, i.e. roughly as many. A full LR sweep at the proxy costs about of one full-size training run. Under μP, you do exactly one sweep at the proxy and use the winner verbatim at full scale. Under SP, you would need to sweep at every scale you train — a 6× multiplier on every Chinchilla rung.
Visualizing LR Transfer
The visualizer below plots loss vs at three widths (256, 1024, 4096) under both parametrizations. Toggle between SP and μP and hover any LR value to read the loss at each width.
The qualitative pattern reproduces the result in Figure 1 of the Tensor Programs V paper. Under SP, the three U-shaped curves drift to the left as width grows — the optimal LR at width 4096 is about 16× smaller than at width 256. Under μP, the three curves stack: the optimum sits at the same regardless of width. That stacking is the entire promise of μP, and it is what every frontier lab now exploits to keep hyperparameter search costs bounded as model scale grows.
Plain Python: A μP Coordinate-Check Simulator
Before you trust μP in a 671B training run, you have to verify it empirically on a tiny model. The standard diagnostic is the coordinate check: train the same network at several widths, log the per-layer activation magnitudes and gradient norms, and confirm that the products stay constant across widths. The script below does exactly that, in pure NumPy.
What the coord check tells you. Under SP, the printed loss at width 4096 with base_lr = 1e-2 should be visibly worse than at width 256 — the same global LR was too large for the wider model and the optimizer overshot. Under μP, the losses across widths land within a few percent of each other. If your μP wiring is wrong (a layer misclassified as "input" instead of "hidden", say), the coord check fails LOUDLY: one width diverges and the others converge. That is exactly the signal you want to see in CI before any expensive run launches.
PyTorch: μP Wiring on a Real Transformer
The production pattern keeps μP entirely inside the layer definitions and the optimizer setup. The rest of the training stack — FSDP, gradient accumulation, bf16, activation checkpointing — is unchanged.
Two architectural details worth marking. First, μP composes with optimizer state sharding (ZeRO-3, FSDP). Each shard sees the same per-group LRs because the LR is a scalar on each param group, not a per-parameter tensor. Second, μP does not interact with mixed-precision scaling — the gradient scaler operates on the gradient, not on the learning rate, so bf16 / fp8 training runs see μP as a no-op at the precision boundary.
What the production version adds
Real μP implementations (the open-source mup library and the in-house variants at DeepMind, Anthropic, and Meta) extend this skeleton in three ways:
- Attention scaling. The dot-product score temperature becomes under μP (the so-called "μP attention scaling"). This keeps the pre-softmax logits O(1) at every head dimension.
- Embedding LR. Token embeddings get their own LR multiplier; some teams scale them by to keep the embedding dynamics aligned with the rest of the network.
- Reparam vs init μP. Two equivalent recipes exist: rescale the LR per layer (what we showed) or rescale the init per layer and use a single global LR. Reparametrization is easier to integrate with existing optimizer code and is the variant DeepSeek-V3 and Llama-3 use.
At Massive Scale: How DeepSeek, GPT-4, and Llama Tune
The reason μP matters is not pedagogy — it is the actual procedure every frontier lab now follows. Reconstructed from public technical reports and post-mortems:
| Lab / Model | Proxy size | Target size | Sweep cost | What they tune at the proxy |
|---|---|---|---|---|
| DeepSeek-V3 (2024) | ~1.4B active / ~10B total | 37B active / 671B total | ~0.5% of full run | LR, β₂, weight decay, expert router temperature, MTP head loss weight |
| Llama-3 (2024) | ~8B | 405B | ~1% of full run | LR, batch size schedule, warmup, weight decay |
| GPT-4 (2023, inferred) | Reported as a 'small model with predictable scaling' (system card) | ~1.8T MoE (rumoured) | Not disclosed; described as 'reliable extrapolation from ≤10000× smaller' | Loss curve, LR, optimization stability flags |
The DeepSeek-V3 paper is unusually forthcoming about the recipe. Section 5.2 reports that they ran a μP-style proxy sweep at a 1.4B active / 10B total MoE, picked the LR (4.2e-4) and β₂ (0.95) there, then trained the 671B run with those same values and no further tuning. They report a single LR-related restart in the entire 2.788M H800-hour run, attributable to a numerical stability issue in fp8, not to LR mis-scaling.
The batch-size half of the story
μP fixes LR transfer. It does not fix batch size. The McCandlish paper's critical batch size depends on the curvature of the loss and the gradient noise — both of which evolve during training and across scales. In practice every lab measures empirically by sweeping a few batch sizes early in training (the gradient-noise-scale estimator gives a cheap online estimate) and ramping the batch from a small starting value (4M tokens) to a large terminal value (60M+ tokens for DeepSeek-V3, ~16M for Llama-3). This batch-ramp schedule is independent of μP; the two recipes compose.
Engineering Reality and Failure Modes
Three failure modes account for nearly every μP-related incident that surfaces in public post-mortems.
- Layer misclassification. An embedding layer accidentally tagged as "hidden" gets a width-shrunk LR and learns to a worse representation. A hidden layer accidentally tagged as "input" gets an un-shrunk LR and the training run diverges in the first 1000 steps. Defence: print every parameter group's width_mult before the first step and gate the launch on a manual diff against the reference config. This is a five-line check that catches a $5M bug.
- Coord-check skipped. Teams sometimes assume μP "just works" because the library exports a MuPLinear. Then the attention scaling is missed (still using instead of ) and the transfer breaks silently — loss curves at small scale look fine, but the big run lands at a worse loss than expected. Defence: every new model architecture re-runs the coord check before its first scale-up.
- Proxy too small. μP is a large-width limit. Below width ~128 the asymptotic behaviour has not kicked in, and the LR optimum at the proxy will mis-extrapolate by 2–3×. Defence: pick the proxy width as the largest model that fits in your sweep budget, not the smallest. Most teams use proxies in the 128M–2B parameter range; below 100M the transfer gets noisy.
The good news: when μP works (which is most of the time), it is invisible. The training loop looks like a normal AdamW loop. The config file has an extra width_mult column. The optimizer prints a slightly longer param-group list. The reward is that you trained a 671B-parameter model with hyperparameters you tuned in an afternoon on a small proxy — and the post-mortem column titled "LR-related divergences" reads zero.
The big picture. Chinchilla told us how much compute to spend on parameters and tokens. μP told us how to spend it without re-sweeping hyperparameters at every scale. Together, they reduce the cost of training a new frontier model from "sample-efficiency-bound and sweep-bound" to just "sample-efficiency-bound" — which is the regime every post-2023 release operates in. The next section returns to the loss-curve side of the story: scaling laws for inference, where the compute-optimal question flips on its head.