Section 9.1 derived the Chinchilla scaling laws for dense transformers — the famous result that, for a given compute budget, the loss-minimising allocation is roughly twenty tokens per parameter. Those laws are a bedrock prediction tool, and every frontier lab uses them to size dense models. But Chinchilla assumes that every parameter touches every token. Mixture- of-Experts models violate that assumption deliberately. In an MoE, most of the parameters sit idle on any given token — they only wake up when the router sends a token their way. The cost-per-token is decoupled from the capacity-per-token, and the entire Chinchilla framework needs an extra dimension to describe it.
The thesis of this section. Dense scaling laws have two knobs: parameters and tokens . MoE scaling laws have three: activated parameters , tokens , and granularity . The third knob is what lets DeepSeek-V3 ship 671 B parameters for the per-token cost of a 37 B dense model — and the scaling law tells us exactly how much that knob is worth.
The Real Problem: Chinchilla Breaks at G > 1
Run a Chinchilla optimiser on a 671 B-parameter target and it will prescribe roughly T tokens, costing on the order of FLOPs. That is more compute than any single lab spent in 2024. Yet DeepSeek-V3 trained a 671 B model on 14.8 T tokens at roughly FLOPs — almost twenty times cheaper than Chinchilla predicts. The dense law does not just mispredict the cost; it mispredicts which side of the budget constraint the model can live on. Something is missing from the formula.
What is missing is the distinction between two things Chinchilla treats as one:
| Quantity | Dense interpretation | MoE reality |
|---|---|---|
| Parameters in the model | N | N_total (huge) |
| Parameters that fire per token | N (same) | N_act (much smaller) |
| Training FLOPs / token | 6 N | 6 N_act |
| Memory / activation footprint | Grows with N | Grows with N_total (full model resident) |
| Capacity for new patterns | Grows with N | Grows sub-linearly with N_total (γ < 1) |
Chinchilla's scaling law uses one for both the cost column and the capacity column. MoE has to split them. Empirical work by Clark et al. (2022), Krajewski et al. (2024), and the DeepSeek team has converged on a clean three-parameter extension: keep Chinchilla's functional form, replace the capacity term with effective capacity, and let the data decide how much of the extra MoE parameters actually contribute.
Intuition: Two Knobs Instead of One
Think of a dense transformer as a single big specialist. Every token consults the same expert. Its cost-per-token is the specialist's size, and so is its capacity. You cannot grow capacity without growing cost.
An MoE is a panel of small specialists with a triage doctor. Each token sees the triage doctor (the router), gets routed to specialists out of , and only those specialists do work. Cost-per-token is the size of the firing specialists. Capacity is the size of the whole panel. The two are now different numbers — that is the entire design freedom.
Granularity is the panel-to-firing-doctor ratio. is a single doctor — a dense model. is the DeepSeek design: 256 experts, top-8 routing, so on every token of the parameters fire, modulated by the shared-expert design that brings that ratio up to roughly in effective FLOPs.
The Mathematical Form of an MoE Scaling Law
Start with Chinchilla. Hoffmann et al. (2022) fit:
with (the irreducible text entropy), , . The first term is unreachable language entropy; the second falls as the model gets bigger; the third falls as the training set grows.
Clark et al. (2022) — and more precisely Krajewski et al. (2024) — showed that the MoE form is the same equation with the capacity term replaced by an effective parameter count:
with the granularity and the MoE conversion exponent. Set and you recover dense Chinchilla exactly. Set and you trade total parameters for activated cost at a discount governed by .
The effective parameter count
Define . For DeepSeek-V3 ( B, ):
Read: a 37 B-activated MoE with granularity 18 behaves, in the scaling-law fit, like a dense 105 B-effective-parameter model. It is not as good as a true dense 671 B (which would be B), because . But it costs the FLOPs of a 37 B model, and it punches at the loss of a 105 B model. That is the MoE bargain in one equation.
The compute-optimal point with G held fixed
Training compute is paid only on activated experts:
Substitute into the loss formula and minimise over at fixed and . The Chinchilla algebra goes through unchanged. The optimum sits where the capacity and data partial derivatives balance:
Two facts to read off. First, at fixed compute, increasing shifts the optimum toward smaller and more tokens. That is exactly the DeepSeek design — a relatively modest 37 B activated, trained on 14.8 T tokens. Second, with and , the scaling exponent of in the optimal is roughly — small, but enough to push from a dense 70 B compute-optimum down to a 37 B-activated MoE optimum at the same FLOPs.
Manual Numerical Walkthrough
Three configurations, same compute budget, all derived by hand.
Click to expand: comparing dense 70B vs DeepSeek 671B/37B vs Switch 112B/7B
Setup. Fix the training compute budget at FLOPs — roughly DeepSeek-V3's reported training compute. The constants are .
Configuration 1 — dense 70 B (Chinchilla-style). B, . Affordable tokens: = 8.10 T. Loss:
Configuration 2 — DeepSeek 671 B / 37 B. B, . Effective capacity: . Affordable tokens at the same budget: = 15.3 T. Loss:
Configuration 3 — Switch-style 112 B / 7 B. B, . Effective capacity: . Affordable tokens: = 81 T (token-rich but capacity-starved). Loss:
Reading the three numbers. Same compute. Dense 70 B → 2.013. DeepSeek 37 B × G=18 → 1.957. Switch 7 B × G=16 → 2.062. The DeepSeek configuration wins by nats over dense — a substantial margin at this scale. Switch loses because the activated count is so small that the capacity term blows up even with G = 16. Granularity helps, but only on top of a meaningful activated base.
What this proves. Picking too small to chase is a real failure mode. The scaling-law search has to be done in both dimensions simultaneously — which is exactly what the PyTorch grid search below does.
Visualizing the MoE Frontier
The widget below lets you set activated parameters , granularity , and a compute budget . It computes the predicted MoE loss at the compute-optimal , and overlays the dense Chinchilla curve (i.e. with the same activated count) for comparison. The presets reproduce the three numerical configurations above.
Three movements to lock in. First, pull down to 1 — the two curves merge, because recovers dense Chinchilla exactly. Second, push up to 18 with the DeepSeek preset — the MoE curve drops uniformly below the dense curve across every compute budget you slide through. Third, drop to 7 B with G = 16 (Switch preset) — the MoE curve rises above the dense curve at small compute, because the capacity term dominates when is too small to back up the granularity gain.
Plain Python: Predicting Loss for a Candidate MoE
The plain-Python implementation evaluates the MoE scaling law for three real configurations and picks the compute-optimal token count for each. No PyTorch — just math.log and a for-loop over candidate values. This is exactly the code a scaling-law team runs before they decide what model to spend a $100M training run on.
Two structural details. First, the FLOPs formula is the entire reason MoE is economically viable. If you accidentally use in your budget accounting, every MoE looks ruinous and you stop building them. Second, the constants are not universal — they need to be re-fit on small-scale runs of your architecture, your data, and your tokenizer. The numbers above are illustrative; in production you would run a sweep of ~20 small proxy models, fit the three exponents by non-linear least squares, then transfer to the full run.
PyTorch: Searching the (N_act, G) Frontier
For a real frontier search you do not evaluate three configurations — you evaluate a grid of hundreds, then a finer grid in the neighbourhood of the optimum. PyTorch's broadcasting makes the whole grid a single vectorised call, and torch.argmin pulls the optimum out in microseconds.
Three observations worth marking, all about how a real lab uses this loop:
- The grid is the search space, not the search algorithm. Brute-force evaluation of a 50 × 50 grid is ten thousand loss evaluations — milliseconds on a CPU. There is no need for gradient-based optimisation here; the loss surface is smooth and low-dimensional, and a dense grid finds the optimum more reliably than any clever optimiser.
- Always include the dense column. A frontier search that does not include in the grid cannot tell you whether the MoE is actually winning. The dense column is the falsifier; every cell of the MoE grid is good news only relative to that anchor.
- Re-fit the exponents at scale before launch. The grid's optimum depends on three numbers () fit on small proxies. Those numbers drift with scale — γ in particular tends to drop slightly as activated count grows, because expert overlap increases. DeepSeek's reported procedure is to fit on 280 M – 1 B proxies, then re-fit a single correction term from one or two 16 B-activated runs before committing to the 37 B-activated production model.
At Massive Scale: How DeepSeek Picked 671B / 37B
Put the scaling law against the real DeepSeek-V3 numbers. The public reports give us: B, B, T tokens, training compute on the order of FLOPs. Plug those into the law and you get:
Now ask: what would a dense model at the same compute budget have delivered? Solving Chinchilla for gives a compute-optimal dense B, T tokens, . The MoE wins by nats — a meaningful margin that compounds over downstream evaluations into the kind of capability gap that wins benchmarks.
What changes at the frontier
| Bottleneck | Dense behaviour | MoE behaviour | Engineering fix |
|---|---|---|---|
| Training FLOPs | Scales as 6 N D | Scales as 6 N_act D (much smaller) | Stays the same |
| GPU memory (parameters) | Holds N params per GPU | Must hold N_total params across the cluster | Expert parallelism (chapter 5.5) |
| GPU memory (optimizer) | 8 bytes/param if Adam | 8 bytes/param × N_total — dominant cost | ZeRO / sharded optimizer |
| Communication | All-reduce gradients | All-to-all routes tokens to experts every layer | Token packing + overlap with compute |
| Inference latency | Constant — same path every token | Constant per token (only N_act fires) | Same — that is the MoE win |
Read the table sideways: every MoE bottleneck that exists is on the memory and communication rows. The compute-and-latency rows look just like dense — except smaller. That is why MoE is an engineering win at frontier scale and a nuisance at small scale: small models do not have a memory bottleneck to relieve.
Engineering Reality and Gotchas
MoE scaling-law fits are clean equations, but three production failure modes earn their flags:
- γ is not constant across architectures. The conversion exponent depends on how independent the experts actually are. Switch (one-expert routing) sits around ; DeepSeek-style fine-grained experts with top-8 routing push it toward . Use the wrong and your prediction of the 671 B model's loss can be off by 0.1 nats — enough to abort a $50 M run that would in fact have succeeded.
- Capacity factor and load balancing affect the effective N_act. The clean formula assumes every token gets exactly its k routed experts. In practice, capacity overflow drops some tokens and load imbalance burns experts. Both push the effective below its nominal value. The auxiliary-loss-free balancing of chapter 6 was designed in part so that scaling-law predictions would actually hold at frontier scale.
- The data exponent β depends on quality, not just quantity. The fit above uses a single , but real shifts when the mixture changes (chapter 8.4). Increasing the math weight raises at the cost of . Production teams carry a separate per domain and aggregate; treating β as a single number is a 0.02-nat mispredict, small but enough to flip a Pareto comparison.
The one sentence to carry forward: MoE scaling laws turn one Chinchilla knob into three — , , and — and the third one is what lets a 671 B model train for the cost of a 37 B model and ship for the latency of a 37 B model — which is the entire reason every frontier release since 2024 has been an MoE.