Chapter 9
20 min read
Section 49 of 117

MoE Scaling Laws

Scaling Laws and Compute-Optimal Training

Section 9.1 derived the Chinchilla scaling laws for dense transformers — the famous result that, for a given compute budget, the loss-minimising allocation is roughly twenty tokens per parameter. Those laws are a bedrock prediction tool, and every frontier lab uses them to size dense models. But Chinchilla assumes that every parameter touches every token. Mixture- of-Experts models violate that assumption deliberately. In an MoE, most of the parameters sit idle on any given token — they only wake up when the router sends a token their way. The cost-per-token is decoupled from the capacity-per-token, and the entire Chinchilla framework needs an extra dimension to describe it.

The thesis of this section. Dense scaling laws have two knobs: parameters NN and tokens DD. MoE scaling laws have three: activated parameters NactN_{\text{act}}, tokens DD, and granularity G=Ntotal/NactG = N_{\text{total}} / N_{\text{act}}. The third knob is what lets DeepSeek-V3 ship 671 B parameters for the per-token cost of a 37 B dense model — and the scaling law tells us exactly how much that knob is worth.

The Real Problem: Chinchilla Breaks at G > 1

Run a Chinchilla optimiser on a 671 B-parameter target and it will prescribe roughly 2067113.420 \cdot 671 \approx 13.4 T tokens, costing on the order of 667110913.410125.410256 \cdot 671 \cdot 10^9 \cdot 13.4 \cdot 10^{12} \approx 5.4 \cdot 10^{25} FLOPs. That is more compute than any single lab spent in 2024. Yet DeepSeek-V3 trained a 671 B model on 14.8 T tokens at roughly 310243 \cdot 10^{24} FLOPs — almost twenty times cheaper than Chinchilla predicts. The dense law does not just mispredict the cost; it mispredicts which side of the budget constraint the model can live on. Something is missing from the formula.

What is missing is the distinction between two things Chinchilla treats as one:

QuantityDense interpretationMoE reality
Parameters in the modelNN_total (huge)
Parameters that fire per tokenN (same)N_act (much smaller)
Training FLOPs / token6 N6 N_act
Memory / activation footprintGrows with NGrows with N_total (full model resident)
Capacity for new patternsGrows with NGrows sub-linearly with N_total (γ < 1)

Chinchilla's scaling law uses one NN for both the cost column and the capacity column. MoE has to split them. Empirical work by Clark et al. (2022), Krajewski et al. (2024), and the DeepSeek team has converged on a clean three-parameter extension: keep Chinchilla's functional form, replace the capacity term with effective capacity, and let the data decide how much of the extra MoE parameters actually contribute.

The empirical signal that motivates the formula. Train two MoEs with identical activated-parameter counts (e.g. both 37 B activated) but different total-parameter counts (e.g. 110 B vs 670 B). At the same training compute, the 670 B model lands at lower loss — but not by as much as a hypothetical dense 670 B would. The improvement falls off as a power law in G=Ntotal/NactG = N_{\text{total}} / N_{\text{act}}, with an exponent γ0.35\gamma \approx 0.35. That single exponent is the heart of MoE scaling-law theory.

Intuition: Two Knobs Instead of One

Think of a dense transformer as a single big specialist. Every token consults the same expert. Its cost-per-token is the specialist's size, and so is its capacity. You cannot grow capacity without growing cost.

An MoE is a panel of small specialists with a triage doctor. Each token sees the triage doctor (the router), gets routed to kk specialists out of EE, and only those kk specialists do work. Cost-per-token is the size of the kk firing specialists. Capacity is the size of the whole panel. The two are now different numbers — that is the entire design freedom.

Granularity GG is the panel-to-firing-doctor ratio. G=1G = 1 is a single doctor — a dense model. G=18G = 18 is the DeepSeek design: 256 experts, top-8 routing, so on every token 8/2561/328 / 256 \approx 1/32 of the parameters fire, modulated by the shared-expert design that brings that ratio up to roughly 1/181 / 18 in effective FLOPs.

Why granularity helps but not linearly. Each additional expert adds a new specialism, but specialisms overlap. The 257th expert covers patterns already partly covered by the first 256, so its marginal capacity contribution is less than the 256 before it. That diminishing-returns shape is exactly a power law in GG — and the fitted exponent γ0.35\gamma \approx 0.35 says the marginal gain falls like G0.65G^{-0.65}.

The Mathematical Form of an MoE Scaling Law

Start with Chinchilla. Hoffmann et al. (2022) fit:

L(N,D)=E+ANα+BDβL(N, D) = E + \frac{A}{N^{\alpha}} + \frac{B}{D^{\beta}}

with E1.69E \approx 1.69 (the irreducible text entropy), α0.34\alpha \approx 0.34, β0.28\beta \approx 0.28. The first term is unreachable language entropy; the second falls as the model gets bigger; the third falls as the training set grows.

Clark et al. (2022) — and more precisely Krajewski et al. (2024) — showed that the MoE form is the same equation with the capacity term replaced by an effective parameter count:

L(Nact,D,G)=E+A(NactGγ)α+BDβL(N_{\text{act}}, D, G) = E + \frac{A}{(N_{\text{act}} \cdot G^{\gamma})^{\alpha}} + \frac{B}{D^{\beta}}

with G=Ntotal/NactG = N_{\text{total}} / N_{\text{act}} the granularity and γ(0,1)\gamma \in (0, 1) the MoE conversion exponent. Set G=1G = 1 and you recover dense Chinchilla exactly. Set G>1G > 1 and you trade total parameters for activated cost at a discount governed by γ\gamma.

The effective parameter count

Define Neff=NactGγN_{\text{eff}} = N_{\text{act}} \cdot G^{\gamma}. For DeepSeek-V3 ( Nact=37N_{\text{act}} = 37 B, G18.1G \approx 18.1):

Neff=3710918.10.351.051011N_{\text{eff}} = 37 \cdot 10^{9} \cdot 18.1^{0.35} \approx 1.05 \cdot 10^{11}

Read: a 37 B-activated MoE with granularity 18 behaves, in the scaling-law fit, like a dense 105 B-effective-parameter model. It is not as good as a true dense 671 B (which would be Neff=671N_{\text{eff}} = 671 B), because γ<1\gamma < 1. But it costs the FLOPs of a 37 B model, and it punches at the loss of a 105 B model. That is the MoE bargain in one equation.

The compute-optimal point with G held fixed

Training compute is paid only on activated experts:

C=6NactDC = 6 \cdot N_{\text{act}} \cdot D

Substitute D=C/(6Nact)D = C / (6 N_{\text{act}}) into the loss formula and minimise over NactN_{\text{act}} at fixed GG and CC. The Chinchilla algebra goes through unchanged. The optimum sits where the capacity and data partial derivatives balance:

NactCβ/(α+β)Gγβ/(α+β)N_{\text{act}}^{\star} \propto C^{\beta / (\alpha + \beta)} \cdot G^{-\gamma \beta / (\alpha + \beta)}

DCα/(α+β)Gγβ/(α+β)D^{\star} \propto C^{\alpha / (\alpha + \beta)} \cdot G^{\gamma \beta / (\alpha + \beta)}

Two facts to read off. First, at fixed compute, increasing GG shifts the optimum toward smaller NactN_{\text{act}} and more tokens. That is exactly the DeepSeek design — a relatively modest 37 B activated, trained on 14.8 T tokens. Second, with γ=0.35\gamma = 0.35 and α=0.34,β=0.28\alpha = 0.34, \beta = 0.28, the scaling exponent of GG in the optimal NactN_{\text{act}} is roughly 0.16-0.16 — small, but enough to push from a dense 70 B compute-optimum down to a 37 B-activated MoE optimum at the same FLOPs.

Manual Numerical Walkthrough

Three configurations, same compute budget, all derived by hand.

Click to expand: comparing dense 70B vs DeepSeek 671B/37B vs Switch 112B/7B

Setup. Fix the training compute budget at C=3.41024C = 3.4 \cdot 10^{24} FLOPs — roughly DeepSeek-V3's reported training compute. The constants are E=1.69,A=406.4,B=410.7,α=0.34,β=0.28,γ=0.35E = 1.69, A = 406.4, B = 410.7, \alpha = 0.34, \beta = 0.28, \gamma = 0.35.

Configuration 1 — dense 70 B (Chinchilla-style). Nact=70N_{\text{act}} = 70 B, G=1G = 1. Affordable tokens: D=C/(6N)=3.41024/(671010)8.101012D = C / (6 N) = 3.4 \cdot 10^{24} / (6 \cdot 7 \cdot 10^{10}) \approx 8.10 \cdot 10^{12} = 8.10 T. Loss:

L=1.69+406.4(71010)0.34+410.7(8.101012)0.28L = 1.69 + 406.4 \cdot (7 \cdot 10^{10})^{-0.34} + 410.7 \cdot (8.10 \cdot 10^{12})^{-0.28}

=1.69+0.099+0.2242.013= 1.69 + 0.099 + 0.224 \approx 2.013

Configuration 2 — DeepSeek 671 B / 37 B. Nact=37N_{\text{act}} = 37 B, G=18.1G = 18.1. Effective capacity: Neff=3710918.10.351.051011N_{\text{eff}} = 37 \cdot 10^{9} \cdot 18.1^{0.35} \approx 1.05 \cdot 10^{11}. Affordable tokens at the same budget: D=3.41024/(63.71010)1.531013D = 3.4 \cdot 10^{24} / (6 \cdot 3.7 \cdot 10^{10}) \approx 1.53 \cdot 10^{13} = 15.3 T. Loss:

L=1.69+406.4(1.051011)0.34+410.7(1.531013)0.28L = 1.69 + 406.4 \cdot (1.05 \cdot 10^{11})^{-0.34} + 410.7 \cdot (1.53 \cdot 10^{13})^{-0.28}

=1.69+0.084+0.1841.957= 1.69 + 0.084 + 0.184 \approx 1.957

Configuration 3 — Switch-style 112 B / 7 B. Nact=7N_{\text{act}} = 7 B, G=16G = 16. Effective capacity: Neff=7109160.351.901010N_{\text{eff}} = 7 \cdot 10^{9} \cdot 16^{0.35} \approx 1.90 \cdot 10^{10}. Affordable tokens: D=3.41024/(67109)8.101013D = 3.4 \cdot 10^{24} / (6 \cdot 7 \cdot 10^{9}) \approx 8.10 \cdot 10^{13} = 81 T (token-rich but capacity-starved). Loss:

L=1.69+406.4(1.901010)0.34+410.7(8.101013)0.28L = 1.69 + 406.4 \cdot (1.90 \cdot 10^{10})^{-0.34} + 410.7 \cdot (8.10 \cdot 10^{13})^{-0.28}

=1.69+0.234+0.1382.062= 1.69 + 0.234 + 0.138 \approx 2.062

Reading the three numbers. Same compute. Dense 70 B → 2.013. DeepSeek 37 B × G=18 → 1.957. Switch 7 B × G=16 → 2.062. The DeepSeek configuration wins by 0.0560.056 nats over dense — a substantial margin at this scale. Switch loses because the activated count is so small that the capacity term blows up even with G = 16. Granularity helps, but only on top of a meaningful activated base.

What this proves. Picking NactN_{\text{act}} too small to chase GG is a real failure mode. The scaling-law search has to be done in both dimensions simultaneously — which is exactly what the PyTorch grid search below does.

Visualizing the MoE Frontier

The widget below lets you set activated parameters NactN_{\text{act}}, granularity GG, and a compute budget CC. It computes the predicted MoE loss at the compute-optimal DD, and overlays the dense Chinchilla curve (i.e. G=1G = 1 with the same activated count) for comparison. The presets reproduce the three numerical configurations above.

Loading MoE scaling-law visualizer…

Three movements to lock in. First, pull GG down to 1 — the two curves merge, because G=1G = 1 recovers dense Chinchilla exactly. Second, push GG up to 18 with the DeepSeek preset — the MoE curve drops uniformly below the dense curve across every compute budget you slide through. Third, drop NactN_{\text{act}} to 7 B with G = 16 (Switch preset) — the MoE curve rises above the dense curve at small compute, because the capacity term dominates when NactN_{\text{act}} is too small to back up the granularity gain.

Plain Python: Predicting Loss for a Candidate MoE

The plain-Python implementation evaluates the MoE scaling law for three real configurations and picks the compute-optimal token count for each. No PyTorch — just math.log and a for-loop over candidate DD values. This is exactly the code a scaling-law team runs before they decide what model to spend a $100M training run on.

🐍moe_scaling.py
3Fitted constants of the MoE scaling law

E is the irreducible loss — entropy of the language that no model, however large, can drive lower. A and B are scale constants. ALPHA and BETA are the well-known Chinchilla exponents for capacity and data. GAMMA is the new exponent introduced by MoE work: it controls how much of the total-parameter count actually counts as effective capacity. GAMMA < 1 means doubling total params at fixed activated params is worth LESS than doubling a dense model — but it is not free, either.

EXECUTION STATE
ALPHA = 0.34 (capacity exponent)
BETA = 0.28 (data exponent)
GAMMA = 0.35 (MoE granularity exponent)
10Predicted loss for a candidate (N_act, D, G)

The loss formula has three terms: irreducible (E), capacity (A * N_eff^-α), and data (B * D^-β). The capacity term is where MoE differs from dense. Instead of N alone, we use N_eff = N_act · G^γ. With G = 1 the formula collapses to Chinchilla. With G = 18 and γ = 0.35 the same 37B activated params get an effective-capacity multiplier of 18^0.35 ≈ 2.83 — meaning DeepSeek's 37B activated experts behave like a dense ~105B-effective-param model in this scaling-law fit.

EXECUTION STATE
N_eff (DeepSeek) = 37e9 · 18.1^0.35 ≈ 1.05e11
Equivalent dense = ≈ 105 B params (effective)
17Compute is paid only on activated experts

The whole reason MoE works is that each token only flows through k ≪ E experts. So the per-step FLOPs are 6 · N_act · D, not 6 · N_total · D. This is the single inequality that lets DeepSeek host 671 B of capacity for the price of a 37 B dense model — at training and inference both.

EXECUTION STATE
FLOPs (DeepSeek, D = 14.8T) = ≈ 3.3e24
FLOPs (dense 671B, D = 14.8T) = ≈ 6.0e25 (18× more)
20Sweep D in log space to find the compute-optimal point

Given a compute budget C, you can spend it on more tokens (more D) or — for a dense model — on more parameters. For MoE at fixed (N_act, G), the only knob is D. We sweep D from 1B to 100T in log steps, drop any (N_act, D) that overshoots the budget, and keep the one with the lowest predicted loss. This is the same minimisation Chinchilla did for dense; here it is done per (G) slice.

28Three configurations at the same compute budget

All three runs cost 3.4e24 FLOPs — roughly DeepSeek-V3's training compute. The comparison reveals the central thesis of MoE scaling: at fixed compute, increasing G drops loss, but with diminishing returns governed by γ. Dense 70B is the strong baseline; DeepSeek 671B/37B beats it; Switch 112B/7B is a different regime (small N_act, large G) and lands worst because the capacity exponent rewards large N_act independently of G.

EXECUTION STATE
Dense 70B (D*) = ≈ 1.34T tok, L* ≈ 2.013
DeepSeek 37B × G=18 (D*) = ≈ 2.54T tok, L* ≈ 1.957
Switch 7B × G=16 (D*) = ≈ 13.4T tok, L* ≈ 2.062
32Reading the output

At identical training compute, the MoE with moderate activated params and G ≈ 18 lands the lowest loss. Two structural reasons: (i) capacity exponent α rewards effective-param count, and N_eff = 37e9 · 18^0.35 dominates 70e9; (ii) data exponent β rewards more tokens, and the MoE's activated cost is half of the dense 70B's, freeing budget for 2× more tokens. Combined, you get a meaningful loss gap.

42 lines without explanation
1import math
2
3# Fitted MoE scaling-law constants (Krajewski et al. 2024, illustrative).
4E      = 1.69     # irreducible loss
5A, B   = 406.4, 410.7
6ALPHA  = 0.34     # capacity exponent
7BETA   = 0.28     # data exponent
8GAMMA  = 0.35     # how much MoE 'total params' convert to effective capacity
9
10def predicted_loss(n_act_B, d_B, G):
11    """
12    n_act_B : activated params per token, in billions
13    d_B     : training tokens, in billions
14    G       : granularity factor = N_total / N_act
15    """
16    N_eff = (n_act_B * 1e9) * (G ** GAMMA)   # effective capacity
17    D     = d_B * 1e9
18    return E + A * N_eff**(-ALPHA) + B * D**(-BETA)
19
20def compute_flops(n_act_B, d_B):
21    # Training compute is dominated by the activated experts only.
22    return 6 * (n_act_B * 1e9) * (d_B * 1e9)
23
24def compute_optimal(n_act_B, G, C_budget):
25    """Sweep D in log space; return (D*, L*) under the given budget."""
26    best = (None, math.inf)
27    for k in range(0, 251):
28        d_B = 10 ** (k / 50.0)                 # 1B .. 1e5 B = 100T
29        if compute_flops(n_act_B, d_B) > C_budget:
30            break
31        L = predicted_loss(n_act_B, d_B, G)
32        if L < best[1]:
33            best = (d_B, L)
34    return best
35
36# Three configurations at the same training compute (3.4e24 FLOPs)
37C = 3.4e24
38
39dense_70B            = compute_optimal(70, G=1,    C_budget=C)
40deepseek_37B_x_G18   = compute_optimal(37, G=18.1, C_budget=C)
41switch_7B_x_G16      = compute_optimal( 7, G=16,   C_budget=C)
42
43for name, (d, L) in [
44    ("Dense 70B           ", dense_70B),
45    ("DeepSeek 671B / 37B ", deepseek_37B_x_G18),
46    ("Switch 112B / 7B    ", switch_7B_x_G16),
47]:
48    print(f"{name}  D* = {d:6.0f} B tokens   L* = {L:.3f}")

Two structural details. First, the FLOPs formulaC=6NactDC = 6 N_{\text{act}} D is the entire reason MoE is economically viable. If you accidentally use 6Ntotal6 N_{\text{total}} in your budget accounting, every MoE looks ruinous and you stop building them. Second, the constants α,β,γ\alpha, \beta, \gamma are not universal — they need to be re-fit on small-scale runs of your architecture, your data, and your tokenizer. The numbers above are illustrative; in production you would run a sweep of ~20 small proxy models, fit the three exponents by non-linear least squares, then transfer to the full run.

PyTorch: Searching the (N_act, G) Frontier

For a real frontier search you do not evaluate three configurations — you evaluate a grid of hundreds, then a finer grid in the neighbourhood of the optimum. PyTorch's broadcasting makes the whole grid a single vectorised call, and torch.argmin pulls the optimum out in microseconds.

🐍moe_frontier_search.py
9The loss function lifted onto tensors

Exact same algebra as the plain-Python version, but every input is now a tensor. This unlocks the trick we are after: evaluate every cell of the (N_act, G) grid in one broadcasted op instead of a nested Python for-loop. For real frontier searches this matters — you sweep thousands of candidate configurations against a handful of compute budgets and you want it to take seconds, not hours.

EXECUTION STATE
n_act_B.shape = (|n_act|, 1)
G.shape = (|n_act|, |g|)
return.shape = (|n_act|, |g|)
15The two axes of the search

n_act is the activated-parameter axis: how many parameters each token actually touches. g is the granularity axis: N_total / N_act. The Cartesian product (n_act, g) is the MoE design space. Note: g = 1 is the dense column — every MoE search at fixed compute should always include this column so you can see the dense baseline next to its MoE competitors.

EXECUTION STATE
n_act = [7, 13, 22, 37, 70, 120]
g = [1, 2, 4, 8, 16, 24, 32]
19Compute budget converts to a per-row token ceiling

FLOPs = 6 · N_act · D. Holding FLOPs fixed at C, you can afford D_max = C / (6·N_act) tokens. Smaller activated models can afford more tokens. This is the SAME compute-data tradeoff Chinchilla discovered for dense; what changes in MoE is that varying G changes effective capacity WITHOUT changing the activated cost, so you can move along the capacity axis for free in FLOPs terms.

EXECUTION STATE
D_max @ 7B activated = ≈ 80.9T tokens
D_max @ 70B activated = ≈ 8.09T tokens
22Broadcast G across the D_max row to form the full grid

D_max only depends on N_act (FLOPs = 6 · N_act · D), so each row of D_max is constant in G. We broadcast it across the G axis to align shapes for the loss evaluation. Subtle point: at fixed N_act, ALL granularities can afford the same D_max — they just spend their effective capacity differently.

EXECUTION STATE
D_max.shape = (6, 7)
27Evaluate the loss at every (N_act, G) cell

One vectorised call. L_grid[i, j] is the predicted loss for an MoE with activated params n_act[i], granularity g[j], trained to D_max[i, j] tokens, under the fixed compute budget. This grid IS the MoE compute-loss frontier. Plotted as a heatmap, the optimum is a basin somewhere off the dense column.

EXECUTION STATE
L_grid.shape = (6, 7)
32Argmin recovers the compute-optimal design

torch.argmin over the flattened grid finds the single (N_act, G) cell with the lowest predicted loss at this compute budget. The recovered cell tells you THE MoE that DeepSeek would have picked from this grid: ~37 B activated, G ≈ 16–24, ~14T tokens — within shouting distance of the published 671B/37B configuration. For a real frontier-lab search the grid is denser, the constants are re-fit on small proxy runs, and the answer is then validated by one or two larger runs before launch.

34 lines without explanation
1import torch
2
3# Same scaling-law constants as before, now on tensors so we can sweep
4# the entire (N_act, G) frontier in a single batched evaluation.
5E      = 1.69
6A, B   = 406.4, 410.7
7ALPHA, BETA, GAMMA = 0.34, 0.28, 0.35
8
9C_budget = 3.4e24  # training FLOPs ~ DeepSeek-V3
10
11def loss_grid(n_act_B, G, D_B):
12    """All three inputs are tensors of compatible shape."""
13    N_eff = (n_act_B * 1e9) * G.pow(GAMMA)
14    D     = D_B * 1e9
15    return E + A * N_eff.pow(-ALPHA) + B * D.pow(-BETA)
16
17# 1) Grid over candidate MoE configurations.
18n_act = torch.tensor([7., 13., 22., 37., 70., 120.])      # activated B
19g     = torch.tensor([1., 2., 4., 8., 16., 24., 32.])     # granularities
20
21# 2) For each (N_act, G) cell, derive the maximum D affordable.
22#    FLOPs = 6 * N_act * D  =>  D_max = C / (6 * N_act).
23D_max = C_budget / (6 * n_act[:, None] * 1e9) / 1e9       # in B tokens
24# Broadcast g over rows so D_max has shape (|n_act|, |g|).
25D_max = D_max.expand(-1, g.numel())
26
27# 3) Loss at the compute-optimal D (= D_max under this constraint).
28N_act_B  = n_act[:, None].expand(-1, g.numel())
29G_grid   = g[None, :].expand(n_act.numel(), -1)
30L_grid   = loss_grid(N_act_B, G_grid, D_max)
31
32# 4) Find the global minimum over the (N_act, G) frontier.
33flat = L_grid.flatten()
34k    = torch.argmin(flat)
35i, j = divmod(int(k), g.numel())
36print(f"Compute-optimal MoE @ {C_budget:.1e} FLOPs:")
37print(f"  N_act = {n_act[i].item():6.1f} B")
38print(f"  G     = {g[j].item():6.1f}x  -> N_total = {(n_act[i]*g[j]).item():6.1f} B")
39print(f"  D     = {D_max[i, j].item():6.0f} B tokens")
40print(f"  L*    = {L_grid[i, j].item():.3f}")

Three observations worth marking, all about how a real lab uses this loop:

  1. The grid is the search space, not the search algorithm. Brute-force evaluation of a 50 × 50 grid is ten thousand loss evaluations — milliseconds on a CPU. There is no need for gradient-based optimisation here; the loss surface is smooth and low-dimensional, and a dense grid finds the optimum more reliably than any clever optimiser.
  2. Always include the dense column. A frontier search that does not include G=1G = 1 in the grid cannot tell you whether the MoE is actually winning. The dense column is the falsifier; every cell of the MoE grid is good news only relative to that anchor.
  3. Re-fit the exponents at scale before launch. The grid's optimum depends on three numbers (α,β,γ\alpha, \beta, \gamma) fit on small proxies. Those numbers drift with scale — γ in particular tends to drop slightly as activated count grows, because expert overlap increases. DeepSeek's reported procedure is to fit on 280 M – 1 B proxies, then re-fit a single correction term from one or two 16 B-activated runs before committing to the 37 B-activated production model.
Implementation note. The vectorised grid above broadcasts a (6, 1) tensor against a (1, 7) tensor to produce a (6, 7) loss grid in one op. For the production search this scales to (50, 50, 5) — fifty NactN_{\text{act}}, fifty GG, five compute budgets — and still runs in under a second on a single CPU. The scaling-law surface is one of the only places in this whole book where the search is genuinely cheap; spend the compute on the proxies that fit the exponents instead.

At Massive Scale: How DeepSeek Picked 671B / 37B

Put the scaling law against the real DeepSeek-V3 numbers. The public reports give us: Ntotal=671N_{\text{total}} = 671 B, Nact37N_{\text{act}} \approx 37 B, D=14.8D = 14.8 T tokens, training compute on the order of 3.010243.0 \cdot 10^{24} FLOPs. Plug those into the law and you get:

G=671/3718.1G = 671 / 37 \approx 18.1

Neff=3710918.10.351.051011N_{\text{eff}} = 37 \cdot 10^{9} \cdot 18.1^{0.35} \approx 1.05 \cdot 10^{11}

L1.69+406.4(1.051011)0.34+410.7(1.481013)0.281.96L \approx 1.69 + 406.4 \cdot (1.05 \cdot 10^{11})^{-0.34} + 410.7 \cdot (1.48 \cdot 10^{13})^{-0.28} \approx 1.96

Now ask: what would a dense model at the same compute budget have delivered? Solving Chinchilla for C=3.01024C = 3.0 \cdot 10^{24} gives a compute-optimal dense N65N \approx 65 B, D7.7D \approx 7.7 T tokens, L2.02L \approx 2.02. The MoE wins by 0.06\sim 0.06 nats — a meaningful margin that compounds over downstream evaluations into the kind of capability gap that wins benchmarks.

What changes at the frontier

BottleneckDense behaviourMoE behaviourEngineering fix
Training FLOPsScales as 6 N DScales as 6 N_act D (much smaller)Stays the same
GPU memory (parameters)Holds N params per GPUMust hold N_total params across the clusterExpert parallelism (chapter 5.5)
GPU memory (optimizer)8 bytes/param if Adam8 bytes/param × N_total — dominant costZeRO / sharded optimizer
CommunicationAll-reduce gradientsAll-to-all routes tokens to experts every layerToken packing + overlap with compute
Inference latencyConstant — same path every tokenConstant per token (only N_act fires)Same — that is the MoE win

Read the table sideways: every MoE bottleneck that exists is on the memory and communication rows. The compute-and-latency rows look just like dense — except smaller. That is why MoE is an engineering win at frontier scale and a nuisance at small scale: small models do not have a memory bottleneck to relieve.

Why frontier labs all-in on MoE in 2024–2025. The scaling law says it directly: for a fixed compute budget, the compute-optimal MoE has lower loss than the compute-optimal dense model, and the gap widens at higher budgets. The crossover where MoE starts winning sits at roughly C51022C \approx 5 \cdot 10^{22} FLOPs in current fits — below that, the memory and routing overhead is not worth it; above that, every reasonable G > 4 beats dense. DeepSeek, Mixtral, Qwen-MoE, GLaM, and Gemini-1.5 all sit decisively above the crossover.

Engineering Reality and Gotchas

MoE scaling-law fits are clean equations, but three production failure modes earn their flags:

  1. γ is not constant across architectures. The conversion exponent γ\gamma depends on how independent the experts actually are. Switch (one-expert routing) sits around γ0.25\gamma \approx 0.25; DeepSeek-style fine-grained experts with top-8 routing push it toward 0.380.38. Use the wrong γ\gamma and your prediction of the 671 B model's loss can be off by 0.1 nats — enough to abort a $50 M run that would in fact have succeeded.
  2. Capacity factor and load balancing affect the effective N_act. The clean formula assumes every token gets exactly its k routed experts. In practice, capacity overflow drops some tokens and load imbalance burns experts. Both push the effective NactN_{\text{act}} below its nominal value. The auxiliary-loss-free balancing of chapter 6 was designed in part so that scaling-law predictions would actually hold at frontier scale.
  3. The data exponent β depends on quality, not just quantity. The fit above uses a single β\beta, but real β\beta shifts when the mixture changes (chapter 8.4). Increasing the math weight raises βmath\beta_{\text{math}} at the cost of βweb\beta_{\text{web}}. Production teams carry a separate βi\beta_i per domain and aggregate; treating β as a single number is a 0.02-nat mispredict, small but enough to flip a Pareto comparison.
How a frontier team validates a scaling-law prediction before launch. Three checks: (a) fit (α,β,γ)(\alpha, \beta, \gamma) on at least twenty proxy runs spanning two orders of magnitude in activated count and G; (b) hold out one larger proxy (e.g. 16 B-activated) and verify the law predicts its loss within ±0.02 nats; (c) run a short pilot of the full design (e.g. 100 B tokens of the 37 B production model) and check the early-loss curve agrees with the law's prediction at that token count. Skipping any of the three is how labs end up announcing an MoE that under-performs its claimed compute-equivalent dense.

The one sentence to carry forward: MoE scaling laws turn one Chinchilla knob into three — NactN_{\text{act}}, DD, and GG — and the third one is what lets a 671 B model train for the cost of a 37 B model and ship for the latency of a 37 B model — which is the entire reason every frontier release since 2024 has been an MoE.

Where we go from here. Section 9.3 turns to emergent abilities — capabilities that appear suddenly at a critical compute threshold rather than improving smoothly with scale. We will see why scaling laws (dense or MoE) predict the loss curve beautifully but say almost nothing about which evaluations the model will pass, and how that gap shapes the way labs choose between two designs with indistinguishable predicted loss but very different predicted capabilities.
Loading comments...