Chapter 6
15 min read
Section 32 of 117

The Routing Collapse Problem

Auxiliary-Loss-Free Load Balancing

In the previous chapter we built a Mixture-of-Experts block, sliced the experts fine, and added a small always-on shared pool. The promise was intoxicating: hundreds of experts, only a handful active per token, ten times the parameters at the same active compute. There is one problem nobody warned you about, and it is the single most common reason MoE training runs are quietly abandoned. The router collapses. A handful of experts win every token; the rest go dark, never receive gradient, and become dead weight that consumes memory and bandwidth while contributing nothing. By the time you notice, you have wasted a week of H800 time.

The collapse, in one sentence: a top-kk router with no balancing pressure is a positive-feedback system. Whatever expert wins a token early gets more gradient, becomes more attractive, wins more tokens, gets more gradient. The dynamic is exponential, not gradual.

The Real Problem

Picture the first 1000 steps of an MoE training run with 64 experts and top-2 routing. At step 0 the router weights are random — every expert has roughly an equal chance of being picked. By step 50, tiny accidents have nudged the router: expert 17 happened to win a few extra tokens this minute, nothing dramatic. By step 500, expert 17 is handling 8% of all tokens. By step 1000, expert 17 and three friends are handling 70% of the corpus and the other 60 experts have effectively stopped learning. The loss curve still goes down — slowly — because the four winners are fitting harder. Most of the model is dead.

This is not a numerical glitch. It is the natural behaviour of an unbiased top-kk router. The mechanism is the same one that produces winner-take-all markets, viral cascades, and the rich-get-richer dynamic of preferential attachment. MoE happens to wire it directly into the compute graph of your trillion-parameter model.

What the failure looks like in production

SymptomWhat it meansHow fast it shows up
Loss curve plateaus prematurelyActive capacity ≪ total capacity1–10k steps
Load CoV climbs past 0.5Some experts now receive 5–10× others100s–1000s of steps
Dead expert count > 10%Pretraining wastes that fraction of params10k+ steps
All-to-all bandwidth unevenA few GPUs saturate, others idleVisible in profiler from step 1
Eval scores stall on diverse tasksSpecialist diversity has collapsed1k+ eval steps
Restart from checkpoint fails to recoverDead experts cannot be revivedPermanent
Why this matters more than it sounds. A dead expert is not a passive parameter on disk — it costs FP8 weight memory on every GPU that shards it, it costs optimizer states (typically 8 bytes per parameter for AdamW in BF16 master), it costs gradient buffer space, and in expert-parallel layouts it costs all-to-all bandwidth even on the rare batch where someone does route to it. A 256-expert pool with 30% dead means you are training a 180-expert model at 256-expert cost. That is not a small bug. That is hundreds of thousands of GPU-hours wasted per training run.

The Feedback Loop That Eats the Router

The cleanest way to see why collapse is inevitable is to trace the feedback loop one cycle at a time. Pretend there are two experts, A and B, and at step 0 the router has a microscopic preference for A — say A wins 51% of tokens, B wins 49%.

  1. A gets more tokens. More tokens means more gradient signal reaching A's weights — A learns faster than B this step.
  2. A becomes more useful. Because A has learned more, A contributes more usefully to the loss when it is picked. The router's gradient now flows in the direction that picks A more strongly: the router weight for A grows.
  3. A becomes more attractive. A larger router weight for A means a larger softmax score for A on the next batch — A wins even more tokens than 51%.
  4. Repeat. Each cycle of the loop amplifies the asymmetry. What started as 51/49 becomes 60/40, then 80/20, then 99/1 — and B is functionally dead.

There is no opposing force. Nothing in the standard top-kk router penalises an expert for winning too much. Every feedback path points the same way. Without an intervention, this runs to completion.

The hospital analogy, broken

In Chapter 5 we modelled MoE as a hospital with a triage desk. Routing collapse is what happens when the triage nurse keeps sending every patient to the cardiologist because the cardiologist saw the most patients yesterday and is therefore "most experienced." The dermatologist sees no patients, learns nothing new, and the gap to cardiology widens. By the end of the year the hospital has one extraordinarily skilled cardiologist and a building full of forgotten specialists. The hospital's budget bought nine expert salaries and got the output of one.

The right frame: routing collapse is not a router design flaw — the router is doing exactly what the gradient told it to. Collapse is a specification flaw: we never told the system that load balance is part of the objective. Sections 6.2 and 6.3 are about adding that specification, the wrong way and the right way.

The Mathematics of Collapse

Let EE denote the number of experts, kk the top-k value, and NN the number of tokens in a batch. Define the load of expert ii as the number of tokens dispatched to it:

i=t=1N1[iTopK(Wrht,k)]\ell_i = \sum_{t=1}^{N} \mathbb{1}\big[ i \in \mathrm{TopK}(W_r h_t, k) \big]

where hth_t is the hidden state of token tt and WrRE×dW_r \in \mathbb{R}^{E \times d} is the router. The fair share is ˉ=Nk/E\bar{\ell} = N k / E. The standard imbalance metric is the coefficient of variation:

CoV()=σ()μ()=1Ei(iˉ)2ˉ\mathrm{CoV}(\ell) = \frac{\sigma(\ell)}{\mu(\ell)} = \frac{\sqrt{\tfrac{1}{E}\sum_i (\ell_i - \bar{\ell})^2}}{\bar{\ell}}

Perfect balance gives CoV=0\mathrm{CoV} = 0. A single-expert monopoly gives CoV=E1\mathrm{CoV} = \sqrt{E - 1} — for E=256E = 256, that is about 16. The practical alarm threshold in production MoE training is CoV>0.5\mathrm{CoV} > 0.5.

Why the dynamic is exponential, not linear

Let pip_i be the probability that a random token picks expert ii in its top-kk. At step ss, the gradient update to the router weight for expert ii is approximately proportional to pigip_i \cdot g_i where gig_i is the per-token usefulness gradient. To a first approximation the bias evolves as:

bi(s+1)=bi(s)+ηpi(s)gˉb_i^{(s+1)} = b_i^{(s)} + \eta \cdot p_i^{(s)} \cdot \bar{g}

Because top-k is monotone in the bias and softmax is monotone in the score, pi(s+1)p_i^{(s+1)} is an increasing function of bi(s+1)b_i^{(s+1)}. Substitute: experts with larger pip_i grow their bias faster, which grows their pip_i on the next step. This is a discrete-time analogue of x˙=αx\dot{x} = \alpha x — pure exponential amplification. Small initial asymmetries blow up; they do not average out.

The naive softmax temperature fix does not work. Tempting: sharpen or blur the softmax to control concentration. In practice, a hotter softmax (small temperature) accelerates collapse — winners win harder. A colder softmax (large temperature) makes routing nearly uniform but kills specialisation, which was the whole point of MoE. There is no temperature that balances load without destroying the architecture.

Manual Numerical Walkthrough

Let us follow the feedback loop with explicit numbers for a four-expert, top-1 router over six steps. Every value is computed by hand.

Click to expand: six steps of a four-expert collapse, by hand

Setup. E=4E = 4 experts, top-1 routing, 20 tokens per step. Fair share is ˉ=5\bar{\ell} = 5. Router bias starts at b=[0.10,0.05,0.00,0.05]b = [0.10, 0.05, 0.00, -0.05] — a tiny initial preference for expert 1, the kind of accident any random init produces. Each token tt has affinity noise drawn from U(0.3,0.3)\mathcal{U}(-0.3, 0.3). Picking simplification: token tt picks expert argmaxi(bi+ϵt,i)\arg\max_i (b_i + \epsilon_{t,i}). Update rule: bibi+0.05(iˉ)b_i \leftarrow b_i + 0.05 \cdot (\ell_i - \bar{\ell}).

Step 1. With this tiny bias and a random sample of 20 tokens, the noise dominates but not symmetrically. Empirically: =[7,5,4,4]\ell = [7, 5, 4, 4]. Expert 1 picked up two extra tokens by virtue of its 0.10 bias and a favourable noise draw. Update: Δb=0.05[2,0,1,1]=[0.10,0,0.05,0.05]\Delta b = 0.05 \cdot [2, 0, -1, -1] = [0.10, 0, -0.05, -0.05]. New bias b=[0.20,0.05,0.05,0.10]b = [0.20, 0.05, -0.05, -0.10].

Step 2. Bias gap between expert 1 and expert 4 is now 0.30 — comparable to the noise amplitude. Expert 1 wins ties that go against the noise. =[9,5,3,3]\ell = [9, 5, 3, 3]. Update: Δb=[0.20,0,0.10,0.10]\Delta b = [0.20, 0, -0.10, -0.10]. New bias b=[0.40,0.05,0.15,0.20]b = [0.40, 0.05, -0.15, -0.20].

Step 3. Gap of 0.60 between extremes now exceeds the full noise range. Only tokens with the worst possible noise for expert 1 escape it. =[13,4,2,1]\ell = [13, 4, 2, 1]. Update: Δb=[0.40,0.05,0.15,0.20]\Delta b = [0.40, -0.05, -0.15, -0.20]. New bias b=[0.80,0.00,0.30,0.40]b = [0.80, 0.00, -0.30, -0.40].

Step 4. Expert 1 is now uncatchable for ordinary tokens. =[17,2,1,0]\ell = [17, 2, 1, 0]. Expert 4 received zero tokens this step — its gradients are zero, its parameters do not update, its bias does not move except by the rebalance term. Update: Δb=[0.60,0.15,0.20,0.25]\Delta b = [0.60, -0.15, -0.20, -0.25]. New bias b=[1.40,0.15,0.50,0.65]b = [1.40, -0.15, -0.50, -0.65].

Step 5. =[18,2,0,0]\ell = [18, 2, 0, 0]. Two dead experts. Bias grows further: bias gap to expert 1 is now over 2.0 — no noise draw can flip the decision. The dead experts will stay dead unless something external intervenes.

Step 6 and beyond. The system has reached a degenerate equilibrium. Expert 1 takes everything, three experts are dead. The router has learned, correctly given its specification, that expert 1 is always the right answer. Compute consumed: 4 expert worth of parameters, optimizer states, and shard memory. Compute used: 1 expert. The other 3 are pure waste.

The signal hidden in this walkthrough. The collapse did not begin with a flaw. It began with a totally normal initial asymmetry (0.05 bias gap) and a totally normal optimization update. The collapse was waiting in the architecture, encoded in the absence of any term that penalises imbalance. Sections 6.2 and 6.3 will introduce that term.

Visualizing a Routing Collapse

Below is the four-step dynamic from the walkthrough, except generalized to 8 experts and top-2 routing, with 64 tokens per step. Press Play with Balancing OFF and watch the load distribution evolve. Within 20 to 40 steps the bars on the right go red — those are dead experts. The CoV trace at the bottom climbs past the alarm threshold of 0.5 within a few seconds of real time. Then toggle Balancing ON, reset, and play again — same initial conditions, completely different outcome.

Loading routing-collapse simulator…

Three behaviours to watch for. First, the collapse is non-monotonic in the short term — individual steps can bounce around — but the trend is unmistakable within a few dozen steps. Real training runs see the same shape over thousands of steps. Second, once an expert goes red (dead), it stays dead. There is no recovery path without external intervention. Third, the Top-2 Share metric is the earliest reliable warning: it climbs measurably before any individual expert dies, giving you time to act if your training infra is watching.

Try this: reset with seed 0 (perfectly fair init) and play with Balancing OFF. Even with zero initial preference, the simulator will collapse within ~60 steps. Random noise alone is sufficient to break the symmetry, and the feedback loop does the rest. The collapse is not caused by bad init; bad init only makes it faster.

Plain Python: Simulating the Collapse

Here is the same dynamic in NumPy, stripped to the bone. No router weight matrix, no token embeddings, no transformer — just the feedback loop in isolation. If you can read this 30-line program you understand the entire problem the rest of this chapter is built to solve.

🐍routing_collapse.py
3Seeded RNG for reproducibility

We want every reader to see the same collapse pattern. A fixed seed makes the simulator deterministic so the conclusion does not depend on luck.

EXECUTION STATE
rng = PCG64(seed=0)
4MoE topology — small but representative

8 experts, top-2 routing, 64 tokens per step. The qualitative collapse dynamic is identical at 256 experts and 4096 tokens — only the numbers scale.

EXECUTION STATE
N_EXPERTS = 8
K = 2
BATCH = 64
8Bias per expert — the only learnable state in this toy

In a real router this is the (E, d) weight matrix; here we collapse it to one scalar per expert because we are not learning a token-conditioned function — we are isolating the collapse dynamic itself. Initial values are tiny random noise.

EXECUTION STATE
bias.shape = (8,)
9Cumulative load — the diagnostic axis

We track total tokens received per expert. In a healthy router this stays roughly uniform. In a collapsed router it goes wildly skewed.

13Per-token affinity noise

Each token has its own per-expert noise term. This stands in for the token-dependent part of the router output and is what gives different tokens different preferences. Critically, it is small relative to the bias once collapse starts.

EXECUTION STATE
noise.shape = (64, 8)
14Logits = shared bias + token-specific noise

Broadcasting: bias[None, :] adds the same per-expert bias to every token's noise vector. As bias grows for a winning expert, every token's score for that expert grows — that is the feedback loop made concrete.

EXECUTION STATE
scores.shape = (64, 8)
15Top-k routing per token

np.argpartition picks the indices of the K largest scores per row in O(N) per row. Every token sends its activation to its K = 2 highest-scoring experts.

EXECUTION STATE
topk.shape = (64, 2)
16Count load per expert

Loop over the (token, slot) pairs and increment that expert's counter. In production this is done with a single scatter-add — the loop is here for clarity.

EXAMPLE
After step 0 with the seeded RNG, load may look like [9, 11, 16, 7, 14, 21, 30, 20]. Already unequal — and the bias has not even updated yet.
23Fair share — what each expert should receive on average

With BATCH = 64 tokens and K = 2 picks each, 128 tokens are dispatched. Spread over 8 experts that is 16 per expert. Anything far from 16 signals imbalance.

EXECUTION STATE
fair = 16.0
24The positive-feedback update — the heart of collapse

Experts whose load > fair get a positive bias bump (becoming MORE attractive next step). Experts whose load < fair get a negative bump. This single line encodes the runaway dynamic that destroys vanilla MoE training.

EXAMPLE
If expert 6 was visited 30 times (fair = 16), its bias grows by 0.10 · (30 - 16)/16 ≈ 0.088. Next step it scores higher for every token by 0.088 — and noise of ±0.30 is not enough to dislodge it.
25Accumulate the diagnostic counter

We log every step into cumulative_load so we can compute end-of-run statistics. Without this we could not see how skewed the long-run distribution becomes.

27Coefficient of variation — the standard imbalance metric

CoV = σ/μ. CoV = 0 means perfectly uniform load. CoV > 0.5 is the standard MoE alarm threshold. CoV > 1.0 means a few experts are doing nearly all the work — full collapse.

28Concentration of compute

What fraction of all tokens went to the top-2 experts? In a healthy 8-expert router this should be 25%. After collapse with seed 0 it is typically over 60% — meaning 6 of 8 experts are nearly silent.

29Dead expert count

An expert receiving less than 20% of its fair share is functionally dead — its gradients are too sparse to learn anything. With seed 0 you should see 3 to 4 dead experts at the end of this simulation.

18 lines without explanation
1import numpy as np
2
3rng = np.random.default_rng(0)
4N_EXPERTS, K, BATCH = 8, 2, 64
5STEPS, LR = 60, 0.10
6
7# Router bias per expert. Tiny random init — every expert is "equal".
8bias = (rng.random(N_EXPERTS) - 0.5) * 0.5
9cumulative_load = np.zeros(N_EXPERTS)
10
11for step in range(STEPS):
12    load = np.zeros(N_EXPERTS, dtype=int)
13    # Each token gets noisy logits = bias + per-token affinity noise.
14    noise = (rng.random((BATCH, N_EXPERTS)) - 0.5) * 0.6
15    scores = bias[None, :] + noise                       # (BATCH, N_EXPERTS)
16    topk = np.argpartition(-scores, K, axis=1)[:, :K]    # (BATCH, K)
17    for row in topk:
18        for e in row:
19            load[e] += 1
20
21    # Positive feedback: experts that won this step grow more attractive
22    # because they received more gradient and their router weight updates
23    # in the direction that scores their winning tokens higher. We model
24    # that compactly as: bias_e += LR * surplus / fair_share.
25    fair = (BATCH * K) / N_EXPERTS
26    bias += LR * (load - fair) / fair
27    cumulative_load += load
28
29cov = cumulative_load.std() / cumulative_load.mean()
30top2_share = np.sort(cumulative_load)[-2:].sum() / cumulative_load.sum()
31dead = int((cumulative_load < 0.2 * fair * STEPS).sum())
32print(f"CoV={cov:.2f}  top-2 share={top2_share:.0%}  dead={dead}/{N_EXPERTS}")

Run this snippet at home and you will see CoV land somewhere in the 1.0 to 1.5 range after 60 steps with E=8E = 8, top-2 share well over 50%, and two or three dead experts. The exact numbers depend on the seed; the qualitative outcome does not. Change EE to 64 and BATCH\mathrm{BATCH} to 256 and you get a richer version of the same collapse — more visibly skewed, more dead experts in absolute terms.

The minimal counterexample. Replace the bias update with bibi0.06(iˉ)/ˉb_i \leftarrow b_i - 0.06 \cdot (\ell_i - \bar{\ell})/\bar{\ell} — flipping the sign so winners get pushed down. Re-run. The collapse evaporates instantly. CoV stays under 0.2 forever. That sign flip is the entire idea behind DeepSeek's auxiliary-loss-free balancing in section 6.3 — a single line of code that prevents days of wasted training.

PyTorch: Detecting Collapse in Real Training

Before we fix the collapse in sections 6.2 to 6.4, we need a way to detect it inside a real training loop. The simulator above made the dynamic visible because we instrumented it from the start; production MoE training runs are silent unless you wire diagnostics yourself. The following function is the minimum viable instrumentation — call it once per step, log every value, alert when any one crosses a threshold.

🐍routing_diagnostics.py
4@torch.no_grad — diagnostics, not training

Routing diagnostics are read-only sensors on the live router output. We do not want gradients flowing back through them. The decorator guarantees the autograd graph stays clean.

5Signature — minimal contract

Takes the raw router logits the training step already computed, plus k and n_experts. Returning a dict of floats is the right shape for wandb / tensorboard / prometheus.

EXECUTION STATE
router_logits.shape = (N, E)
11Top-k per token

Same call as the live router uses to pick experts. We are reproducing what the router actually dispatched, not re-deriving it from softmax probabilities (which would be off-by-one in edge cases with ties).

EXECUTION STATE
topk_idx.shape = (N, k)
14Flatten (token, slot) into a 1D index stream

bincount needs a flat tensor. Flattening over the k slots is correct because every (token, slot) pair is one dispatch event.

EXECUTION STATE
flat.shape = (N*k,)
15bincount = per-expert token count

torch.bincount with minlength=n_experts is the canonical way to count occurrences. Output index i = how many tokens were dispatched to expert i this batch. This is exactly the load vector.

EXECUTION STATE
load.shape = (E,)
18Fair share

If everything were uniform, each expert would receive load.sum() / n_experts tokens. This is the reference line on every imbalance plot.

19Coefficient of variation — the alarm metric

CoV = std / mean. The clamp(min=1.0) on mean is a defensive numerical guard — early in training with a tiny batch the mean can dip below 1 and produce a divide-by-zero. In production, log this every step.

EXAMPLE
Healthy run: 0.10 – 0.35. Drifting: 0.35 – 0.60. Collapse: > 1.0.
20Dead expert count — the kill metric

An expert receiving less than 20% of its fair share is starving for gradient. If this stays > 0 for thousands of steps, those experts are dead weight — parameters, optimizer states, and shard memory you are paying for and getting nothing from.

21Concentration — the failure signature

Top-2 share answers: are two experts doing most of the work? In a 64-expert pool, healthy top-2 share is around 4%. > 30% means routing has crystallised into a small winner set.

24Router entropy — early warning

Average softmax(logits) over the batch, then compute its Shannon entropy. Maximum entropy = log(E) means the router is maximally indecisive (good — every expert has a chance). Low entropy means the router has decided the answer is always 'expert 6' regardless of token.

EXECUTION STATE
p.shape = (E,)
26Normalize entropy to [0, 1]

Dividing by log(E) makes the metric comparable across MoE configurations. norm_entropy = 1.0 is uniform routing, norm_entropy → 0 is total collapse. A healthy DeepSeek run holds norm_entropy above 0.85.

33Where this hooks in

Call once per training step inside the no-grad section after the router forward pass. The output is cheap (one bincount + a few reductions) — adding this never moves the wall-clock by more than a fraction of a percent, but it is the difference between catching collapse on hour 3 and discovering it on day 9.

36Wire the alarm

Any of these signals crossing a threshold should page the on-call. A single bad batch is noise; a CoV trending up over 100 batches is a real collapse. In DeepSeek's training infra, this metric drives the same alerting pipeline as loss explosions.

28 lines without explanation
1import torch
2import torch.nn.functional as F
3
4@torch.no_grad()
5def routing_diagnostics(router_logits: torch.Tensor, k: int,
6                        n_experts: int) -> dict[str, float]:
7    """
8    router_logits: (N, E)  — N = batch*seq, E = number of experts
9    Returns diagnostics every MoE training loop should log.
10    """
11    # 1. Pick top-k experts per token.
12    topk_w, topk_idx = router_logits.topk(k, dim=-1)        # (N, k)
13
14    # 2. Count how many tokens each expert received this batch.
15    flat = topk_idx.reshape(-1)                              # (N*k,)
16    load = torch.bincount(flat, minlength=n_experts).float() # (E,)
17
18    # 3. Imbalance metrics.
19    fair = load.sum() / n_experts
20    cov = load.std(unbiased=False) / load.mean().clamp(min=1.0)
21    dead = (load < 0.2 * fair).sum()
22    top2_share = load.topk(2).values.sum() / load.sum()
23
24    # 4. Router entropy — low entropy means routing has crystallised.
25    p = F.softmax(router_logits, dim=-1).mean(dim=0)         # (E,)
26    entropy = -(p * (p + 1e-12).log()).sum()
27    max_entropy = torch.log(torch.tensor(float(n_experts)))
28
29    return {
30        "load_cov": cov.item(),
31        "dead_experts": int(dead.item()),
32        "top2_share": top2_share.item(),
33        "norm_entropy": (entropy / max_entropy).item(),
34    }
35
36# Inside the training loop:
37#   logits = router(hidden_states)      # (B*T, E)
38#   stats = routing_diagnostics(logits, k=2, n_experts=64)
39#   if stats["load_cov"] > 0.5:
40#       wandb.alert(title="MoE routing collapse imminent",
41#                   text=str(stats))

Three of the four metrics are derived from the load vector and one is derived from the router probabilities directly. Together they cover the failure space:

MetricHealthyWarningCollapsed
load_cov< 0.350.35 – 0.60> 1.0
dead_experts01 – 5%> 10%
top2_share (E=64)~ 3 – 6%10 – 25%> 40%
norm_entropy> 0.850.70 – 0.85< 0.50
What this does not do. Detecting collapse does not prevent it. The detector is necessary infrastructure but it is a smoke alarm, not a sprinkler. The actual prevention mechanisms are: auxiliary balance losses (section 6.2 — the textbook approach that hurts model quality), bias-term balancing (section 6.3 — DeepSeek's solution that does not), and the sequence-level complementary loss (section 6.4). All three are responses to the dynamic this section just exposed.

What Changes at Massive Scale

Everything in this section was demonstrated with 8 experts and 64 tokens per step. Real DeepSeek-V3 runs with 256 routed experts, top-8 routing, 4M tokens per batch, sharded across 2000+ H800s. The collapse dynamic does not get gentler at scale. It gets sharper and more expensive.

Why scale makes it worse

  1. More experts means more dead-expert capacity at risk. With 8 experts, losing 3 to collapse is 37.5% of the model. With 256 experts, the same fractional collapse loses 96 experts — 96 × the FFN parameters of the dense baseline, wasted.
  2. Sparser routing amplifies the feedback. Top-8 of 256 is a 3% activation ratio. Each expert sees, on average, 3% of the batch. Variance in that 3% is high — random fluctuations in early training are large enough to seed asymmetries that the feedback loop then amplifies.
  3. Cross-device traffic patterns become catastrophic. Experts are sharded across nodes. A collapsed router sends all tokens to a few nodes — the all-to-all becomes an all-to-few, saturating a tiny slice of the network while the rest of the cluster idles. GPU utilization drops from 60% to 15% as collapse advances. You see this in NCCL profiles before you see it in the loss curve.
  4. Recovery is logistically impossible. At 64 experts locally you might try resetting the router and warm-starting from a checkpoint. At 256 experts across 2000 GPUs, restarting means hours of checkpoint loading and rewinding the dataloader; doing it more than once per training run is unacceptable.

The DeepSeek-V3 evidence

The DeepSeek-V3 technical report shows the auxiliary-loss-free balancer holding norm_entropy above 0.95 and load CoV below 0.15 for the full 14.8T-token pretraining run. Earlier MoE attempts in the open literature — GShard, Switch Transformer, ST-MoE — all required a non-trivial auxiliary balance loss in their training objective. The progression of approaches in sections 6.2 through 6.4 mirrors the historical progression of the field: we knew about collapse for years before we knew the right way to prevent it.

Engineering Reality and Gotchas

Several practical lessons recur across MoE training postmortems:

  1. Collapse can hide behind a falling loss curve. The surviving experts continue to learn and the loss continues to drop — slowly. Teams without routing diagnostics frequently ship models with 20% dead experts and never know. Always log the four metrics above; cheaper than wasted compute by a factor of millions.
  2. The first 1000 steps are critical. Once an asymmetry establishes itself in the early steps it is very hard to undo. Some training infrastructures apply a router warmup — uniform routing for the first NN steps, then gradually fade in the learned router — specifically to let every expert collect a baseline gradient signal before the feedback loop activates.
  3. Random init does not save you. A common misconception is that initialising the router weights uniformly gives every expert a fair start. It does, technically — but as the simulator showed, even a fair start collapses within tens of steps. The asymmetry is generated by the stochasticity of the data stream, not by the init.
  4. Recoverability is myth. Once an expert has missed thousands of gradient steps, its parameters are not just stale — they are actively wrong relative to where the rest of the model has moved. Reviving it requires far more compute than preventing its death cost.
  5. Fine-grained MoE (chapter 5) makes the stakes higher. When each expert is small, you can afford more of them — which is why DeepSeek scaled to 256 experts. But more experts means more surface area for collapse, and a dead small expert is still a dead expert. The fine-grained gains of chapter 5 are conditional on solving the collapse problem this chapter addresses.
What does NOT prevent collapse: dropout on router logits (delays it by a few hundred steps), higher learning rate (accelerates it), smaller batch size (accelerates it via noisier load estimates), gradient clipping (orthogonal — affects magnitude, not direction), expert capacity limits with token dropping (hides collapse in a metric while making training quality worse). None of these are real solutions. They all treat symptoms.
Where we go next. Section 6.2 examines the classical solution — adding an auxiliary balance loss to the training objective. It works, in the sense that it prevents collapse, but it costs model quality in a way that is invisible until you compare to a model trained without it. Section 6.3 introduces DeepSeek's bias-term balancing, which prevents collapse without any auxiliary loss at all — the technical insight that made DeepSeek-V3's 256-expert pool possible. Section 6.4 adds the sequence-level safety net for edge cases. By the end of this chapter, the collapse you just watched in the simulator will be one diagnostic away from impossible.

Carry one sentence forward into the next section: top-kk routing without a balancing pressure is a positive-feedback loop, and positive-feedback loops always collapse to a degenerate equilibrium. Everything else in this chapter is engineering around that single dynamical fact.

Loading comments...