In the previous chapter we built a Mixture-of-Experts block, sliced the experts fine, and added a small always-on shared pool. The promise was intoxicating: hundreds of experts, only a handful active per token, ten times the parameters at the same active compute. There is one problem nobody warned you about, and it is the single most common reason MoE training runs are quietly abandoned. The router collapses. A handful of experts win every token; the rest go dark, never receive gradient, and become dead weight that consumes memory and bandwidth while contributing nothing. By the time you notice, you have wasted a week of H800 time.
The collapse, in one sentence: a top- router with no balancing pressure is a positive-feedback system. Whatever expert wins a token early gets more gradient, becomes more attractive, wins more tokens, gets more gradient. The dynamic is exponential, not gradual.
The Real Problem
Picture the first 1000 steps of an MoE training run with 64 experts and top-2 routing. At step 0 the router weights are random — every expert has roughly an equal chance of being picked. By step 50, tiny accidents have nudged the router: expert 17 happened to win a few extra tokens this minute, nothing dramatic. By step 500, expert 17 is handling 8% of all tokens. By step 1000, expert 17 and three friends are handling 70% of the corpus and the other 60 experts have effectively stopped learning. The loss curve still goes down — slowly — because the four winners are fitting harder. Most of the model is dead.
This is not a numerical glitch. It is the natural behaviour of an unbiased top- router. The mechanism is the same one that produces winner-take-all markets, viral cascades, and the rich-get-richer dynamic of preferential attachment. MoE happens to wire it directly into the compute graph of your trillion-parameter model.
What the failure looks like in production
| Symptom | What it means | How fast it shows up |
|---|---|---|
| Loss curve plateaus prematurely | Active capacity ≪ total capacity | 1–10k steps |
| Load CoV climbs past 0.5 | Some experts now receive 5–10× others | 100s–1000s of steps |
| Dead expert count > 10% | Pretraining wastes that fraction of params | 10k+ steps |
| All-to-all bandwidth uneven | A few GPUs saturate, others idle | Visible in profiler from step 1 |
| Eval scores stall on diverse tasks | Specialist diversity has collapsed | 1k+ eval steps |
| Restart from checkpoint fails to recover | Dead experts cannot be revived | Permanent |
The Feedback Loop That Eats the Router
The cleanest way to see why collapse is inevitable is to trace the feedback loop one cycle at a time. Pretend there are two experts, A and B, and at step 0 the router has a microscopic preference for A — say A wins 51% of tokens, B wins 49%.
- A gets more tokens. More tokens means more gradient signal reaching A's weights — A learns faster than B this step.
- A becomes more useful. Because A has learned more, A contributes more usefully to the loss when it is picked. The router's gradient now flows in the direction that picks A more strongly: the router weight for A grows.
- A becomes more attractive. A larger router weight for A means a larger softmax score for A on the next batch — A wins even more tokens than 51%.
- Repeat. Each cycle of the loop amplifies the asymmetry. What started as 51/49 becomes 60/40, then 80/20, then 99/1 — and B is functionally dead.
There is no opposing force. Nothing in the standard top- router penalises an expert for winning too much. Every feedback path points the same way. Without an intervention, this runs to completion.
The hospital analogy, broken
In Chapter 5 we modelled MoE as a hospital with a triage desk. Routing collapse is what happens when the triage nurse keeps sending every patient to the cardiologist because the cardiologist saw the most patients yesterday and is therefore "most experienced." The dermatologist sees no patients, learns nothing new, and the gap to cardiology widens. By the end of the year the hospital has one extraordinarily skilled cardiologist and a building full of forgotten specialists. The hospital's budget bought nine expert salaries and got the output of one.
The Mathematics of Collapse
Let denote the number of experts, the top-k value, and the number of tokens in a batch. Define the load of expert as the number of tokens dispatched to it:
where is the hidden state of token and is the router. The fair share is . The standard imbalance metric is the coefficient of variation:
Perfect balance gives . A single-expert monopoly gives — for , that is about 16. The practical alarm threshold in production MoE training is .
Why the dynamic is exponential, not linear
Let be the probability that a random token picks expert in its top-. At step , the gradient update to the router weight for expert is approximately proportional to where is the per-token usefulness gradient. To a first approximation the bias evolves as:
Because top-k is monotone in the bias and softmax is monotone in the score, is an increasing function of . Substitute: experts with larger grow their bias faster, which grows their on the next step. This is a discrete-time analogue of — pure exponential amplification. Small initial asymmetries blow up; they do not average out.
Manual Numerical Walkthrough
Let us follow the feedback loop with explicit numbers for a four-expert, top-1 router over six steps. Every value is computed by hand.
Click to expand: six steps of a four-expert collapse, by hand
Setup. experts, top-1 routing, 20 tokens per step. Fair share is . Router bias starts at — a tiny initial preference for expert 1, the kind of accident any random init produces. Each token has affinity noise drawn from . Picking simplification: token picks expert . Update rule: .
Step 1. With this tiny bias and a random sample of 20 tokens, the noise dominates but not symmetrically. Empirically: . Expert 1 picked up two extra tokens by virtue of its 0.10 bias and a favourable noise draw. Update: . New bias .
Step 2. Bias gap between expert 1 and expert 4 is now 0.30 — comparable to the noise amplitude. Expert 1 wins ties that go against the noise. . Update: . New bias .
Step 3. Gap of 0.60 between extremes now exceeds the full noise range. Only tokens with the worst possible noise for expert 1 escape it. . Update: . New bias .
Step 4. Expert 1 is now uncatchable for ordinary tokens. . Expert 4 received zero tokens this step — its gradients are zero, its parameters do not update, its bias does not move except by the rebalance term. Update: . New bias .
Step 5. . Two dead experts. Bias grows further: bias gap to expert 1 is now over 2.0 — no noise draw can flip the decision. The dead experts will stay dead unless something external intervenes.
Step 6 and beyond. The system has reached a degenerate equilibrium. Expert 1 takes everything, three experts are dead. The router has learned, correctly given its specification, that expert 1 is always the right answer. Compute consumed: 4 expert worth of parameters, optimizer states, and shard memory. Compute used: 1 expert. The other 3 are pure waste.
The signal hidden in this walkthrough. The collapse did not begin with a flaw. It began with a totally normal initial asymmetry (0.05 bias gap) and a totally normal optimization update. The collapse was waiting in the architecture, encoded in the absence of any term that penalises imbalance. Sections 6.2 and 6.3 will introduce that term.
Visualizing a Routing Collapse
Below is the four-step dynamic from the walkthrough, except generalized to 8 experts and top-2 routing, with 64 tokens per step. Press Play with Balancing OFF and watch the load distribution evolve. Within 20 to 40 steps the bars on the right go red — those are dead experts. The CoV trace at the bottom climbs past the alarm threshold of 0.5 within a few seconds of real time. Then toggle Balancing ON, reset, and play again — same initial conditions, completely different outcome.
Three behaviours to watch for. First, the collapse is non-monotonic in the short term — individual steps can bounce around — but the trend is unmistakable within a few dozen steps. Real training runs see the same shape over thousands of steps. Second, once an expert goes red (dead), it stays dead. There is no recovery path without external intervention. Third, the Top-2 Share metric is the earliest reliable warning: it climbs measurably before any individual expert dies, giving you time to act if your training infra is watching.
Plain Python: Simulating the Collapse
Here is the same dynamic in NumPy, stripped to the bone. No router weight matrix, no token embeddings, no transformer — just the feedback loop in isolation. If you can read this 30-line program you understand the entire problem the rest of this chapter is built to solve.
Run this snippet at home and you will see CoV land somewhere in the 1.0 to 1.5 range after 60 steps with , top-2 share well over 50%, and two or three dead experts. The exact numbers depend on the seed; the qualitative outcome does not. Change to 64 and to 256 and you get a richer version of the same collapse — more visibly skewed, more dead experts in absolute terms.
The minimal counterexample. Replace the bias update with — flipping the sign so winners get pushed down. Re-run. The collapse evaporates instantly. CoV stays under 0.2 forever. That sign flip is the entire idea behind DeepSeek's auxiliary-loss-free balancing in section 6.3 — a single line of code that prevents days of wasted training.
PyTorch: Detecting Collapse in Real Training
Before we fix the collapse in sections 6.2 to 6.4, we need a way to detect it inside a real training loop. The simulator above made the dynamic visible because we instrumented it from the start; production MoE training runs are silent unless you wire diagnostics yourself. The following function is the minimum viable instrumentation — call it once per step, log every value, alert when any one crosses a threshold.
Three of the four metrics are derived from the load vector and one is derived from the router probabilities directly. Together they cover the failure space:
| Metric | Healthy | Warning | Collapsed |
|---|---|---|---|
| load_cov | < 0.35 | 0.35 – 0.60 | > 1.0 |
| dead_experts | 0 | 1 – 5% | > 10% |
| top2_share (E=64) | ~ 3 – 6% | 10 – 25% | > 40% |
| norm_entropy | > 0.85 | 0.70 – 0.85 | < 0.50 |
What Changes at Massive Scale
Everything in this section was demonstrated with 8 experts and 64 tokens per step. Real DeepSeek-V3 runs with 256 routed experts, top-8 routing, 4M tokens per batch, sharded across 2000+ H800s. The collapse dynamic does not get gentler at scale. It gets sharper and more expensive.
Why scale makes it worse
- More experts means more dead-expert capacity at risk. With 8 experts, losing 3 to collapse is 37.5% of the model. With 256 experts, the same fractional collapse loses 96 experts — 96 × the FFN parameters of the dense baseline, wasted.
- Sparser routing amplifies the feedback. Top-8 of 256 is a 3% activation ratio. Each expert sees, on average, 3% of the batch. Variance in that 3% is high — random fluctuations in early training are large enough to seed asymmetries that the feedback loop then amplifies.
- Cross-device traffic patterns become catastrophic. Experts are sharded across nodes. A collapsed router sends all tokens to a few nodes — the all-to-all becomes an all-to-few, saturating a tiny slice of the network while the rest of the cluster idles. GPU utilization drops from 60% to 15% as collapse advances. You see this in NCCL profiles before you see it in the loss curve.
- Recovery is logistically impossible. At 64 experts locally you might try resetting the router and warm-starting from a checkpoint. At 256 experts across 2000 GPUs, restarting means hours of checkpoint loading and rewinding the dataloader; doing it more than once per training run is unacceptable.
The DeepSeek-V3 evidence
The DeepSeek-V3 technical report shows the auxiliary-loss-free balancer holding norm_entropy above 0.95 and load CoV below 0.15 for the full 14.8T-token pretraining run. Earlier MoE attempts in the open literature — GShard, Switch Transformer, ST-MoE — all required a non-trivial auxiliary balance loss in their training objective. The progression of approaches in sections 6.2 through 6.4 mirrors the historical progression of the field: we knew about collapse for years before we knew the right way to prevent it.
Engineering Reality and Gotchas
Several practical lessons recur across MoE training postmortems:
- Collapse can hide behind a falling loss curve. The surviving experts continue to learn and the loss continues to drop — slowly. Teams without routing diagnostics frequently ship models with 20% dead experts and never know. Always log the four metrics above; cheaper than wasted compute by a factor of millions.
- The first 1000 steps are critical. Once an asymmetry establishes itself in the early steps it is very hard to undo. Some training infrastructures apply a router warmup — uniform routing for the first steps, then gradually fade in the learned router — specifically to let every expert collect a baseline gradient signal before the feedback loop activates.
- Random init does not save you. A common misconception is that initialising the router weights uniformly gives every expert a fair start. It does, technically — but as the simulator showed, even a fair start collapses within tens of steps. The asymmetry is generated by the stochasticity of the data stream, not by the init.
- Recoverability is myth. Once an expert has missed thousands of gradient steps, its parameters are not just stale — they are actively wrong relative to where the rest of the model has moved. Reviving it requires far more compute than preventing its death cost.
- Fine-grained MoE (chapter 5) makes the stakes higher. When each expert is small, you can afford more of them — which is why DeepSeek scaled to 256 experts. But more experts means more surface area for collapse, and a dead small expert is still a dead expert. The fine-grained gains of chapter 5 are conditional on solving the collapse problem this chapter addresses.
Carry one sentence forward into the next section: top- routing without a balancing pressure is a positive-feedback loop, and positive-feedback loops always collapse to a degenerate equilibrium. Everything else in this chapter is engineering around that single dynamical fact.