In the last section we treated the router as a single line of math: score every expert, take the top , softmax the survivors. That equation is true — and dangerously incomplete. The moment you try to actually train an MoE model with it, the router becomes the most fragile component in the entire system. Tokens collide on the same expert; gradients refuse to flow through a discrete argmax; logits drift to infinity; the wrong expert wins for the wrong reason and the model never recovers.
This section opens the router and looks at the engineering that makes sparse routing trainable in practice. By the end you should be able to read the routing code of GShard, Switch Transformer, or DeepSeek-V3 and recognise every line as the answer to a specific failure mode.
The thesis of this section. A router is not a softmax. A router is four hacks stacked on top of a softmax, each one fixing a problem that nearly killed MoE training before someone solved it.
Four Hard Problems Inside the Router
Look at the basic top- rule from section 1: restricted to the top entries. Read it carefully and four problems jump out, none of them obvious from the math:
| Problem | What goes wrong | The fix this section covers |
|---|---|---|
| Non-differentiable top-k | argmax has zero gradient. The router can't learn from which experts it picked. | Route gradients through the gate values, not the indices. |
| Capacity overflow | Some experts get hammered, GPU tensors must be rectangular, excess tokens have no slot. | Capacity factor + token dropping + residual fallback. |
| Premature confidence | An untrained router locks onto a few experts and the others never train (routing collapse). | Learnable Gaussian noise on the logits at training time. |
| Logit drift | Over long training runs router logits grow to ±∞, softmax saturates, gradients vanish. | Router z-loss — a tiny penalty on the softmax log-denominator. |
Each row above is a section below. None of them are theoretical niceties: every production MoE codebase you will read implements every single one, and removing any of them in an ablation will visibly hurt loss in the first 1B training tokens.
How Gradients Cross the Top-k Wall
Backprop needs derivatives. Top- is a discrete selector — it returns a set of indices, not a smooth function of . Move any router input by ε and the chosen set either does not change (gradient = 0) or jumps to a new set (gradient = undefined). On paper, the router cannot learn.
The trick is to notice that top- returns two things: indices and values. The indices are not differentiable, but the values are. So we route the gradient through the values. Concretely, the MoE output is , and the gate is a smooth function of . Differentiating with respect to a router weight :
. The sum runs over the currently chosen experts. If expert was not picked, its gate is zero and too — the router gets no signal about that expert this step. The router only learns about experts it actually tried.
Why not just use Gumbel-softmax or REINFORCE?
Both have been tried. The Gumbel-softmax relaxation replaces top- with a soft sample that has a smooth gradient everywhere, but it adds compute (you have to evaluate all experts during training to use the soft weights) and breaks the sparsity gain that motivated MoE in the first place. REINFORCE-style policy gradients work but have huge variance at the scale of billion-token training runs. Top- + softmax-on-survivors won in practice because it is cheap, sparse, and good enough — the price you pay is needing the other three hacks in this section to make it stable.
Expert Capacity and Dropped Tokens
GPUs do not like ragged tensors. To dispatch tokens to experts in parallel, every expert must receive the same number of tokens — or at least a number known in advance so the dispatch buffer is rectangular. But routing is data-dependent: there is no guarantee that the 1024 tokens in a batch will split evenly across 8 experts. Some experts get 300 tokens, others get 50, and the matmul shapes refuse to cooperate.
The solution is brutal and pragmatic: declare an expert capacity up front and drop anything that overflows. The capacity is , where is the number of tokens in the batch, is top-k, is the number of experts, and is the capacity factor — typically 1.0 to 2.0.
Read the formula slowly. is the perfectly balanced share: if routing were uniformly random, every expert would receive exactly this many tokens. is the headroom we grant to absorb imbalance. means no headroom — any imbalance causes drops. means every expert can absorb twice the fair share — almost no drops, but you have allocated 2× the dispatch memory.
The capacity dial in real systems
| System | Capacity factor C | Notes |
|---|---|---|
| Switch Transformer (Fedus 2021) | 1.0 | Aggressive: 'drop or die'. Required strong balance loss. |
| GShard (Lepikhin 2020) | 1.25 | Standard default. Small headroom, modest balance loss. |
| DeepSeek-V2 / V3 | no fixed cap; uses device-level cap | Bias-term load balancing makes drops rare; see ch.6. |
| Mixtral 8×7B (inference) | implicit, k=2 only | Inference does not need fixed shapes — drops disappear. |
Notice the trend: newer systems push the capacity factor down or remove it entirely, relying instead on better load-balancing mechanisms to keep routing roughly uniform. The capacity factor is a crutch; the next chapter is about throwing the crutch away.
Manual Numerical Walkthrough
Let us trace a tiny example end-to-end. 6 tokens, 3 experts, top-1 routing, capacity factor . We will compute logits, pick experts, hit overflow, and watch a token get dropped.
Click to expand: 6 tokens, 3 experts, k = 1, by hand
Setup. Capacity =. Each expert may hold at most 2 tokens. We have:
Router logits per token (rows = tokens, cols = experts):
E1 E2 E3 t0: [ 2.1, 0.4, 0.7] → argmax = E1 t1: [ 1.8, 0.6, 0.2] → argmax = E1 t2: [ 2.4, 0.9, 0.5] → argmax = E1 ← problem t3: [ 0.1, 1.9, 0.5] → argmax = E2 t4: [ 0.3, 0.4, 2.2] → argmax = E3 t5: [ 0.6, 2.0, 0.9] → argmax = E2
Greedy assignment (batch order).
- t0 → E1 (E1 count 1/2) ✓
- t1 → E1 (E1 count 2/2) ✓ — E1 is now full
- t2 → E1 wanted, but E1 is full → DROPPED
- t3 → E2 (E2 count 1/2) ✓
- t4 → E3 (E3 count 1/2) ✓
- t5 → E2 (E2 count 2/2) ✓
Final counts: E1 = 2, E2 = 2, E3 = 1. Drop rate = 1/6 ≈ 17%. Notice that E3 has a free slot — token t2 could have been routed there with negligible loss, but the deterministic top-1 rule never considered E3 for t2.
What happens to t2 in the forward pass? Its gate is multiplied by the assignment mask (0), so its MoE output is zero. The residual connection still passes the token through, so the next transformer block sees the unmodified hidden state. The model is not broken — it just lost one MoE block of representation work for this token.
What happens during backprop? Because t2 has , the gradient through that gate is also zero. The router gets no signal from this token. This is one reason drops hurt: they consume training data but produce no gradient for the router.
Visualising Capacity Overflow
The widget below makes the trade-off tangible. Slide the capacity factor down and watch tokens get dropped (rose-red, struck through); slide it up and watch GPU utilisation collapse as unpopular experts sit half-empty. Hit New batch to reroll the router scores and see how the dynamics shift when the popularity skew changes.
Two patterns are worth pausing on. First, at C = 1.0 you almost always drop something — the perfectly balanced assignment is a knife-edge that random batches never hit. Second, dropping is bursty: it concentrates on whichever expert the batch happens to over-favour, which is exactly the signature of an unbalanced router. The whole point of the next chapter (auxiliary-loss-free balancing) is to make the popularity bars all the same height so this widget never has to drop anything.
Noisy Top-k: Why Routers Need Randomness
Imagine you start training with a freshly initialised router. The logits are tiny, near-random. Token sees, say, . Expert 1 wins by a hair. Now backprop: expert 1 produced something useful, so its gate goes up, its logit gets a small positive push. Next forward pass with the same token, the gap widens slightly. Within a few thousand steps every token is routed to expert 1. The other experts never see traffic, never receive gradient, and the model is dead.
This is routing collapse in its purest form, and it is the reason naive top- does not work. The fix from Shazeer et al. (2017) is to add learnable Gaussian noise to the logits before the top- selection:
Read each piece. is the usual router logit. is a second linear that produces a per-(token, expert) noise scale; softplus keeps it positive. is fresh Gaussian noise on every forward pass. The two logits are added: the deterministic signal and the noisy perturbation.
Two experts whose deterministic logits differ by 0.01 might routinely swap rank under noise of std 0.3. Two experts whose deterministic logits differ by 3.0 will almost never swap. The router learns to sharpen down once it is confident, and leave it high in regions where the right expert is genuinely ambiguous. The result: every expert sees traffic for long enough to train, but the router still ends up deterministic once it knows what it is doing.
Switch Transformer's alternative: jitter
Switch Transformer (Fedus 2021) dropped the noise network and replaced it with input-multiplicative jitter: . Same effect at lower cost. DeepSeek-V3 dropped both — it relies on its bias-term load balancer (Chapter 6) to do the exploration work, proving that with the right balance signal you can route deterministically without collapsing.
Sigmoid vs Softmax Gates (DeepSeek-V3)
Every routing recipe up to GShard used softmax over experts: . The softmax is the natural choice when you want a probability distribution over experts. But it has a subtle problem at large : the gradient through softmax is a normalised quantity. Boosting one expert's logit necessarily suppresses the others. The router has to fight against the normalisation to learn fine-grained preferences.
DeepSeek-V3 broke from tradition and used a sigmoid gate per expert: . Each expert is gated independently. The gates do not sum to 1; they are then top--selected and re-normalised (divided by their sum) just before being applied. The math at output time looks similar; the math at gradient time is very different — every expert's logit can move independently, which empirically makes the router faster to converge with 64+ experts.
| Property | Softmax gate | Sigmoid gate |
|---|---|---|
| Sum over experts before top-k | = 1 | free |
| Per-expert gradient coupling | high (zero-sum) | low (independent) |
| Behaviour at large E | diffuse, slow to sharpen | sharper signal per expert |
| Use in DeepSeek-V3 (E = 256) | — | default |
| Use in Mixtral (E = 8) | default | — |
Router Z-Loss: Keeping Logits Sane
Train an MoE model for long enough and the router logits do something unpleasant: they drift upward in absolute magnitude. values that started at 0.3 are now 30. The softmax saturates. overflows in fp16. Even in fp32 the gradient through softmax becomes vanishingly small in the unselected entries, and the router stops learning.
ST-MoE (Zoph 2022) introduced the z-loss to suppress this drift. The idea: the log-denominator of the softmax, , equals zero only when all logits sum (in the log-sum-exp sense) to one. Penalise its square and you are penalising the router for using unnecessarily large logit magnitudes:
is tiny — typically . The loss is added to the main cross-entropy. Two beautiful properties: (1) it is invariant to adding a constant to all logits (a no-op for softmax anyway), so it only penalises magnitude, not direction. (2) Its gradient is small near zero — it does not perturb a healthy router.
.backward(). The PyTorch snippet below does exactly this.Plain Python: Noisy Top-k With Capacity
Here is the entire routing mechanism — noisy top-k, gate softmax, capacity-respecting assignment — in pure NumPy, small enough to read in one sitting:
Running this script prints something like capacity=5, counts=[5, 4, 2, 4], dropped=1. One token wanted expert 1 but the bucket was full; with it gets dropped. Bump capacity_factor to 2.0 and the drop disappears, at the cost of 60% of the dispatch slots being wasted on air.
PyTorch: A Production-Style Router
The PyTorch version below adds the two pieces a serious training run needs: a z-loss returned as a side output, and a self.training-gated noise so that inference is fully deterministic. The interface is intentionally narrow: forward(x_flat) returns everything the dispatch layer and the loss layer need.
Slot this Router into the MoEFFN from section 1 by replacing the inline router lines with topk_idx, gates, counts, z_loss = self.router(x_flat), and you have the skeleton of a real production MoE layer. The training loop adds z_loss to the cross-entropy and stores counts for the load-balance auxiliary loss in the next chapter.
The one missing piece. The greedy assignment loop is O(N · k) Python — fine for a teaching toy, ruinous on a GPU. Real implementations replace it with a parallel exclusive-cumulative-sum trick that runs in a single CUDA kernel. The math is identical, the cost is invisible. See Megatron-LM'smoedirectory for a reference implementation.
What Changes at Massive Scale
At toy scale the router is the cheap part. At trillion-parameter scale the router's outputs determine the cost of the most expensive operation in distributed training: the all-to-all token shuffle. Every token chosen for an expert that lives on a different GPU has to be sent over NVLink or InfiniBand. The router's decisions, made one batch at a time, directly set the network volume.
| Routing decision | Direct effect on cost | Engineering knob |
|---|---|---|
| How balanced are the expert counts? | Sets the size of the slowest expert's bucket → straggler latency. | Auxiliary loss or bias-term balancing (ch. 6). |
| What capacity factor C? | Sets all-to-all buffer size: cost = C · N · k · D per device. | Tune per cluster; smaller for fast networks, larger for slow. |
| Where do the chosen experts live? | Local experts = free; remote experts = a network hop. | Topology-aware expert placement (ch. 5, section 5). |
| Are routing decisions stable across steps? | Stable routing improves cache locality and reduces kernel re-tuning. | Lower noise late in training; z-loss to suppress drift. |
DeepSeek-V3 is the cleanest illustration of how much of the router's design is dictated by scale. With 256 experts distributed across hundreds of GPUs, the cost difference between a well-balanced router and a poorly balanced one is enormous — not in FLOPs (those are fixed at ) but in network seconds, which dominate the step time. That is why DeepSeek invested in bias-term load balancing instead of relying on auxiliary losses: the balance has to be near-perfect, and the gradient interference of a balance loss was hurting quality.
The Engineering Reality of Routing
Three patterns recur in every MoE codebase that actually runs at scale, and they are worth knowing before you try to implement one:
- The router is the first thing to monitor. Track per- expert token counts every step; track router logit magnitude every 100 steps; track drop rate as a single scalar in your training dashboard. The first sign of trouble in an MoE run is almost always visible in routing statistics 1000+ steps before the loss reflects it.
- Routing is fp32 even in mixed-precision training. The router's softmax is one of the few places where fp16 overflow has historically been a real bug; everyone — DeepSeek, GShard, Switch — promotes the router computation to fp32 even when the experts run in bf16 or fp8. The extra cost is rounding error in the total step time.
- Inference uses a different router than training. At serve time you disable noise, disable z-loss (irrelevant), disable capacity dropping (no fixed shape needed), and may switch from top-k to a tighter top-1 to reduce latency. The router class should take a
training: boolflag and behave differently in each mode.
The one sentence to carry forward: a router is a softmax wrapped in four corrections — gradient routing through values, capacity with drop, noise for exploration, z-loss for stability — and removing any one of them breaks MoE training. Every line of a production router you ever read is one of those four corrections in code.