The previous section left us with a system on the edge of collapse: every MoE layer wants to route tokens to a handful of experts, and the unrouted experts learn nothing, contribute nothing, and waste their share of the cluster. For four years between GShard (2020) and DeepSeek V3 (2024), every MoE in production reached for the same fix — an extra term in the loss function whose job was to spread tokens across experts. This section is the careful, quantitative story of that fix: why it works, why it is universal, and why every team that scaled it eventually noticed the same uncomfortable side effect — the model trained on a joint objective is measurably worse at the task it was actually built to do.
The bargain. Auxiliary balancing loss buys MoE feasibility — without it, expert parallelism degenerates into one busy GPU and seven idle ones. The bill comes back as a quality tax that compounds across 58 MoE layers and 14.8 trillion training tokens. That tax, and the search for a way out of it, is what makes Chapter 6 exist.
The Problem Auxiliary Loss Was Invented to Solve
Section 6.1 showed routing collapse as a dynamical instability: a slightly favored expert receives more tokens, its parameters improve faster, it gets more favored, and a positive feedback loop locks the router into a degenerate distribution. The pure language-modeling loss has no reason to prevent this. Cross-entropy only cares whether the next token is predicted; if expert 17 can predict the next token alone, the loss is happy and the other 255 experts can sit idle for the rest of training.
From the data-flow side, the consequences are worse than just wasted parameters. Recall from §5.5 that every MoE layer dispatches tokens to their chosen experts via an all-to-all collective with a fixed capacity per expert (the receive buffer size). Tokens that overflow a popular expert's buffer get dropped — they skip the MoE block entirely. So a router collapse causes two simultaneous failures: a quality failure (most experts contribute nothing) and a throughput failure (popular experts overflow and drop their excess). The system is not just suboptimal; it actively destroys gradient information on the dropped tokens.
Intuition: A Penalty That Punishes Popularity
Picture a school with eight cafeteria lines, and a thousand students with strong opinions about which line they prefer. Left alone, they all pile into the same two lines, lines 1 and 3 overflow, the other six stand empty. The cafeteria manager has a choice. She could force students into specific lines by assignment — but that ignores their preferences and produces grumpy diners. Or she could put a small penalty on the popular lines: a one-second wait per extra student in front of you. Now student preference still drives most of the decision, but at the margin, when two lines are nearly equal, the lightly used line wins.
That penalty is what an auxiliary loss does. The router still picks based on token content. But the loss adds a small, differentiable nudge: the more concentrated your routing distribution, the higher the penalty. At equilibrium, the router learns to break content ties in favor of underutilized experts. The popular lines stay popular; the empty lines start to fill up.
The price the cafeteria manager pays is that a student who really wanted line 1 sometimes ends up in line 4. The price the MoE layer pays is that a token whose content genuinely best matches expert 17 sometimes lands at expert 42. We will quantify that price below — but the intuition to anchor is: auxiliary loss does not steer the router; it perturbs it. And every perturbation is, by definition, a deviation from what the task gradient was asking for.
The Math: GShard and Switch Auxiliary Loss
Let be the number of tokens in the current batch and the number of experts. The router produces a probability matrix where is the softmax probability that token is routed to expert . From this matrix we extract two summary statistics per expert.
First, the average routing probability assigned to expert across the batch: . This is differentiable with respect to the router weights — it is just an average of softmax outputs.
Second, the actual fraction of tokens whose top-1 pick was expert : . This is not differentiable — argmax has zero gradient almost everywhere — but it serves as a faithful magnitude: if expert grabbed half the batch, then regardless of how the softmax was shaped.
The GShard auxiliary loss combines them: . The factor is the rescaling that pins the minimum at 1.0 — at perfectly uniform routing, , so . Any deviation pushes the value above 1, and the quadratic-like coupling between and means concentrated routing is penalized super-linearly.
The full training objective is , where is a tuned hyperparameter — Switch Transformer used , GShard reported similar values. The gradient that reaches the router weights is the sum of the gradient from the language modeling loss and a small gradient from that pushes downward for popular experts and upward for unpopular ones.
The Switch Transformer variation
Switch Transformer published essentially the same formula but with a slightly different scaling argument and an emphasis on top-1 routing only. The mathematical content is identical: . Where Switch and GShard genuinely differ is in their treatment of the capacity factor (Switch uses 1.0 with token-dropping, GShard allowed higher capacity with re-routing), but as a balancing mechanism the loss is one thing, not two.
Manual Numerical Walkthrough
Let's compute for one tiny batch by hand. The numbers are small enough to follow on paper.
Click to expand: GShard aux loss on 8 tokens / 4 experts
Setup. Eight tokens, four experts, top-1 routing. The router has produced the following probability matrix (rows sum to 1, columns are experts E0..E3):
| token | E0 | E1 | E2 | E3 |
|---|---|---|---|---|
| t0 | 0.60 | 0.10 | 0.20 | 0.10 |
| t1 | 0.55 | 0.05 | 0.30 | 0.10 |
| t2 | 0.15 | 0.10 | 0.65 | 0.10 |
| t3 | 0.50 | 0.10 | 0.30 | 0.10 |
| t4 | 0.20 | 0.10 | 0.60 | 0.10 |
| t5 | 0.45 | 0.10 | 0.35 | 0.10 |
| t6 | 0.10 | 0.10 | 0.70 | 0.10 |
| t7 | 0.10 | 0.60 | 0.20 | 0.10 |
Step 1: column averages — . Add each column and divide by 8.
Sanity: . Good. Expert 2 is the most popular by soft mass; expert 3 the least.
Step 2: top-1 fractions — . Take the argmax of each row.
- Top-1 picks: t0→E0, t1→E0, t2→E2, t3→E0, t4→E2, t5→E0, t6→E2, t7→E1.
- (four tokens picked E0)
- (E3 was never anyone's favorite)
Step 3: assemble the loss. .
- Sum:
The result, , is above the minimum of 1.0, confirming this batch is imbalanced. With , this aux loss contributes to the total loss. Tiny in absolute terms, but its gradient is specifically targeting the router weights — a small absolute number applied to a tiny subset of parameters can be a relatively large update on those parameters.
The takeaway. The minimum (1.0) is reached only at perfect balance, and the value grows in two ways at once: through (whose imbalance is binary at the top-1 level) and through (whose imbalance is continuous). When both align — popular by soft mass and by argmax — the loss penalizes the router heavily. That double-pressure is the whole point.
Visualizing the Tension
Slide the control below. At the router freely follows content and the routing distribution is wildly imbalanced — that is the collapse from §6.1. As grows, the bars equalize toward the uniform marker, but watch the red curve on the right: the quality penalty rises with every step of balance you buy.
Two things to notice. First, the relationship is not linear: small buys a lot of balance for almost no quality cost (the blue curve drops fast while the red curve stays near zero), but past a certain the quality penalty accelerates while the marginal balance improvement flattens out. This concavity is what makes auxiliary loss useful in practice — there is a sweet spot. Second, the two curves never both reach the floor. No setting of gives perfect balance and zero quality penalty simultaneously. The whole reason §6.3's bias-term approach exists is to find a mechanism that can deliver both.
Capacity Factor: Why Balance Becomes Mandatory
The visualizer above measures balance and quality as if the cluster had infinite capacity. Real systems do not. Each expert's receive buffer is sized at compile time as where is the capacity factor (typically between 1.0 and 1.5). Tokens whose expert is already full are dropped — they bypass the MoE block entirely. The 16-token capacity simulator below shows the consequence directly.
Slide capacity down. Notice how, even with the same router decisions, the number of dropped tokens climbs immediately as soon as one expert's slot count is exceeded. Slide capacity up. Drops vanish, but most of the slot grid is wasted — GPU utilization tanks. The balancing loss is what keeps you off both horns of this trade-off: at balanced routing, every expert's bucket is roughly the same size, so a capacity factor of 1.0 is enough to absorb the load without dropping. Without balancing, no realistic capacity factor (short of replicating every expert) can keep up.
Plain Python: The GShard Loss From Scratch
Before wiring this into a training loop, let's implement the loss with nothing but NumPy. Every step you read should match the hand-walkthrough above, with the same numbers landing in the same places.
The whole loss is six lines: softmax, mean across the batch dim, argmax + one-hot for , then the dot product. Nothing about it is sophisticated. The sophistication is in appreciating which lines have a gradient and which do not — that is the ingredient PyTorch turns into a training signal.
The minimum-is-one trick. Why does the formula multiply by ? Because always, so by Cauchy-Schwarz the sum is minimized at (when both are uniform) and bounded above by 1 (when both concentrate on the same expert). Multiplying by gives a clean, scale-invariant loss in , with the lower bound at exactly 1 regardless of .
PyTorch: Wiring the Loss Into a Training Step
Production MoE modules return their aux loss alongside the activation output. The trainer is responsible for adding it to the task loss with the right coefficient and calling backward on the sum. Here is the cleanest possible end-to-end version.
The pattern generalizes to deep models with many MoE layers: each layer returns its own aux loss, and the trainer either sums them or averages them before adding . DeepSeek-V2's public code, for instance, computes per-layer aux losses and a separate device-level balance loss (we will see that in §6.4) and stacks them all into the final scalar.
- One per balancing mechanism, not per layer. All MoE layers share the same in standard implementations. Per-layer tuning was tried by several teams — it did not generalize across model sizes and was abandoned.
- The detach on f matters for performance, not correctness. Argmax has no gradient regardless, but wrapping it in
torch.no_gradprevents autograd from building an unnecessary graph. On a 671B model with 58 MoE layers this saves a measurable amount of host memory. - Aux loss is sometimes annealed. Some implementations start high (to bootstrap balance quickly) and decay it as training progresses, on the theory that experts diversify and balance naturally once the router has learned something. DeepSeek-V3 ablations (§6.3, Table 3) found this brittle: the balance regresses as soon as shrinks, and you cannot tell from the loss curve whether you are regressing.
The Hidden Cost: Gradient Interference
The argument against auxiliary loss is not philosophical; it is a precise statement about gradient geometry. The router weights receive an update every step that is the sum of two contributions:
.
The first term is the gradient that actually improves the model — it pushes routing decisions toward whatever the data says is the right expert for each token. The second term pushes routing decisions toward uniformity, regardless of data. For a single token, these two vectors live in the same parameter space and can be added — but they generally point in different directions. Decompose the aux gradient into the component parallel to the task gradient and the component orthogonal to it:
.
The parallel component is fine — it just resizes the task gradient up or down. The orthogonal component is the toxic part: it deflects the update vector away from the direction the task would have taken on its own. On batches where the task and balance objectives happen to agree, the deflection is small. On batches where they disagree — when the task gradient pulls hard toward a popular expert because that expert genuinely knows the content best — the deflection is large.
The orthogonal component compounds in two ways. First, across MoE layers: a 58-layer DeepSeek-V3 has 58 separate routers, each receiving its own deflected gradient. Second, across training steps: 14.8 trillion tokens means hundreds of thousands of optimizer steps, each adding a small amount of unwanted noise to the router. The effect is statistical and gradual — there is no single step where the model breaks. There is only a steady drift, measurable only by ablation, of the trained model away from where the pure task gradient would have taken it.
What Goes Wrong at Massive Scale
DeepSeek-V2 and V3 papers publish ablations that quantify this drift. Holding model size, data, and compute fixed, training the same MoE with vs. without an auxiliary balancing loss produces measurable differences on downstream benchmarks. The numbers move with model scale, but the pattern is consistent.
| Model & setup | Validation PPL | MMLU | Reasoning bench |
|---|---|---|---|
| MoE 16B, no balancing (collapses) | fails | fails | fails |
| MoE 16B, aux loss α=0.01 | baseline | baseline | baseline |
| MoE 16B, aux loss α=0.1 (over-balanced) | +1.2% | −0.8 pts | −1.4 pts |
| MoE 16B, bias-term (§6.3) no aux | −0.4% | +0.6 pts | +1.1 pts |
Numbers above are illustrative of the directional pattern reported in DeepSeek-V2 and V3 ablations — exact values vary by setup. The important columns are not the magnitudes; it is the sign. The bias-term variant (the subject of §6.3) consistently beats the well-tuned aux-loss variant. It is not a tie, and it is not a regression noise: in the V3 ablations the gap is reported as statistically robust across multiple training seeds.
Where the regression lives
Two regions of model behavior absorb most of the quality penalty. First, rare-domain tokens: code, math, multilingual text — anything where one or two experts have plausibly specialized. The aux loss pulls those tokens toward generalist experts they would otherwise have skipped, diluting the specialization that made MoE worth doing in the first place. Second, the long tail of the loss curve: training-loss ablations show that early in training the aux loss is almost free (because the router barely knows anything anyway), but its cost grows over the course of training as specialization that could have emerged is dampened by the balancing pressure.
Engineering Reality: Why "Just Tune Alpha" Fails
The natural reply is: pick a better . If is too aggressive, try . If that lets the router collapse again, try . Several teams spent serious resources on exactly this hyperparameter sweep. Four reasons it fails to close the gap:
- The optimum depends on the data distribution, and the data distribution changes during training. Early data tends to be general; later curriculum stages include more specialized domains. The that prevents collapse on generalist data may suppress specialization on math data. There is no single that is optimal throughout. (DeepSeek tried annealing schedules; quality variance across seeds dwarfed the schedule's benefit.)
- The optimum depends on model size. A 7B MoE tolerates a different than a 671B MoE because the router's relative parameter share differs. You cannot ablate on a 7B model and extrapolate; you have to retune at every target scale, which is impossibly expensive on the budgets that justify MoE in the first place.
- Diagnosing the regression is hard. The aux loss does its damage subtly. Training cross-entropy looks fine. Eval benchmarks move by 0.5–2 points — well within run-to-run noise on a single seed. You only see the effect by running multiple seeds and comparing distributions. Most teams ship without doing that carefully and never know what they left on the table.
- The orthogonality argument has no sweet spot. As shown above, does not vanish at any positive . The best you can do by tuning is to scale it down — and below a certain magnitude, the balance signal becomes too weak to prevent collapse. The window where both objectives are mostly satisfied does exist, but it is narrow and shifts with everything that changes about the run.
For now, the picture to carry forward is this: auxiliary balancing loss is the standard, the obvious choice, and quietly the single largest avoidable cost in modern MoE training. Every line of math we wrote is correct; every implementation works; every model trained this way ships. They ship slightly worse than they could. Closing that gap — at the level of a few benchmark points and a few hundred basis points of pretraining loss — is what justifies DeepSeek-V3's bias-term approach in §6.3, and what makes the decade-old GShard formula not the last word on MoE balancing.