The Tension We Inherit
Two sections ago we watched a mixture-of-experts layer collapse. A handful of experts captured almost all the routing decisions, every other expert went idle, and the model effectively shrank from N experts down to four or five. One section ago we tried the standard fix: an auxiliary load-balancing loss. It worked — the load curves flattened — but the gradient of that loss flowed straight into the same parameters that produce the token logits, and the model paid for the balance with a measurable hit on language-modeling quality. Bigger model, lower quality. That is the worst possible trade in large-scale training.
So we arrive at this section with a very specific question: can we get the discipline of the auxiliary loss without paying its gradient bill? DeepSeek-V3's answer — published in late 2024 and refined in the V3 / R1 technical reports — is a strikingly small modification to the router. It introduces a single per-expert scalar, updates it with no gradient at all, and produces flatter load curves than the auxiliary loss while leaving model quality strictly better. This section is about that scalar.
The one-sentence version. Add a per-expert bias to the score used to select the top-K experts — but not to the score used to weight them. Then nudge down whenever expert is over-served and up whenever it's under-served. The model gradient never sees at all.
The Trick: Decouple Selection from Weighting
Standard top-K routing fuses two distinct decisions into one function. First, the router decides which experts a token activates. Second, it decides how much each chosen expert contributes to the output. In every MoE layer before DeepSeek-V3, both of those decisions are read off the same per-(token, expert) affinity score .
That fusion is what made the auxiliary-loss approach painful. To change which experts get picked, the auxiliary loss had to change how much each gets weighted, because they came from the same logits. The two effects rode the same gradient.
The bias-term trick unfuses them — physically. Let be the raw affinity (the model's own opinion of how relevant expert is to token ). Now build the routing decision from a different score:
where is a per-expert scalar that depends only on recent load — not on the token, not on the model weights.
The router selects top-K from . The gating weights it emits — what mathematically combines the experts' outputs — use the original with no bias. Selection and weighting are now driven by two different scores. They can be tuned independently. That's the entire structural move.
The Mathematical Formulation
Let the MoE layer have experts, route top- per token, and process tokens per batch. The router holds a learned centroid per expert and, separately, a non-trainable per-expert bias .
Affinity (the model's opinion)
For each token hidden state , the per-expert affinity is , where is the elementwise sigmoid. Each lives in independently of the others — independent, not normalized into a softmax. This choice matters: it means adding a bias to expert 's score does not steal probability mass from any other expert. The bookkeeping stays local.
Routing decision (selection only)
Define the bias-adjusted score . The router picks the top- indices:
— the set of expert indices a token activates.
Gating weights (weighting only)
For each , the gating weight is the renormalized raw affinity:
Notice does not appear here. The model's opinion alone decides how much weight each chosen expert contributes. The token's output is then .
The Bias Update as a Control Loop
With selection and weighting decoupled, we're free to update however we want. DeepSeek picks the simplest possible rule: a sign-based proportional controller.
After each step, let — the number of token-slots routed to expert this batch. The uniform setpoint is . Update:
The hyperparameter (the DeepSeek paper writes it as ) controls the loop's speed. Read the rule sign by sign:
- Expert got more tokens than expected — , so drops by . Its bias-adjusted scores fall, and on the next batch it tends to lose top-K races it would otherwise have won by small margins.
- Expert got fewer tokens — rises by . Its bias-adjusted scores climb; on the next batch it wins more close races.
- Expert exactly at the setpoint — unchanged. (In practice this never happens; the sign function breaks every tie.)
The fixed point is the place where the per-expert biases have settled into whatever offsets the data distribution actually needs — bigger negative biases for the intrinsically "popular" experts, bigger positive biases for the intrinsically "niche" ones — and load is uniform. This is a standard bang-bang controller. It is the same idea as a thermostat clicking on and off.
Why It Has Zero Gradient Interference
The cleanest way to see the argument is to trace the computation graph. Start at the loss and walk backward to find which parameters it depends on.
| Quantity | Depends on… | Has gradient? |
|---|---|---|
| Per-token output y_t | g_{i,t} and Expert_i(u_t) | yes (through experts and gates) |
| Gate weight g_{i,t} | raw affinity s_{i,t} (NOT b_i) | yes (through centroids e_i) |
| Affinity s_{i,t} | centroid e_i and hidden state u_t | yes (e_i is a Parameter) |
| Bias b_i | history of loads ℓ_i (no parameters at all) | no (no path from L) |
| Top-K membership T_t | argmax over (s + b), non-differentiable | ignored (straight-through) |
The bias influences only through the discrete top-K choice, which has no usable gradient. The differentiable path — through — bypasses entirely. So when the optimizer asks autograd "what should I do with ?", the honest answer is nothing. We update it ourselves, with the control loop, and the gradient signal that improves token-level prediction flows undisturbed through the centroids and the expert networks.
Interactive: Watch the Bias Loop Stabilize
Before any code, run the loop yourself. The simulator below uses 8 experts, top-K = 2, and a deliberately skewed affinity distribution — experts 0 and 1 have systematically higher mean affinity than the rest. With bias correction off, that skew is enough to send the routing into permanent collapse. With it on, watch the biases drift into negative territory for the popular experts and positive for the niche ones until load is essentially uniform.
Things worth doing in the simulator, in order:
- Press Skip 30 steps with the bias correction off. The per-step load on experts 0 and 1 sits around 20-25 tokens (out of 32 slot-allocations), and the bottom four experts get almost nothing. The imbalance curve refuses to fall.
- Reset. Turn the bias correction on with . Step a few times — you can see the biases for the popular experts already turning negative after step 1. Press Skip 30 steps. Imbalance falls to near .
- Crank to its max () and step a few times. The biases jitter visibly — you're seeing the cost of an over-aggressive controller. With , the convergence is smoother but slower. This trade-off is exactly the one DeepSeek tunes with a schedule in real training.
- Drop to zero. The control loop freezes. Routing decisions still get the current biases applied, but they stop adapting — this is what DeepSeek does in the last ~5% of training to lock in the routing pattern.
Manual Numerical Walkthrough
Let's execute the full loop by hand for one step, with experts, , and tokens. Small enough to carry every intermediate.
Step-by-step: one batch through a bias-balanced router
Setup
Affinities (rows are tokens, columns are experts). These would normally come from ; we hand-pick them so expert 0 looks objectively most relevant on most tokens — the natural skew we want the bias to correct.
| E0 | E1 | E2 | E3 | |
|---|---|---|---|---|
| t=0 | 0.90 | 0.40 | 0.20 | 0.10 |
| t=1 | 0.85 | 0.55 | 0.25 | 0.15 |
| t=2 | 0.80 | 0.30 | 0.60 | 0.20 |
| t=3 | 0.70 | 0.50 | 0.30 | 0.40 |
| t=4 | 0.95 | 0.45 | 0.15 | 0.25 |
| t=5 | 0.75 | 0.65 | 0.10 | 0.05 |
Current biases after a few prior steps of the loop: . Expert 0 already has a meaningful negative bias because it has been over-served. Expert 3 has a positive bias to lift it.
Step 1: Form the bias-adjusted score
— add the per-expert bias to every column:
| E0 (+(-0.30)) | E1 (+(-0.05)) | E2 (+0.10) | E3 (+0.25) | |
|---|---|---|---|---|
| t=0 | 0.60 | 0.35 | 0.30 | 0.35 |
| t=1 | 0.55 | 0.50 | 0.35 | 0.40 |
| t=2 | 0.50 | 0.25 | 0.70 | 0.45 |
| t=3 | 0.40 | 0.45 | 0.40 | 0.65 |
| t=4 | 0.65 | 0.40 | 0.25 | 0.50 |
| t=5 | 0.45 | 0.60 | 0.20 | 0.30 |
Step 2: Top-K = 2 selection (per row)
Pick the two highest entries in each row of the adjusted table:
| Chosen experts | Membership set T_t | |
|---|---|---|
| t=0 | E0 (0.60), E1 (0.35) — E3 tie broken by index | {E0, E1} |
| t=1 | E0 (0.55), E1 (0.50) | {E0, E1} |
| t=2 | E2 (0.70), E0 (0.50) | {E0, E2} |
| t=3 | E3 (0.65), E1 (0.45) | {E1, E3} |
| t=4 | E0 (0.65), E3 (0.50) | {E0, E3} |
| t=5 | E1 (0.60), E0 (0.45) | {E0, E1} |
Step 3: Gate weights — use the RAW affinity
Now we pull the gate weights from the original table — no bias here. Token picked :
The model's opinion of expert 0's relevance to token 0 is fully preserved — it gets 69% of the gate. The bias only ensured E0 stayed in the chosen set; it did not water down E0's contribution.
Step 4: Tally load and update biases
Count occurrences across all :
| Expert | Load this step | Setpoint ℓ* = TK/N = 12/4 = 3 | overload sign | Δb |
|---|---|---|---|---|
| E0 | 5 | 3 | +1 | −γ |
| E1 | 4 | 3 | +1 | −γ |
| E2 | 1 | 3 | −1 | +γ |
| E3 | 2 | 3 | −1 | +γ |
With , the new biases are:
Compare to where we started: . Every over-served expert dropped 0.05; every under-served one rose 0.05. On the next batch, that 0.05 shift will tip a handful of close decisions away from E0 and E1 toward E2 and E3. Push the same process through 100 batches and the load distribution flattens while the centroids evolve under their own gradient — completely independent of the bias loop.
The model never knew
Gradients computed in the backward pass at this step depend on and the experts, both of which used the raw table. The biases appeared only in Step 2 — a non-differentiable top-K selection — and as a state update in Step 4 that lives outside autograd. There is no chain rule that connects loss to .
Plain Python Implementation
Here is the entire mechanism — affinity, selection, gating, control loop — in pure NumPy. Nothing is hidden behind an MoE library; every operation is a single named line you can trace. Read it once top-to-bottom, then open the explanation panel.
Two structural things to take away from this file. First, the bias vector is just a NumPy array we mutate by hand — there is no gradient anywhere in the file. Second, the line decision = s + bias[None, :] is the entire algorithm. Everything before it sets up the data; everything after it counts load and updates the controller.
What to try with this code
- Run it as-is and print the load every 50 steps. You'll see the popular experts start at ~25 tokens/step and walk down to ~16; the unpopular ones start near zero and climb up to ~16.
- Set GAMMA = 0. The biases never change, the decision is just , and the load stays permanently lopsided — the no-balancing baseline.
- Replace np.sign(overload) with overload / expected_load. This is the magnitude-aware variant; convergence is faster, but a single unusual batch can over-correct.
- Add a slow drift to popular — e.g. rotate the high-affinity slots from (0, 1) to (2, 3) at step 200. Watch the biases re-balance on the fly. This is the actual situation during long pretraining: data distribution drifts, and the loop tracks it.
PyTorch Implementation
The PyTorch version preserves the same control structure but uses the framework's tools properly — centroids as Parameters (so the optimizer learns them) and bias as a registered buffer (so it's saved with the model but invisible to the optimizer and to autograd).
How this module fits into a training loop
The usage pattern is intentionally boring:
- Forward pass: call
router(h), dispatch tokens to the chosen experts usingidxandgates, combine outputs. Standard MoE. - Backward pass:
loss.backward()flows throughgates, which depend onself.centroidsbut not onself.bias. The optimizer updates centroids and expert weights as usual. - After
optimizer.step(): callrouter.update_bias(load, total_tokens). This is the control-loop update, fully outside autograd.
Because self.bias is a buffer and not a parameter, model.state_dict() captures it. Resuming a training run starts the controller from its warmed-up state — you don't lose thousands of steps of load history every time you checkpoint.
load tensor must be all-reduced across data-parallel ranks before update_bias is called. Otherwise each rank only sees its local micro-batch and the controllers across ranks drift apart. One line of torch.distributed.all_reduce(load, op=SUM) fixes it. DeepSeek also balances across expert-parallel groups, which we cover in chapter 12.The γ Schedule in DeepSeek-V3
The simulator above hints at the trade-off: a large converges quickly but oscillates; a small is stable but slow to recover from distribution shift. DeepSeek-V3 resolves this the same way every other large-scale training trick does: schedule it.
| Phase | γ value | Why |
|---|---|---|
| Warmup (first ~2K steps) | γ = 0.001 → 0.001 | Centroids are still random. The model's opinion is noise; we don't want the bias to chase noise too fast. |
| Main training | γ = 0.001 (constant) | The dominant regime. Slow, smooth corrections; small enough that the controller doesn't fight transient batch-to-batch noise. |
| Late training (~last 5%) | γ decays linearly to 0 | Locks in the routing pattern. The model can specialize harder when it knows its expert assignments won't keep drifting. |
Two details about the published numbers. First, the reported is for tokens per device with hundreds of devices in parallel — that is, an all-reduced load of tens of thousands of tokens per step. The effective controller strength is times the per-step load count, so absolute numbers don't transfer to smaller setups. If you reproduce this with batch size 64, you almost certainly want to .
Second, the cool-down is not optional. If you train with right up to the last step, the biases that the deployed model uses are the ones from the noisiest single batch you ever saw. The decay phase averages the biases toward their long-run mean. Models with the decay disabled show measurably worse inference-time load balance.
What Changes at Massive Scale
Everything above runs in 50 lines of NumPy on a laptop. Putting it on a 671B-parameter DeepSeek-V3 cluster surfaces three concerns that don't exist in the toy.
1. Load must be aggregated across ranks before the update
A real MoE layer is sharded with expert parallelism: different GPUs hold different experts. A token routed to expert 42 physically travels (via an all-to-all collective) to the GPU that owns expert 42. The load counter for expert 42 is naturally local to that GPU. But the data-parallel dimension is replicated — each data-parallel rank sees its own micro-batch of tokens, and they each accumulate their own partial . Before applying the update we all_reduce the load vector across data-parallel ranks so every replica of the layer sees the same load and applies the same update to its (replicated) bias. Skip this and the biases drift apart, the routing decisions diverge across ranks, and training corrupts within a few hundred steps.
2. Bias bookkeeping is microscopic; expert dispatch is not
The bias update itself is N additions and N sign checks — for N = 256 experts, that is 256 floating-point ops. The MoE layer's compute is dominated by the expert MLPs (billions of FLOPs) and by the all-to-all token dispatch (tens of GB of communication). The balancing logic is essentially free. This is part of why the bias-term method is preferred over more elaborate controllers: whatever you spend on the controller, you spend per layer per step, and there are 61 MoE layers in DeepSeek-V3 firing thousands of times per second.
3. Dropped tokens are an emergent property of the loop
Real MoE deployments cap each expert at a per-step capacity — tokens beyond the cap are dropped or rerouted. With the auxiliary loss, you tune the loss coefficient and hope the load stays inside the cap. With the bias-term method, if the cap is set near the uniform load , the controller actively steers toward that target, and the drop rate falls naturally as training progresses. DeepSeek reports drop rates below 0.1% by the middle of training — orders of magnitude below auxiliary-loss baselines.
| Property | Auxiliary loss | Bias term (DeepSeek) |
|---|---|---|
| Gradient interference | Yes (same path as token logits) | None — by construction |
| Tuning surface | Loss coefficient λ (sensitive) | Update rate γ (forgiving) |
| Communication cost | 0 extra collectives | 1 all_reduce per layer per step |
| Sensitivity to data drift | Fixed strength — corrects slowly | Continuous control — tracks drift |
| Effect on model quality | Mild regression (paper-reported) | Slight improvement or neutral |
| Drop rate at production cap | 1–5% | <0.1% by mid-training |
Production Pitfalls and What to Monitor
What to log every step
- Per-expert load (a length-N vector). Plot as a heatmap over training steps. Should go from skewed to flat over the first few thousand steps.
- Max-over-min load ratio . Single scalar; should fall from >10× to near 1.5× and stay there.
- Bias magnitude . Should grow during warmup then plateau. Unbounded growth means γ is too large or the all-reduce is missing.
- Drop rate (tokens exceeding per-expert capacity). Should fall monotonically. A persistent floor means your capacity cap is below the uniform setpoint and you should raise it.
Summary
DeepSeek-V3's auxiliary-loss-free load balancing is one of the most elegant pieces of mechanical design in modern MoE training. Three structural decisions carry the entire result:
- Decouple selection from weighting. Use for top-K selection and the raw for gate weights.
- Update the bias by a control loop, not by gradient descent. After each step, nudge down for over-served experts and up for under-served ones.
- Store the bias as a buffer, not a parameter. Autograd can't reach it, the optimizer can't see it, and the loss has no path through it.
The combined effect is the discipline of an auxiliary loss with the gradient quiet of no loss at all. Compared to MoE models that use an auxiliary loss, DeepSeek-V3 trains with flatter load curves, a sharper drop rate, and a small but consistent improvement in language-modeling quality — not in spite of the load balancer, but because the load balancer stopped fighting the model.
The next section closes the chapter by adding one more layer of defense: a per-sequence balance term that catches the rare extreme imbalances even the bias loop can't prevent — and we'll see why it's applied at the sequence level, not the batch level, before moving on to multi-token prediction in chapter 7.