Section 6.3 introduced DeepSeek's clean fix for load imbalance: a learned bias on each expert's router score, updated by a non-differentiable feedback rule. That mechanism is enough to balance the routing on average across the batch — and on average, it works beautifully. But averages can hide bad behavior at finer time scales. A batch can look balanced while individual sequences inside it are quietly collapsing. The sequence-level balance loss is the tiny safety net DeepSeek adds on top, designed to catch exactly this failure.
The promise of sequence-level balance. A nearly-zero auxiliary loss applied per sequence — not per batch — to penalize within-sequence routing collapse. The coefficient is so small () that it does not corrupt the routing decision. Just large enough that, when one sequence quietly funnels 80% of its tokens to a single expert, the gradient has somewhere to push back.
The Hidden Failure of Batch-Level Balance
Imagine a training batch with 32 sequences. The bias-term feedback rule from the previous section nudges biases up for under-used experts and down for over-used ones, measured over the whole batch. After a few hundred steps it converges to a clean state: across those 32 sequences combined, every expert handles roughly its fair share of tokens. The batch-level histogram looks like a flat line.
Now inspect the routing inside a single sequence. You may discover that sequence #7 sent 80% of its tokens to expert 12, while sequence #19 sent 80% of its tokens to expert 3. Both sequences are using only a sliver of the expert pool locally, but they happen to use different slivers — so when you average across the batch, the imbalance cancels out. The bias-term mechanism is satisfied. The model is not.
| What you measure | Bias-term sees | Reality inside sequences |
|---|---|---|
| Tokens per expert (batch) | Perfectly balanced (the goal) | Looks balanced — averages cancel |
| Tokens per expert (per sequence) | Not monitored | Can be wildly imbalanced |
| Expert utilization (forward) | Full | Locally degenerate, globally fine |
| Gradient signal per expert | Healthy on average | Spiky and high-variance within sequences |
| Decode-time stability | Looks fine in training | Single-stream inference collapses to 1–2 experts |
Two Time Scales, Two Different Failures
Picture a restaurant kitchen. The head chef (bias-term feedback) watches the whole shift and shouts "more salads, fewer steaks" when the totals drift out of balance. By the end of the night the count is even — across the whole shift, the kitchen handled equal numbers of every dish. But within any given hour the kitchen might have been doing nothing but desserts, then nothing but soups. The shift-level average is balanced; the hour-by-hour rhythm is broken.
The sequence-level loss is the line cook — a much quieter signal that operates inside each hour. It does not try to fix the absolute counts (the head chef already handles that). It just whispers "hey, you've sent the last four orders to the same station; spread the next one out." Two different control loops at two different time scales, doing two different jobs.
This split is fundamental to how DeepSeek thinks about load balancing. Thebias-term feedback handles slow, global imbalance and does so without contributing any gradient. The sequence-level loss handles fast, local imbalance and contributes a microscopic gradient — small enough that it cannot reshape the routing decision, but present enough that the optimizer sees a directional hint when a sequence starts to collapse.
The Math: A Per-Sequence Smoothing Loss
Fix one sequence with tokens and experts. Top- routing produces, for each token , a set of chosen experts. The sequence-level balance loss DeepSeek-V3 uses is:
with two ingredients. The hard fraction counts how often expert was actually routed to, normalized so that perfectly uniform routing gives :
The soft probability is the average router probability that expert received across the sequence, where are the un-biased router logits for token :
Why two terms — one hard, one soft
The hard term is non-differentiable: an argmax / top- cannot pass a gradient. The soft term is fully differentiable through the softmax. Their product is the engineering trick: gradient flows only through , but it is weighted by the actual frequency . So the loss says: "if expert is being routed to a lot , then push its average probability down for next time."
Why the bias is excluded from the logits
The router logits used in are pre-bias — the raw output of the router's linear layer, before the section-6.3 bias is added. If you let the bias appear inside the softmax here, the balance loss would push gradient into the bias, and the bias-term mechanism would lose its "non-differentiable controller" property. Keeping the two mechanisms cleanly separated is critical.
What is conspicuously absent
Notice this loss is averaged over (tokens within one sequence) and computed independently for each sequence in the batch. The batch dimension only appears in the very last reduction — the per-sequence losses are averaged together to give one scalar. There is no cross-sequence mixing inside the formula. That is the entire point: the loss sees only what is happening within one sequence at a time.
Manual Numerical Walkthrough
Let us run the loss on one toy sequence of tokens, experts, top-. Every number is calculated by hand so the mechanism is fully exposed.
Click to expand: one sequence, T=6, E=4, k=2, by hand
Setup — a quietly collapsing sequence. Six tokens, four experts. The router strongly prefers expert 0 throughout the sequence, but the second choice rotates around — so on its own this sequence will not scream "collapse" to a batch-level monitor. Per-token pre-bias logits :
- t=0: [3.2, 1.6, 0.4, 0.5]
- t=1: [3.1, 0.5, 1.4, 0.6]
- t=2: [2.9, 0.4, 0.5, 1.3]
- t=3: [3.0, 1.5, 0.5, 0.4]
- t=4: [3.3, 0.4, 1.2, 0.5]
- t=5: [3.1, 1.4, 0.5, 0.4]
Step 1 — hard top-k per token. Pick the top-2 indices per row:
- t=0 → {0, 1}, t=1 → {0, 2}, t=2 → {0, 3}
- t=3 → {0, 1}, t=4 → {0, 2}, t=5 → {0, 1}
Counts per expert: expert 0 was routed to 6 times, expert 1 → 3, expert 2 → 2, expert 3 → 1. Total routings = (sanity check: ✓).
Step 2 — normalize to f. Multiply by :
f = (1/3)·[6, 3, 2, 1] = [2.00, 1.00, 0.67, 0.33]
Read: expert 0 is taking 2× its uniform share; expert 3 is taking only 1/3 of its share. A flat / uniform sequence would give .
Step 3 — softmax each token's scores. Doing token t=0 in detail. Subtract max: . Exponentiate: , sum ≈ 1.330. So .
Doing this for all six tokens gives (rounded to 3 decimals):
- t=0: [0.752, 0.152, 0.046, 0.050]
- t=1: [0.776, 0.058, 0.142, 0.064]
- t=2: [0.764, 0.063, 0.069, 0.155]
- t=3: [0.722, 0.161, 0.060, 0.054]
- t=4: [0.794, 0.043, 0.097, 0.048]
- t=5: [0.747, 0.137, 0.056, 0.050]
Step 4 — average columns to get P. Column means:
P ≈ [0.759, 0.102, 0.078, 0.070]
Read: the router's smooth view says expert 0 wins 76% of every token's probability mass, on average. Experts 1–3 split the rest.
Step 5 — inner product and scale.
f · P = 2.00·0.759 + 1.00·0.102 + 0.67·0.078 + 0.33·0.070
= 1.518 + 0.102 + 0.052 + 0.023 = 1.695
L_seq = α · 1.695 = 1e-4 · 1.695 ≈ 1.70e-4
What this number does to gradients. Take the partial derivative of with respect to . That is — a small but non-zero push, multiplied by the f-weight, so the more over-used the expert is, the bigger the push. This is the directional hint the optimizer needs to send the next few tokens elsewhere.
Compare to a balanced sequence. If the same six tokens had been routed uniformly so , and the softmax was correspondingly flat so , then and . So the collapsing sequence pays the loss of a balanced one. Tiny in absolute terms, but precisely targeted.
Sanity floor. The minimum of subject to and is achieved at for all , giving the value . So the floor of is exactly , and anything above that is local imbalance being penalized.
Visualizing Batch vs. Sequence Scope
The diagram below holds two sequences side by side. Sequence A is routed evenly; Sequence B suffers a within-sequence collapse onto expert 0. Toggle between Batch-level and Sequence-level to see how the same routing pattern produces a single averaged loss versus per-sequence losses. The α slider scales the loss without changing the routing.
Three things to lock in. First, in batch mode the f-bars and P-bars for sequence B's collapse get washed out by sequence A's spread — the merged signal is misleadingly clean. Second, in sequence mode each sequence carries its own loss: sequence B pays more, and that extra penalty is exactly the gradient signal that will pull its next-step routing back toward uniform. Third, the absolute scale of the loss is microscopic even when you push α up to — DeepSeek-V3's default of sits well below the noise floor of the main LM loss.
Plain Python: One Sequence, By Hand
Below is the entire per-sequence calculation in NumPy. Six tokens, four experts. Every line maps one-to-one to a line in the math above — and the printed L_seq matches the walkthrough we just did by hand.
Two structural details deserve a second pass. First, the normalization factor on is what makes the loss insensitive to sequence length. Without it, a longer sequence would mechanically produce a larger loss, and the per-sequence reduction would just become a roundabout per-token average. With it, is dimensionless and uniform-target = 1, so a 4096-token sequence and a 256-token sequence are compared on equal terms.
Second, this implementation is intentionally non-batched. There is no dimension; you call this function once per sequence. The PyTorch version below adds the batch dim and a final mean, but otherwise the math is identical.
Sanity check. Set every routing decision to and every score vector to . Then , , and — the worst possible value (= E). At the other extreme, perfectly uniform routing and a flat softmax give . The loss spans a factor of E between the worst and best collapse states.
PyTorch: Vectorized Across the Batch
The production version lives inside the MoE forward pass and runs once per layer. The only new ideas are the batch dimension and the careful reduction order — average within each sequence first, then average across the batch.
Three subtleties worth marking, all about how this loss interacts with the rest of the MoE block:
- The probs softmax and the routing softmax are different. The softmax inside this loss is purely for measuring imbalance — it is not the one whose top-k decides actual routing. In DeepSeek-V3 the routing decision uses scores-plus-bias, while this loss uses scores-only. Two softmaxes, two purposes.
- Gradient flows into the router weights, not the bias. Because the bias was not used in
probs, the gradient ofL_seqwith respect to the bias parameters is exactly zero. The bias is updated only by the non-differentiable feedback rule from section 6.3. This is a clean separation of concerns: the bias controls long-run balance, the loss controls short-run balance. - One loss per MoE layer. Modern stacks have dozens of MoE layers, and each has its own router and its own potential for collapse. The balance loss is computed independently per layer and summed; with and 58 MoE layers (DeepSeek-V3) the total contribution to the loss is still only — well under 1% of the LM loss.
scatter_add. The version above is written for clarity. The gradient graph is identical either way — only the kernel layout changes.What Changes at Massive Scale
At toy scale the sequence-level loss looks like a curiosity — a tiny add-on with a near-zero coefficient. At production scale the situation is dramatically different because the within-sequence collapse failure mode becomes much more likely as the routed pool grows. The DeepSeek-V3 numbers tell the story:
| Setting | Routed experts | Top-k | Tokens / sequence | Loss coefficient α |
|---|---|---|---|---|
| DeepSeek-V2 | 160 | 6 | 4096 | 1e-3 |
| DeepSeek-V3 (paper) | 256 | 8 | 4096 | 1e-4 |
| Mixtral 8×7B (no seq-loss) | 8 | 2 | 32k | — (uses aux loss) |
Two patterns to read here. First, as the routed pool grew from 160 to 256, DeepSeek reduced α from to . The bias-term mechanism from section 6.3 got better at handling the bulk of the balancing job, so the safety net could be pulled tighter and quieter. The loss is now there mainly to handle the tail — the small fraction of sequences with unusual content distributions.
Second, Mixtral does not use a sequence-level loss at all. It uses the standard auxiliary load-balancing loss with a much larger coefficient. That works adequately for an 8-expert pool — there are not enough experts for within-sequence collapse to become a serious problem. But it would not scale to 256 experts: the auxiliary loss would have to be loud, and routing quality would suffer.
The interaction with sequence packing
Real training rarely uses one logical document per sequence. Multiple documents are concatenated up to with an attention mask that prevents cross-document attention. From the sequence-level loss's point of view, this is a feature: a 4k-token packed sequence contains a varied document mix, so balanced routing is genuinely the right target. If you packed a single homogeneous document into the full 4k window — say, a long block of Chinese poetry — the loss would correctly fire and push the router away from the natural domain-driven collapse. Whether that is what you want is a separate design call; for general pre-training, it is.
The interaction with expert parallelism
Each MoE layer's router runs on every GPU that holds a copy of the model (data-parallel rank). The per-sequence loss is computed locally on the rank that owns the sequence — no all-reduce needed for the balance loss itself, because there is no cross-sequence mixing. Only the gradient accumulation does any sync, and that happens once for the whole step regardless. So the sequence-level loss is essentially free in communication terms, which is part of why DeepSeek can afford to add it at every MoE layer.
Engineering Reality and Gotchas
The sequence-level balance loss looks innocuous and is one of the easiest pieces of DeepSeek's MoE stack to get subtly wrong. Three failure modes are worth flagging:
- Letting the bias into the softmax. If you reuse the same
scores_with_biastensor for both the routing decision and the balance loss, the gradient ofL_seqwill leak into the bias parameters. The bias-term feedback rule was explicitly designed to be non-differentiable; letting gradient enter it muddies the control loop and reintroduces all the routing-quality problems section 6.2 documented. Keep the two paths separate. - Using α too large. Try and the loss starts dominating the gradient direction for the router. Routing collapses into a forced uniform policy, and the model loses the ability to specialize. This is the same failure mode as the section-6.2 auxiliary loss — just slower and meaner because it strikes at the sequence scale.
- Averaging in the wrong order. Reducing first across the batch and then across the sequence is mathematically different from the intended "sequence first, batch second" order — and the former is exactly the auxiliary loss from section 6.2. Get the reduction order wrong and you are quietly running the old method while believing you are running the new one. Always check the dim arguments in your
.mean()calls.
The one sentence to carry forward: the sequence-level balance loss is the smallest piece of the DeepSeek MoE stack and the easiest to dismiss, yet it is what turns the bias-term mechanism from a batch-average controller into a true per-sequence balancer — and that is what makes the architecture behave at single-stream inference, not just at training time.