Chapter 6
15 min read
Section 35 of 117

Sequence-Level Balance Loss

Auxiliary-Loss-Free Load Balancing

Section 6.3 introduced DeepSeek's clean fix for load imbalance: a learned bias on each expert's router score, updated by a non-differentiable feedback rule. That mechanism is enough to balance the routing on average across the batch — and on average, it works beautifully. But averages can hide bad behavior at finer time scales. A batch can look balanced while individual sequences inside it are quietly collapsing. The sequence-level balance loss is the tiny safety net DeepSeek adds on top, designed to catch exactly this failure.

The promise of sequence-level balance. A nearly-zero auxiliary loss applied per sequence — not per batch — to penalize within-sequence routing collapse. The coefficient is so small (α=104\alpha = 10^{-4}) that it does not corrupt the routing decision. Just large enough that, when one sequence quietly funnels 80% of its tokens to a single expert, the gradient has somewhere to push back.

The Hidden Failure of Batch-Level Balance

Imagine a training batch with 32 sequences. The bias-term feedback rule from the previous section nudges biases up for under-used experts and down for over-used ones, measured over the whole batch. After a few hundred steps it converges to a clean state: across those 32 sequences combined, every expert handles roughly its fair share of tokens. The batch-level histogram looks like a flat line.

Now inspect the routing inside a single sequence. You may discover that sequence #7 sent 80% of its tokens to expert 12, while sequence #19 sent 80% of its tokens to expert 3. Both sequences are using only a sliver of the expert pool locally, but they happen to use different slivers — so when you average across the batch, the imbalance cancels out. The bias-term mechanism is satisfied. The model is not.

What you measureBias-term seesReality inside sequences
Tokens per expert (batch)Perfectly balanced (the goal)Looks balanced — averages cancel
Tokens per expert (per sequence)Not monitoredCan be wildly imbalanced
Expert utilization (forward)FullLocally degenerate, globally fine
Gradient signal per expertHealthy on averageSpiky and high-variance within sequences
Decode-time stabilityLooks fine in trainingSingle-stream inference collapses to 1–2 experts
Why this matters at inference. Production inference often runs one sequence at a time on a node — there is no batch to average over. If the model has learned routing patterns that only balance across many sequences, each individual decoded sequence behaves like one of those degenerate sub-cases. You will see strong activity on 1–2 experts and dead silence on the rest, exactly the routing-collapse symptom from section 6.1, just disguised until deployment. The sequence-level loss makes the training signal match the inference distribution.

Two Time Scales, Two Different Failures

Picture a restaurant kitchen. The head chef (bias-term feedback) watches the whole shift and shouts "more salads, fewer steaks" when the totals drift out of balance. By the end of the night the count is even — across the whole shift, the kitchen handled equal numbers of every dish. But within any given hour the kitchen might have been doing nothing but desserts, then nothing but soups. The shift-level average is balanced; the hour-by-hour rhythm is broken.

The sequence-level loss is the line cook — a much quieter signal that operates inside each hour. It does not try to fix the absolute counts (the head chef already handles that). It just whispers "hey, you've sent the last four orders to the same station; spread the next one out." Two different control loops at two different time scales, doing two different jobs.

This split is fundamental to how DeepSeek thinks about load balancing. Thebias-term feedback handles slow, global imbalance and does so without contributing any gradient. The sequence-level loss handles fast, local imbalance and contributes a microscopic gradient — small enough that it cannot reshape the routing decision, but present enough that the optimizer sees a directional hint when a sequence starts to collapse.

The right mental picture. Bias-term feedback is a thermostat — no-gradient, set-point controller, slow. The sequence-level loss is a gentle spring — gradient-bearing, restoring force, fast. The product of the two is a routing distribution that is balanced everywhere, not just on average.

The Math: A Per-Sequence Smoothing Loss

Fix one sequence with TT tokens and EE experts. Top-kk routing produces, for each token tt, a set of chosen expertsTt{1,,E}\mathcal{T}_t \subset \{1, \dots, E\}. The sequence-level balance loss DeepSeek-V3 uses is:

Lseq=αi=1EfiPi\mathcal{L}_{\text{seq}} = \alpha \sum_{i=1}^{E} f_i \cdot P_i

with two ingredients. The hard fraction fif_i counts how often expert ii was actually routed to, normalized so that perfectly uniform routing gives fi=1f_i = 1:

fi=EkTt=1T1{iTt}f_i = \frac{E}{k T} \sum_{t=1}^{T} \mathbb{1}\{i \in \mathcal{T}_t\}

The soft probability PiP_i is the average router probability that expert ii received across the sequence, where st,is_{t,i} are the un-biased router logits for token tt:

Pi=1Tt=1Test,ijest,jP_i = \frac{1}{T} \sum_{t=1}^{T} \frac{e^{s_{t,i}}}{\sum_{j} e^{s_{t,j}}}

Why two terms — one hard, one soft

The hard term fif_i is non-differentiable: an argmax / top-kk cannot pass a gradient. The soft term PiP_i is fully differentiable through the softmax. Their product fiPif_i P_i is the engineering trick: gradient flows only through PiP_i, but it is weighted by the actual frequency fif_i. So the loss says: "if expert ii is being routed to a lot (fi large)(f_i \text{ large}), then push its average probability down (Pi<0)(\nabla_{P_i} < 0) for next time."

Why the bias is excluded from the logits

The router logits used in PiP_i are pre-bias — the raw output of the router's linear layer, before the section-6.3 bias is added. If you let the bias appear inside the softmax here, the balance loss would push gradient into the bias, and the bias-term mechanism would lose its "non-differentiable controller" property. Keeping the two mechanisms cleanly separated is critical.

Why alpha is so small. DeepSeek-V3 sets α=104\alpha = 10^{-4}. The main language-modeling loss is on the order of 1\sim 1. So the balance loss contributes at most 104\sim 10^{-4} in magnitude — a thousand times smaller than what the auxiliary losses of section 6.2 used. That is the whole design philosophy: the bias term does the heavy lifting; this loss only needs to be present, not loud.

What is conspicuously absent

Notice this loss is averaged over TT (tokens within one sequence) and computed independently for each sequence in the batch. The batch dimension BB only appears in the very last reduction — the per-sequence losses are averaged together to give one scalar. There is no cross-sequence mixing inside the formula. That is the entire point: the loss sees only what is happening within one sequence at a time.

Manual Numerical Walkthrough

Let us run the loss on one toy sequence of T=6T = 6 tokens, E=4E = 4 experts, top-k=2k = 2. Every number is calculated by hand so the mechanism is fully exposed.

Click to expand: one sequence, T=6, E=4, k=2, by hand

Setup — a quietly collapsing sequence. Six tokens, four experts. The router strongly prefers expert 0 throughout the sequence, but the second choice rotates around — so on its own this sequence will not scream "collapse" to a batch-level monitor. Per-token pre-bias logits st,is_{t,i}:

  • t=0: [3.2, 1.6, 0.4, 0.5]
  • t=1: [3.1, 0.5, 1.4, 0.6]
  • t=2: [2.9, 0.4, 0.5, 1.3]
  • t=3: [3.0, 1.5, 0.5, 0.4]
  • t=4: [3.3, 0.4, 1.2, 0.5]
  • t=5: [3.1, 1.4, 0.5, 0.4]

Step 1 — hard top-k per token. Pick the top-2 indices per row:

  • t=0 → {0, 1}, t=1 → {0, 2}, t=2 → {0, 3}
  • t=3 → {0, 1}, t=4 → {0, 2}, t=5 → {0, 1}

Counts per expert: expert 0 was routed to 6 times, expert 1 → 3, expert 2 → 2, expert 3 → 1. Total routings = kT=12k T = 12 (sanity check: 6+3+2+1=126 + 3 + 2 + 1 = 12 ✓).

Step 2 — normalize to f. Multiply by E/(kT)=4/12=1/3E/(kT) = 4/12 = 1/3:

f = (1/3)·[6, 3, 2, 1] = [2.00, 1.00, 0.67, 0.33]

Read: expert 0 is taking 2× its uniform share; expert 3 is taking only 1/3 of its share. A flat / uniform sequence would give f=[1,1,1,1]f = [1, 1, 1, 1].

Step 3 — softmax each token's scores. Doing token t=0 in detail. Subtract max: z=[0,1.6,2.8,2.7]z = [0, -1.6, -2.8, -2.7]. Exponentiate: ez[1.000,0.202,0.061,0.067]e^z \approx [1.000, 0.202, 0.061, 0.067], sum ≈ 1.330. So p0[0.752,0.152,0.046,0.050]p_0 \approx [0.752, 0.152, 0.046, 0.050].

Doing this for all six tokens gives (rounded to 3 decimals):

  • t=0: [0.752, 0.152, 0.046, 0.050]
  • t=1: [0.776, 0.058, 0.142, 0.064]
  • t=2: [0.764, 0.063, 0.069, 0.155]
  • t=3: [0.722, 0.161, 0.060, 0.054]
  • t=4: [0.794, 0.043, 0.097, 0.048]
  • t=5: [0.747, 0.137, 0.056, 0.050]

Step 4 — average columns to get P. Column means:

P ≈ [0.759, 0.102, 0.078, 0.070]

Read: the router's smooth view says expert 0 wins 76% of every token's probability mass, on average. Experts 1–3 split the rest.

Step 5 — inner product and scale.

f · P = 2.00·0.759 + 1.00·0.102 + 0.67·0.078 + 0.33·0.070

= 1.518 + 0.102 + 0.052 + 0.023 = 1.695

L_seq = α · 1.695 = 1e-4 · 1.695 ≈ 1.70e-4

What this number does to gradients. Take the partial derivative of Lseq\mathcal{L}_{\text{seq}} with respect to P0P_0. That is αf0=1042.00=2×104\alpha \cdot f_0 = 10^{-4} \cdot 2.00 = 2 \times 10^{-4} — a small but non-zero push, multiplied by the f-weight, so the more over-used the expert is, the bigger the push. This is the directional hint the optimizer needs to send the next few tokens elsewhere.

Compare to a balanced sequence. If the same six tokens had been routed uniformly so f=[1,1,1,1]f = [1, 1, 1, 1], and the softmax was correspondingly flat so P[0.25,0.25,0.25,0.25]P \approx [0.25, 0.25, 0.25, 0.25], then fP=4(10.25)=1.00f \cdot P = 4 \cdot (1 \cdot 0.25) = 1.00 and Lseq=104\mathcal{L}_{\text{seq}} = 10^{-4}. So the collapsing sequence pays 1.7×\sim 1.7 \times the loss of a balanced one. Tiny in absolute terms, but precisely targeted.

Sanity floor. The minimum of ifiPi\sum_i f_i P_i subject to ifi=E\sum_i f_i = E and iPi=1\sum_i P_i = 1 is achieved at fi=1,Pi=1/Ef_i = 1, P_i = 1/E for all ii, giving the value 11. So the floor of Lseq\mathcal{L}_{\text{seq}} is exactly α\alpha, and anything above that is local imbalance being penalized.

Visualizing Batch vs. Sequence Scope

The diagram below holds two sequences side by side. Sequence A is routed evenly; Sequence B suffers a within-sequence collapse onto expert 0. Toggle between Batch-level and Sequence-level to see how the same routing pattern produces a single averaged loss versus per-sequence losses. The α slider scales the loss without changing the routing.

Loading sequence-balance visualizer…

Three things to lock in. First, in batch mode the f-bars and P-bars for sequence B's collapse get washed out by sequence A's spread — the merged signal is misleadingly clean. Second, in sequence mode each sequence carries its own loss: sequence B pays more, and that extra penalty is exactly the gradient signal that will pull its next-step routing back toward uniform. Third, the absolute scale of the loss is microscopic even when you push α up to 2×1032 \times 10^{-3} — DeepSeek-V3's default of 10410^{-4} sits well below the noise floor of the main LM loss.

Plain Python: One Sequence, By Hand

Below is the entire per-sequence calculation in NumPy. Six tokens, four experts. Every line maps one-to-one to a line in the math above — and the printed L_seq matches the walkthrough we just did by hand.

🐍sequence_balance_numpy.py
4Toy sequence with T = 6, E = 4, k = 2

A single sequence of 6 tokens, 4 experts, top-2 routing. Real systems use 256+ experts and 4k+ tokens; the shape of the math is identical.

EXECUTION STATE
T = 6
E = 4
k = 2
5Coefficient stays tiny

DeepSeek-V3 uses α = 1e-4 — three to four orders of magnitude smaller than the auxiliary losses of section 6.2. Just enough to nudge a collapsing sequence, not enough to bend the routing decision away from quality.

EXECUTION STATE
alpha = 1e-4
8Router scores tensor

These are the raw outputs of the routing linear layer, BEFORE softmax and BEFORE the bias term from section 6.3 was added. The balance loss attaches to the un-biased scores so the bias does not leak gradient through it.

EXECUTION STATE
scores.shape = (6, 4)
18Per-token softmax

We need a smooth probability vector P over experts for each token — softmax converts raw scores into a distribution that sums to 1. This smooth view is what gives the balance loss a usable gradient; the hard top-k pick alone is non-differentiable.

23probs has shape (T, E)

Each row is one token's belief about every expert. probs[t, i] is how strongly the router wanted to send token t to expert i.

EXECUTION STATE
probs.shape = (6, 4)
26Hard top-k for the f term

argsort returns the indices that would sort scores descending; we keep the top k = 2 columns per row. This is the actual routing decision — which expert each token went to. f counts these.

EXECUTION STATE
topk.shape = (6, 2)
29Count routings per expert

Loop over the 6 tokens and for each one increment counts at the 2 chosen expert indices. The total number of routings is k * T = 12.

EXECUTION STATE
counts (example) = [3, 3, 3, 3]
32Normalize counts to a uniform-target ratio

The factor E / (k * T) rescales so that perfectly uniform routing gives f_i = 1 for every expert. If expert i got nothing, f_i = 0; if it took double its share, f_i = 2. Normalizing makes the loss numerically comparable across sequence lengths.

EXECUTION STATE
f.shape = (4,)
35Average probability per expert

P_i is just the mean of probs[:, i] over the sequence. It captures what the router 'wanted' to do on average, even for experts that did not survive the top-k cut. This is the term that carries the gradient — gradient flows into the score logits through P.

EXECUTION STATE
P.shape = (4,)
38Inner product gives the loss

Sum f_i * P_i over the experts and scale by α. The loss is large when a heavily-routed expert (large f_i) is also a high-probability expert (large P_i) — i.e. when the router is doubling down on a popular expert. It is minimized when f and P are anti-correlated, which pushes the router toward uniformity.

EXECUTION STATE
L_seq = ≈ α · E_uniform = 1e-4 (when balanced)
31 lines without explanation
1import numpy as np
2
3# Tiny toy: T = 6 tokens, E = 4 experts, top-k = 2.
4T, E, k = 6, 4, 2
5alpha = 1e-4
6
7# (a) Per-token, per-expert router scores (T, E).
8scores = np.array([
9    [2.0, 0.4, 1.6, 0.5],
10    [0.3, 2.1, 0.6, 1.5],
11    [1.8, 1.5, 0.4, 0.5],
12    [0.4, 0.5, 1.9, 1.4],
13    [0.4, 1.7, 1.6, 0.5],
14    [1.7, 0.4, 0.5, 1.5],
15])
16
17# (b) Softmax over the experts for each token.
18def softmax(x):
19    z = x - x.max(axis=-1, keepdims=True)
20    e = np.exp(z)
21    return e / e.sum(axis=-1, keepdims=True)
22
23probs = softmax(scores)                                   # (T, E)
24
25# (c) Top-k indices per token define the actual routing.
26topk = np.argsort(-scores, axis=-1)[:, :k]                # (T, k)
27
28# (d) f_i = fraction of tokens routed to expert i, normalized so uniform = 1.
29counts = np.zeros(E)
30for r in topk:
31    counts[r] += 1
32f = (E / (k * T)) * counts                                # (E,)
33
34# (e) P_i = average router probability for expert i over the sequence.
35P = probs.mean(axis=0)                                    # (E,)
36
37# (f) Per-sequence balance loss.
38L_seq = alpha * (f * P).sum()
39print(f"f = {f}")
40print(f"P = {P}")
41print(f"L_seq = {L_seq:.3e}")

Two structural details deserve a second pass. First, the normalization factor E/(kT)E/(kT) on ff is what makes the loss insensitive to sequence length. Without it, a longer sequence would mechanically produce a larger loss, and the per-sequence reduction would just become a roundabout per-token average. With it, fif_i is dimensionless and uniform-target = 1, so a 4096-token sequence and a 256-token sequence are compared on equal terms.

Second, this implementation is intentionally non-batched. There is no BB dimension; you call this function once per sequence. The PyTorch version below adds the batch dim and a final mean, but otherwise the math is identical.

Sanity check. Set every routing decision to i=0i = 0 and every score vector to [100,0,0,0][100, 0, 0, 0]. Then f=[4,0,0,0]f = [4, 0, 0, 0], P[1,0,0,0]P \approx [1, 0, 0, 0], and fP=4f \cdot P = 4 — the worst possible value (= E). At the other extreme, perfectly uniform routing and a flat softmax give fP=1f \cdot P = 1. The loss spans a factor of E between the worst and best collapse states.

PyTorch: Vectorized Across the Batch

The production version lives inside the MoE forward pass and runs once per layer. The only new ideas are the batch dimension and the careful reduction order — average within each sequence first, then average across the batch.

🐍sequence_balance_pytorch.py
5Inputs: pre-bias scores and the routed picks

scores is (B, T, E) — one batch, T tokens per sequence, E experts. topk_idx is the chosen expert indices per token. The contract is that scores carries NO bias term, so gradients only flow into the router weight matrix.

12Read off the shapes

Pulling B, T, E from the score tensor and k from topk_idx — no hardcoded constants. The same function works for a 6-token toy and a 4096-token production sequence.

17Softmax along the expert axis

F.softmax with dim=-1 normalizes each token's score vector to a probability distribution over the E experts. Shape stays (B, T, E).

EXECUTION STATE
probs.shape = (B, T, E)
18Average probabilities WITHIN each sequence

Critical line. dim=1 averages over tokens — per sequence, not per batch. This is exactly what makes the loss 'sequence-level'. Compare with dim=(0,1) which would average over batch+tokens and reproduce the auxiliary loss from section 6.2.

EXECUTION STATE
P.shape = (B, E)
21One-hot the routed indices

one_hot expands (B, T, k) into (B, T, k, E) with a 1 in the chosen-expert slot. Summing these gives the count of times each expert was routed to.

EXECUTION STATE
onehot.shape = (B, T, k, E)
22Collapse to per-sequence counts

Sum across the T and k dimensions but NOT B — we keep one count vector per sequence. .float() promotes from long for the division to follow.

EXECUTION STATE
counts.shape = (B, E)
23Normalize counts so uniform = 1

Same trick as the NumPy version, vectorized. The (E / (k * T)) factor is a scalar so it broadcasts cleanly.

EXECUTION STATE
f.shape = (B, E)
26Per-sequence inner product

(f * P) is element-wise (B, E); summing over dim=-1 gives one scalar loss per sequence. This is the 'per-sequence' part — we never mix sequences here.

EXECUTION STATE
L_per_seq.shape = (B,)
27Batch mean and alpha scaling

Mean across the batch is the standard reduction. α = 1e-4 makes the gradient tiny — about 0.0001× the main LM loss — so the balance term smooths routing without overpowering language modeling.

20 lines without explanation
1import torch
2import torch.nn.functional as F
3
4def sequence_balance_loss(
5    scores: torch.Tensor,  # (B, T, E) — raw router logits, NO bias added
6    topk_idx: torch.Tensor,  # (B, T, k) — actual routed experts per token
7    alpha: float = 1e-4,
8) -> torch.Tensor:
9    """
10    Per-sequence balance loss from DeepSeek-V3. Reduced as the mean across B.
11
12    Note: 'scores' here are pre-bias router logits. The bias from section 6.3
13    is only added to decide topk_idx; it must NOT leak gradients here.
14    """
15    B, T, E = scores.shape
16    k = topk_idx.shape[-1]
17
18    # (a) Smooth term: per-expert mean probability inside each sequence.
19    probs = F.softmax(scores, dim=-1)              # (B, T, E)
20    P = probs.mean(dim=1)                           # (B, E)
21
22    # (b) Hard term: per-expert routed-fraction inside each sequence.
23    onehot = F.one_hot(topk_idx, num_classes=E)     # (B, T, k, E)
24    counts = onehot.sum(dim=(1, 2)).float()         # (B, E)
25    f = counts * (E / (k * T))                       # (B, E)
26
27    # (c) Per-sequence inner product, then mean across the batch.
28    L_per_seq = (f * P).sum(dim=-1)                 # (B,)
29    return alpha * L_per_seq.mean()

Three subtleties worth marking, all about how this loss interacts with the rest of the MoE block:

  1. The probs softmax and the routing softmax are different. The softmax inside this loss is purely for measuring imbalance — it is not the one whose top-k decides actual routing. In DeepSeek-V3 the routing decision uses scores-plus-bias, while this loss uses scores-only. Two softmaxes, two purposes.
  2. Gradient flows into the router weights, not the bias. Because the bias was not used in probs, the gradient of L_seq with respect to the bias parameters is exactly zero. The bias is updated only by the non-differentiable feedback rule from section 6.3. This is a clean separation of concerns: the bias controls long-run balance, the loss controls short-run balance.
  3. One loss per MoE layer. Modern stacks have dozens of MoE layers, and each has its own router and its own potential for collapse. The balance loss is computed independently per layer and summed; with α=104\alpha = 10^{-4} and 58 MoE layers (DeepSeek-V3) the total contribution to the loss is still only 0.006\sim 0.006 — well under 1% of the LM loss.
Implementation note. Real codepaths fuse this with the existing router softmax and avoid the explicit one-hot via a scatter_add. The version above is written for clarity. The gradient graph is identical either way — only the kernel layout changes.

What Changes at Massive Scale

At toy scale the sequence-level loss looks like a curiosity — a tiny add-on with a near-zero coefficient. At production scale the situation is dramatically different because the within-sequence collapse failure mode becomes much more likely as the routed pool grows. The DeepSeek-V3 numbers tell the story:

SettingRouted expertsTop-kTokens / sequenceLoss coefficient α
DeepSeek-V2160640961e-3
DeepSeek-V3 (paper)256840961e-4
Mixtral 8×7B (no seq-loss)8232k— (uses aux loss)

Two patterns to read here. First, as the routed pool grew from 160 to 256, DeepSeek reduced α from 10310^{-3} to 10410^{-4}. The bias-term mechanism from section 6.3 got better at handling the bulk of the balancing job, so the safety net could be pulled tighter and quieter. The loss is now there mainly to handle the tail — the small fraction of sequences with unusual content distributions.

Second, Mixtral does not use a sequence-level loss at all. It uses the standard auxiliary load-balancing loss with a much larger coefficient. That works adequately for an 8-expert pool — there are not enough experts for within-sequence collapse to become a serious problem. But it would not scale to 256 experts: the auxiliary loss would have to be loud, and routing quality would suffer.

The interaction with sequence packing

Real training rarely uses one logical document per sequence. Multiple documents are concatenated up to T=4096T = 4096 with an attention mask that prevents cross-document attention. From the sequence-level loss's point of view, this is a feature: a 4k-token packed sequence contains a varied document mix, so balanced routing is genuinely the right target. If you packed a single homogeneous document into the full 4k window — say, a long block of Chinese poetry — the loss would correctly fire and push the router away from the natural domain-driven collapse. Whether that is what you want is a separate design call; for general pre-training, it is.

The interaction with expert parallelism

Each MoE layer's router runs on every GPU that holds a copy of the model (data-parallel rank). The per-sequence loss is computed locally on the rank that owns the sequence — no all-reduce needed for the balance loss itself, because there is no cross-sequence mixing. Only the gradient accumulation does any sync, and that happens once for the whole step regardless. So the sequence-level loss is essentially free in communication terms, which is part of why DeepSeek can afford to add it at every MoE layer.

Engineering Reality and Gotchas

The sequence-level balance loss looks innocuous and is one of the easiest pieces of DeepSeek's MoE stack to get subtly wrong. Three failure modes are worth flagging:

  1. Letting the bias into the softmax. If you reuse the same scores_with_bias tensor for both the routing decision and the balance loss, the gradient of L_seq will leak into the bias parameters. The bias-term feedback rule was explicitly designed to be non-differentiable; letting gradient enter it muddies the control loop and reintroduces all the routing-quality problems section 6.2 documented. Keep the two paths separate.
  2. Using α too large. Try α=102\alpha = 10^{-2} and the loss starts dominating the gradient direction for the router. Routing collapses into a forced uniform policy, and the model loses the ability to specialize. This is the same failure mode as the section-6.2 auxiliary loss — just slower and meaner because it strikes at the sequence scale.
  3. Averaging in the wrong order. Reducing first across the batch and then across the sequence is mathematically different from the intended "sequence first, batch second" order — and the former is exactly the auxiliary loss from section 6.2. Get the reduction order wrong and you are quietly running the old method while believing you are running the new one. Always check the dim arguments in your .mean() calls.
How DeepSeek reports the balance state. During training, DeepSeek logs three quantities per MoE layer: the max-over-experts of fif_i, the per-sequence loss, and the running update rate of the bias term. If the first goes above ~1.3, the second spikes, or the third saturates at the bias-update cap, an operator knows the routing is drifting. The sequence-level loss is half engineering signal, half observability instrument.

The one sentence to carry forward: the sequence-level balance loss is the smallest piece of the DeepSeek MoE stack and the easiest to dismiss, yet it is what turns the bias-term mechanism from a batch-average controller into a true per-sequence balancer — and that is what makes the architecture behave at single-stream inference, not just at training time.

Where we go from here. Chapter 6 closes here. Sections 6.1–6.4 together describe how DeepSeek replaces the loud auxiliary loss with a quiet two-control stack: a non-differentiable bias-term feedback rule for slow global balance, and a microscopic sequence-level loss for fast local balance. Chapter 7 leaves load balancing and turns to the second pillar of DeepSeek-V3's training innovation — Multi-Token Prediction.
Loading comments...