Boo-AI — Master Artificial Intelligence by Building from Scratch

The Tension We Inherit

Two sections ago we watched a mixture-of-experts layer collapse. A handful of experts captured almost all the routing decisions, every other expert went idle, and the model effectively shrank from N experts down to four or five. One section ago we tried the standard fix: an auxiliary load-balancing loss. It worked — the load curves flattened — but the gradient of that loss flowed straight into the same parameters that produce the token logits, and the model paid for the balance with a measurable hit on language-modeling quality. Bigger model, lower quality. That is the worst possible trade in large-scale training.

So we arrive at this section with a very specific question: can we get the discipline of the auxiliary loss without paying its gradient bill? DeepSeek-V3's answer — published in late 2024 and refined in the V3 / R1 technical reports — is a strikingly small modification to the router. It introduces a single per-expert scalar, updates it with no gradient at all, and produces flatter load curves than the auxiliary loss while leaving model quality strictly better. This section is about that scalar.

The one-sentence version. Add a per-expert bias $b_i$ to the score $s_{i,t}$ used to select the top-K experts — but not to the score used to weight them. Then nudge $b_i$ down whenever expert $i$ is over-served and up whenever it's under-served. The model gradient never sees $b_i$ at all.

The Trick: Decouple Selection from Weighting

Standard top-K routing fuses two distinct decisions into one function. First, the router decides which experts a token activates. Second, it decides how much each chosen expert contributes to the output. In every MoE layer before DeepSeek-V3, both of those decisions are read off the same per-(token, expert) affinity score $s_{i,t}$ .

That fusion is what made the auxiliary-loss approach painful. To change which experts get picked, the auxiliary loss had to change how much each gets weighted, because they came from the same logits. The two effects rode the same gradient.

The bias-term trick unfuses them — physically. Let $s_{i,t}$ be the raw affinity (the model's own opinion of how relevant expert $i$ is to token $t$ ). Now build the routing decision from a different score:

$\tilde{s}_{i,t} = s_{i,t} + b_i$ where $b_i$ is a per-expert scalar that depends only on recent load — not on the token, not on the model weights.

The router selects top-K from $\tilde{s}_{i,t}$ . The gating weights it emits — what mathematically combines the experts' outputs — use the original $s_{i,t}$ with no bias. Selection and weighting are now driven by two different scores. They can be tuned independently. That's the entire structural move.

Think of

b_i

as a per-expert queue-shedding signal. The model still has its own opinion (

s_{i,t}

) about which expert is relevant. The router uses that opinion plus a small nudge from the dispatcher (

b_i

) to break ties in favor of less-loaded experts. Once an expert is in the top-K, its contribution is decided by the model's opinion alone.

The Mathematical Formulation

Let the MoE layer have $N$ experts, route top- $K$ per token, and process $T$ tokens per batch. The router holds a learned centroid $e_i \in \mathbb{R}^d$ per expert and, separately, a non-trainable per-expert bias $b_i \in \mathbb{R}$ .

Affinity (the model's opinion)

For each token hidden state $u_t \in \mathbb{R}^d$ , the per-expert affinity is $s_{i,t} = \sigma(u_t^\top e_i)$ , where $\sigma$ is the elementwise sigmoid. Each $s_{i,t}$ lives in $(0, 1)$ independently of the others — independent, not normalized into a softmax. This choice matters: it means adding a bias to expert $i$ 's score does not steal probability mass from any other expert. The bookkeeping stays local.

Routing decision (selection only)

Define the bias-adjusted score $\tilde{s}_{i,t} = s_{i,t} + b_i$ . The router picks the top- $K$ indices:

$\mathcal{T}_t = \operatorname{Top\text{-}K}_i\,\tilde{s}_{i,t}$ — the set of $K$ expert indices a token $t$ activates.

Gating weights (weighting only)

For each $i \in \mathcal{T}_t$ , the gating weight is the renormalized raw affinity:

$g_{i,t} = \frac{s_{i,t}}{\sum_{j \in \mathcal{T}_t} s_{j,t}}$

Notice $b_i$ does not appear here. The model's opinion alone decides how much weight each chosen expert contributes. The token's output is then $y_t = \sum_{i \in \mathcal{T}_t} g_{i,t} \, \mathrm{Expert}_i(u_t)$ .

This split is what makes the next claim true. Because

g_{i,t}

doesn't depend on

b_i

, the loss

L

doesn't depend on

b_i

either —

\partial L / \partial b_i = 0

for every

i

. We can freely update

b_i

outside of autograd without disturbing anything the model is learning.

The Bias Update as a Control Loop

With selection and weighting decoupled, we're free to update $b_i$ however we want. DeepSeek picks the simplest possible rule: a sign-based proportional controller.

After each step, let $\ell_i = |\{ (t, k) : i \in \mathcal{T}_t \}|$ — the number of token-slots routed to expert $i$ this batch. The uniform setpoint is $\ell^* = T K / N$ . Update:

$b_i \leftarrow b_i - \gamma \cdot \operatorname{sign}(\ell_i - \ell^*)$

The hyperparameter $\gamma > 0$ (the DeepSeek paper writes it as $u$ ) controls the loop's speed. Read the rule sign by sign:

Expert $i$ got more tokens than expected — $\ell_i > \ell^*$ , so $b_i$ drops by $\gamma$ . Its bias-adjusted scores fall, and on the next batch it tends to lose top-K races it would otherwise have won by small margins.
Expert $i$ got fewer tokens — $b_i$ rises by $\gamma$ . Its bias-adjusted scores climb; on the next batch it wins more close races.
Expert exactly at the setpoint — $b_i$ unchanged. (In practice this never happens; the sign function breaks every tie.)

The fixed point is the place where the per-expert biases have settled into whatever offsets the data distribution actually needs — bigger negative biases for the intrinsically "popular" experts, bigger positive biases for the intrinsically "niche" ones — and load is uniform. This is a standard bang-bang controller. It is the same idea as a thermostat clicking on and off.

Two specific design choices keep this from being a vague analogy. First, the rule uses sign, not the raw error

\ell_i - \ell^*

. This makes every step a bounded nudge of size

\gamma

— a single outlier batch cannot blow up

b_i

. Second, the rule runs every step, not on a schedule. The biases ride along with the model, adjusting batch by batch as the true affinity distribution drifts during training.

Why It Has Zero Gradient Interference

The cleanest way to see the argument is to trace the computation graph. Start at the loss $L$ and walk backward to find which parameters it depends on.

Quantity	Depends on…	Has gradient?
Per-token output y_t	g_{i,t} and Expert_i(u_t)	yes (through experts and gates)
Gate weight g_{i,t}	raw affinity s_{i,t} (NOT b_i)	yes (through centroids e_i)
Affinity s_{i,t}	centroid e_i and hidden state u_t	yes (e_i is a Parameter)
Bias b_i	history of loads ℓ_i (no parameters at all)	no (no path from L)
Top-K membership T_t	argmax over (s + b), non-differentiable	ignored (straight-through)

The bias $b_i$ influences $L$ only through the discrete top-K choice, which has no usable gradient. The differentiable path — through $g_{i,t}$ — bypasses $b_i$ entirely. So when the optimizer asks autograd "what should I do with $b_i$ ?", the honest answer is nothing. We update it ourselves, with the control loop, and the gradient signal that improves token-level prediction flows undisturbed through the centroids $e_i$ and the expert networks.

This is the property the auxiliary-loss method could not give us. An auxiliary loss term adds a non-zero

\partial L_{\rm aux} / \partial e_i

to the same parameters that produce token logits. The two gradients then mix at the optimizer step. The bias-term method has

\partial L / \partial b_i = 0

by construction, not by tuning. There is no coefficient you can set wrong.

Interactive: Watch the Bias Loop Stabilize

Before any code, run the loop yourself. The simulator below uses 8 experts, top-K = 2, and a deliberately skewed affinity distribution — experts 0 and 1 have systematically higher mean affinity than the rest. With bias correction off, that skew is enough to send the routing into permanent collapse. With it on, watch the biases drift into negative territory for the popular experts and positive for the niche ones until load is essentially uniform.

Loading bias-balancing simulator…

Things worth doing in the simulator, in order:

Press Skip 30 steps with the bias correction off. The per-step load on experts 0 and 1 sits around 20-25 tokens (out of 32 slot-allocations), and the bottom four experts get almost nothing. The imbalance curve refuses to fall.
Reset. Turn the bias correction on with $\gamma = 0.05$ . Step a few times — you can see the biases for the popular experts already turning negative after step 1. Press Skip 30 steps. Imbalance falls to near $1.0\times$ .
Crank $\gamma$ to its max ( $0.2$ ) and step a few times. The biases jitter visibly — you're seeing the cost of an over-aggressive controller. With $\gamma = 0.005$ , the convergence is smoother but slower. This trade-off is exactly the one DeepSeek tunes with a schedule in real training.
Drop $\gamma$ to zero. The control loop freezes. Routing decisions still get the current biases applied, but they stop adapting — this is what DeepSeek does in the last ~5% of training to lock in the routing pattern.

Manual Numerical Walkthrough

Let's execute the full loop by hand for one step, with $N = 4$ experts, $K = 2$ , and $T = 6$ tokens. Small enough to carry every intermediate.

Step-by-step: one batch through a bias-balanced router

Setup

Affinities $s_{i,t}$ (rows are tokens, columns are experts). These would normally come from $\sigma(u_t^\top e_i)$ ; we hand-pick them so expert 0 looks objectively most relevant on most tokens — the natural skew we want the bias to correct.

	E0	E1	E2	E3
t=0	0.90	0.40	0.20	0.10
t=1	0.85	0.55	0.25	0.15
t=2	0.80	0.30	0.60	0.20
t=3	0.70	0.50	0.30	0.40
t=4	0.95	0.45	0.15	0.25
t=5	0.75	0.65	0.10	0.05

Current biases after a few prior steps of the loop: $b = (-0.30,\, -0.05,\, +0.10,\, +0.25)$ . Expert 0 already has a meaningful negative bias because it has been over-served. Expert 3 has a positive bias to lift it.

Step 1: Form the bias-adjusted score

$\tilde{s}_{i,t} = s_{i,t} + b_i$ — add the per-expert bias to every column:

	E0 (+(-0.30))	E1 (+(-0.05))	E2 (+0.10)	E3 (+0.25)
t=0	0.60	0.35	0.30	0.35
t=1	0.55	0.50	0.35	0.40
t=2	0.50	0.25	0.70	0.45
t=3	0.40	0.45	0.40	0.65
t=4	0.65	0.40	0.25	0.50
t=5	0.45	0.60	0.20	0.30

Step 2: Top-K = 2 selection (per row)

Pick the two highest entries in each row of the adjusted table:

	Chosen experts	Membership set T_t
t=0	E0 (0.60), E1 (0.35) — E3 tie broken by index	{E0, E1}
t=1	E0 (0.55), E1 (0.50)	{E0, E1}
t=2	E2 (0.70), E0 (0.50)	{E0, E2}
t=3	E3 (0.65), E1 (0.45)	{E1, E3}
t=4	E0 (0.65), E3 (0.50)	{E0, E3}
t=5	E1 (0.60), E0 (0.45)	{E0, E1}

Step 3: Gate weights — use the RAW affinity

Now we pull the gate weights from the original $s_{i,t}$ table — no bias here. Token $t = 0$ picked $\{E_0, E_1\}$ :

$g_{0,0} = \frac{0.90}{0.90 + 0.40} = 0.692, \quad g_{1,0} = \frac{0.40}{0.90 + 0.40} = 0.308$

The model's opinion of expert 0's relevance to token 0 is fully preserved — it gets 69% of the gate. The bias only ensured E0 stayed in the chosen set; it did not water down E0's contribution.

Step 4: Tally load and update biases

Count occurrences across all $\mathcal{T}_t$ :

Expert	Load this step	Setpoint ℓ* = TK/N = 12/4 = 3	overload sign	Δb
E0	5	3	+1	−γ
E1	4	3	+1	−γ
E2	1	3	−1	+γ
E3	2	3	−1	+γ

With $\gamma = 0.05$ , the new biases are:

$b' = (-0.35,\, -0.10,\, +0.15,\, +0.30)$

Compare to where we started: $(-0.30, -0.05, +0.10, +0.25)$ . Every over-served expert dropped 0.05; every under-served one rose 0.05. On the next batch, that 0.05 shift will tip a handful of close decisions away from E0 and E1 toward E2 and E3. Push the same process through 100 batches and the load distribution flattens while the centroids $e_i$ evolve under their own gradient — completely independent of the bias loop.

The model never knew

Gradients computed in the backward pass at this step depend on $g_{i,t}$ and the experts, both of which used the raw $s_{i,t}$ table. The biases appeared only in Step 2 — a non-differentiable top-K selection — and as a state update in Step 4 that lives outside autograd. There is no chain rule that connects loss to $b_i$ .

Plain Python Implementation

Here is the entire mechanism — affinity, selection, gating, control loop — in pure NumPy. Nothing is hidden behind an MoE library; every operation is a single named line you can trace. Read it once top-to-bottom, then open the explanation panel.

Bias-term load balancing in pure NumPy

🐍bias_balancing_numpy.py

Explanation(11)

Code(51)

5Why this toy setup is honest

N_EXPERTS=8, TOP_K=2, 64 tokens per step gives an expected uniform load of 64·2/8 = 16 tokens/expert. Small enough to read by eye, large enough that the law of averages is real — single tokens won't swing the loop.

EXAMPLE

expected_load = 64 * 2 / 8 = 16 tokens / expert / step

9GAMMA — the only knob

γ controls how aggressively the control loop reacts. Too small → slow to recover after a regime shift in the data. Too large → biases oscillate and pull individual tokens away from genuinely useful experts. DeepSeek-V3 starts at ≈0.001 and decays toward 0 over training.

14Where the imbalance comes from

popular has +1.3 on experts 0 and 1 and 0 everywhere else. Without bias correction, every step ~50% of the top-K slots go to those two experts. This hard-coded skew stands in for the much subtler skews a real router develops on real data.

18bias starts at zero

We initialize b_i = 0 for every expert so the first step routes purely on affinity. Routing is allowed to be unbalanced at the start — the loop is responsible for fixing it, not the initialization.

22Affinity model

In a real MoE layer s_{i,t} = sigmoid(u_t · e_i). Here we just add Gaussian noise to the popularity vector. The shape (T, N) is what every real router emits: one score per (token, expert) pair.

EXAMPLE

s.shape == (64, 8)

27Top-k by partial sort

np.argpartition is O(N) and returns the top-k indices unsorted — fast enough that we don't have to write a heap. Order inside the top-k doesn't matter for routing; only membership does.

33decision = s + bias — the whole trick on one line

We add the bias only to the SELECTION score. The raw affinity s is what we keep around to compute gate weights (the contribution of each expert to the token's output). This separation is the entire reason the method works without polluting gradients.

EXAMPLE

s[i] = 1.4   b[i] = -0.6   decision[i] = 0.8   (gate still uses 1.4)

37Counting load

picks has shape (T, k). Flattening and bincount gives us a length-N_EXPERTS vector — exactly the per-expert token count for this step. This single vector drives the control loop.

42expected_load is the setpoint

Like a thermostat's target temperature. We compute it once: total tokens entering the layer times top-k, divided uniformly. Any expert above this number is over-served; any below is under-served.

44The whole update — three lines, no gradients

overload tells us the error signal per expert. sign(overload) turns it into a direction (+1 / 0 / -1). Multiply by γ and subtract from bias. This is a pure proportional controller — no calculus, no autograd, no shared parameters with the model.

EXAMPLE

overload = [+30, +28, -6, -7, -8, -9, -10, -11]  -> bias -= 0.05 * [+1,+1,-1,-1,-1,-1,-1,-1]

49Reading the final state

After ~400 steps, expect bias values for the popular experts to settle into negative territory (≈ -0.6 to -1.0) and the unpopular ones into positive territory. The load vector will be flat near 16 ± a few. The model never knew this was happening.

40 lines without explanation

1import numpy as np
2
3np.random.seed(0)
4
5# ----- Toy MoE layer ---------------------------------------------------------
6N_EXPERTS   = 8        # number of experts in the layer
7TOP_K       = 2        # each token activates 2 experts
8TOKENS_PER_STEP = 64   # mini-batch size for routing
9GAMMA       = 0.05     # bias update rate (DeepSeek calls this u in the paper)
10N_STEPS     = 400
11
12# Per-expert "centroid" e_i — a real router learns these; we hard-code so
13# expert 0 and 1 are systematically more attractive than the rest.
14# This is the imbalance source.
15popular   = np.array([1.3, 1.3, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0])
16
17# ----- State that the control loop updates ----------------------------------
18bias = np.zeros(N_EXPERTS)            # b_i — what we update by hand
19load_history = []                     # to plot later
20
21def affinity(num_tokens, rng):
22    """Per-token affinity s_{i,t} — popular experts get higher mean."""
23    noise = rng.standard_normal((num_tokens, N_EXPERTS)) * 0.7
24    return popular[None, :] + noise
25
26def topk_route(scores, k):
27    """Return the indices of the top-k experts per token. (T, k)"""
28    return np.argpartition(-scores, k, axis=1)[:, :k]
29
30rng = np.random.default_rng(0)
31expected_load = TOKENS_PER_STEP * TOP_K / N_EXPERTS    # uniform target
32
33for step in range(N_STEPS):
34    s = affinity(TOKENS_PER_STEP, rng)            # raw affinity, (T, N)
35    decision = s + bias[None, :]                  # ONLY for selection
36    picks    = topk_route(decision, TOP_K)        # (T, k)
37
38    # Count how many tokens each expert received this step
39    load = np.bincount(picks.ravel(), minlength=N_EXPERTS).astype(float)
40    load_history.append(load.copy())
41
42    # ----- The control loop update --------------------------------------
43    # overloaded  -> push bias DOWN  -> fewer tokens next time
44    # underloaded -> push bias UP    -> more  tokens next time
45    overload = load - expected_load
46    bias = bias - GAMMA * np.sign(overload)
47
48# ----- Final state ----------------------------------------------------------
49print("final bias  :", np.round(bias,   3))
50print("final load  :", load.astype(int))
51print("uniform aim :", expected_load)

Two structural things to take away from this file. First, the bias vector is just a NumPy array we mutate by hand — there is no gradient anywhere in the file. Second, the line decision = s + bias[None, :] is the entire algorithm. Everything before it sets up the data; everything after it counts load and updates the controller.

What to try with this code

Run it as-is and print the load every 50 steps. You'll see the popular experts start at ~25 tokens/step and walk down to ~16; the unpopular ones start near zero and climb up to ~16.
Set GAMMA = 0. The biases never change, the decision is just $s$ , and the load stays permanently lopsided — the no-balancing baseline.
Replace np.sign(overload) with overload / expected_load. This is the magnitude-aware variant; convergence is faster, but a single unusual batch can over-correct.
Add a slow drift to popular — e.g. rotate the high-affinity slots from (0, 1) to (2, 3) at step 200. Watch the biases re-balance on the fly. This is the actual situation during long pretraining: data distribution drifts, and the loop tracks it.

PyTorch Implementation

The PyTorch version preserves the same control structure but uses the framework's tools properly — centroids as Parameters (so the optimizer learns them) and bias as a registered buffer (so it's saved with the model but invisible to the optimizer and to autograd).

BiasBalancedRouter — production-shaped PyTorch module

🐍bias_balanced_router.py

Explanation(11)

Code(71)

5Why this is a nn.Module but not 'learnable' in the usual sense

Subclassing nn.Module gives us autograd-aware forward(), GPU placement, state-dict serialization, and DDP-friendly buffers — without making the bias a parameter. The bias gets saved with the checkpoint (so resumed training continues with the warmed-up biases) but is invisible to the optimizer.

24centroids = nn.Linear — these ARE learned

The expert centroids e_i are real parameters trained by gradient descent. Calling self.centroids(h) is mathematically the same as h @ e^T, producing an affinity score per (token, expert) pair. Bias=False because the per-expert balancing bias is conceptually separate from any linear-layer bias term.

EXAMPLE

h: (B, T, d_model)  ->  s: (B, T, n_experts)

29register_buffer — the load-bearing detail

buffer != Parameter. Parameters are autograd leaves that the optimizer touches. Buffers are tensors that move with the model (.cuda(), state_dict()) but the optimizer doesn't see them. This is exactly the storage class we need for a tensor that is updated by a control loop, not by SGD.

41Affinity via sigmoid, not softmax

DeepSeek-V3 uses per-expert sigmoid affinity, not a softmax over experts. Sigmoid makes the scores independent per expert — adding a bias to expert i doesn't redistribute probability mass to the others, so the control loop becomes mechanically simple: nudge b_i, exactly b_i changes.

EXAMPLE

s = sigmoid(h @ e^T)   range (0, 1) per cell

44The single most important line in the file

decision = s + self.bias is broadcasted over (B, T, n_experts). Note we KEEP s around — we don't overwrite it with decision. That allows the next two lines to use the bias for SELECTION and the raw affinity for WEIGHTING.

45topk on the corrected score

Returns (values, indices). We throw away values because we'll re-look-up the raw affinity at the chosen indices. idx has shape (B, T, k) — for each token, which experts are 'on'.

48gather: route raw scores by index

gather pulls s[..., idx[..., j]] for each j in 0..k-1. The result chosen_s has shape (B, T, k) and contains the RAW affinity of each chosen expert — no bias adjustment.

49Normalize so gates sum to 1

Dividing by sum keeps the per-token output a convex combination of expert outputs. The clamp_min(1e-9) protects against the (very rare) case where all chosen affinities are near zero — a single line of defensive programming, not a load-bearing trick.

52Counting load with scatter_add_

We flatten idx to (B*T*k,) and add 1 for each slot at the chosen expert position. After this line, load[i] = number of token-slots that landed on expert i this batch. This is the only summary statistic the control loop needs.

EXAMPLE

load tensor shape: (n_experts,)

60@torch.no_grad() — explicit anti-autograd seal

The decorator turns off gradient tracking for everything inside. Even if some user accidentally calls update_bias() inside a forward pass that's being traced, no graph will be built through the bias update. This guarantees the 'no gradient interference' claim is structurally true, not just culturally true.

65Idiomatic out-of-place vs in-place

sub_ mutates self.bias in place — fine because we're inside @no_grad. We use .sign() so the update is direction-only (DeepSeek's rule). A magnitude-aware variant (sub_(γ · overload / expected)) also works and converges faster but can over-react to a noisy single batch.

EXAMPLE

self.bias[i] -= gamma  if expert i is overloaded

60 lines without explanation

1import torch
2import torch.nn as nn
3import torch.nn.functional as F
4
5
6class BiasBalancedRouter(nn.Module):
7    """
8    Auxiliary-loss-free router from DeepSeek-V3.
9
10      score      s_{i,t} = sigmoid(u_t . e_i)
11      selection  top_k( s + b )                    <-- bias INSIDE top-k
12      gate       g_{i,t} = s_{i,t} / sum_{j in topk} s_{j,t}    <-- bias NOT here
13      update     b_i  <-  b_i - gamma * sign(load_i - expected)
14    """
15
16    def __init__(
17        self,
18        d_model: int,
19        n_experts: int,
20        top_k: int,
21        bias_update_rate: float = 1e-3,
22    ):
23        super().__init__()
24        self.n_experts = n_experts
25        self.top_k     = top_k
26        self.gamma     = bias_update_rate
27
28        # Expert centroids — learned by gradient descent (these DO get grads)
29        self.centroids = nn.Linear(d_model, n_experts, bias=False)
30
31        # Per-expert balancing bias — NOT a parameter; updated by the
32        # control loop. Buffer = saved with the model but excluded from
33        # the optimizer.
34        self.register_buffer("bias", torch.zeros(n_experts))
35
36    def forward(self, h: torch.Tensor):
37        """
38        h : (B, T, d_model)
39        returns:
40          gates    (B, T, k)    weighting for each chosen expert
41          indices  (B, T, k)    which expert each slot picked
42          load     (n_experts,) tokens per expert this batch (for the update)
43        """
44        # ---- 1. Per-token affinity --------------------------------------
45        s = torch.sigmoid(self.centroids(h))          # (B, T, n_experts)
46
47        # ---- 2. Select top-k using BIAS-CORRECTED scores ---------------
48        decision = s + self.bias                      # broadcast (n_experts,)
49        _, idx   = decision.topk(self.top_k, dim=-1)  # (B, T, k)
50
51        # ---- 3. Gate weights use RAW affinity (no bias!) ---------------
52        chosen_s = s.gather(-1, idx)                  # (B, T, k)
53        gates    = chosen_s / chosen_s.sum(-1, keepdim=True).clamp_min(1e-9)
54
55        # ---- 4. Count load for the control-loop update ------------------
56        load = torch.zeros(self.n_experts, device=h.device, dtype=h.dtype)
57        load.scatter_add_(
58            0,
59            idx.reshape(-1),
60            torch.ones_like(idx.reshape(-1), dtype=h.dtype),
61        )
62        return gates, idx, load
63
64    @torch.no_grad()
65    def update_bias(self, load: torch.Tensor, total_tokens: int):
66        """
67        Call AFTER the optimizer step. No autograd anywhere in here.
68        """
69        expected = total_tokens * self.top_k / self.n_experts
70        overload = load - expected
71        self.bias.sub_(self.gamma * overload.sign())

How this module fits into a training loop

The usage pattern is intentionally boring:

Forward pass: call router(h), dispatch tokens to the chosen experts using idx and gates, combine outputs. Standard MoE.
Backward pass: loss.backward() flows through gates, which depend on self.centroids but not on self.bias. The optimizer updates centroids and expert weights as usual.
After optimizer.step(): call router.update_bias(load, total_tokens). This is the control-loop update, fully outside autograd.

Because self.bias is a buffer and not a parameter, model.state_dict() captures it. Resuming a training run starts the controller from its warmed-up state — you don't lose thousands of steps of load history every time you checkpoint.

In real distributed training, the load tensor must be all-reduced across data-parallel ranks before update_bias is called. Otherwise each rank only sees its local micro-batch and the controllers across ranks drift apart. One line of torch.distributed.all_reduce(load, op=SUM) fixes it. DeepSeek also balances across expert-parallel groups, which we cover in chapter 12.

The γ Schedule in DeepSeek-V3

The simulator above hints at the trade-off: a large $\gamma$ converges quickly but oscillates; a small $\gamma$ is stable but slow to recover from distribution shift. DeepSeek-V3 resolves this the same way every other large-scale training trick does: schedule it.

Phase	γ value	Why
Warmup (first ~2K steps)	γ = 0.001 → 0.001	Centroids are still random. The model's opinion is noise; we don't want the bias to chase noise too fast.
Main training	γ = 0.001 (constant)	The dominant regime. Slow, smooth corrections; small enough that the controller doesn't fight transient batch-to-batch noise.
Late training (~last 5%)	γ decays linearly to 0	Locks in the routing pattern. The model can specialize harder when it knows its expert assignments won't keep drifting.

Two details about the published numbers. First, the reported $\gamma = 0.001$ is for $T \approx 4096$ tokens per device with hundreds of devices in parallel — that is, an all-reduced load of tens of thousands of tokens per step. The effective controller strength is $\gamma$ times the per-step load count, so absolute numbers don't transfer to smaller setups. If you reproduce this with batch size 64, you almost certainly want $\gamma \approx 0.01$ to $0.05$ .

Second, the cool-down is not optional. If you train with $\gamma > 0$ right up to the last step, the biases that the deployed model uses are the ones from the noisiest single batch you ever saw. The decay phase averages the biases toward their long-run mean. Models with the decay disabled show measurably worse inference-time load balance.

What Changes at Massive Scale

Everything above runs in 50 lines of NumPy on a laptop. Putting it on a 671B-parameter DeepSeek-V3 cluster surfaces three concerns that don't exist in the toy.

1. Load must be aggregated across ranks before the update

A real MoE layer is sharded with expert parallelism: different GPUs hold different experts. A token routed to expert 42 physically travels (via an all-to-all collective) to the GPU that owns expert 42. The load counter $\ell_i$ for expert 42 is naturally local to that GPU. But the data-parallel dimension is replicated — each data-parallel rank sees its own micro-batch of tokens, and they each accumulate their own partial $\ell_i$ . Before applying the update we all_reduce the load vector across data-parallel ranks so every replica of the layer sees the same load and applies the same update to its (replicated) bias. Skip this and the biases drift apart, the routing decisions diverge across ranks, and training corrupts within a few hundred steps.

2. Bias bookkeeping is microscopic; expert dispatch is not

The bias update itself is N additions and N sign checks — for N = 256 experts, that is 256 floating-point ops. The MoE layer's compute is dominated by the expert MLPs (billions of FLOPs) and by the all-to-all token dispatch (tens of GB of communication). The balancing logic is essentially free. This is part of why the bias-term method is preferred over more elaborate controllers: whatever you spend on the controller, you spend per layer per step, and there are 61 MoE layers in DeepSeek-V3 firing thousands of times per second.

3. Dropped tokens are an emergent property of the loop

Real MoE deployments cap each expert at a per-step capacity — tokens beyond the cap are dropped or rerouted. With the auxiliary loss, you tune the loss coefficient and hope the load stays inside the cap. With the bias-term method, if the cap is set near the uniform load $\ell^* = TK/N$ , the controller actively steers toward that target, and the drop rate falls naturally as training progresses. DeepSeek reports drop rates below 0.1% by the middle of training — orders of magnitude below auxiliary-loss baselines.

Property	Auxiliary loss	Bias term (DeepSeek)
Gradient interference	Yes (same path as token logits)	None — by construction
Tuning surface	Loss coefficient λ (sensitive)	Update rate γ (forgiving)
Communication cost	0 extra collectives	1 all_reduce per layer per step
Sensitivity to data drift	Fixed strength — corrects slowly	Continuous control — tracks drift
Effect on model quality	Mild regression (paper-reported)	Slight improvement or neutral
Drop rate at production cap	1–5%	<0.1% by mid-training

Production Pitfalls and What to Monitor

Pitfall 1: γ too large for the per-step token count. Symptom: bias values oscillate visibly between consecutive steps; the loss curve develops a periodic wobble. Fix: cut γ by 4× and try again. The simulator above (with γ = 0.2 on 64 tokens/step) shows this failure mode clearly.

Pitfall 2: forgetting the all-reduce. Each data-parallel rank updates its own copy of the bias from its own local load. After a few hundred steps the biases on different ranks diverge by several γ. Routing decisions stop being deterministic across replicas, the optimizer state becomes inconsistent, and training silently degrades. There is no error message — just a slow quality drop. Always all-reduce the load before the bias update.

Pitfall 3: leaking the bias into gating. The most common implementation bug is computing both selection and gating from

\tilde{s}_{i,t}

. If you do this, you have reinvented the auxiliary-loss problem — the bias now enters the loss path and the "no gradient interference" guarantee evaporates. The whole reason this method works is the decoupling. Code-review the gating line specifically.

Pitfall 4: forgetting the late-training decay. Models trained without a γ → 0 cool-down ship with biases that reflect the noisiest batches near the end of training. Eval load looks subtly worse than mid-training load. Schedule the decay.

What to log every step

Per-expert load $\ell_i$ (a length-N vector). Plot as a heatmap over training steps. Should go from skewed to flat over the first few thousand steps.
Max-over-min load ratio $\max_i \ell_i / \max(1, \min_i \ell_i)$ . Single scalar; should fall from >10× to near 1.5× and stay there.
Bias magnitude $\|b\|_\infty$ . Should grow during warmup then plateau. Unbounded growth means γ is too large or the all-reduce is missing.
Drop rate (tokens exceeding per-expert capacity). Should fall monotonically. A persistent floor means your capacity cap is below the uniform setpoint and you should raise it.

Summary

DeepSeek-V3's auxiliary-loss-free load balancing is one of the most elegant pieces of mechanical design in modern MoE training. Three structural decisions carry the entire result:

Decouple selection from weighting. Use $\tilde{s}_{i,t} = s_{i,t} + b_i$ for top-K selection and the raw $s_{i,t}$ for gate weights.
Update the bias by a control loop, not by gradient descent. After each step, nudge $b_i$ down for over-served experts and up for under-served ones.
Store the bias as a buffer, not a parameter. Autograd can't reach it, the optimizer can't see it, and the loss has no path through it.

The combined effect is the discipline of an auxiliary loss with the gradient quiet of no loss at all. Compared to MoE models that use an auxiliary loss, DeepSeek-V3 trains with flatter load curves, a sharper drop rate, and a small but consistent improvement in language-modeling quality — not in spite of the load balancer, but because the load balancer stopped fighting the model.

The next section closes the chapter by adding one more layer of defense: a per-sequence balance term that catches the rare extreme imbalances even the bias loop can't prevent — and we'll see why it's applied at the sequence level, not the batch level, before moving on to multi-token prediction in chapter 7.