Chapter 5
20 min read
Section 28 of 117

Fine-Grained Expert Decomposition

Mixture-of-Experts: DeepSeekMoE

The MoE layer we built in section 1 had eight experts and routed two of them per token. That works — but if you train it long enough you find a strange disease setting in. The router keeps sending the same combinations of experts together. The diversity it was supposed to buy you collapses into a handful of habitual pairings. Every expert ends up a half-cardiologist, half-dermatologist, hedging against the possibility that it might be the only one called.

Fine-grained expert decomposition is DeepSeek's answer to that pathology. The recipe is almost embarrassingly simple: slice each fat expert into mm slimmer ones, and route mm times as many of them per token. The FLOPs are unchanged. The number of expert combinations explodes. Specialists actually get to specialize.

The bet of fine-grained MoE. Keep the active compute per token fixed. Multiply the router's set of possible expert teams by orders of magnitude. Let combinatorics, not parameter counts, do the work of capacity.

The Redundancy Problem

In the classic E=8,k=2E = 8, k = 2 setup, the router can build at most (82)=28\binom{8}{2} = 28 different expert teams. That is the entire library of two-expert "committees" the model can ever assemble for a token. Twenty-eight is not much when the dataset contains code, English prose, mathematical notation, German, Bengali, Python tracebacks, JSON, and Shakespeare — and each of those domains has dozens of sub-modes.

What happens in practice is that every expert is forced to be a mostly-generalist. Consider expert 3. Across training, it gets paired with experts {1, 5} for code, {2, 6} for math, {4, 7} for poetry, and so on. Because expert 3 cannot guarantee which partner it will run with, it cannot afford to be too specialized; it has to carry some baseline competence in every domain so that, when it runs alongside the "wrong" partner, the pair still produces a reasonable output. Multiply that hedging behavior over all eight experts and you have eight slightly-watered-down generalists, not eight crisp specialists.

The problem is geometric. Specialization needs combinatorial elbow room. With only 28 possible teams, the router cannot give each expert a narrow niche to live in — there are not enough niches to go around.

Why this matters for massive training. At trillion-parameter scale, the entire point of MoE is that different experts learn different things. If they cannot, you are paying the memory cost of EE experts and getting roughly the representational power of one fat one. The capacity gain vanishes.

The Intuition: Specialists, Not Generalists

Go back to the hospital analogy from section 1. Imagine your hospital has 8 specialists, and on every patient you must pick 2 of them. You cannot afford a hyper-narrow expert — say, a left-knee surgeon — because with only 28 possible pairings, the chance that a left-knee patient ever arrives and gets routed to the left-knee surgeon is too small to justify keeping her on staff. So everyone trains toward broad general practice.

Now multiply the staff. Same building, same total payroll, same total examination-hours per day — but instead of 8 broad specialists you hire 32 narrow ones, and every patient gets seen by 8 of them. Each visit still costs the same number of person-hours. But now the left-knee-orthopod is on every left-knee case, the pediatric-cardio specialist runs on every newborn-with-murmur, and so on. The system can afford narrowness because narrowness shows up often enough to train.

That is exactly the recipe. Cut every expert in 4 along the FFN hidden dimension. You now have 32 thinner experts, each with one-quarter the capacity. Route 8 per token instead of 2. Active parameters per token are identical: 8dff4=2dff8 \cdot \tfrac{d_{ff}}{4} = 2 \cdot d_{ff}. Total parameters are identical too: 32dff4=8dff32 \cdot \tfrac{d_{ff}}{4} = 8 \cdot d_{ff}. Nothing changed in the FLOP budget. Everything changed in the combinatorics.

The right mental picture: coarse MoE = small staff of broad specialists, few possible teams. Fine MoE = large staff of narrow specialists, astronomically many possible teams. Same compute per visit either way.

The Mathematical Idea

Let the coarse MoE have EE experts, hidden size dffd_{ff}, top-kk routing. Pick a segmentation factor m{2,4,8,}m \in \{2, 4, 8, \dots\}. The fine-grained MoE has:

E=mEE' = m \cdot E experts, each of hidden size dff=dff/md_{ff}' = d_{ff} / m, with top-k=mkk' = m \cdot k routing.

Each new slim expert is still a full two-layer FFN, just narrower: Ei(x)=W2(i)σ(W1(i)x)E_i'(x) = W_2^{(i)} \, \sigma\big( W_1^{(i)} x \big) with W1(i)Rdff×dW_1^{(i)} \in \mathbb{R}^{d_{ff}' \times d} and W2(i)Rd×dffW_2^{(i)} \in \mathbb{R}^{d \times d_{ff}'}. The MoE output is the same convex combination as before, just over a bigger pool: MoE(x)=iTgi(x)Ei(x)\mathrm{MoE}(x) = \sum_{i \in \mathcal{T}'} g_i(x) \cdot E_i'(x), where T=TopK(Wrx,k)\mathcal{T}' = \mathrm{TopK}(W_r' x,\, k') has size k=mkk' = m k.

The two conservation laws

Fine-grained decomposition is engineered around two invariants. Both must hold or the recipe is no longer FLOP-matched.

Conservation of active FLOPs. Per token, the active expert work is k2ddff=(mk)2d(dff/m)=k2ddffk' \cdot 2 \cdot d \cdot d_{ff}' = (m k) \cdot 2 \cdot d \cdot (d_{ff}/m) = k \cdot 2 \cdot d \cdot d_{ff}. The mms cancel exactly. The fine model and the coarse model run the same number of multiply-accumulates per token.

Conservation of total expert parameters. Across the whole layer, E2ddff=(mE)2d(dff/m)=E2ddffE' \cdot 2 \cdot d \cdot d_{ff}' = (m E) \cdot 2 \cdot d \cdot (d_{ff}/m) = E \cdot 2 \cdot d \cdot d_{ff}. The same blob of parameters has been chopped finer, not enlarged.

The only thing that grows is the router. It now produces E=mEE' = m E logits instead of EE, costing O(mEd)O(m \cdot E \cdot d) work — a rounding error next to even one expert evaluation.

Why Combinations Matter

The interesting quantity is the size of the router's search space. For coarse routing this is (Ek)\binom{E}{k}; for fine routing it is (mEmk)\binom{m E}{m k}. The ratio is what we care about:

R(m)=(mEmk)(Ek)R(m) = \frac{\binom{m E}{m k}}{\binom{E}{k}}.

With E=8,k=2E = 8, k = 2, plug in a few values of mm and the explosion is visible:

mTotal expertsActive kCombinations× vs coarse
1 (coarse)8228
21641,82065×
4 (DeepSeek-V2)32810,518,300≈ 376,000×
86416≈ 4.89 × 10¹⁴≈ 1.7 × 10¹³×

At m=4m = 4 the router has roughly a third of a million times more expert teams to choose from than it did before — at the exact same compute budget. By m=8m = 8, the number is large enough that the router can effectively pick a unique team for every token in a multi-billion-token training set.

What buys what. Capacity decoupling (section 1) bought us the ability to store more parameters than we run. Fine-grained decomposition (this section) buys us the ability to address those parameters with much finer granularity. Both gains compound: DeepSeek-V3 has 671B total params, 37B active per token, and addresses them through a 256-way router picking top-8.

Shared Experts: Isolating Common Knowledge

Even with combinatorial elbow room, some redundancy remains. There is knowledge that every token needs — basic English syntax, common subword statistics, residual-stream housekeeping. If we make every routed expert learn that baseline, we are wasting capacity. So the second half of the DeepSeekMoE design peels off one or two experts and marks them shared: they fire on every token, unconditionally, outside the router's softmax.

With nsn_s shared experts and nrn_r routed experts, the layer computes:

MoE(x)=j=1nsSj(x)always on+iTgi(x)Ei(x)routed\mathrm{MoE}(x) = \underbrace{\sum_{j=1}^{n_s} S_j(x)}_{\text{always on}} + \underbrace{\sum_{i \in \mathcal{T}} g_i(x) \cdot E_i(x)}_{\text{routed}}

The shared term has no gate — or rather, an implicit gate of 1. The routed term works as before, except the router only scores and selects among the nrn_r routed experts. To keep total active FLOPs matched against the coarse baseline, you reduce the routed top-kk by nsn_s: with m=4,ns=1m = 4, n_s = 1, the standard configuration is 1 shared + 31 routed, picking top-7 of the routed pool (1 + 7 = 8 active fine experts = 2 coarse-expert FLOPs).

Why isolation, not addition. If you simply added always-on experts on top of your routed budget, you would be paying more FLOPs. DeepSeekMoE isolates them inside the existing budget — the shared expert displaces one routed slot. The bookkeeping is FLOP-neutral; only the labor division changes.

The training signal becomes much cleaner. Routed experts no longer compete for the right to learn boring universal patterns; the shared expert absorbs that load. The router's gradient now talks almost exclusively about specialization, because the things every token needs are handled before the router even runs.

Manual Numerical Walkthrough

Let's route one toy token through both a coarse MoE and a fine-grained MoE, and verify by hand that the FLOPs match while the expressed function does not.

Click to expand: coarse vs fine, one token, by hand

Setup. Token x=[1,0]x = [1, 0]. Coarse: 4 experts of dff=4d_{ff} = 4, top-k=1k = 1. Fine: 8 experts of dff=2d_{ff} = 2, top-k=2k = 2. Segmentation factor m=2m = 2.

Coarse FLOPs. One expert runs, doing 2ddff=224=162 \cdot d \cdot d_{ff} = 2 \cdot 2 \cdot 4 = 16 multiply-adds.

Fine FLOPs. Two experts run, each doing 222=82 \cdot 2 \cdot 2 = 8 mult-adds. Total = 28=162 \cdot 8 = 16. Identical to coarse. ✓

Coarse routing. Suppose coarse logits are s=[2.0,0.5,0.2,0.1]s = [2.0, 0.5, -0.2, 0.1]. Top-1 picks expert 1. Gate g1=1.0g_1 = 1.0. The output is E1(x)E_1(x) with no blending — a single unanimous expert.

Fine routing. The fine model splits expert 1 into two halves E1a,E1bE_{1a}, E_{1b}, expert 2 into E2a,E2bE_{2a}, E_{2b}, etc. Suppose the fine router gives logits s=[2.1,1.9,0.6,0.4,0.1,0.3,0.2,0.0]s' = [2.1, 1.9, 0.6, 0.4, -0.1, -0.3, 0.2, 0.0]. Top-2 picks E1aE_{1a} and E1bE_{1b} (i.e., both halves of the same fat expert) — the fine model has chosen to behave almost exactly like the coarse one.

But it does not have to. For a different token, the fine router might pick E1aE_{1a} (the "code-syntax" half of expert 1) together with E3bE_{3b} (the "Python-stdlib" half of expert 3) — a hybrid the coarse model literally could not express, because it had to choose expert 1 OR expert 3, never half of each.

Softmax math is identical. Fine gates over the two picks: subtract the max, exponentiate, normalize. z=[0,0.2]z = [0, -0.2], ez=[1,0.819]e^z = [1, 0.819], sum = 1.819, g1a=0.55,g1b=0.45g_{1a} = 0.55,\, g_{1b} = 0.45. The output is 0.55E1a(x)+0.45E1b(x)0.55 \cdot E_{1a}(x) + 0.45 \cdot E_{1b}(x).

The combinations count. Coarse: 4 possible choices. Fine: (82)=28\binom{8}{2} = 28 — a 7× jump from segmenting just once with m=2m = 2. The router has 7× more ways to describe what this token is.

Visualizing the Decomposition

Toggle between the three modes below. Watch the "Active params" number stay glued at 2 units, while the "Expert combinations" number blows up. Pick a different token to see different specialists light up — the coarse model can only mix between 28 pairs, the fine model picks 8 from a pool of ten million teams.

Loading fine-grained expert visualizer…

Three observations are worth pausing on. First, the green "active" expert count is identical in every mode — fine-grained MoE is not buying capacity by spending more compute, it is buying expressivity by spending the same compute differently. Second, in fine mode the lit cells are scattered, not clustered: the router can mix knowledge from far corners of the expert grid in a way the coarse mode literally cannot reach. Third, with the shared expert on, the always-on row carries the universal load — so the routed picks are free to be obvious specialists rather than hedged generalists.

Plain Python: From Coarse to Fine

Before the PyTorch version, here is the whole idea in NumPy. We build two MoE configurations that differ only in their (E,dff,k)(E, d_{ff}, k) triple and verify the FLOP audit by hand.

🐍fine_grained_moe_numpy.py
7Hidden width is what we are splitting

d_model is the residual stream width (unchanged). d_ff is what we segment: the coarse model has d_ff=32 inside each expert; the fine model uses d_ff=8 — four times narrower. This is the single knob fine-grained decomposition turns.

EXECUTION STATE
d_model = 4
11Build one MoE configuration

We will call this twice — once with the coarse (E=8, d_ff=32) shape and once with the fine (E=32, d_ff=8) shape. The router shape grows with E because there is one logit per expert.

12Expert layer 1, stacked across experts

W1 has shape (E, d_ff, d_model). For the fine model that is (32, 8, 4); for the coarse it is (8, 32, 4). The total parameter count of W1 — E · d_ff · d_model — is the same in both: 32·8·4 = 8·32·4 = 1024.

EXECUTION STATE
W1.shape (fine) = (32, 8, 4)
W1.shape (coarse) = (8, 32, 4)
13Expert layer 2

Same pattern, transposed. The crucial invariant is E · d_ff = constant. Total expert parameters are conserved across both configurations — we are just chopping the same blob of parameters into more, smaller pieces.

14Router widens with E

W_r grows from (8, 4) to (32, 4). The router cost is still trivial compared to the experts themselves — O(E · d_model) is far cheaper than even one expert FFN.

17Score every expert

Same routing step as section 1 — a single matvec produces one logit per expert. The fine model produces 32 scores; the coarse, 8.

EXECUTION STATE
logits.shape (fine) = (32,)
logits.shape (coarse) = (8,)
18Top-k scales with the segmentation

Critical line. Coarse uses k=2; fine uses k=8 = 2·m where m=4 is the segmentation factor. We pick MORE experts in the fine model — that is exactly how we keep FLOPs matched while gaining combinatorial freedom.

20Softmax over the survivors

Same numerical-stability trick (subtract the max). The gate vector for the fine model has 8 nonzero entries summing to 1; the coarse has 2. The router can now spread weight across more specialists.

EXECUTION STATE
g.shape (fine) = (8,)
g.shape (coarse) = (2,)
24Loop over the chosen experts

We run k FFNs, scaled by their gates. In the coarse model each FFN is fat (d_ff=32); in the fine model each is slim (d_ff=8). 2 fat FFNs and 8 slim FFNs do the same total work.

25First half of the expert FFN

W1[i] @ x produces an h of shape (d_ff,). For coarse: (32,). For fine: (8,). ReLU keeps the positive coordinates. Same math, narrower projection.

26Second half + gating

Project back to d_model and scale by the gate. Sum over the k chosen experts. This is where the convex combination is built up.

31Same input, two layouts

We hand the same x to both models. They will produce different outputs (different weights), but they spend identical compute. The lesson is what the router can now express — not what FLOPs it costs.

35FLOP audit, line by line

Each expert FFN does roughly 2 · d_model · d_ff work per token. Coarse: 2 · (2 · 4 · 32) = 512. Fine: 8 · (2 · 4 · 8) = 512. Conservation of compute is not a coincidence — it is the design constraint that the fine-grained recipe is built around.

24 lines without explanation
1import numpy as np
2
3# Coarse:  E = 8 experts of hidden dim d_ff = 32, route top-k = 2
4# Fine:    E = 32 experts of hidden dim d_ff = 8,  route top-k = 8
5# Both spend the same active FLOPs per token.
6
7d_model = 4
8rng = np.random.default_rng(0)
9
10def build_moe(E: int, d_ff: int):
11    W1 = rng.standard_normal((E, d_ff, d_model)) * 0.1
12    W2 = rng.standard_normal((E, d_model, d_ff)) * 0.1
13    W_r = rng.standard_normal((E, d_model)) * 0.1
14    return W1, W2, W_r
15
16def moe_forward(x, W1, W2, W_r, k):
17    logits = W_r @ x                                # (E,)
18    idx = np.argsort(-logits)[:k]                   # top-k experts
19    z = logits[idx] - logits[idx].max()
20    g = np.exp(z) / np.exp(z).sum()                 # (k,) gates
21
22    y = np.zeros(d_model)
23    for gi, i in zip(g, idx):
24        h = np.maximum(0, W1[i] @ x)                # expert FFN, half 1
25        y += gi * (W2[i] @ h)                       # half 2, scaled by gate
26    return y
27
28coarse = build_moe(E=8,  d_ff=32)
29fine   = build_moe(E=32, d_ff=8)
30
31x = np.array([0.5, -0.2, 0.9, 0.1])
32print("coarse:", moe_forward(x, *coarse, k=2))
33print("fine:  ", moe_forward(x, *fine,   k=8))
34
35# FLOP audit — active expert params per token:
36# coarse: 2 experts × (32·4 + 4·32) = 2 × 256 = 512
37# fine:   8 experts × (8·4  + 4·8 ) = 8 × 64  = 512    # identical!

Notice what did not change. The router is still one matvec. Top-k is still one argsort. Softmax still runs over the survivors. The expert FFN is still W2σ(W1x)W_2 \sigma(W_1 x). Every line of the MoE machinery from section 1 transferred unmodified. The only edits were the shapes — and shape edits, it turns out, are where most of the design space lives.

Sanity check. Set m=1m = 1 in the formulas above. The fine model collapses back to the coarse one:E=EE' = E, k=kk' = k, dff=dffd_{ff}' = d_{ff}. Coarse MoE is the special case of fine-grained MoE with no segmentation.

PyTorch: The DeepSeekMoE Layer

Below is a faithful sketch of a DeepSeekMoE block — fine-grained routed experts plus an always-on shared expert. It is a drop-in replacement for the dense FFN inside a transformer block.

🐍deepseek_moe.py
5One module, two kinds of experts

DeepSeekMoE bundles two populations: shared experts (always on, like a residual specialist) and routed experts (top-k, fine-grained). One nn.Module owns both.

9What does d_ff_base mean?

The d_ff of the equivalent COARSE model we are decomposing. If you would have built E=8 experts at d_ff=4·d_model, that is your d_ff_base. The fine experts will each have d_ff = d_ff_base / seg.

11The segmentation factor

seg = m, the number of slim experts we cut one fat expert into. m=4 is DeepSeek-V2's choice; m=2 or m=8 are also seen. Higher m means more, slimmer specialists.

EXECUTION STATE
seg = 4 (typical)
15Compute the fine geometry

d_ff shrinks by seg. num_routed = num_coarse · seg − num_shared, so the TOTAL number of fine experts equals num_coarse · seg, with a few of them peeled off as shared. k grows accordingly: k = k_coarse · seg − num_shared, conserving active FLOPs.

EXECUTION STATE
d_ff = d_ff_base / seg
num_routed = num_coarse·seg − num_shared
self.k = k_coarse·seg − num_shared
19Shared expert population

Built like any FFN, but registered separately. In DeepSeek-V2 there is typically 1–2 shared experts. Their gates are not learned — they fire unconditionally.

24Routed expert population

The fine-grained specialists. Each is a slim FFN with hidden d_ff = d_ff_base / seg. In a real V3 layer you would see 256 of these per MoE block.

29Router only scores the routed experts

Important detail: shared experts are NOT in the router. They are unconditional, so giving them logits would only let the router silence them — which defeats the purpose. router output dim = num_routed, not num_routed + num_shared.

EXECUTION STATE
self.router = Linear(D, E_r)
32Flatten batch and time

Same pattern as section 1's MoE — every token is a row, routing decisions are per-token.

EXECUTION STATE
x_flat.shape = (B·T, D)
35Shared pass: run unconditionally

We compute the shared experts on EVERY token and start the output as their sum. There is no gating because shared experts are not in the router's softmax — their effective gate is fixed at 1. This is the residual-knowledge pathway.

38Score the routed experts

Same router step as section 1, but the output dim is now num_routed = (num_coarse · seg − num_shared) — much wider than the coarse case.

EXECUTION STATE
logits.shape = (N, E_r)
39Top-k with fine-grained k

self.k is now seg-multiplied. With seg=4, num_coarse=8, k_coarse=2, num_shared=1: self.k = 2·4 − 1 = 7. We pick 7 routed experts per token. Combined with the always-on shared expert, that is 8 active specialists — same total FLOPs as 2 coarse experts.

40Softmax over the chosen routed experts only

Gates sum to 1 across the 7 routed picks. The shared expert sits outside this softmax — it contributes alongside the routed gates, not in place of them.

42Loop over experts, gather tokens

Same dispatch pattern as section 1: for each routed expert, gather the tokens that picked it, run one matmul on the gathered batch, scatter back. This pattern is what gives MoE GPU throughput.

49Accumulate into the shared baseline

y already holds the shared experts' contribution. We add the gated routed-expert outputs on top. The shared term is like a bias that does not depend on routing — the routed terms refine it per token.

51Restore (B, T, D)

The fine-grained MoE layer is still a drop-in replacement for a dense FFN. Outside the block, the transformer does not know — or care — how its FFN was built.

40 lines without explanation
1import torch
2import torch.nn as nn
3import torch.nn.functional as F
4
5class DeepSeekMoE(nn.Module):
6    """Fine-grained MoE with shared experts (DeepSeekMoE pattern)."""
7
8    def __init__(
9        self,
10        d_model: int,
11        d_ff_base: int,    # the "coarse" d_ff we are decomposing
12        seg: int = 4,      # segmentation factor m
13        num_coarse: int = 8,
14        k_coarse: int = 2,
15        num_shared: int = 1,
16    ):
17        super().__init__()
18        d_ff = d_ff_base // seg
19        self.num_routed = num_coarse * seg - num_shared
20        self.k = k_coarse * seg - num_shared
21        self.num_shared = num_shared
22
23        self.shared = nn.ModuleList([
24            nn.Sequential(nn.Linear(d_model, d_ff), nn.GELU(),
25                          nn.Linear(d_ff, d_model))
26            for _ in range(num_shared)
27        ])
28        self.routed = nn.ModuleList([
29            nn.Sequential(nn.Linear(d_model, d_ff), nn.GELU(),
30                          nn.Linear(d_ff, d_model))
31            for _ in range(self.num_routed)
32        ])
33        self.router = nn.Linear(d_model, self.num_routed, bias=False)
34
35    def forward(self, x: torch.Tensor) -> torch.Tensor:
36        B, T, D = x.shape
37        x_flat = x.reshape(B * T, D)
38
39        # 1) Shared experts run on every token.
40        y = sum(s(x_flat) for s in self.shared)
41
42        # 2) Routed experts: top-k of the (num_routed) specialists.
43        logits = self.router(x_flat)                      # (N, E_r)
44        topk_w, topk_idx = logits.topk(self.k, dim=-1)
45        topk_w = F.softmax(topk_w, dim=-1)
46
47        for i in range(self.num_routed):
48            mask = topk_idx == i
49            if not mask.any():
50                continue
51            rows, slot = mask.nonzero(as_tuple=True)
52            gate = topk_w[rows, slot].unsqueeze(-1)
53            y[rows] = y[rows] + gate * self.routed[i](x_flat[rows])
54
55        return y.reshape(B, T, D)

Three things to internalize from the PyTorch version:

  1. The shared expert is not in the router. Its gate is implicitly 1. If you accidentally put it in the router's output dimension, the softmax can suppress it — and then it stops doing its job (absorbing common knowledge). Keep the populations strictly separate.
  2. k grows by segmentation factor minus shared count. The arithmetic k=kcoarsemnsk = k_{\text{coarse}} \cdot m - n_s is the FLOP-conservation contract. Get it wrong and your "cheap" fine MoE is either slower than the coarse baseline or under-utilizes its experts.
  3. Routing is still per-token, still discrete. Everything from section 1 about top-k being non-differentiable in indices but differentiable through gate values still applies. The fine-grained recipe does not change the gradient story — it just changes the geometry of what the router has to learn.
Implementation note. In production frameworks (the actual DeepSeek-V3 code, Megatron-LM, vLLM's MoE backend) the per-expert Python loop is replaced with a single fused grouped-GEMM against a stacked expert tensor of shape (Er,dff,d)(E_r, d_{ff}', d). Functionally identical, dramatically faster on a GPU. The number of experts is now large enough (256 in V3) that the per-expert loop overhead would otherwise dominate.

What Changes at Massive Scale

The toy in this section has 32 experts. DeepSeek-V3 has 256 routed experts per MoE block plus a shared one, picks top-8 routed per token, and stacks 58 of these blocks. Three numbers tell the production story:

ModelTotal experts / layerRouted top-kSharedActive / total
Mixtral 8×7B (coarse)82013B / 47B
DeepSeek-V2 (fine + shared)160 routed + 2 shared6221B / 236B
DeepSeek-V3 (fine + shared)256 routed + 1 shared8137B / 671B
Qwen3-MoE-A235B128 routed8022B / 235B

The trend is monotone. Newer MoE models have more experts, narrower experts, and higher top-k. The active parameter ratio is similar across them — that is the FLOP-conservation constraint at work — but the combinatorial richness of routing is wildly different. DeepSeek-V3's router can address (2568)4.1×1014\binom{256}{8} \approx 4.1 \times 10^{14} distinct expert teams; Mixtral's router can address 28.

The bottleneck shifts again. With coarse MoE, the bottleneck was load balance (chapter 6). With fine-grained MoE, two new costs appear: router cost — now non-trivial, because a Linear from dmodeld_{model} to 256 is still small but it runs once per token of every layer — and kernel launch overhead, because dispatching 8 small matmuls per token across 256 expert weight banks is harder than dispatching 2. Both have well-engineered solutions (grouped GEMMs, token shuffling, all-to-all overlap), and we will get to them in the expert-parallelism section.

The scaling-law angle

The DeepSeekMoE paper (Dai et al., 2024) ran the experiment carefully: at a fixed active-parameter budget, finer segmentation reliably lowers loss. The improvement is not free — communication and routing cost grow — but on the scaling-law curve, fine-grained MoE sits below the coarse-MoE curve, which sits below the dense curve. Each move buys compute-efficient capacity at the cost of systems complexity.

The Engineering Reality

Three things go wrong if you implement fine-grained MoE naively, and each has shaped how production systems actually look.

  1. Tiny kernels. A 256-way expert pool with per-expert hidden dim dff/m384d_{ff}/m \approx 384 means each expert FFN is a small matmul. Without fusion, the GPU spends more time launching kernels than computing. Solution: grouped GEMM kernels (CUTLASS, Triton) that pack the per-expert work into one big matmul with per-row expert ids.
  2. Routing skew. Finer experts are easier to under-train. If one of 256 experts only sees 0.1% of the tokens, its weights drift toward zero and the router learns to ignore it. Bias-term load balancing (chapter 6, section 3) is the DeepSeek-specific fix: add a small per-expert bias to the router logits, increase it for under-used experts, decrease it for over-used ones. Pushes the distribution back toward uniform without distorting gradients.
  3. Memory-bandwidth pressure. With 256 experts sharing one residual stream, every token forward pass has to read the weights of 8 small experts and the shared one. For models too large to fit on one GPU, those 9 expert reads come from 9 potentially-different GPUs — an all-to-all collective. We will spend the next sections on the parallelism primitives that make that affordable.

Despite all this, the trade keeps paying off. DeepSeek-V3 reached GPT-4-class quality on a fraction of the active compute precisely because fine-grained decomposition let it carry 671B parameters of knowledge while running only 37B per token. The recipe is now standard across the frontier of open-weight models, and the version you see in 2025-era papers is almost always this one: many slim experts, high top-k, one or two shared experts, grouped-GEMM kernels, bias-term load balancing.

Where we go from here. Section 4 of this chapter opens up the auxiliary losses that keep a 256-way router from collapsing onto a handful of favorites. Then we turn to expert parallelism — how the experts are actually sharded across a cluster, and how the all-to-all communication is overlapped with compute so a trillion-parameter MoE can train at all.

The one sentence to carry forward: fine-grained decomposition buys you exponentially more expert combinations at zero FLOP cost, and shared experts buy you specialization by offloading the boring universal knowledge to a separate, always-on pathway. Together, they are why a 671B model can think like a 671B model while computing like a 37B one.

Loading comments...