The MoE layer we built in section 1 had eight experts and routed two of them per token. That works — but if you train it long enough you find a strange disease setting in. The router keeps sending the same combinations of experts together. The diversity it was supposed to buy you collapses into a handful of habitual pairings. Every expert ends up a half-cardiologist, half-dermatologist, hedging against the possibility that it might be the only one called.
Fine-grained expert decomposition is DeepSeek's answer to that pathology. The recipe is almost embarrassingly simple: slice each fat expert into slimmer ones, and route times as many of them per token. The FLOPs are unchanged. The number of expert combinations explodes. Specialists actually get to specialize.
The bet of fine-grained MoE. Keep the active compute per token fixed. Multiply the router's set of possible expert teams by orders of magnitude. Let combinatorics, not parameter counts, do the work of capacity.
The Redundancy Problem
In the classic setup, the router can build at most different expert teams. That is the entire library of two-expert "committees" the model can ever assemble for a token. Twenty-eight is not much when the dataset contains code, English prose, mathematical notation, German, Bengali, Python tracebacks, JSON, and Shakespeare — and each of those domains has dozens of sub-modes.
What happens in practice is that every expert is forced to be a mostly-generalist. Consider expert 3. Across training, it gets paired with experts {1, 5} for code, {2, 6} for math, {4, 7} for poetry, and so on. Because expert 3 cannot guarantee which partner it will run with, it cannot afford to be too specialized; it has to carry some baseline competence in every domain so that, when it runs alongside the "wrong" partner, the pair still produces a reasonable output. Multiply that hedging behavior over all eight experts and you have eight slightly-watered-down generalists, not eight crisp specialists.
The problem is geometric. Specialization needs combinatorial elbow room. With only 28 possible teams, the router cannot give each expert a narrow niche to live in — there are not enough niches to go around.
The Intuition: Specialists, Not Generalists
Go back to the hospital analogy from section 1. Imagine your hospital has 8 specialists, and on every patient you must pick 2 of them. You cannot afford a hyper-narrow expert — say, a left-knee surgeon — because with only 28 possible pairings, the chance that a left-knee patient ever arrives and gets routed to the left-knee surgeon is too small to justify keeping her on staff. So everyone trains toward broad general practice.
Now multiply the staff. Same building, same total payroll, same total examination-hours per day — but instead of 8 broad specialists you hire 32 narrow ones, and every patient gets seen by 8 of them. Each visit still costs the same number of person-hours. But now the left-knee-orthopod is on every left-knee case, the pediatric-cardio specialist runs on every newborn-with-murmur, and so on. The system can afford narrowness because narrowness shows up often enough to train.
That is exactly the recipe. Cut every expert in 4 along the FFN hidden dimension. You now have 32 thinner experts, each with one-quarter the capacity. Route 8 per token instead of 2. Active parameters per token are identical: . Total parameters are identical too: . Nothing changed in the FLOP budget. Everything changed in the combinatorics.
The Mathematical Idea
Let the coarse MoE have experts, hidden size , top- routing. Pick a segmentation factor . The fine-grained MoE has:
experts, each of hidden size , with top- routing.
Each new slim expert is still a full two-layer FFN, just narrower: with and . The MoE output is the same convex combination as before, just over a bigger pool: , where has size .
The two conservation laws
Fine-grained decomposition is engineered around two invariants. Both must hold or the recipe is no longer FLOP-matched.
Conservation of active FLOPs. Per token, the active expert work is . The s cancel exactly. The fine model and the coarse model run the same number of multiply-accumulates per token.
Conservation of total expert parameters. Across the whole layer, . The same blob of parameters has been chopped finer, not enlarged.
The only thing that grows is the router. It now produces logits instead of , costing work — a rounding error next to even one expert evaluation.
Why Combinations Matter
The interesting quantity is the size of the router's search space. For coarse routing this is ; for fine routing it is . The ratio is what we care about:
.
With , plug in a few values of and the explosion is visible:
| m | Total experts | Active k | Combinations | × vs coarse |
|---|---|---|---|---|
| 1 (coarse) | 8 | 2 | 28 | 1× |
| 2 | 16 | 4 | 1,820 | 65× |
| 4 (DeepSeek-V2) | 32 | 8 | 10,518,300 | ≈ 376,000× |
| 8 | 64 | 16 | ≈ 4.89 × 10¹⁴ | ≈ 1.7 × 10¹³× |
At the router has roughly a third of a million times more expert teams to choose from than it did before — at the exact same compute budget. By , the number is large enough that the router can effectively pick a unique team for every token in a multi-billion-token training set.
Shared Experts: Isolating Common Knowledge
Even with combinatorial elbow room, some redundancy remains. There is knowledge that every token needs — basic English syntax, common subword statistics, residual-stream housekeeping. If we make every routed expert learn that baseline, we are wasting capacity. So the second half of the DeepSeekMoE design peels off one or two experts and marks them shared: they fire on every token, unconditionally, outside the router's softmax.
With shared experts and routed experts, the layer computes:
The shared term has no gate — or rather, an implicit gate of 1. The routed term works as before, except the router only scores and selects among the routed experts. To keep total active FLOPs matched against the coarse baseline, you reduce the routed top- by : with , the standard configuration is 1 shared + 31 routed, picking top-7 of the routed pool (1 + 7 = 8 active fine experts = 2 coarse-expert FLOPs).
The training signal becomes much cleaner. Routed experts no longer compete for the right to learn boring universal patterns; the shared expert absorbs that load. The router's gradient now talks almost exclusively about specialization, because the things every token needs are handled before the router even runs.
Manual Numerical Walkthrough
Let's route one toy token through both a coarse MoE and a fine-grained MoE, and verify by hand that the FLOPs match while the expressed function does not.
Click to expand: coarse vs fine, one token, by hand
Setup. Token . Coarse: 4 experts of , top-. Fine: 8 experts of , top-. Segmentation factor .
Coarse FLOPs. One expert runs, doing multiply-adds.
Fine FLOPs. Two experts run, each doing mult-adds. Total = . Identical to coarse. ✓
Coarse routing. Suppose coarse logits are . Top-1 picks expert 1. Gate . The output is with no blending — a single unanimous expert.
Fine routing. The fine model splits expert 1 into two halves , expert 2 into , etc. Suppose the fine router gives logits . Top-2 picks and (i.e., both halves of the same fat expert) — the fine model has chosen to behave almost exactly like the coarse one.
But it does not have to. For a different token, the fine router might pick (the "code-syntax" half of expert 1) together with (the "Python-stdlib" half of expert 3) — a hybrid the coarse model literally could not express, because it had to choose expert 1 OR expert 3, never half of each.
Softmax math is identical. Fine gates over the two picks: subtract the max, exponentiate, normalize. , , sum = 1.819, . The output is .
The combinations count. Coarse: 4 possible choices. Fine: — a 7× jump from segmenting just once with . The router has 7× more ways to describe what this token is.
Visualizing the Decomposition
Toggle between the three modes below. Watch the "Active params" number stay glued at 2 units, while the "Expert combinations" number blows up. Pick a different token to see different specialists light up — the coarse model can only mix between 28 pairs, the fine model picks 8 from a pool of ten million teams.
Three observations are worth pausing on. First, the green "active" expert count is identical in every mode — fine-grained MoE is not buying capacity by spending more compute, it is buying expressivity by spending the same compute differently. Second, in fine mode the lit cells are scattered, not clustered: the router can mix knowledge from far corners of the expert grid in a way the coarse mode literally cannot reach. Third, with the shared expert on, the always-on row carries the universal load — so the routed picks are free to be obvious specialists rather than hedged generalists.
Plain Python: From Coarse to Fine
Before the PyTorch version, here is the whole idea in NumPy. We build two MoE configurations that differ only in their triple and verify the FLOP audit by hand.
Notice what did not change. The router is still one matvec. Top-k is still one argsort. Softmax still runs over the survivors. The expert FFN is still . Every line of the MoE machinery from section 1 transferred unmodified. The only edits were the shapes — and shape edits, it turns out, are where most of the design space lives.
Sanity check. Set in the formulas above. The fine model collapses back to the coarse one:, , . Coarse MoE is the special case of fine-grained MoE with no segmentation.
PyTorch: The DeepSeekMoE Layer
Below is a faithful sketch of a DeepSeekMoE block — fine-grained routed experts plus an always-on shared expert. It is a drop-in replacement for the dense FFN inside a transformer block.
Three things to internalize from the PyTorch version:
- The shared expert is not in the router. Its gate is implicitly 1. If you accidentally put it in the router's output dimension, the softmax can suppress it — and then it stops doing its job (absorbing common knowledge). Keep the populations strictly separate.
- k grows by segmentation factor minus shared count. The arithmetic is the FLOP-conservation contract. Get it wrong and your "cheap" fine MoE is either slower than the coarse baseline or under-utilizes its experts.
- Routing is still per-token, still discrete. Everything from section 1 about top-k being non-differentiable in indices but differentiable through gate values still applies. The fine-grained recipe does not change the gradient story — it just changes the geometry of what the router has to learn.
What Changes at Massive Scale
The toy in this section has 32 experts. DeepSeek-V3 has 256 routed experts per MoE block plus a shared one, picks top-8 routed per token, and stacks 58 of these blocks. Three numbers tell the production story:
| Model | Total experts / layer | Routed top-k | Shared | Active / total |
|---|---|---|---|---|
| Mixtral 8×7B (coarse) | 8 | 2 | 0 | 13B / 47B |
| DeepSeek-V2 (fine + shared) | 160 routed + 2 shared | 6 | 2 | 21B / 236B |
| DeepSeek-V3 (fine + shared) | 256 routed + 1 shared | 8 | 1 | 37B / 671B |
| Qwen3-MoE-A235B | 128 routed | 8 | 0 | 22B / 235B |
The trend is monotone. Newer MoE models have more experts, narrower experts, and higher top-k. The active parameter ratio is similar across them — that is the FLOP-conservation constraint at work — but the combinatorial richness of routing is wildly different. DeepSeek-V3's router can address distinct expert teams; Mixtral's router can address 28.
The bottleneck shifts again. With coarse MoE, the bottleneck was load balance (chapter 6). With fine-grained MoE, two new costs appear: router cost — now non-trivial, because a Linear from to 256 is still small but it runs once per token of every layer — and kernel launch overhead, because dispatching 8 small matmuls per token across 256 expert weight banks is harder than dispatching 2. Both have well-engineered solutions (grouped GEMMs, token shuffling, all-to-all overlap), and we will get to them in the expert-parallelism section.
The scaling-law angle
The DeepSeekMoE paper (Dai et al., 2024) ran the experiment carefully: at a fixed active-parameter budget, finer segmentation reliably lowers loss. The improvement is not free — communication and routing cost grow — but on the scaling-law curve, fine-grained MoE sits below the coarse-MoE curve, which sits below the dense curve. Each move buys compute-efficient capacity at the cost of systems complexity.
The Engineering Reality
Three things go wrong if you implement fine-grained MoE naively, and each has shaped how production systems actually look.
- Tiny kernels. A 256-way expert pool with per-expert hidden dim means each expert FFN is a small matmul. Without fusion, the GPU spends more time launching kernels than computing. Solution: grouped GEMM kernels (CUTLASS, Triton) that pack the per-expert work into one big matmul with per-row expert ids.
- Routing skew. Finer experts are easier to under-train. If one of 256 experts only sees 0.1% of the tokens, its weights drift toward zero and the router learns to ignore it. Bias-term load balancing (chapter 6, section 3) is the DeepSeek-specific fix: add a small per-expert bias to the router logits, increase it for under-used experts, decrease it for over-used ones. Pushes the distribution back toward uniform without distorting gradients.
- Memory-bandwidth pressure. With 256 experts sharing one residual stream, every token forward pass has to read the weights of 8 small experts and the shared one. For models too large to fit on one GPU, those 9 expert reads come from 9 potentially-different GPUs — an all-to-all collective. We will spend the next sections on the parallelism primitives that make that affordable.
Despite all this, the trade keeps paying off. DeepSeek-V3 reached GPT-4-class quality on a fraction of the active compute precisely because fine-grained decomposition let it carry 671B parameters of knowledge while running only 37B per token. The recipe is now standard across the frontier of open-weight models, and the version you see in 2025-era papers is almost always this one: many slim experts, high top-k, one or two shared experts, grouped-GEMM kernels, bias-term load balancing.
The one sentence to carry forward: fine-grained decomposition buys you exponentially more expert combinations at zero FLOP cost, and shared experts buy you specialization by offloading the boring universal knowledge to a separate, always-on pathway. Together, they are why a 671B model can think like a 671B model while computing like a 37B one.