In the previous section we sliced each big expert into many tiny ones and increased the number of activations per token. That gave the router more combinations to choose from and let each expert specialize harder. But it also exposed an embarrassing redundancy hiding inside vanilla MoE: every expert was independently relearning the same basic facts. Shared experts are DeepSeek's fix — a small set of always-on FFNs that absorb the common- knowledge load so the routed pool is free to specialize.
The bet of shared experts. Hand the boring common knowledge — whitespace, punctuation, basic syntax, the word "the" — to a small always-on block. Let the routed experts do nothing but specialize. You spend a little more compute per token; you stop wasting model capacity on duplicated baselines.
The Redundancy Problem
Look inside a trained vanilla MoE and inspect the weights of each routed expert. A predictable, expensive pattern appears: every expert has independently learned roughly the same basic features. The first layers of every expert look almost alike. The router keeps picking different experts for different tokens, but each expert had to re-derive — from scratch — the parts of language that are universal.
Why does this happen? Top- routing is a hard partition. A token like the word "the" can only land on experts at a time. Across the training corpus, every expert eventually sees enough "the"s, "a"s, and commas that it has no choice but to learn how to handle them. The result is that a sizable fraction of every expert's parameters encode the same baseline competence.
| Knowledge type | Where it lives in vanilla MoE | Cost of redundancy |
|---|---|---|
| Common syntax / punctuation | Duplicated across all experts | ~30% of expert capacity wasted |
| Common vocabulary (function words) | Duplicated across all experts | Recall is fine, but capacity is squandered |
| Domain specialty (e.g. SQL keywords) | Concentrated in a few experts | This is what experts are FOR |
| Rare facts / long-tail tokens | Concentrated in a few experts | Working as intended |
Specialists Need a Generalist
Return to the hospital from the previous chapter. We argued that a triage desk sending each patient to two specialists beats every doctor seeing every patient. But notice what real hospitals also have: a general practitioner who sees everyone. The GP reads vitals, takes history, notices the obvious red flags — the universal medical baseline. The specialists then layer their expertise on top of that baseline.
Without the GP, every specialist would have to re-take the patient's vitals themselves. The cardiologist would relearn how to take blood pressure. The dermatologist would relearn how to read a temperature. Each specialist would spend a chunk of training on shared basics instead of going deep on what they are uniquely good at.
That is exactly the role of shared experts in DeepSeekMoE. A small number — typically 1 or 2 — of always-on FFNs play the role of GPs. They run for every token regardless of routing. The routed pool then sits behind a top- gate and learns only what is left over: the specialty.
The Math: Two Pools, One Output
Let denote the shared experts indexed , and the routed experts indexed . A DeepSeekMoE block computes:
Read this carefully. is the token's hidden state. The first sum runs over shared experts with no gate — each one's output is added directly. The second sum runs over , the set of routed experts the router selected, each scaled by its gate value . The gates come from a softmax restricted to the survivors: for , otherwise zero.
What is conspicuously absent
There is no global softmax across shared and routed experts. Their outputs are not competing for probability mass. The shared block contributes unconditionally; the routed block contributes a convex combination of its winners; the two are simply added. This decoupling is intentional — shared experts are part of thearchitecture's baseline, not part of the routing decision.
The compute equation
A dense FFN of width costs roughly FLOPs per token. A DeepSeekMoE block with shared and active routed experts (all of width ) costs . The total parameter count is .
Manual Numerical Walkthrough
Let us push one toy token through a block with shared expert and routed experts, top-. Every number is calculated by hand.
Click to expand: one shared + four routed, by hand
Setup. Token . One shared expert plus four routed experts, each a 1-layer FFN with , weights chosen so the arithmetic is trivial:
- Shared (the GP — averages and shifts every token)
- (specialist on the first coordinate)
- (specialist on the second coordinate)
- (a near-generalist that vanilla MoE would over-recruit)
- (a negator)
(A) Shared path. The shared expert runs for every token, no gate: . So far .
(B) Router. Routed logits with . With we get . Top-2 keeps routed experts 1 and 3 (scores 2.0 and 0.5).
Softmax over the two survivors. Subtract the max: . , sum 1.223. So ; .
Routed expert outputs (only the survivors). ; . and never run — their weights sat in memory at zero FLOP cost.
Combine. Routed contribution is . Adding the shared contribution from step (A): .
Compare to vanilla MoE on the same token. A vanilla MoE with the same 4 routed experts and no shared block would have output . The shared block added a uniform baseline of — the "every token needs this" signal. Crucially, the routed experts in DeepSeekMoE are free to not encode that baseline themselves, so their capacity goes into the parts that actually differ between tokens.
The FLOP audit. Vanilla MoE here did 2 FFNs. DeepSeekMoE did 3 FFNs (1 shared + 2 routed). That is a 50% increase in active compute for this block — but in exchange every routed expert is now free to be a true specialist. The DeepSeek-V2 ablations show this trade is comfortably positive: shared experts buy more loss reduction than the extra FLOPs cost.
Visualizing the Two-Pool Architecture
Toggle between Vanilla MoE and Shared + Routed in the diagram below. Watch how the violet shared experts light up for every token while the green routed pool still flips depending on which token you pick. Drag the slider to see the active-blocks counter change.
Three observations to lock in. First, the shared experts receive no router wire — they are connected directly from the token. There is no decision to make about whether to run them. Second, when you switch tokens, only the green wires move; the violet wires stay lit. That is the architectural commitment: shared experts are tokenagnostic. Third, the active-block counter shows that DeepSeekMoE pays a small fixed overhead (the shared block) for an outsized quality gain — every routed expert now stops paying the duplicated-baseline tax.
Plain Python: Shared + Routed From Scratch
Before the PyTorch version, here is the entire mechanism in NumPy. The two loops — one unconditional over shared experts, one gated over routed experts — are deliberately separated so you can see exactly where the always-on path lives.
The interesting structural detail is on the router line: it maps , never . Shared experts are outside the routing scope by construction. If you ever see a "shared expert" implementation that scores all experts and zeros out the gate on some of them, that is not the DeepSeek pattern — that is a soft mask. It costs the same to compute the score and it conflates two ideas (gating + always-on) that should stay separate.
Sanity check. Set in the snippet above. The DeepSeekMoE block collapses exactly into the vanilla MoE from section 5.1. Conversely, set and : the block becomes a dense FFN with stacked sub-experts. DeepSeekMoE is the principled interpolation between these two extremes.
PyTorch: A Drop-In DeepSeekMoE Block
Moving from one-token NumPy to a batched PyTorch module, the only new ideas are (a) keeping shared and routed weights in separate ModuleLists and (b) running shared experts on the entire flattened batch as one matmul each. The routed half is identical to the vanilla MoE you already saw, with the gather / scatter trick to keep GPU matmuls large.
Two subtleties worth marking, both consequences of the always-on path:
- Shared experts get more gradient signal per step. Every shared expert sees every token in the batch, so its parameters update on every step with full gradient density. Routed experts only see a fraction of the batch on average. In practice, DeepSeek tunes the learning rate the same way for both pools — the extra signal flowing into the shared block does not need a special schedule — but you should expect shared experts to converge faster and saturate earlier.
- The shared block is a stable floor. Even if the router collapses (the failure mode covered in chapter 6) and routes every token to the same one or two routed experts, the shared block still functions. The model never falls below the baseline competence the shared experts encoded. This is why DeepSeek treats shared experts as a robustness mechanism on top of their auxiliary-loss-free balancing — belt and suspenders.
What Changes at Massive Scale
DeepSeekMoE keeps a small shared block and a very large fine-grained routed pool. The actual numbers from the released papers tell the design story:
| Model | Shared experts | Routed experts | Top-k routed | Active / total params |
|---|---|---|---|---|
| DeepSeekMoE 16B | 2 | 64 | 6 | 2.8B / 16B |
| DeepSeek-V2 | 2 | 160 | 6 | 21B / 236B |
| DeepSeek-V3 | 1 | 256 | 8 | 37B / 671B |
| Mixtral 8×7B (no shared) | 0 | 8 | 2 | 13B / 47B |
Two patterns jump out. First, the shared count is tiny — . You do not need a lot of always-on capacity to absorb common knowledge; you need just enough that the routed pool stops duplicating it. Second, as DeepSeek went from V2 to V3, they actually reduced shared from 2 to 1 while pushing routed from 160 to 256 and from 6 to 8. The interpretation: with a finer-grained, larger routed pool, even less needs to be carved off as shared, because the routed pool itself has the granularity to cover near-baseline patterns specifically.
The gradient-flow consequence
Shared experts pull gradient on every step from every token. Routed experts pull gradient on roughly of the tokens. For DeepSeek-V3 that is . Without a shared block, every routed expert is on a starvation gradient diet for the baseline patterns — it has to learn whitespace handling from 3% of the tokens, painstakingly. The shared block gets the same baseline signal from 100% of tokens. Convergence on common patterns is dramatically faster.
The memory-bandwidth angle
On a real cluster, the routed experts are sharded across many GPUs (expert parallelism — covered in the next section). Every token has to travel via all-to-all to whichever GPUs own its chosen experts. The shared block, by contrast, lives once per device replica and runs locally. No all-to-all, no cross-node traffic. For tokens that would otherwise have to make a long trip for marginal specialization gain, the shared block delivers competence for free.
Engineering Reality and Gotchas
The shared-experts pattern is one of the simplest pieces of DeepSeekMoE to implement and one of the easiest to misuse. Three failure modes are worth flagging:
- Too many shared experts. If grows large enough that the shared block dominates the active-FLOP budget, you have just built a dense FFN with a small MoE tail. The routed pool stops mattering. DeepSeek caps at 1 or 2 for a reason.
- Soft-mask "shared" experts. Some implementations add "shared" experts to the routed pool and force their gates to be large. This costs the same as DeepSeek's pattern at inference but is architecturally cloudy — the gradient still flows through the router into the always-on experts, which can destabilize the routing signal. Keep the two paths cleanly separated.
- Imbalanced learning rates. Because shared experts see every token and routed experts see a few, a tempting hack is to lower the shared experts' learning rate to "balance" the gradient density. In practice, this hurts: the shared block should converge faster, that is its job. Use one global learning rate and let the architecture do the work.
The one sentence to carry forward is this: shared experts are how DeepSeek separates the universal from the specialist, and that separation is what unlocks the truly fine-grained routed pools — hundreds of experts, dozens active per token — that define the V2 and V3 generations.