Boo-AI — Master Artificial Intelligence by Building from Scratch

In the previous section we sliced each big expert into many tiny ones and increased the number of activations per token. That gave the router more combinations to choose from and let each expert specialize harder. But it also exposed an embarrassing redundancy hiding inside vanilla MoE: every expert was independently relearning the same basic facts. Shared experts are DeepSeek's fix — a small set of always-on FFNs that absorb the common- knowledge load so the routed pool is free to specialize.

The bet of shared experts. Hand the boring common knowledge — whitespace, punctuation, basic syntax, the word "the" — to a small always-on block. Let the routed experts do nothing but specialize. You spend a little more compute per token; you stop wasting model capacity on duplicated baselines.

The Redundancy Problem

Look inside a trained vanilla MoE and inspect the weights of each routed expert. A predictable, expensive pattern appears: every expert has independently learned roughly the same basic features. The first layers of every expert look almost alike. The router keeps picking different experts for different tokens, but each expert had to re-derive — from scratch — the parts of language that are universal.

Why does this happen? Top- $k$ routing is a hard partition. A token like the word "the" can only land on $k$ experts at a time. Across the training corpus, every expert eventually sees enough "the"s, "a"s, and commas that it has no choice but to learn how to handle them. The result is that a sizable fraction of every expert's parameters encode the same baseline competence.

Knowledge type	Where it lives in vanilla MoE	Cost of redundancy
Common syntax / punctuation	Duplicated across all experts	~30% of expert capacity wasted
Common vocabulary (function words)	Duplicated across all experts	Recall is fine, but capacity is squandered
Domain specialty (e.g. SQL keywords)	Concentrated in a few experts	This is what experts are FOR
Rare facts / long-tail tokens	Concentrated in a few experts	Working as intended

Why this matters for massive training. Suppose every routed expert spends 30% of its parameters on duplicated common knowledge. With 160 experts, you are storing the same baseline 160 times. That is parameters bought, gradient steps applied, optimizer states allocated — all to memorize the same thing over and over. Shared experts let you store the baseline once.

Specialists Need a Generalist

Return to the hospital from the previous chapter. We argued that a triage desk sending each patient to two specialists beats every doctor seeing every patient. But notice what real hospitals also have: a general practitioner who sees everyone. The GP reads vitals, takes history, notices the obvious red flags — the universal medical baseline. The specialists then layer their expertise on top of that baseline.

Without the GP, every specialist would have to re-take the patient's vitals themselves. The cardiologist would relearn how to take blood pressure. The dermatologist would relearn how to read a temperature. Each specialist would spend a chunk of training on shared basics instead of going deep on what they are uniquely good at.

That is exactly the role of shared experts in DeepSeekMoE. A small number — typically 1 or 2 — of always-on FFNs play the role of GPs. They run for every token regardless of routing. The routed pool then sits behind a top- $k$ gate and learns only what is left over: the specialty.

The right mental picture: shared experts are the baseline feature extractor; routed experts are the specialty layer. The model output is baseline + specialty, not an average of two things competing for the same role.

The Math: Two Pools, One Output

Let $E_s$ denote the shared experts indexed $1, \dots, N_s$ , and $E_r$ the routed experts indexed $1, \dots, N_r$ . A DeepSeekMoE block computes:

$y = \sum_{i=1}^{N_s} E_s^{(i)}(x) + \sum_{i \in \mathcal{T}} g_i(x) \cdot E_r^{(i)}(x)$

Read this carefully. $x$ is the token's hidden state. The first sum runs over $N_s$ shared experts with no gate — each one's output is added directly. The second sum runs over $\mathcal{T} = \mathrm{TopK}(W_r x, k)$ , the set of $k$ routed experts the router selected, each scaled by its gate value $g_i(x)$ . The gates come from a softmax restricted to the survivors: $g_i(x) = e^{s_i} / \sum_{j \in \mathcal{T}} e^{s_j}$ for $i \in \mathcal{T}$ , otherwise zero.

What is conspicuously absent

There is no global softmax across shared and routed experts. Their outputs are not competing for probability mass. The shared block contributes unconditionally; the routed block contributes a convex combination of its winners; the two are simply added. This decoupling is intentional — shared experts are part of thearchitecture's baseline, not part of the routing decision.

The compute equation

A dense FFN of width $d_{ff}$ costs roughly $2 d \cdot d_{ff}$ FLOPs per token. A DeepSeekMoE block with $N_s$ shared and $k$ active routed experts (all of width $d_{ff}$ ) costs $(N_s + k) \cdot 2 d \cdot d_{ff}$ . The total parameter count is $(N_s + N_r) \cdot 2 d \cdot d_{ff}$ .

The decoupling sharpens. Vanilla MoE buys you a

k / E

compute ratio. DeepSeekMoE pays a small fixed

N_s

on top of that, but in exchange every routed expert is free of the common-knowledge tax — so a fine-grained routed pool with

N_r = 160, k = 6, N_s = 2

can match or beat a dense baseline at a fraction of the FLOPs. The shared block is a cheap insurance policy against routing collapse on common tokens.

Manual Numerical Walkthrough

Let us push one toy token through a block with $N_s = 1$ shared expert and $N_r = 4$ routed experts, top- $k = 2$ . Every number is calculated by hand.

Click to expand: one shared + four routed, by hand

Setup. Token $x = [1, 0]$ . One shared expert plus four routed experts, each a 1-layer FFN with $d = 2, d_{ff} = 2$ , weights chosen so the arithmetic is trivial:

Shared $E_s(x) = [\tfrac{1}{2}(x_1 + x_2),\, \tfrac{1}{2}(x_1 + x_2)]$ (the GP — averages and shifts every token)
$E_r^{(1)}(x) = [2 x_1, 0]$ (specialist on the first coordinate)
$E_r^{(2)}(x) = [0, 2 x_2]$ (specialist on the second coordinate)
$E_r^{(3)}(x) = [x_1 + x_2, x_1 + x_2]$ (a near-generalist that vanilla MoE would over-recruit)
$E_r^{(4)}(x) = [-x_1, -x_2]$ (a negator)

(A) Shared path. The shared expert runs for every token, no gate: $E_s([1, 0]) = [0.5,\, 0.5]$ . So far $y = [0.5,\, 0.5]$ .

(B) Router. Routed logits $s = W_r x$ with $W_r = \begin{bmatrix} 2.0 & 0.1 \\ 0.2 & 1.5 \\ 0.5 & 0.5 \\ -1.0 & -1.0 \end{bmatrix}$ . With $x = [1, 0]$ we get $s = [2.0,\, 0.2,\, 0.5,\, -1.0]$ . Top-2 keeps routed experts 1 and 3 (scores 2.0 and 0.5).

Softmax over the two survivors. Subtract the max: $z = [0,\, -1.5]$ . $e^z = [1.0,\, 0.223]$ , sum 1.223. So $g_1 = 0.818, g_3 = 0.182$ ; $g_2 = g_4 = 0$ .

Routed expert outputs (only the survivors). $E_r^{(1)}([1, 0]) = [2, 0]$ ; $E_r^{(3)}([1, 0]) = [1, 1]$ . $E_r^{(2)}$ and $E_r^{(4)}$ never run — their weights sat in memory at zero FLOP cost.

Combine. Routed contribution is $0.818 \cdot [2, 0] + 0.182 \cdot [1, 1] = [1.818,\, 0.182]$ . Adding the shared contribution from step (A): $y = [0.5, 0.5] + [1.818, 0.182] = [2.318,\, 0.682]$ .

Compare to vanilla MoE on the same token. A vanilla MoE with the same 4 routed experts and no shared block would have output $[1.818, 0.182]$ . The shared block added a uniform baseline of $[0.5, 0.5]$ — the "every token needs this" signal. Crucially, the routed experts in DeepSeekMoE are free to not encode that baseline themselves, so their capacity goes into the parts that actually differ between tokens.

The FLOP audit. Vanilla MoE here did 2 FFNs. DeepSeekMoE did 3 FFNs (1 shared + 2 routed). That is a 50% increase in active compute for this block — but in exchange every routed expert is now free to be a true specialist. The DeepSeek-V2 ablations show this trade is comfortably positive: shared experts buy more loss reduction than the extra FLOPs cost.

Visualizing the Two-Pool Architecture

Toggle between Vanilla MoE and Shared + Routed in the diagram below. Watch how the violet shared experts light up for every token while the green routed pool still flips depending on which token you pick. Drag the $k$ slider to see the active-blocks counter change.

Loading shared-experts visualizer…

Three observations to lock in. First, the shared experts receive no router wire — they are connected directly from the token. There is no decision to make about whether to run them. Second, when you switch tokens, only the green wires move; the violet wires stay lit. That is the architectural commitment: shared experts are tokenagnostic. Third, the active-block counter shows that DeepSeekMoE pays a small fixed overhead (the shared block) for an outsized quality gain — every routed expert now stops paying the duplicated-baseline tax.

Plain Python: Shared + Routed From Scratch

Before the PyTorch version, here is the entire mechanism in NumPy. The two loops — one unconditional over shared experts, one gated over routed experts — are deliberately separated so you can see exactly where the always-on path lives.

🐍deepseekmoe_numpy.py

Explanation(12)

Code(36)

4Two pools, one knob each

n_shared shared experts always run; n_routed routed experts are gated top-k. Splitting the pool is the whole architectural move — the shared block carries common knowledge, the routed block specializes.

EXECUTION STATE

n_shared = 1

n_routed = 4

k = 2

7Shared expert weights

Standard FFN tensors, but stored separately from the routed pool. They are ordinary parameters trained by every token — gradients are not gated.

EXECUTION STATE

W1_s.shape = (1, 8, 4)

10Routed expert weights

Same shape as the shared weights, but each routed expert only sees the tokens the router sends to it. With n_routed = 4, we store 4× the parameters but only run k = 2 per token.

EXECUTION STATE

W1_r.shape = (4, 8, 4)

13Router is scoped to the routed pool only

Critical detail: the router has shape (n_routed, d), NOT (n_shared + n_routed, d). Shared experts are not scored — they are unconditional. The router has no idea they exist.

EXECUTION STATE

W_router.shape =

(4, 4)

15One expert forward pass

Plain ReLU FFN: W2 @ relu(W1 x). Used identically by both pools — the only difference is when it runs, not what it computes.

20Initialize the output accumulator

We will add contributions from the shared block first, then from each selected routed expert. No softmax across the union — shared and routed contributions are summed directly.

21Run every shared expert, unconditionally

There is no gate here. Every shared expert FFN executes for every token. This is the always-on baseline that absorbs common-knowledge load.

25Score the routed pool

One linear matvec into the n_routed scores. Cheap: O(n_routed · d). The router is the only learned part of the gating decision.

EXECUTION STATE

logits.shape = (4,)

26Top-k on the routed scores

Pick the k highest-scoring routed experts. Shared experts are not in this competition — they are guaranteed runtime regardless of any logit.

EXAMPLE

If logits = [0.7, -0.2, 1.3, 0.4], topk_idx = [2, 0] (k = 2 winners).

28Numerically-stable softmax over k survivors

Subtract the max before exp to prevent overflow. The resulting k gates sum to 1 — so the routed block contributes a convex combination of the chosen experts. Shared and routed sums are then added (NOT averaged).

EXECUTION STATE

gates.shape = (2,)

sum(gates) = 1.0

30Accumulate the gated routed contributions

Each routed expert that survived the top-k contributes g · FFN(x) into the same y the shared block already filled. The output is y = Σ shared + Σ gated routed.

34One toy token

Real DeepSeek-V2 uses d = 5120 with 2 shared experts and 160 routed experts (k = 6). The toy is identical in mechanism — only the numbers scale.

24 lines without explanation

1import numpy as np
2
3# 1 shared expert + 4 routed experts, top-k = 2 over the routed pool.
4n_shared, n_routed, d, d_ff, k = 1, 4, 4, 8, 2
5
6rng = np.random.default_rng(0)
7# Shared experts: always-on FFNs.
8W1_s = rng.standard_normal((n_shared, d_ff, d)) * 0.1
9W2_s = rng.standard_normal((n_shared, d, d_ff)) * 0.1
10# Routed experts: gated FFNs.
11W1_r = rng.standard_normal((n_routed, d_ff, d)) * 0.1
12W2_r = rng.standard_normal((n_routed, d, d_ff)) * 0.1
13# Router only scores the routed pool — shared experts have no router.
14W_router = rng.standard_normal((n_routed, d)) * 0.1
15
16def ffn(W1, W2, x):
17    return W2 @ np.maximum(0, W1 @ x)
18
19def moe_forward(x):
20    # (A) Shared block — runs unconditionally.
21    y = np.zeros(d)
22    for i in range(n_shared):
23        y += ffn(W1_s[i], W2_s[i], x)
24
25    # (B) Routed block — top-k gating.
26    logits = W_router @ x                        # (n_routed,)
27    topk_idx = np.argsort(-logits)[:k]
28    z = logits[topk_idx] - logits[topk_idx].max()
29    gates = np.exp(z) / np.exp(z).sum()          # (k,) sums to 1
30
31    for g, i in zip(gates, topk_idx):
32        y += g * ffn(W1_r[i], W2_r[i], x)
33    return y
34
35x = np.array([0.5, -0.2, 0.9, 0.1])
36print(moe_forward(x))

The interesting structural detail is on the router line: it maps $d \to n_{routed}$ , never $d \to n_{shared} + n_{routed}$ . Shared experts are outside the routing scope by construction. If you ever see a "shared expert" implementation that scores all experts and zeros out the gate on some of them, that is not the DeepSeek pattern — that is a soft mask. It costs the same to compute the score and it conflates two ideas (gating + always-on) that should stay separate.

Sanity check. Set $n_{shared} = 0$ in the snippet above. The DeepSeekMoE block collapses exactly into the vanilla MoE from section 5.1. Conversely, set $n_{shared} = N_s + N_r$ and $k = 0$ : the block becomes a dense FFN with $N_s + N_r$ stacked sub-experts. DeepSeekMoE is the principled interpolation between these two extremes.

PyTorch: A Drop-In DeepSeekMoE Block

Moving from one-token NumPy to a batched PyTorch module, the only new ideas are (a) keeping shared and routed weights in separate ModuleLists and (b) running shared experts on the entire flattened batch as one matmul each. The routed half is identical to the vanilla MoE you already saw, with the gather / scatter trick to keep GPU matmuls large.

🐍deepseekmoe_pytorch.py

Explanation(16)

Code(45)

5One nn.Module, two pools

DeepSeekMoEBlock keeps shared and routed weights as separate ModuleLists so autograd routes the right gradients into each — shared get gradients from every token, routed only from the tokens that picked them.

11Shared experts as a ModuleList

Each shared expert is an ordinary 2-layer FFN with GELU. They are stored separately because they are queried differently — no router touches them.

16Routed experts as a ModuleList

Same shape, different runtime contract. ModuleList tracks parameters so .parameters() yields both pools cleanly. In production, both pools are typically fused into a stacked (E, d_ff, d) tensor for batched matmul.

22Router emits n_routed logits

Linear(D, n_routed) — note the n_routed, not n_shared + n_routed. The router never decides whether shared experts run; that decision was made at architecture time.

EXECUTION STATE

router.weight.shape = (n_routed, D)

25Standard transformer batch shape

x is (batch, seq_len, d_model). Routing decisions are still per-token — but the shared experts also act per-token, just without a gate.

EXECUTION STATE

x.shape = (B, T, D)

26Flatten batch and time into tokens

Collapse batch and time so every token is one row. N = B · T. The MoE block treats every token independently — the same token in two sequences is allowed to pick different routed experts.

EXECUTION STATE

x_flat.shape = (N, D)

29Initialize the output

Allocate zeros of the same shape as the flat input. We will add shared and routed contributions into this single tensor — order does not matter, addition is commutative.

30Shared loop: every token, every shared expert

For each shared expert, run a single dense (N, D) → (N, D) matmul on the entire batch. No gathering, no masking, no gating. This is why shared experts are cheap to schedule — they are just FFNs.

34Router scores the routed pool only

Output is (N, n_routed). Total cost is O(N · D · n_routed) — much cheaper than running any FFN, even though n_routed can be 160 in DeepSeek-V2.

EXECUTION STATE

logits.shape = (N, n_routed)

35Top-k along the routed axis

torch.topk returns the k largest logits and their indices per token. topk_idx[n] are the routed experts token n must visit.

EXECUTION STATE

topk_idx.shape = (N, k)

36Softmax over the survivors

We softmax across the k kept logits per token. The k gates sum to 1, giving a convex combination of routed experts. Shared and routed sums are NOT renormalized together — they are simply added.

38Loop over routed experts, not tokens

For each routed expert i, gather the tokens that picked it, run that expert once on the packed mini-batch, scatter the gated result back. One large matmul per expert beats thousands of tiny ones.

39Build the per-expert mask

topk_idx == i broadcasts to (N, k); True where token n chose expert i in any of its k slots. mask.any() lets us skip experts with no tokens this batch.

43Extract gates for selected (token, expert) pairs

nonzero gives the (row, slot) coordinates of True entries. We index the gates tensor with these to get the right scalar weight for each token's contribution.

EXECUTION STATE

g.shape = (M, 1)

44Run routed expert and add gated output

self.routed[i](x_flat[tok_rows]) runs the routed FFN on the M tokens that chose it — one packed matmul. Multiply by the gate, add into y at the right rows. y already contains the shared contribution, so this is pure addition.

45Restore (B, T, D)

Reshape back so the DeepSeekMoEBlock is a drop-in replacement for a dense FFN inside the transformer block.

29 lines without explanation

1import torch
2import torch.nn as nn
3import torch.nn.functional as F
4
5class DeepSeekMoEBlock(nn.Module):
6    def __init__(self, d_model: int, d_ff: int,
7                 n_shared: int, n_routed: int, k: int):
8        super().__init__()
9        self.n_shared, self.n_routed, self.k = n_shared, n_routed, k
10
11        self.shared = nn.ModuleList([
12            nn.Sequential(nn.Linear(d_model, d_ff), nn.GELU(),
13                          nn.Linear(d_ff, d_model))
14            for _ in range(n_shared)
15        ])
16        self.routed = nn.ModuleList([
17            nn.Sequential(nn.Linear(d_model, d_ff), nn.GELU(),
18                          nn.Linear(d_ff, d_model))
19            for _ in range(n_routed)
20        ])
21        # Router scores only the routed pool.
22        self.router = nn.Linear(d_model, n_routed, bias=False)
23
24    def forward(self, x: torch.Tensor) -> torch.Tensor:
25        B, T, D = x.shape
26        x_flat = x.reshape(B * T, D)                          # (N, D)
27
28        # ---- shared path: unconditional, no router ----
29        y = torch.zeros_like(x_flat)
30        for shared_expert in self.shared:
31            y = y + shared_expert(x_flat)
32
33        # ---- routed path: top-k gating ----
34        logits = self.router(x_flat)                          # (N, E_r)
35        topk_w, topk_idx = logits.topk(self.k, dim=-1)        # (N, k), (N, k)
36        gates = F.softmax(topk_w, dim=-1)                     # (N, k)
37
38        for i in range(self.n_routed):
39            mask = topk_idx == i                              # (N, k)
40            if not mask.any():
41                continue
42            tok_rows, slot = mask.nonzero(as_tuple=True)
43            g = gates[tok_rows, slot].unsqueeze(-1)           # (M, 1)
44            y[tok_rows] += g * self.routed[i](x_flat[tok_rows])
45        return y.reshape(B, T, D)

Two subtleties worth marking, both consequences of the always-on path:

Shared experts get more gradient signal per step. Every shared expert sees every token in the batch, so its parameters update on every step with full gradient density. Routed experts only see a $k / N_r$ fraction of the batch on average. In practice, DeepSeek tunes the learning rate the same way for both pools — the extra signal flowing into the shared block does not need a special schedule — but you should expect shared experts to converge faster and saturate earlier.
The shared block is a stable floor. Even if the router collapses (the failure mode covered in chapter 6) and routes every token to the same one or two routed experts, the shared block still functions. The model never falls below the baseline competence the shared experts encoded. This is why DeepSeek treats shared experts as a robustness mechanism on top of their auxiliary-loss-free balancing — belt and suspenders.

Implementation note. In real DeepSeek code the per-routed-expert loop is replaced by a single grouped GEMM against a packed

(N_r, d_{ff}, d)

tensor, and the shared block uses an ordinary stacked FFN of the same form. Functionally identical to the snippet above, dramatically faster on H800s.

What Changes at Massive Scale

DeepSeekMoE keeps a small shared block and a very large fine-grained routed pool. The actual numbers from the released papers tell the design story:

Model	Shared experts	Routed experts	Top-k routed	Active / total params
DeepSeekMoE 16B	2	64	6	2.8B / 16B
DeepSeek-V2	2	160	6	21B / 236B
DeepSeek-V3	1	256	8	37B / 671B
Mixtral 8×7B (no shared)	0	8	2	13B / 47B

Two patterns jump out. First, the shared count is tiny — $N_s \in \{1, 2\}$ . You do not need a lot of always-on capacity to absorb common knowledge; you need just enough that the routed pool stops duplicating it. Second, as DeepSeek went from V2 to V3, they actually reduced shared from 2 to 1 while pushing routed from 160 to 256 and $k$ from 6 to 8. The interpretation: with a finer-grained, larger routed pool, even less needs to be carved off as shared, because the routed pool itself has the granularity to cover near-baseline patterns specifically.

The gradient-flow consequence

Shared experts pull gradient on every step from every token. Routed experts pull gradient on roughly $k / N_r$ of the tokens. For DeepSeek-V3 that is $8 / 256 = 3.1\%$ . Without a shared block, every routed expert is on a starvation gradient diet for the baseline patterns — it has to learn whitespace handling from 3% of the tokens, painstakingly. The shared block gets the same baseline signal from 100% of tokens. Convergence on common patterns is dramatically faster.

The memory-bandwidth angle

On a real cluster, the routed experts are sharded across many GPUs (expert parallelism — covered in the next section). Every token has to travel via all-to-all to whichever GPUs own its chosen experts. The shared block, by contrast, lives once per device replica and runs locally. No all-to-all, no cross-node traffic. For tokens that would otherwise have to make a long trip for marginal specialization gain, the shared block delivers competence for free.

Engineering Reality and Gotchas

The shared-experts pattern is one of the simplest pieces of DeepSeekMoE to implement and one of the easiest to misuse. Three failure modes are worth flagging:

Too many shared experts. If $N_s$ grows large enough that the shared block dominates the active-FLOP budget, you have just built a dense FFN with a small MoE tail. The routed pool stops mattering. DeepSeek caps $N_s$ at 1 or 2 for a reason.
Soft-mask "shared" experts. Some implementations add "shared" experts to the routed pool and force their gates to be large. This costs the same as DeepSeek's pattern at inference but is architecturally cloudy — the gradient still flows through the router into the always-on experts, which can destabilize the routing signal. Keep the two paths cleanly separated.
Imbalanced learning rates. Because shared experts see every token and routed experts see a few, a tempting hack is to lower the shared experts' learning rate to "balance" the gradient density. In practice, this hurts: the shared block should converge faster, that is its job. Use one global learning rate and let the architecture do the work.

Where we go from here. The next section opens the cluster — how to actually shard a 256-expert routed pool across hundreds of GPUs without drowning in all-to-all traffic. Shared experts will reappear there as a performance accelerator: because they live on every GPU, they hide a chunk of the all-to-all latency behind their own compute.

The one sentence to carry forward is this: shared experts are how DeepSeek separates the universal from the specialist, and that separation is what unlocks the truly fine-grained routed pools — hundreds of experts, dozens active per token — that define the V2 and V3 generations.