Chapter 5
25 min read
Section 31 of 117

Implementing DeepSeekMoE

Mixture-of-Experts: DeepSeekMoE

The previous five sections built up the DeepSeekMoE idea one piece at a time: why sparse conditional compute, how a router decides, why fine-grained experts outperform fat ones, why a small always-on shared block stabilizes the whole thing, and how the routed pool gets sharded across hundreds of GPUs. Now we wire it all together. This section is the assembly — one complete nn.Module that takes a batched transformer hidden state in, runs every part of DeepSeekMoE, and hands the right tensor back to the next transformer layer.

What you should walk away with. A mental model of the full DeepSeekMoE forward pass, the exact tensor shapes at every stage, a from-scratch NumPy reference you can step through with a debugger, and a production-style PyTorch block that scales — with the same code structure — from a toy on your laptop to the 671B-parameter DeepSeek-V3 on a real cluster.

From Five Ideas to One Block

Before code, a quick stocktake. A DeepSeekMoE block is built out of exactly five ingredients, and every one of them was earned in a previous section of this chapter:

IngredientEarned inWhat it contributes
Sparse conditional compute (top-k)5.1Activate only a fraction of parameters per token
Soft router with softmax gates5.2Learn which expert sees which token, end-to-end
Fine-grained expert pool (large N_r)5.3Many small specialists, more routing combinations
Shared experts (small N_s)5.4Absorb common-knowledge load, free the routed pool
Expert parallelism (all-to-all)5.5Shard the routed pool across many GPUs

The implementation is just these five things glued in the right order. Most of the engineering complexity sits in the glue: the right tensor shapes, the right flatten/reshape sandwich, the right place to put the softmax, and — for real clusters — the right all-to-all hook. The forward pass equation, by contrast, is astonishingly simple.

The Real Problem: Plumbing the Whole Forward Pass

Each previous section showed its piece in isolation. The router lived in a tiny NumPy snippet. Shared experts had their own toy. Fine-grained decomposition was a parameter-count argument. None of those snippets was a thing you could drop into a transformer and train. The implementation problem is exactly that: turn five mental models into one batched, autograd-traceable, shard- ready module.

Three things make this hard in practice. First, tensor shapes change rapidly: (B,T,D)(B, T, D) in, (N,D)(N, D) after flattening, (N,Er)(N, E_r) after the router, (N,k)(N, k) after top-k, then per-expert mini-batches of shape (M,D)(M, D) where MM changes every step. Getting the indexing right is most of the bug surface. Second, the shared and routed paths combine by addition, not by a global softmax — a subtle but load-bearing detail that has been gotten wrong in more than one published reimplementation. Third, the same forward code has to work for a 1-GPU toy and a 256-GPU cluster — the only difference being where the all-to-all wrappers appear.

Why this section exists. You can read every previous section and still not be able to write a working DeepSeekMoE block from scratch. The forward equation hides four pages of tensor bookkeeping. We walk that bookkeeping line by line, in NumPy first and PyTorch second, so the block stops being magical and becomes the kind of thing you would code in an afternoon.

Intuition: A Tiny Scheduler Around Two Pools

Strip the autograd machinery away and a DeepSeekMoE block is a scheduler. For each token, it makes one decision: which routed experts get to see this token. The shared experts do not need a decision — they always run. So the forward loop is:

  1. Always-on path. Run every shared expert on every token. Sum their outputs.
  2. Scoring. The router computes one logit per routed expert. Cheap matmul, no FFNs touched yet.
  3. Selection. Top-kk picks the winners. The other NrkN_r - k routed experts go to sleep for this token.
  4. Gating. Softmax over the survivors gives convex gates that sum to 1.
  5. Specialist path. Each surviving routed expert runs on the tokens that picked it. Multiply by the gate, add into the running output.
  6. Return. The accumulated yy is the block's output. Pass to the next transformer layer as if nothing interesting happened.
The mental picture to keep. Two pools, two paths, one sum. The shared pool is plumbed directly from the token. The routed pool sits behind a router that picks kk winners per token. Their outputs get added — never averaged, never softmaxed-together. That additive combination is the whole DeepSeekMoE architectural commitment.

The Complete DeepSeekMoE Forward Equation

Pull every piece together and the block computes a single tensor from a single tensor:

y(x)=i=1NsEs(i)(x)always-on shared+iT(x)gi(x)Er(i)(x)gated routedy(x) = \underbrace{\sum_{i=1}^{N_s} E_s^{(i)}(x)}_{\text{always-on shared}} + \underbrace{\sum_{i \in \mathcal{T}(x)} g_i(x) \cdot E_r^{(i)}(x)}_{\text{gated routed}}

The notation is the same we have used throughout the chapter, repeated here for completeness. xRdx \in \mathbb{R}^{d} is the input token hidden state. Es(i):RdRdE_s^{(i)}: \mathbb{R}^{d} \to \mathbb{R}^{d} and Er(i):RdRdE_r^{(i)}: \mathbb{R}^{d} \to \mathbb{R}^{d} are individual expert FFNs — typically E(x)=W2σ(W1x)E(x) = W_2 \, \sigma(W_1 x) with σ\sigma a GELU. The top-k set T(x)=TopK(Wrx,k)\mathcal{T}(x) = \mathrm{TopK}(W_r x,\, k) contains the kk indices with the largest router scores. The gates gi(x)=exp(si)/jT(x)exp(sj)g_i(x) = \exp(s_i) / \sum_{j \in \mathcal{T}(x)} \exp(s_j) come from a softmax restricted to the survivors — gi=0g_i = 0 for iT(x)i \notin \mathcal{T}(x).

What the equation does not say

Three things are conspicuously absent from the formula, and each absence encodes an architectural decision:

  1. No global softmax across shared and routed. The two sums are added independently. Shared experts are not in the gating competition; they are part of the architectural baseline.
  2. No bias term on the router. DeepSeek's router uses a plain linear layer with no bias. The auxiliary-loss-free balancing in chapter 6 will add a per-expert bias to logits only for selection, not for the gates — a subtle distinction that matters at scale.
  3. No capacity drop. The equation assumes every token gets its k winners. Real training uses a capacity factor that occasionally drops tokens when a single expert overflows; we covered that in 5.5.

FLOP accounting

A dense FFN of width dffd_{ff} costs roughly 2ddff2 \cdot d \cdot d_{ff} FLOPs per token. A DeepSeekMoE block with NsN_s shared experts, NrN_r routed experts, and top-kk routing costs:

FLOPs(x)(Ns+k)2ddff+Nrd\mathrm{FLOPs}(x) \approx (N_s + k) \cdot 2 d \cdot d_{ff} + N_r \cdot d

The first term is the FFNs that actually run. The second is the router's matvec — negligible because ddffd \ll d_{ff}. Compare against the (Ns+Nr)2ddff(N_s + N_r) \cdot 2 d \cdot d_{ff} parameters held in memory and you see the whole DeepSeekMoE bargain in one ratio:active/total=(Ns+k)/(Ns+Nr)\mathrm{active}/\mathrm{total} = (N_s + k) / (N_s + N_r). For DeepSeek-V3 that is (1+8)/(1+256)3.5%(1 + 8) / (1 + 256) \approx 3.5\% — and that is why a 671B-parameter model trains and serves at the cost of a ~37B-parameter dense model.

Manual Numerical Walkthrough

We push one toy token through a complete block with d=4d = 4, Ns=2N_s = 2, Nr=8N_r = 8, k=2k = 2. Every number below is computed by hand so you can verify any step against your own reimplementation.

Click to expand: one token through the complete block, by hand

Setup. Token x=[0.8,0.3,0.5,0.2]x = [0.8,\, -0.3,\, 0.5,\, 0.2] (an embedding for the toy code-domain token "def fibonacci(n):"). Two shared experts and eight routed experts. To keep arithmetic readable, we use precomputed per-expert outputs rather than write out every weight matrix.

(A) Shared path — always on. The two shared experts produce outputs:

  • Es(1)(x)=[0.20,0.15,0.10,0.18]E_s^{(1)}(x) = [0.20,\, 0.15,\, 0.10,\, 0.18]
  • Es(2)(x)=[0.12,0.22,0.14,0.09]E_s^{(2)}(x) = [0.12,\, 0.22,\, 0.14,\, 0.09]

Summing: Σshared=[0.32,0.37,0.24,0.27]\Sigma\,\text{shared} = [0.32,\, 0.37,\, 0.24,\, 0.27]. This is the unconditional baseline — every token in the corpus, regardless of domain, would have a non-trivial shared contribution.

(B) Router scores the routed pool. Router logits across the 8 routed experts (precomputed from WrxW_r x):

s=[3.2,0.4,0.7,2.6,0.1,1.4,0.8,0.9]s = [3.2,\, 0.4,\, 0.7,\, 2.6,\, 0.1,\, 1.4,\, 0.8,\, 0.9]

(C) Top-k = 2 selection. Argsort says experts 1 (score 3.2) and 4 (score 2.6) survive. The other six experts contribute zero — their FFNs never run.

Softmax over the 2 survivors. Subtract max: z=[0,0.6]z = [0,\, -0.6]. ez=[1.0,0.549]e^z = [1.0,\, 0.549], sum 1.549. So g1=1.0/1.549=0.646g_1 = 1.0 / 1.549 = 0.646 and g4=0.549/1.549=0.354g_4 = 0.549 / 1.549 = 0.354. Gates sum to 1.

(D) Selected routed experts run. Each surviving expert produces its own output:

  • Er(1)(x)=[1.40,0.20,0.10,0.30]E_r^{(1)}(x) = [1.40,\, 0.20,\, 0.10,\, 0.30] (a code specialist)
  • Er(4)(x)=[1.10,0.30,0.20,0.40]E_r^{(4)}(x) = [1.10,\, 0.30,\, 0.20,\, 0.40] (another code-leaning expert)

Gated contributions: 0.646[1.40,0.20,0.10,0.30]=[0.904,0.129,0.065,0.194]0.646 \cdot [1.40, 0.20, 0.10, 0.30] = [0.904, 0.129, 0.065, 0.194] and 0.354[1.10,0.30,0.20,0.40]=[0.389,0.106,0.071,0.142]0.354 \cdot [1.10, 0.30, 0.20, 0.40] = [0.389, 0.106, 0.071, 0.142]. Sum: Σrouted=[1.293,0.235,0.136,0.336]\Sigma\,\text{routed} = [1.293,\, 0.235,\, 0.136,\, 0.336].

(E) Combine. Final output y=Σshared+Σroutedy = \Sigma\,\text{shared} + \Sigma\,\text{routed}:

y=[0.32+1.293,0.37+0.235,0.24+0.136,0.27+0.336]=[1.613,0.605,0.376,0.606]y = [0.32 + 1.293,\, 0.37 + 0.235,\, 0.24 + 0.136,\, 0.27 + 0.336] = [1.613,\, 0.605,\, 0.376,\, 0.606]

FLOP audit on this token. Active FFNs: Ns+k=2+2=4N_s + k = 2 + 2 = 4. Inactive FFNs: Nrk=6N_r - k = 6 — their parameters lived in memory but no compute touched them. Compared to a dense FFN of the same width, the block did 4× the work of one FFN but stored Ns+Nr=10N_s + N_r = 10× the parameters. That 4-vs-10 ratio is the entire DeepSeekMoE bargain in miniature.

Now swap the token. If we replace x with a math-domain token, the shared contribution barely changes (it is the baseline), but the router logits redirect — experts 2 (math specialist) and 5 (math-adjacent) survive instead of 1 and 4. The same 4 FFNs run, but two of them are different. That swap is the specialization that vanilla dense models cannot do at the same parameter budget.

Visualizing the Forward Pass

The tracer below walks the full block one stage at a time. Pick a token at the top right; the shared (violet) experts always fire, while the routed (green) pool only lights up the two experts the router selects for that domain. Use Play to auto-advance through the seven stages or step manually with the arrows.

Loading forward-pass tracer…

Three things to watch as you switch tokens. First, the violet column is invariant — the same two shared experts fire for every token, with nearly the same output magnitude. That is the baseline doing its job. Second, the green grid changes its highlighted cells: the router picks different routed experts for code vs. math vs. history. The dimmed cells stayed in memory but consumed zero FLOPs. Third, the final output yy is just the sum of the two boxed accumulators — never a softmax, never a weighted average across the two pools.

Plain Python: A Complete DeepSeekMoE Block

Here is the entire block in NumPy. No autograd, no batching, no GPU — just one class with one forward method that you can step through with pdb. The shape of every tensor matches the equation we just wrote.

🐍deepseek_moe_numpy.py
3One class, two pools, one router

The whole DeepSeekMoE architecture lives inside a single class so you can trace one forward call top to bottom. Two ModuleList-like pools (shared + routed) plus one router — nothing else.

6Five integers define the block

d is the hidden size, d_ff the FFN width per expert. n_shared = always-on count (1–2 in real DeepSeek), n_routed = total routed experts, k = how many routed experts each token visits.

EXECUTION STATE
n_shared = 2
n_routed = 8
k = 2
11Shared expert weights

Stacked tensors so all shared experts can run in a single batched matmul when we move to PyTorch. Shape is (E_s, d_ff, d) — expert axis first.

EXECUTION STATE
W1_s.shape = (2, 8, 4)
W2_s.shape = (2, 4, 8)
13Routed expert weights

Identical tensor shape, just a different runtime contract: only k of these n_routed experts will execute per token. Storage cost is full; compute cost is k / n_routed.

EXECUTION STATE
W1_r.shape = (8, 8, 4)
W2_r.shape = (8, 4, 8)
16Router is scoped to the routed pool

The router has shape (n_routed, d), not (n_shared + n_routed, d). Shared experts are architecturally always on — no logit, no gate. Mixing them into the router is the most common DeepSeekMoE bug.

EXECUTION STATE
W_router.shape =
(8, 4)
18Per-expert FFN: the building block

A vanilla two-layer FFN: W1 projects up to d_ff, ReLU non-linearity, W2 projects back to d. Both shared and routed experts use the exact same architecture — the only thing that differs is when they fire.

23Allocate the output accumulator

y starts at zero and grows by addition. There is no global softmax across shared and routed outputs — both are summed directly into y. Order of accumulation does not matter; addition is commutative.

EXECUTION STATE
y.shape = (4,)
25Run every shared expert

No router, no gate, no mask — every shared expert just runs. This is the always-on path. After this loop, y holds Σᵢ Eₛ⁽ⁱ⁾(x).

29Router scores the routed pool

One linear matvec: W_router (n_routed, d) times x (d,) gives n_routed scores. Cheap — O(n_routed · d) — much smaller than any FFN. The router is the only learned part of the gating decision.

EXECUTION STATE
logits.shape = (8,)
30Top-k selection on raw logits

np.argsort(-logits)[:k] returns the indices of the k highest logits. Real implementations use np.argpartition or torch.topk for O(n) instead of O(n log n), but the semantics are identical.

EXAMPLE
For logits = [3.2, 0.4, 0.7, 2.6, 0.1, 1.4, 0.8, 0.9] with k=2, topk = [0, 3].
33Numerically-stable softmax on the survivors

Subtract the max before exp to prevent overflow when logits are large. Softmax is restricted to the k chosen logits — the dropped experts contribute zero, not 1/n_routed. After this, gates is a (k,) vector that sums to 1.

EXECUTION STATE
gates.shape = (2,)
sum(gates) = 1.0
37Run only the selected routed experts

We loop over the k surviving (expert, gate) pairs — not over all n_routed experts. The dropped experts' weights sit in memory at zero FLOP cost. y accumulates g · FFN(x) for each survivor.

40Return the output and a routing trace

Returning the (chosen, gates, logits) trace alongside y is what real training code does: the trace feeds the load-balancing loss in chapter 6 and the routing logger for evaluation. Without it you have no observability into routing behavior.

42Construct and call

Toy dimensions: d=4, 2 shared + 8 routed experts, top-k=2. Real DeepSeek-V3 uses d=7168, 1 shared + 256 routed, top-k=8. The mechanism is identical — only the integers grow.

34 lines without explanation
1import numpy as np
2
3class DeepSeekMoENumPy:
4    """One full DeepSeekMoE block in plain NumPy. No autograd, no shortcuts."""
5
6    def __init__(self, d, d_ff, n_shared, n_routed, k, seed=0):
7        self.d, self.d_ff = d, d_ff
8        self.n_shared, self.n_routed, self.k = n_shared, n_routed, k
9        rng = np.random.default_rng(seed)
10
11        # Two pools of FFNs, stored separately on purpose.
12        self.W1_s = rng.standard_normal((n_shared, d_ff, d)) * 0.05
13        self.W2_s = rng.standard_normal((n_shared, d, d_ff)) * 0.05
14        self.W1_r = rng.standard_normal((n_routed, d_ff, d)) * 0.05
15        self.W2_r = rng.standard_normal((n_routed, d, d_ff)) * 0.05
16
17        # Router maps d -> n_routed (NEVER n_shared + n_routed).
18        self.W_router = rng.standard_normal((n_routed, d)) * 0.05
19
20    def _ffn(self, W1, W2, x):
21        # Single-expert forward: GELU-ish (use ReLU here for clarity).
22        return W2 @ np.maximum(0, W1 @ x)
23
24    def forward(self, x):
25        # ---- (A) shared block: unconditional, no router ----
26        y = np.zeros(self.d)
27        for i in range(self.n_shared):
28            y += self._ffn(self.W1_s[i], self.W2_s[i], x)
29
30        # ---- (B) router scores ONLY the routed pool ----
31        logits = self.W_router @ x                  # (n_routed,)
32        topk = np.argsort(-logits)[: self.k]        # indices of k winners
33
34        # ---- (C) softmax restricted to the survivors ----
35        z = logits[topk] - logits[topk].max()
36        gates = np.exp(z) / np.exp(z).sum()         # (k,), sums to 1
37
38        # ---- (D) run routed experts and accumulate gated outputs ----
39        for g, i in zip(gates, topk):
40            y += g * self._ffn(self.W1_r[i], self.W2_r[i], x)
41
42        return y, {"chosen": topk, "gates": gates, "logits": logits}
43
44block = DeepSeekMoENumPy(d=4, d_ff=8, n_shared=2, n_routed=8, k=2)
45x = np.array([0.8, -0.3, 0.5, 0.2])
46y, info = block.forward(x)
47print("chosen:", info["chosen"], "gates:", info["gates"].round(3))
48print("y:", y.round(3))

Two structural observations before we move on. First, the router's output dim is NrN_r on line 16 — not Ns+NrN_s + N_r. If you find yourself wanting to score the shared experts too, you are off the architecture. Second, the only difference between the shared loop (line 25) and the routed loop (line 37) is that the routed loop iterates over chosen indices with gates, while the shared loop iterates over every shared expert with no gate. That single line of asymmetry is the whole point of the shared-vs-routed split.

Sanity check via degenerate cases. Set Ns=0N_s = 0: the block collapses to vanilla MoE (section 5.1). Set k=0k = 0 and Ns>0N_s > 0: the block becomes a stacked dense FFN. Set Nr=0N_r = 0: the block is a dense FFN. DeepSeekMoE is the principled interpolation across all three corners.

PyTorch: Production-Style Implementation

Now the production-style PyTorch version. Two things change from the NumPy reference: weights are stored as packed (E,dff,d)(E, d_{ff}, d) tensors so we can issue grouped GEMMs, and the routed loop uses the gather / run / scatter trick that keeps GPU matmuls large. Everything else is one-to-one with the math.

🐍deepseek_moe_pytorch.py
5One nn.Module, every part of DeepSeekMoE inside

DeepSeekMoEBlock is the production-style assembly — a drop-in replacement for a dense FFN inside a transformer block. autograd handles all gradients; the engineer only writes the forward.

12Stacked weights instead of ModuleList

Real DeepSeek code stores both pools as packed (E, d_ff, d) tensors so we can issue grouped GEMMs instead of looping. ModuleList is fine for clarity; this form is faster and simpler to shard.

EXECUTION STATE
W1_s.shape = (n_shared, d_ff, d_model)
W1_r.shape = (n_routed, d_ff, d_model)
17Kaiming init for ReLU/GELU networks

Each expert is independently initialized with Kaiming uniform — the same init you would use for a single FFN. There is no special init for the always-on pool: the architecture (not the init) gives them their always-on role.

20Router emits exactly n_routed logits

Linear(d_model, n_routed). The shared experts are NOT in this output dim. bias=False matches DeepSeek's convention — bias on the router would interact with the auxiliary-loss-free load balancing we cover in chapter 6.

EXECUTION STATE
router.weight.shape = (n_routed, d_model)
22Pool runner: one batched matmul per layer

_run_pool runs every expert in a stacked pool on every row of x. Output shape (E, N, d) gives the caller a clean tensor to sum (shared) or gather rows from (routed). Used here only for the shared pool — the routed pool needs per-expert gather/scatter.

27Two einsums = two batched GEMMs

einsum('efd,nd->enf', W1, x) is a batched matmul: each of the E experts multiplies the (N, d) input. cuBLAS dispatches this as a single grouped GEMM — far faster than a Python loop over experts.

EXECUTION STATE
h.shape = (E, N, d_ff)
32Standard (B, T, D) input

The MoE block plugs in wherever a dense FFN would. Routing decisions are per-token; (B, T) gets flattened to N tokens for the MoE math, then reshaped back at the end.

EXECUTION STATE
x.shape = (B, T, D)
33Flatten to (N, D)

N = B · T. Each row is one token; routing is independent across rows. The same word appearing in two sequences is allowed to land on different routed experts — context drives routing, not lexical identity.

EXECUTION STATE
x_flat.shape = (N, D)
36Run all shared experts in one shot

_run_pool returns (E_s, N, D). We sum across the expert axis to get y of shape (N, D) — the unconditional shared baseline. No router was consulted; every token got every shared expert.

EXECUTION STATE
shared_out.shape = (E_s, N, D)
y.shape = (N, D)
40Score the routed pool

Linear(D, n_routed) on (N, D) → (N, n_routed). Output dim is n_routed — confirming once more that the shared experts are outside the routing scope.

EXECUTION STATE
logits.shape = (N, n_routed)
41torch.topk: per-token top-k along the expert axis

Returns the k largest logits and their indices per token. topk_idx[n] are the routed experts token n must visit. torch.topk is O(N · E_r) — cheaper than a full sort because it stops at k.

EXECUTION STATE
topk_idx.shape = (N, k)
42Softmax over the k survivors only

F.softmax on topk_w gives k gates per token that sum to 1. The dropped experts contribute zero — softmax is restricted, not masked-after-softmax. Shared and routed paths are NOT renormalized together.

EXECUTION STATE
gates.shape = (N, k)
45Loop over routed experts, not tokens

For each expert i, gather the tokens that picked it, run that expert ONCE on the packed mini-batch, scatter back. One large matmul per expert beats thousands of tiny ones — this is the whole reason real MoE training is efficient.

46Build a per-expert mask

topk_idx == i broadcasts to (N, k). True where token n chose expert i in any of its k slots. mask.any() lets us skip experts that got zero tokens this batch — a free 90%+ speedup when routing is sparse.

50Gather gates for selected (token, slot) pairs

mask.nonzero gives (row, slot) coordinates of True entries. We index gates with these to get the right scalar weight for each token's contribution. unsqueeze(-1) makes g broadcastable against the (M, D) output.

EXECUTION STATE
tok_rows.shape = (M,)
g.shape = (M, 1)
52Routed expert forward on packed mini-batch

x_flat[tok_rows] is the (M, D) packed mini-batch of tokens that picked expert i. Two matmuls and a GELU later, we have (M, D) out — one expert's contribution for all the tokens it owns.

EXECUTION STATE
h.shape = (M, d_ff)
out.shape = (M, D)
54Scatter gated output into y

y[tok_rows] += g * out: in-place scatter-add. y already holds the shared contribution from line 37; we layer the gated routed contribution on top. Multiple routed experts can write to the same row (the k > 1 case) — addition handles that cleanly.

56Reshape and return the trace

Reshape back to (B, T, D) for the transformer. The returned dict (logits, topk_idx, gates) feeds the load-balancing loss and the routing logger. Hide it behind a debug flag in production — the dict allocations are not free at scale.

38 lines without explanation
1import torch
2import torch.nn as nn
3import torch.nn.functional as F
4
5class DeepSeekMoEBlock(nn.Module):
6    def __init__(self, d_model: int, d_ff: int,
7                 n_shared: int, n_routed: int, k: int):
8        super().__init__()
9        self.d_model, self.k = d_model, k
10        self.n_shared, self.n_routed = n_shared, n_routed
11
12        # Stacked weights for batched matmul. (E, d_ff, d) and (E, d, d_ff).
13        self.W1_s = nn.Parameter(torch.empty(n_shared, d_ff, d_model))
14        self.W2_s = nn.Parameter(torch.empty(n_shared, d_model, d_ff))
15        self.W1_r = nn.Parameter(torch.empty(n_routed, d_ff, d_model))
16        self.W2_r = nn.Parameter(torch.empty(n_routed, d_model, d_ff))
17        for w in (self.W1_s, self.W2_s, self.W1_r, self.W2_r):
18            nn.init.kaiming_uniform_(w, a=5 ** 0.5)
19
20        self.router = nn.Linear(d_model, n_routed, bias=False)
21
22    def _run_pool(self, W1, W2, x):
23        # x: (N, d) — run every expert in the pool on every row.
24        # Returns (E, N, d) so the caller decides how to combine.
25        h = torch.einsum("efd,nd->enf", W1, x)            # (E, N, d_ff)
26        h = F.gelu(h)
27        return torch.einsum("edf,enf->end", W2, h)        # (E, N, d)
28
29    def forward(self, x: torch.Tensor):
30        B, T, D = x.shape
31        x_flat = x.reshape(B * T, D)                      # (N, D)
32
33        # (A) Shared path: stacked GEMM, then sum across the shared axis.
34        shared_out = self._run_pool(self.W1_s, self.W2_s, x_flat)  # (E_s, N, D)
35        y = shared_out.sum(dim=0)                                  # (N, D)
36
37        # (B) Router scores the routed pool.
38        logits = self.router(x_flat)                      # (N, E_r)
39        topk_w, topk_idx = logits.topk(self.k, dim=-1)    # (N, k), (N, k)
40        gates = F.softmax(topk_w, dim=-1)                 # (N, k)
41
42        # (C) Loop over routed experts, gather tokens, run, scatter.
43        for i in range(self.n_routed):
44            mask = (topk_idx == i)                        # (N, k)
45            if not mask.any():
46                continue
47            tok_rows, slot = mask.nonzero(as_tuple=True)
48            g = gates[tok_rows, slot].unsqueeze(-1)       # (M, 1)
49            # Single packed FFN call for all M tokens of expert i.
50            h = F.gelu(x_flat[tok_rows] @ self.W1_r[i].T) # (M, d_ff)
51            out = h @ self.W2_r[i].T                      # (M, D)
52            y[tok_rows] += g * out
53
54        return y.reshape(B, T, D), {
55            "logits": logits, "topk_idx": topk_idx, "gates": gates,
56        }

Three implementation details that real DeepSeek code does differently and you should know about:

  1. Grouped GEMM instead of a Python loop. The for i in range(self.n_routed) loop is fine at 8 experts but a throughput killer at 256. Real implementations use torch._scaled_grouped_mm (or DeepGEMM, the kernel DeepSeek released) to issue all the routed FFNs as one batched call. The Python becomes ~10 lines shorter and ~3× faster.
  2. Permute, then call, then unpermute. A common pattern in high-performance MoE code: sort the token batch so all tokens for expert 0 come first, then all for expert 1, then call a single grouped GEMM with a length-per-expert array, then unpermute back. This avoids any per-expert indexing and lets the kernel use contiguous memory.
  3. All-to-all wraps the routed loop. On a multi-GPU cluster, the permute step also dispatches tokens to the GPU that owns each routed expert. The forward pass becomes: shared FFN locallypermute + all-to-allrouted grouped GEMM on remote GPU all-to-all back + unpermuteadd to shared output. The shared path runs entirely on-device with no comms — section 5.5 showed why.
One implementation, two scales. The block above runs unchanged on a single GPU for a unit test and inside an FSDP+EP wrapper for a 256-GPU pretraining run. Only the optimizer and the parallelism config change — the nn.Module is the same code. That is the entire reason this section exists.

What Changes at Massive Scale

Take the same DeepSeekMoEBlock above, change three integers, and you have a layer of DeepSeek-V3:

ConfigurationToy (this section)DeepSeek-V2DeepSeek-V3
d_model451207168
d_ff (per expert)815362048
n_shared221
n_routed8160256
top-k routed268
Active params / block~1 KB~75 M~140 M
Total params / block~5 KB~2 B~4 B
Active / total ratio40%3.7%3.5%

Four operational concerns emerge once the integers grow.

Memory: optimizer states dominate

At a 4B-parameter block, the weights themselves are 8 GB in bf16. AdamW adds moment buffers — first and second moments, both in fp32 for stability — and you are at 24 GB per block before activations. FSDP / ZeRO shards the optimizer states across data-parallel replicas; expert parallelism additionally shards the routed weights themselves across EpE_p GPUs. The shared block stays replicated everywhere — it is small enough that the cross-GPU traffic to shard it would cost more than the memory it saves.

Communication: all-to-all is the new attention

Every routed-expert pool larger than one node lives behind an all-to-all collective. Each token has to travel to whichever GPU owns its kk chosen experts, run, and return. With 256 experts sharded across 32 GPUs, each forward pass moves on the order of NkDN \cdot k \cdot D bytes per direction. Bandwidth is the bottleneck — not FLOPs. DeepSeek uses InfiniBand-tuned NCCL kernels and the DeepEP library to overlap the all-to-all with the shared-block compute, hiding most of the latency behind work the shared experts were going to do anyway.

Numerical precision: fp8 wherever you can

DeepSeek-V3 runs the routed expert GEMMs in fp8 — the router stays in bf16, the softmax stays in fp32, the FFN matmuls drop to fp8. Per-tile scaling and an fp32 accumulate keep the fp8 numerics stable. The mixed-precision sandwich saves ~40% of memory traffic on the dominant cost (routed FFNs) without measurable loss damage. Shared experts can stay in bf16 — they handle every token, so any fp8 noise compounds faster there.

Load balance: the elephant in the next chapter

The forward pass above assumes the router picks roughly equal numbers of tokens per expert. It does not. Without intervention, training collapses into routing one or two experts hard while the rest starve. The classical fix — an auxiliary load-balance loss — actively hurts model quality at scale. DeepSeek invented the auxiliary-loss-free bias-term trick that fixes the imbalance without contaminating the gradient signal, and that is the entire subject of chapter 6. The block we just built is the right shape for that fix to slot into; you will see exactly where the bias term lives when we get there.

Engineering Reality and Gotchas

Three failure modes you will absolutely hit if you reimplement DeepSeekMoE without reading the next chapter first:

  1. Routing collapse on warmup. Cold-started routers produce near-uniform logits, so top-k picks effectively random experts for the first few hundred steps. The lucky experts get more gradient, become slightly better targets, get even more tokens — a feedback loop that ends with two experts owning the entire batch and the other 254 idle. Chapter 6 covers the bias-term fix; in the meantime, kaiming-uniform on the router (not zero init) and a short LR warmup buy you stability.
  2. Activation checkpointing recomputes routing. If you activation-checkpoint the MoE block (which you almost certainly will, to fit training), the backward pass re-runs the router and top-k. That is fine mathematically — the same logits produce the same top-k — but you must seed any stochastic routing identically on the recompute, and you must avoid any non-determinism (e.g., atomic gather/scatter) that would change which experts claim which tokens.
  3. Mixed precision around softmax. Run softmax in fp32 even when everything around it is bf16. The exp() is highly sensitive to small logit shifts — fp16 softmax loses gates entirely when one logit dominates. The cost is negligible; the bug it prevents is silent and devastating.
The structure you should keep. The DeepSeekMoEBlock we wrote is the right shape for the next four chapters of the book. Chapter 6 adds the bias-term to its logits line. Chapter 7 (MTP) keeps it unchanged but adds prediction heads downstream. The MLA chapters change the attention block that feeds this MoE block. By committing to this exact interface — one nn.Module, two pools, one router, additive output — you decouple the architectural innovations from each other and make them composable.

The one sentence to carry forward: DeepSeekMoE is a sum of two pools, gated only on one side. Every other piece of complexity — fine-grained decomposition, expert parallelism, fp8 GEMMs, bias-term balancing — is an optimization layered on top of that single additive equation. Once that equation is in your fingers, the rest of the chapter (and the rest of the book) is engineering, not architecture.

Loading comments...