The previous five sections built up the DeepSeekMoE idea one piece at a time: why sparse conditional compute, how a router decides, why fine-grained experts outperform fat ones, why a small always-on shared block stabilizes the whole thing, and how the routed pool gets sharded across hundreds of GPUs. Now we wire it all together. This section is the assembly — one complete nn.Module that takes a batched transformer hidden state in, runs every part of DeepSeekMoE, and hands the right tensor back to the next transformer layer.
What you should walk away with. A mental model of the full DeepSeekMoE forward pass, the exact tensor shapes at every stage, a from-scratch NumPy reference you can step through with a debugger, and a production-style PyTorch block that scales — with the same code structure — from a toy on your laptop to the 671B-parameter DeepSeek-V3 on a real cluster.
From Five Ideas to One Block
Before code, a quick stocktake. A DeepSeekMoE block is built out of exactly five ingredients, and every one of them was earned in a previous section of this chapter:
| Ingredient | Earned in | What it contributes |
|---|---|---|
| Sparse conditional compute (top-k) | 5.1 | Activate only a fraction of parameters per token |
| Soft router with softmax gates | 5.2 | Learn which expert sees which token, end-to-end |
| Fine-grained expert pool (large N_r) | 5.3 | Many small specialists, more routing combinations |
| Shared experts (small N_s) | 5.4 | Absorb common-knowledge load, free the routed pool |
| Expert parallelism (all-to-all) | 5.5 | Shard the routed pool across many GPUs |
The implementation is just these five things glued in the right order. Most of the engineering complexity sits in the glue: the right tensor shapes, the right flatten/reshape sandwich, the right place to put the softmax, and — for real clusters — the right all-to-all hook. The forward pass equation, by contrast, is astonishingly simple.
The Real Problem: Plumbing the Whole Forward Pass
Each previous section showed its piece in isolation. The router lived in a tiny NumPy snippet. Shared experts had their own toy. Fine-grained decomposition was a parameter-count argument. None of those snippets was a thing you could drop into a transformer and train. The implementation problem is exactly that: turn five mental models into one batched, autograd-traceable, shard- ready module.
Three things make this hard in practice. First, tensor shapes change rapidly: in, after flattening, after the router, after top-k, then per-expert mini-batches of shape where changes every step. Getting the indexing right is most of the bug surface. Second, the shared and routed paths combine by addition, not by a global softmax — a subtle but load-bearing detail that has been gotten wrong in more than one published reimplementation. Third, the same forward code has to work for a 1-GPU toy and a 256-GPU cluster — the only difference being where the all-to-all wrappers appear.
Intuition: A Tiny Scheduler Around Two Pools
Strip the autograd machinery away and a DeepSeekMoE block is a scheduler. For each token, it makes one decision: which routed experts get to see this token. The shared experts do not need a decision — they always run. So the forward loop is:
- Always-on path. Run every shared expert on every token. Sum their outputs.
- Scoring. The router computes one logit per routed expert. Cheap matmul, no FFNs touched yet.
- Selection. Top- picks the winners. The other routed experts go to sleep for this token.
- Gating. Softmax over the survivors gives convex gates that sum to 1.
- Specialist path. Each surviving routed expert runs on the tokens that picked it. Multiply by the gate, add into the running output.
- Return. The accumulated is the block's output. Pass to the next transformer layer as if nothing interesting happened.
The Complete DeepSeekMoE Forward Equation
Pull every piece together and the block computes a single tensor from a single tensor:
The notation is the same we have used throughout the chapter, repeated here for completeness. is the input token hidden state. and are individual expert FFNs — typically with a GELU. The top-k set contains the indices with the largest router scores. The gates come from a softmax restricted to the survivors — for .
What the equation does not say
Three things are conspicuously absent from the formula, and each absence encodes an architectural decision:
- No global softmax across shared and routed. The two sums are added independently. Shared experts are not in the gating competition; they are part of the architectural baseline.
- No bias term on the router. DeepSeek's router uses a plain linear layer with no bias. The auxiliary-loss-free balancing in chapter 6 will add a per-expert bias to logits only for selection, not for the gates — a subtle distinction that matters at scale.
- No capacity drop. The equation assumes every token gets its k winners. Real training uses a capacity factor that occasionally drops tokens when a single expert overflows; we covered that in 5.5.
FLOP accounting
A dense FFN of width costs roughly FLOPs per token. A DeepSeekMoE block with shared experts, routed experts, and top- routing costs:
The first term is the FFNs that actually run. The second is the router's matvec — negligible because . Compare against the parameters held in memory and you see the whole DeepSeekMoE bargain in one ratio:. For DeepSeek-V3 that is — and that is why a 671B-parameter model trains and serves at the cost of a ~37B-parameter dense model.
Manual Numerical Walkthrough
We push one toy token through a complete block with , , , . Every number below is computed by hand so you can verify any step against your own reimplementation.
Click to expand: one token through the complete block, by hand
Setup. Token (an embedding for the toy code-domain token "def fibonacci(n):"). Two shared experts and eight routed experts. To keep arithmetic readable, we use precomputed per-expert outputs rather than write out every weight matrix.
(A) Shared path — always on. The two shared experts produce outputs:
Summing: . This is the unconditional baseline — every token in the corpus, regardless of domain, would have a non-trivial shared contribution.
(B) Router scores the routed pool. Router logits across the 8 routed experts (precomputed from ):
(C) Top-k = 2 selection. Argsort says experts 1 (score 3.2) and 4 (score 2.6) survive. The other six experts contribute zero — their FFNs never run.
Softmax over the 2 survivors. Subtract max: . , sum 1.549. So and . Gates sum to 1.
(D) Selected routed experts run. Each surviving expert produces its own output:
- (a code specialist)
- (another code-leaning expert)
Gated contributions: and . Sum: .
(E) Combine. Final output :
FLOP audit on this token. Active FFNs: . Inactive FFNs: — their parameters lived in memory but no compute touched them. Compared to a dense FFN of the same width, the block did 4× the work of one FFN but stored × the parameters. That 4-vs-10 ratio is the entire DeepSeekMoE bargain in miniature.
Now swap the token. If we replace x with a math-domain token, the shared contribution barely changes (it is the baseline), but the router logits redirect — experts 2 (math specialist) and 5 (math-adjacent) survive instead of 1 and 4. The same 4 FFNs run, but two of them are different. That swap is the specialization that vanilla dense models cannot do at the same parameter budget.
Visualizing the Forward Pass
The tracer below walks the full block one stage at a time. Pick a token at the top right; the shared (violet) experts always fire, while the routed (green) pool only lights up the two experts the router selects for that domain. Use Play to auto-advance through the seven stages or step manually with the arrows.
Three things to watch as you switch tokens. First, the violet column is invariant — the same two shared experts fire for every token, with nearly the same output magnitude. That is the baseline doing its job. Second, the green grid changes its highlighted cells: the router picks different routed experts for code vs. math vs. history. The dimmed cells stayed in memory but consumed zero FLOPs. Third, the final output is just the sum of the two boxed accumulators — never a softmax, never a weighted average across the two pools.
Plain Python: A Complete DeepSeekMoE Block
Here is the entire block in NumPy. No autograd, no batching, no GPU — just one class with one forward method that you can step through with pdb. The shape of every tensor matches the equation we just wrote.
Two structural observations before we move on. First, the router's output dim is on line 16 — not . If you find yourself wanting to score the shared experts too, you are off the architecture. Second, the only difference between the shared loop (line 25) and the routed loop (line 37) is that the routed loop iterates over chosen indices with gates, while the shared loop iterates over every shared expert with no gate. That single line of asymmetry is the whole point of the shared-vs-routed split.
Sanity check via degenerate cases. Set : the block collapses to vanilla MoE (section 5.1). Set and : the block becomes a stacked dense FFN. Set : the block is a dense FFN. DeepSeekMoE is the principled interpolation across all three corners.
PyTorch: Production-Style Implementation
Now the production-style PyTorch version. Two things change from the NumPy reference: weights are stored as packed tensors so we can issue grouped GEMMs, and the routed loop uses the gather / run / scatter trick that keeps GPU matmuls large. Everything else is one-to-one with the math.
Three implementation details that real DeepSeek code does differently and you should know about:
- Grouped GEMM instead of a Python loop. The for i in range(self.n_routed) loop is fine at 8 experts but a throughput killer at 256. Real implementations use torch._scaled_grouped_mm (or DeepGEMM, the kernel DeepSeek released) to issue all the routed FFNs as one batched call. The Python becomes ~10 lines shorter and ~3× faster.
- Permute, then call, then unpermute. A common pattern in high-performance MoE code: sort the token batch so all tokens for expert 0 come first, then all for expert 1, then call a single grouped GEMM with a length-per-expert array, then unpermute back. This avoids any per-expert indexing and lets the kernel use contiguous memory.
- All-to-all wraps the routed loop. On a multi-GPU cluster, the permute step also dispatches tokens to the GPU that owns each routed expert. The forward pass becomes: shared FFN locally → permute + all-to-all → routed grouped GEMM on remote GPU → all-to-all back + unpermute → add to shared output. The shared path runs entirely on-device with no comms — section 5.5 showed why.
What Changes at Massive Scale
Take the same DeepSeekMoEBlock above, change three integers, and you have a layer of DeepSeek-V3:
| Configuration | Toy (this section) | DeepSeek-V2 | DeepSeek-V3 |
|---|---|---|---|
| d_model | 4 | 5120 | 7168 |
| d_ff (per expert) | 8 | 1536 | 2048 |
| n_shared | 2 | 2 | 1 |
| n_routed | 8 | 160 | 256 |
| top-k routed | 2 | 6 | 8 |
| Active params / block | ~1 KB | ~75 M | ~140 M |
| Total params / block | ~5 KB | ~2 B | ~4 B |
| Active / total ratio | 40% | 3.7% | 3.5% |
Four operational concerns emerge once the integers grow.
Memory: optimizer states dominate
At a 4B-parameter block, the weights themselves are 8 GB in bf16. AdamW adds moment buffers — first and second moments, both in fp32 for stability — and you are at 24 GB per block before activations. FSDP / ZeRO shards the optimizer states across data-parallel replicas; expert parallelism additionally shards the routed weights themselves across GPUs. The shared block stays replicated everywhere — it is small enough that the cross-GPU traffic to shard it would cost more than the memory it saves.
Communication: all-to-all is the new attention
Every routed-expert pool larger than one node lives behind an all-to-all collective. Each token has to travel to whichever GPU owns its chosen experts, run, and return. With 256 experts sharded across 32 GPUs, each forward pass moves on the order of bytes per direction. Bandwidth is the bottleneck — not FLOPs. DeepSeek uses InfiniBand-tuned NCCL kernels and the DeepEP library to overlap the all-to-all with the shared-block compute, hiding most of the latency behind work the shared experts were going to do anyway.
Numerical precision: fp8 wherever you can
DeepSeek-V3 runs the routed expert GEMMs in fp8 — the router stays in bf16, the softmax stays in fp32, the FFN matmuls drop to fp8. Per-tile scaling and an fp32 accumulate keep the fp8 numerics stable. The mixed-precision sandwich saves ~40% of memory traffic on the dominant cost (routed FFNs) without measurable loss damage. Shared experts can stay in bf16 — they handle every token, so any fp8 noise compounds faster there.
Load balance: the elephant in the next chapter
The forward pass above assumes the router picks roughly equal numbers of tokens per expert. It does not. Without intervention, training collapses into routing one or two experts hard while the rest starve. The classical fix — an auxiliary load-balance loss — actively hurts model quality at scale. DeepSeek invented the auxiliary-loss-free bias-term trick that fixes the imbalance without contaminating the gradient signal, and that is the entire subject of chapter 6. The block we just built is the right shape for that fix to slot into; you will see exactly where the bias term lives when we get there.
Engineering Reality and Gotchas
Three failure modes you will absolutely hit if you reimplement DeepSeekMoE without reading the next chapter first:
- Routing collapse on warmup. Cold-started routers produce near-uniform logits, so top-k picks effectively random experts for the first few hundred steps. The lucky experts get more gradient, become slightly better targets, get even more tokens — a feedback loop that ends with two experts owning the entire batch and the other 254 idle. Chapter 6 covers the bias-term fix; in the meantime, kaiming-uniform on the router (not zero init) and a short LR warmup buy you stability.
- Activation checkpointing recomputes routing. If you activation-checkpoint the MoE block (which you almost certainly will, to fit training), the backward pass re-runs the router and top-k. That is fine mathematically — the same logits produce the same top-k — but you must seed any stochastic routing identically on the recompute, and you must avoid any non-determinism (e.g., atomic gather/scatter) that would change which experts claim which tokens.
- Mixed precision around softmax. Run softmax in fp32 even when everything around it is bf16. The exp() is highly sensitive to small logit shifts — fp16 softmax loses gates entirely when one logit dominates. The cost is negligible; the bug it prevents is silent and devastating.
The one sentence to carry forward: DeepSeekMoE is a sum of two pools, gated only on one side. Every other piece of complexity — fine-grained decomposition, expert parallelism, fp8 GEMMs, bias-term balancing — is an optimization layered on top of that single additive equation. Once that equation is in your fingers, the rest of the chapter (and the rest of the book) is engineering, not architecture.