Chapter 5
15 min read
Section 26 of 117

Why Mixture-of-Experts?

Mixture-of-Experts: DeepSeekMoE

Every dense transformer pays the same bill on every token. Whether the model is completing a Python function, translating Bengali, or finishing a Shakespeare line, the same parameters fire. As models grew from millions to billions to nearly a trillion parameters, that bill became unpayable. Mixture-of-Experts is the architectural answer to a brutally simple question: if most of the knowledge inside a giant model is irrelevant to any one token, why are we computing with all of it?

The bet of MoE. Grow the model's knowledge by 8× while keeping the compute per token roughly the same. You pay for capacity in memory, not in FLOPs.

The Dense Wall

Open up any decoder-only transformer block and roughly two-thirds of its parameters live inside one place: the feed-forward network (FFN). A standard FFN looks like y=W2GELU(W1x)y = W_2 \, \mathrm{GELU}(W_1 x), where W1Rdff×dW_1 \in \mathbb{R}^{d_{ff} \times d} and W2Rd×dffW_2 \in \mathbb{R}^{d \times d_{ff}}, with dff4dd_{ff} \approx 4d. The FFN holds most of the model's memorized knowledge, and it is also where most of the FLOPs go.

The trouble is what scaling actually buys you. To make a dense model 2× smarter, you roughly double the parameters, which doubles the FFN width, which doubles the FLOPs per token. Every doubling of capacity is a doubling of inference cost. By the time you reach a 70-billion-parameter dense model, every single token in every single forward pass touches every weight. Nothing is conditional. The model has no way to say this is a coding token, skip the Shakespeare weights.

Block% of params% of FLOPs / token
Attention (Q, K, V, O)~30%~30%
FFN (W₁, W₂)~65%~65%
Embeddings / norms~5%~5%

The FFN is the elephant. It is also the layer where MoE intervenes — leaving attention untouched and replacing the single fat FFN with a population of smaller ones, only a few of which run per token.

Why this matters for massive training. Training a dense 1T model is bounded by FLOPs and by GPU memory bandwidth. If you can decouple parameters from active parameters, you can keep raising model capacity (which scaling laws say lifts loss) without raising the compute budget proportionally. That decoupling is the entire promise of MoE.

The Specialist Intuition

Imagine a hospital where every patient is seen by every doctor — the cardiologist, the dermatologist, the orthopedic surgeon, the pediatrician — and each writes a small note that is then averaged. That is a dense FFN. It works, but it is absurdly wasteful: the dermatologist does not need to weigh in on a broken ankle.

A real hospital uses a triage desk. A nurse glances at the patient, picks the two most relevant specialists, and sends them in. The dermatologist still exists, still has years of training stored in her head, but she stays in her office unless you arrive with a rash. The hospital's total knowledge is huge; the active knowledge per patient is tiny.

Mixture-of-Experts is exactly that hospital. A small router examines each token and, based on what the token looks like in embedding space, picks the kk most relevant experts out of EE. Only those kk experts run. The others sit idle, weights loaded but never multiplied. Common choices in real systems are E{8,16,64,256}E \in \{8, 16, 64, 256\} with k{1,2}k \in \{1, 2\}.

The right mental picture: dense FFN = wide, every weight touched. MoE FFN = library of FFNs, router checks out 2 per token. Memory grows linearly with E; compute grows linearly with k.

Conditional Compute: The Math

A dense FFN computes a deterministic function of xx: FFN(x)=W2σ(W1x)\mathrm{FFN}(x) = W_2 \, \sigma(W_1 x).

An MoE FFN replaces that one function with a population of EE experts {E1,E2,,EE}\{E_1, E_2, \dots, E_E\}, each one a small FFN with its own private W1(i),W2(i)W_1^{(i)}, W_2^{(i)}. A gate g(x)REg(x) \in \mathbb{R}^E produces a vector of weights — most of them zero — and the output is a weighted sum: MoE(x)=i=1Egi(x)Ei(x)\mathrm{MoE}(x) = \sum_{i=1}^{E} g_i(x) \cdot E_i(x).

Here gi(x)g_i(x) is the gate value for expert ii on input xx, and Ei(x)E_i(x) is what expert ii would return on this input. If gi(x)=0g_i(x) = 0, we never have to compute Ei(x)E_i(x) — the term vanishes. The trick is to design gg so that most entries are exactly zero.

The router

The standard recipe is a single linear layer followed by top-kk selection and a softmax restricted to the survivors. Let WrRE×dW_r \in \mathbb{R}^{E \times d} be the router weights. We compute logits, take the top kk, and softmax only those:

s=WrxREs = W_r x \in \mathbb{R}^{E}, T=TopK(s,k)\mathcal{T} = \mathrm{TopK}(s, k), gi(x)=esijTesjg_i(x) = \frac{e^{s_i}}{\sum_{j \in \mathcal{T}} e^{s_j}} if iTi \in \mathcal{T}, otherwise gi(x)=0g_i(x) = 0.

Read the symbols carefully. xx is one token's hidden state. WrxW_r x is a cheap E-dimensional score vector — one score per expert. TopK\mathrm{TopK} picks the kk highest-scoring experts and discards the rest. The softmax over only those kk values ensures the gates form a proper convex combination ( igi=1\sum_i g_i = 1 ) without spending probability mass on experts we will not run.

The compute equation

Suppose a dense FFN has roughly 2ddff2 d \cdot d_{ff} parameters and the same order of FLOPs per token. An MoE layer with EE experts of the same size stores E2ddffE \cdot 2 d \cdot d_{ff} parameters but only runs kk of them per token, so FLOPs per token are k2ddffk \cdot 2 d \cdot d_{ff}. The ratio is FLOPsMoE/FLOPsdense=k/E\text{FLOPs}_{\text{MoE}} / \text{FLOPs}_{\text{dense}} = k / E. With E=8,k=2E = 8, k = 2 you keep 25% of the cost; with E=64,k=2E = 64, k = 2 you keep just 3%.

Capacity vs cost, decoupled. The router lets the model store E×E \times more parameters at k/Ek / E of the compute. This is the central inequality that makes a trillion-parameter model trainable on a finite GPU budget.

Manual Numerical Walkthrough

Let us route one toy token through a 4-expert MoE with k=2k = 2. We will compute every number by hand so the mechanism is fully visible.

Click to expand: one token, four experts, by hand

Setup. Token x=[1,0]x = [1, 0]. Four experts, each a 1-layer FFN with d_model=2, d_ff=2, weights chosen so the arithmetic is trivial:

  • E1(x)=[2x1,0]E_1(x) = [2 x_1, 0] (reacts to the first coordinate)
  • E2(x)=[0,2x2]E_2(x) = [0, 2 x_2] (reacts to the second coordinate)
  • E3(x)=[x1+x2,x1+x2]E_3(x) = [x_1 + x_2, x_1 + x_2] (a generalist)
  • E4(x)=[x1,x2]E_4(x) = [-x_1, -x_2] (a negator)

Router. Router logits are s=Wrxs = W_r x with Wr=[2.00.10.21.50.50.51.01.0]W_r = \begin{bmatrix} 2.0 & 0.1 \\ 0.2 & 1.5 \\ 0.5 & 0.5 \\ -1.0 & -1.0 \end{bmatrix}. Multiplying with x=[1,0]x = [1, 0] picks out the first column: s=[2.0,0.2,0.5,1.0]s = [2.0,\, 0.2,\, 0.5,\, -1.0].

Top-k. With k=2k = 2 we keep experts 1 and 3 (scores 2.0 and 0.5). Experts 2 and 4 are masked out — we will not run them.

Softmax over survivors. Stable softmax: subtract the max, exponentiate, normalize. z=[2.02.0,0.52.0]=[0,1.5]z = [2.0 - 2.0,\, 0.5 - 2.0] = [0,\, -1.5]. ez=[1.0,0.223]e^z = [1.0,\, 0.223]. Sum = 1.223. So g1=0.818g_1 = 0.818, g3=0.182g_3 = 0.182, and g2=g4=0g_2 = g_4 = 0.

Expert outputs (only the ones we run). E1([1,0])=[2,0]E_1([1, 0]) = [2, 0]. E3([1,0])=[1,1]E_3([1, 0]) = [1, 1]. Note: we never compute E2E_2 or E4E_4. Their weights sat in memory but burned zero FLOPs.

Combine. y=0.818[2,0]+0.182[1,1]=[1.818,0.182]y = 0.818 \cdot [2, 0] + 0.182 \cdot [1, 1] = [1.818,\, 0.182]. This is the MoE block's output for this token.

The FLOP audit. A dense FFN at this size would have done 4 expert evaluations. We did 2. That is a 2× compute saving for this token — and the saving is k/E=0.5k/E = 0.5 regardless of which experts the router picked, because the math is identical for every token in the batch.

Visualizing Sparse Routing

The interactive diagram below is the picture you should hold in your head every time you see the letters "MoE". Toggle between dense and MoE; slide kk up and down; switch which token is being routed. Watch which experts light up, which sit idle, and what happens to the compute bar.

Loading routing visualizer…

Three observations are worth pausing on. First, when you switch tokens, the lit experts move. The router has learned (in our toy example by construction; in real models, through training) that coding tokens prefer some experts and poetry tokens prefer others. Second, the green compute bar tracks k/Ek / E exactly — sliding kk from 8 to 1 drops cost from 100% to 12.5%. Third, the parameter count (purple) is independent of kk. You always pay full memory; you only sometimes pay full compute.

Plain Python: An MoE Layer From Scratch

Before the PyTorch version, let us write the whole thing in NumPy so there is nowhere for the mechanism to hide. The code below is small enough to fit in your head and complete enough to actually run.

🐍moe_numpy.py
3Hyperparameters

Four experts, model dim 4, expert hidden dim 8, top-k routing with k=2. We will store 4× the FFN weights but only ever run 2 of them per token.

EXECUTION STATE
num_experts = 4
d_model = 4
d_ff = 8
k = 2
6Stack all experts into one tensor

W1 has shape (E, d_ff, d_model). It is just E small FFN first-layer weights stacked along a new leading axis. Indexing W1[i] picks out one expert.

EXECUTION STATE
W1.shape = (4, 8, 4)
7Second expert layer

Same pattern for the second linear in each expert — shape (E, d_model, d_ff).

8The router

One tiny linear layer mapping the input vector to E logits — one score per expert. It is the only part of the network that decides who runs.

EXECUTION STATE
W_router.shape =
(4, 4)
11Score every expert

We multiply the router weights by the input. The result is one logit per expert. This is cheap: O(E · d_model) ≈ a single Linear, not E FFNs.

EXECUTION STATE
logits.shape = (4,)
x.shape = (4,)
12Pick the top-k indices

argsort gives ascending order, so negating logits and slicing the first k gives the k highest scores. These are the only experts we will execute.

EXAMPLE
If logits = [0.7, -0.2, 1.3, 0.4], topk_idx = [2, 0] (the 2 largest).
15Numerical-stable softmax over the survivors

We subtract the max before exp to avoid overflow. Then exp + normalize gives a probability over the k chosen experts. Crucially, the softmax runs over k values, not E.

EXECUTION STATE
gates.shape = (2,)
sum(gates) = 1.0
18Accumulator for the weighted sum

We will add each chosen expert's contribution into this vector.

19Loop over the chosen experts only

We iterate over k items, never E. This loop is the heart of the compute savings: dense FFN would run every expert, MoE runs only k.

20Expert forward — first half

W1[i] @ x is a (d_ff, d_model) · (d_model,) matvec → (d_ff,). ReLU keeps the positive coordinates. Same math as a regular FFN, but only for one expert.

EXECUTION STATE
h.shape = (8,)
21Expert forward — second half + gating

W2[i] @ h projects back to d_model. We then scale by the gate g (the router's confidence in this expert) and add to the accumulator. The output is a convex combination of expert outputs.

24One sample token

A toy 4-dim input. Real models have d_model ∈ {2048, 4096, 7168, …} and tens to hundreds of experts.

13 lines without explanation
1import numpy as np
2
3num_experts, d_model, d_ff, k = 4, 4, 8, 2
4
5rng = np.random.default_rng(0)
6W1 = rng.standard_normal((num_experts, d_ff, d_model)) * 0.1
7W2 = rng.standard_normal((num_experts, d_model, d_ff)) * 0.1
8W_router = rng.standard_normal((num_experts, d_model)) * 0.1
9
10def moe_forward(x):
11    logits = W_router @ x                     # (E,)
12    topk_idx = np.argsort(-logits)[:k]        # indices of best k experts
13    topk_logits = logits[topk_idx]
14
15    z = topk_logits - topk_logits.max()
16    gates = np.exp(z) / np.exp(z).sum()       # (k,) sums to 1
17
18    y = np.zeros(d_model)
19    for g, i in zip(gates, topk_idx):
20        h = np.maximum(0, W1[i] @ x)          # expert i: ReLU(W1 x)
21        y += g * (W2[i] @ h)                  # weighted by gate
22    return y
23
24x = np.array([0.5, -0.2, 0.9, 0.1])
25print(moe_forward(x))

Every interesting thing about MoE is on the screen above. The router WrxW_r x is a single matvec. The top-k selection is one argsort. The softmax runs over kk numbers. The expensive part — actually evaluating an expert FFN — happens only inside the loop over kk survivors, not inside a loop over all EE.

Sanity check. Set k=Ek = E in the snippet above. The MoE layer becomes equivalent to a softmax-weighted average of all experts — exactly what you would get from a fully-dense ensemble. The whole compute saving comes from making kEk \ll E.

PyTorch: From Toy to Production

The PyTorch version moves from one-token-at-a-time to a real batch. The pattern we want is: route every token, then for each expert gather the tokens that picked it, run that expert once on a packed mini-batch, and scatter the result back. This keeps GPU matmuls large and idle FLOPs near zero.

🐍moe_pytorch.py
5Subclass nn.Module

Standard PyTorch pattern. The router and the expert list become tracked parameters; autograd will route gradients into both.

10Experts as a ModuleList

Each expert is an ordinary 2-layer FFN with GELU. ModuleList registers them so .parameters() yields all of them. In production, large MoE models replace this with a single fused tensor for memory locality.

18Router: a single Linear, no bias

Maps d_model → num_experts. This is the tiny network that decides routing. It is the only layer trained jointly with the gating decision — the gradient passing through the gate teaches the router which tokens belong with which expert.

21Read tensor shape

x is (batch, seq_len, d_model). Routing is per-token — the same token in two different sequences must still be free to pick its own experts.

EXECUTION STATE
x.shape = (B, T, D)
22Flatten batch and time

We collapse the batch and time axes so every token is just a row. N = B·T. This lets us treat the whole batch as one stream of tokens going into the router.

EXECUTION STATE
x_flat.shape = (N, D)
N = B · T
24Router logits per token

self.router is a Linear(D, E). Output is (N, E): every token gets a score for every expert. Total cost is O(N · D · E) — much cheaper than running any FFN.

EXECUTION STATE
logits.shape = (N, E)
25Top-k selection

torch.topk returns the k largest values and their indices along the last dim. topk_idx[n] is the list of experts token n routed to. This is the discrete decision — it is non-differentiable in the index, but differentiable through the value weights.

EXECUTION STATE
topk_w.shape = (N, k)
topk_idx.shape = (N, k)
26Softmax over the survivors only

We softmax across the k kept logits per token, not across all E. The gates for each token sum to 1, giving a proper convex combination of expert outputs.

28Initialize output

Same shape as the flat input. We will scatter expert contributions into the right rows.

29Loop over experts, not over tokens

Critical pattern: for each expert i we gather all tokens that picked it, run the expert once on the whole batch of selected tokens, and scatter the result back. This is how MoE achieves throughput on a GPU — large matmuls instead of one-token-at-a-time work.

30Which tokens chose expert i?

topk_idx == i broadcasts to (N, k). True at position (n, slot) means token n picked expert i in its slot-th choice.

31Skip empty experts

If no token in the batch picked this expert, do nothing. In real training, this is where load imbalance hurts: some experts get hammered, others sit idle.

33Index unpacking

nonzero returns the (row, slot) coordinates of the True entries. tok_rows tells us which tokens we need; slot tells us which of their k gates is for this expert.

EXECUTION STATE
tok_rows.shape = (M,)
M = tokens picking expert i
34Gate values for those tokens

topk_w[tok_rows, slot] picks the right gate for each (token, expert) pair. unsqueeze adds a trailing 1 so it broadcasts against the expert output of shape (M, D).

EXECUTION STATE
gate.shape = (M, 1)
35Run the expert and scatter back

self.experts[i](x_flat[tok_rows]) runs ONE expert FFN on the M tokens that chose it — a single dense (M, D) → (M, D) matmul. We scale by the gate and add into y at the correct token rows.

36Restore (B, T, D) shape

Reshape back so the MoE block is a drop-in replacement for a dense FFN inside the transformer block.

20 lines without explanation
1import torch
2import torch.nn as nn
3import torch.nn.functional as F
4
5class MoEFFN(nn.Module):
6    def __init__(self, d_model: int, d_ff: int, num_experts: int, k: int):
7        super().__init__()
8        self.num_experts = num_experts
9        self.k = k
10        self.experts = nn.ModuleList([
11            nn.Sequential(
12                nn.Linear(d_model, d_ff),
13                nn.GELU(),
14                nn.Linear(d_ff, d_model),
15            )
16            for _ in range(num_experts)
17        ])
18        self.router = nn.Linear(d_model, num_experts, bias=False)
19
20    def forward(self, x: torch.Tensor) -> torch.Tensor:
21        B, T, D = x.shape
22        x_flat = x.reshape(B * T, D)                          # (N, D)
23
24        logits = self.router(x_flat)                          # (N, E)
25        topk_w, topk_idx = logits.topk(self.k, dim=-1)        # (N, k), (N, k)
26        topk_w = F.softmax(topk_w, dim=-1)                    # (N, k)
27
28        y = torch.zeros_like(x_flat)
29        for i in range(self.num_experts):
30            mask = topk_idx == i                              # (N, k)
31            if not mask.any():
32                continue
33            tok_rows, slot = mask.nonzero(as_tuple=True)
34            gate = topk_w[tok_rows, slot].unsqueeze(-1)       # (M, 1)
35            y[tok_rows] += gate * self.experts[i](x_flat[tok_rows])
36        return y.reshape(B, T, D)

Two things are subtly different from the NumPy version, and both matter for real training:

  1. We loop over experts, not over tokens. If we looped per token we would issue thousands of tiny matmuls — terrible for GPU utilization. By gathering the tokens that picked each expert and running one large (M,D)(M,D)(M, D) \to (M, D) matmul, we keep the GPU saturated.
  2. topk is non-differentiable in its indices but differentiable in its values. The gradient flows back through the gate weights gi(x)g_i(x), which teaches the router to put more probability mass on experts that produced good outputs and less on those that did not. The discrete "which experts" decision is updated indirectly, as a side effect of the values flowing back.
Implementation note. In real frameworks (Megatron, DeepSpeed, DeepSeek's released code) the per-expert loop is replaced with a single permuted-batch matmul against a stacked expert tensor of shape (E,dff,d)(E, d_{ff}, d). Functionally identical, dramatically faster on a GPU.

What Changes at Massive Scale

The toy example has 4 experts and runs on a laptop. A frontier MoE language model has hundreds. Three numbers tell the story:

ModelTotal paramsActive params / tokenRatio
Mixtral 8×7B~47B~13B~3.6×
DeepSeek-V2236B21B~11×
DeepSeek-V3671B37B~18×
GPT-4 (rumored)~1.8T~280B~6×

DeepSeek-V3 is the cleanest illustration. The model carries 671 billion parameters of knowledge but spends only 37 billion parameters of compute per token — a roughly 18× decoupling. That same decoupling applies during training: gradients only flow into the experts that were activated for each token, so the gradient compute is also k/Ek / E of a same-size dense model.

The bottleneck shifts. With dense models, the limit was compute. With MoE, the limit becomes memory bandwidth and cross-GPU communication. Every parameter still has to live somewhere on the cluster; routing means tokens have to travel to whichever GPU owns their chosen experts. We will spend several sections of this chapter on the machinery that makes that travel cheap.

The scaling-law angle

Empirically, language-model loss scales as a power law in compute and parameters (Kaplan et al., 2020; Hoffmann et al., 2022). Dense models tie those two axes together. MoE lets you push the parameter axis while leaving compute roughly fixed — and the experimental evidence is that this still buys you a scaling-law gain in loss, just along a different curve. You do not get something for nothing (the memory and communication costs are real), but you do get to choose where to spend.

The Engineering Reality of MoE

The math above is clean. The systems are not. If you write the naive MoE in production, three things will go catastrophically wrong, and each of them is the subject of an entire later section in this chapter:

  1. Load imbalance. The router's natural tendency, untreated, is to send most tokens to a small subset of popular experts. The unpopular experts never train; the popular ones bottleneck the GPU. This is routing collapse, and it is the central failure mode of MoE training (covered in Chapter 6).
  2. Expert parallelism and all-to-all communication. If the model has 256 experts and your cluster has 256 GPUs, every token might need to travel to a different GPU. That cross-device shuffle is an all-to-all collective — by far the most expensive primitive in distributed training (covered in section 5 of this chapter).
  3. Inference latency. Conditional compute is great for training throughput but adds an extra hop at inference. KV-cache savings still apply, but you now pay a routing decision per token, plus the all-to-all if experts are sharded. Serving stacks like vLLM and SGLang have specialized MoE schedulers; we will look at them in the inference chapter.

Despite these costs, the trade is overwhelmingly worth it. Once the routing, balance, and parallelism problems are solved well, MoE models reliably beat dense models of the same compute budget — and that is precisely why DeepSeek-V3, Mixtral, Qwen-MoE, and most of the frontier open-weight models of 2024-2025 are all sparse mixtures, not dense stacks.

Where we go from here. The next section opens the router and looks closely at the gating mechanism — gate noise, expert capacity, dropped tokens, and the differentiable approximations that make backpropagation through a discrete top-k selection actually work. After that, we look at DeepSeek's fine-grained expert decomposition and shared experts, then expert parallelism at scale, and finally a full PyTorch implementation of DeepSeekMoE.

For now, the one sentence to carry forward is this: MoE is the architectural move that broke the link between parameter count and compute cost, and every model with hundreds of billions of parameters running on anything less than a planet-scale datacenter is built on top of it.

Loading comments...