Every dense transformer pays the same bill on every token. Whether the model is completing a Python function, translating Bengali, or finishing a Shakespeare line, the same parameters fire. As models grew from millions to billions to nearly a trillion parameters, that bill became unpayable. Mixture-of-Experts is the architectural answer to a brutally simple question: if most of the knowledge inside a giant model is irrelevant to any one token, why are we computing with all of it?
The bet of MoE. Grow the model's knowledge by 8× while keeping the compute per token roughly the same. You pay for capacity in memory, not in FLOPs.
The Dense Wall
Open up any decoder-only transformer block and roughly two-thirds of its parameters live inside one place: the feed-forward network (FFN). A standard FFN looks like , where and , with . The FFN holds most of the model's memorized knowledge, and it is also where most of the FLOPs go.
The trouble is what scaling actually buys you. To make a dense model 2× smarter, you roughly double the parameters, which doubles the FFN width, which doubles the FLOPs per token. Every doubling of capacity is a doubling of inference cost. By the time you reach a 70-billion-parameter dense model, every single token in every single forward pass touches every weight. Nothing is conditional. The model has no way to say this is a coding token, skip the Shakespeare weights.
| Block | % of params | % of FLOPs / token |
|---|---|---|
| Attention (Q, K, V, O) | ~30% | ~30% |
| FFN (W₁, W₂) | ~65% | ~65% |
| Embeddings / norms | ~5% | ~5% |
The FFN is the elephant. It is also the layer where MoE intervenes — leaving attention untouched and replacing the single fat FFN with a population of smaller ones, only a few of which run per token.
The Specialist Intuition
Imagine a hospital where every patient is seen by every doctor — the cardiologist, the dermatologist, the orthopedic surgeon, the pediatrician — and each writes a small note that is then averaged. That is a dense FFN. It works, but it is absurdly wasteful: the dermatologist does not need to weigh in on a broken ankle.
A real hospital uses a triage desk. A nurse glances at the patient, picks the two most relevant specialists, and sends them in. The dermatologist still exists, still has years of training stored in her head, but she stays in her office unless you arrive with a rash. The hospital's total knowledge is huge; the active knowledge per patient is tiny.
Mixture-of-Experts is exactly that hospital. A small router examines each token and, based on what the token looks like in embedding space, picks the most relevant experts out of . Only those experts run. The others sit idle, weights loaded but never multiplied. Common choices in real systems are with .
Conditional Compute: The Math
A dense FFN computes a deterministic function of : .
An MoE FFN replaces that one function with a population of experts , each one a small FFN with its own private . A gate produces a vector of weights — most of them zero — and the output is a weighted sum: .
Here is the gate value for expert on input , and is what expert would return on this input. If , we never have to compute — the term vanishes. The trick is to design so that most entries are exactly zero.
The router
The standard recipe is a single linear layer followed by top- selection and a softmax restricted to the survivors. Let be the router weights. We compute logits, take the top , and softmax only those:
, , if , otherwise .
Read the symbols carefully. is one token's hidden state. is a cheap E-dimensional score vector — one score per expert. picks the highest-scoring experts and discards the rest. The softmax over only those values ensures the gates form a proper convex combination ( ) without spending probability mass on experts we will not run.
The compute equation
Suppose a dense FFN has roughly parameters and the same order of FLOPs per token. An MoE layer with experts of the same size stores parameters but only runs of them per token, so FLOPs per token are . The ratio is . With you keep 25% of the cost; with you keep just 3%.
Manual Numerical Walkthrough
Let us route one toy token through a 4-expert MoE with . We will compute every number by hand so the mechanism is fully visible.
Click to expand: one token, four experts, by hand
Setup. Token . Four experts, each a 1-layer FFN with d_model=2, d_ff=2, weights chosen so the arithmetic is trivial:
- (reacts to the first coordinate)
- (reacts to the second coordinate)
- (a generalist)
- (a negator)
Router. Router logits are with . Multiplying with picks out the first column: .
Top-k. With we keep experts 1 and 3 (scores 2.0 and 0.5). Experts 2 and 4 are masked out — we will not run them.
Softmax over survivors. Stable softmax: subtract the max, exponentiate, normalize. . . Sum = 1.223. So , , and .
Expert outputs (only the ones we run). . . Note: we never compute or . Their weights sat in memory but burned zero FLOPs.
Combine. . This is the MoE block's output for this token.
The FLOP audit. A dense FFN at this size would have done 4 expert evaluations. We did 2. That is a 2× compute saving for this token — and the saving is regardless of which experts the router picked, because the math is identical for every token in the batch.
Visualizing Sparse Routing
The interactive diagram below is the picture you should hold in your head every time you see the letters "MoE". Toggle between dense and MoE; slide up and down; switch which token is being routed. Watch which experts light up, which sit idle, and what happens to the compute bar.
Three observations are worth pausing on. First, when you switch tokens, the lit experts move. The router has learned (in our toy example by construction; in real models, through training) that coding tokens prefer some experts and poetry tokens prefer others. Second, the green compute bar tracks exactly — sliding from 8 to 1 drops cost from 100% to 12.5%. Third, the parameter count (purple) is independent of . You always pay full memory; you only sometimes pay full compute.
Plain Python: An MoE Layer From Scratch
Before the PyTorch version, let us write the whole thing in NumPy so there is nowhere for the mechanism to hide. The code below is small enough to fit in your head and complete enough to actually run.
Every interesting thing about MoE is on the screen above. The router is a single matvec. The top-k selection is one argsort. The softmax runs over numbers. The expensive part — actually evaluating an expert FFN — happens only inside the loop over survivors, not inside a loop over all .
Sanity check. Set in the snippet above. The MoE layer becomes equivalent to a softmax-weighted average of all experts — exactly what you would get from a fully-dense ensemble. The whole compute saving comes from making .
PyTorch: From Toy to Production
The PyTorch version moves from one-token-at-a-time to a real batch. The pattern we want is: route every token, then for each expert gather the tokens that picked it, run that expert once on a packed mini-batch, and scatter the result back. This keeps GPU matmuls large and idle FLOPs near zero.
Two things are subtly different from the NumPy version, and both matter for real training:
- We loop over experts, not over tokens. If we looped per token we would issue thousands of tiny matmuls — terrible for GPU utilization. By gathering the tokens that picked each expert and running one large matmul, we keep the GPU saturated.
- topk is non-differentiable in its indices but differentiable in its values. The gradient flows back through the gate weights , which teaches the router to put more probability mass on experts that produced good outputs and less on those that did not. The discrete "which experts" decision is updated indirectly, as a side effect of the values flowing back.
What Changes at Massive Scale
The toy example has 4 experts and runs on a laptop. A frontier MoE language model has hundreds. Three numbers tell the story:
| Model | Total params | Active params / token | Ratio |
|---|---|---|---|
| Mixtral 8×7B | ~47B | ~13B | ~3.6× |
| DeepSeek-V2 | 236B | 21B | ~11× |
| DeepSeek-V3 | 671B | 37B | ~18× |
| GPT-4 (rumored) | ~1.8T | ~280B | ~6× |
DeepSeek-V3 is the cleanest illustration. The model carries 671 billion parameters of knowledge but spends only 37 billion parameters of compute per token — a roughly 18× decoupling. That same decoupling applies during training: gradients only flow into the experts that were activated for each token, so the gradient compute is also of a same-size dense model.
The bottleneck shifts. With dense models, the limit was compute. With MoE, the limit becomes memory bandwidth and cross-GPU communication. Every parameter still has to live somewhere on the cluster; routing means tokens have to travel to whichever GPU owns their chosen experts. We will spend several sections of this chapter on the machinery that makes that travel cheap.
The scaling-law angle
Empirically, language-model loss scales as a power law in compute and parameters (Kaplan et al., 2020; Hoffmann et al., 2022). Dense models tie those two axes together. MoE lets you push the parameter axis while leaving compute roughly fixed — and the experimental evidence is that this still buys you a scaling-law gain in loss, just along a different curve. You do not get something for nothing (the memory and communication costs are real), but you do get to choose where to spend.
The Engineering Reality of MoE
The math above is clean. The systems are not. If you write the naive MoE in production, three things will go catastrophically wrong, and each of them is the subject of an entire later section in this chapter:
- Load imbalance. The router's natural tendency, untreated, is to send most tokens to a small subset of popular experts. The unpopular experts never train; the popular ones bottleneck the GPU. This is routing collapse, and it is the central failure mode of MoE training (covered in Chapter 6).
- Expert parallelism and all-to-all communication. If the model has 256 experts and your cluster has 256 GPUs, every token might need to travel to a different GPU. That cross-device shuffle is an all-to-all collective — by far the most expensive primitive in distributed training (covered in section 5 of this chapter).
- Inference latency. Conditional compute is great for training throughput but adds an extra hop at inference. KV-cache savings still apply, but you now pay a routing decision per token, plus the all-to-all if experts are sharded. Serving stacks like vLLM and SGLang have specialized MoE schedulers; we will look at them in the inference chapter.
Despite these costs, the trade is overwhelmingly worth it. Once the routing, balance, and parallelism problems are solved well, MoE models reliably beat dense models of the same compute budget — and that is precisely why DeepSeek-V3, Mixtral, Qwen-MoE, and most of the frontier open-weight models of 2024-2025 are all sparse mixtures, not dense stacks.
For now, the one sentence to carry forward is this: MoE is the architectural move that broke the link between parameter count and compute cost, and every model with hundreds of billions of parameters running on anything less than a planet-scale datacenter is built on top of it.