We have now seen four parallelism strategies — data, tensor, pipeline, and DualPipe's overlap of compute and communication — and built each one on top of the same primitive: a collective that synchronises tensors across ranks. Expert parallelism is the fifth, and it is the one that cracks open the path to 671B parameters. It is also the one that turns the cluster's network — not its compute — into the bottleneck. This section is the story of that bottleneck and DeepSeek's deceptively small fix.
The thesis. Expert parallelism makes each MoE layer shoot two all-to-all collectives over the network per forward pass. In a multi-node cluster the link that limits training throughput is no longer NVLink — it is the InfiniBand fabric between nodes, which is roughly an order of magnitude slower. Node-limited routing, a one-line change to the router, caps the InfiniBand traffic to a constant multiple of the NVLink traffic so the two finish at the same time. That single equation is what makes DeepSeek-V3's pre-training feasible on commodity 8-GPU nodes.
Where Expert Parallelism Sits in the Stack
Before zooming into the network, let us place expert parallelism among the strategies you already know. The four core parallelism axes partition four different things across the cluster:
| Strategy | What it shards | Collective per step | Network it uses |
|---|---|---|---|
| Data parallel (DP) | Batch | all-reduce of gradients | All — typically intra + inter-node |
| Tensor parallel (TP) | Each matmul along a hidden dim | all-reduce inside one layer | NVLink only (kept on-node) |
| Pipeline parallel (PP) | Layers along depth | point-to-point activations | Mostly NVLink + some IB |
| Expert parallel (EP) | Routed experts across ranks | two all-to-all per MoE layer | Crosses NVLink + InfiniBand |
Tensor parallelism is deliberately kept on-node because its all-reduce sits in the critical path of every matmul — losing a factor of ten on bandwidth would be catastrophic. Pipeline parallelism crosses nodes cheaply because the messages are small (activations, not weights). Data parallelism crosses nodes for the gradient all-reduce, but modern sharded variants (ZeRO-3, FSDP) hide most of that behind compute. Expert parallelism is the new entry, and it is the only collective in the stack that must traverse the cross-node fabric — there is no on-node shortcut, because the chosen expert can live anywhere.
The Real Problem: Two Networks, One Pinch Point
Inside an H800 node, eight GPUs share an NVLink/NVSwitch fabric that delivers around 600 GB/s of effective bisection bandwidth per GPU. Between nodes the cluster uses InfiniBand — usually one or two 200–400 Gbps HCAs per node — which works out to about 50 GB/s per GPU once you amortise NIC sharing. The two numbers are not close. NVLink is roughly faster than IB. Any algorithm that pushes the same number of bytes across both will spend twelve times longer on the IB hop than the NVLink one. That is the whole story.
Now consider a naive EP layout. We have nodes, each with 8 GPUs, holding routed experts spread evenly. A token activates top- experts. Assuming uniform routing, the probability that any one chosen expert sits on the token's home node is . So the probability it sits off the home node is . For , that is 87.5%. Almost every chosen expert is across the network.
The damage compounds because top-k routing fires of these per token. Each cross-node hop drags a row of the activation tensor — typically values at 2 bytes each, so 14 KB — across IB. With a per-device micro-batch of tokens at and , that is of IB traffic per device per dispatch — and there is a matching combine on the way back, doubling it. At 50 GB/s that is 16 ms of pure network time per MoE layer. Multiply by 58 MoE layers and the forward pass alone wants ≈ 0.9 s of IB time. The GPUs are idle the whole while.
Intuition: A City With Both Subways and Airplanes
Forget GPUs for a moment. Imagine a courier service operating in eight cities. Within a city there is a subway: cheap, fast, near-infinite capacity. Between cities the courier must fly: expensive, slow, bandwidth-limited. A package's destination is randomly chosen among the eight cities. If we let every package choose freely, 7 out of 8 packages get on a plane. The airport is the bottleneck — the planes always have a queue, the subways are mostly empty.
Now suppose, before the package leaves the warehouse, we attach a rule: a single package can visit at most four cities on its journey. If you need to drop off four parcels, you must choose four cities, but never a fifth. Inside those four cities you can take as many subway rides as you like. What changes? Each package now boards at most one plane (to the chosen second city) plus the subway hops within each visited city. Aggregate plane traffic falls by a factor of two — no, by a factor of where . The subways stay just as fast. The airport stops being the bottleneck.
That cap is node-limited routing. The cities are nodes; the airport is the InfiniBand NIC; the subway is NVLink; the packages are activation rows. Capping the number of nodes a token may touch is the only knob in the system that operates on the bandwidth-expensive resource without sacrificing expressive top-k routing.
The Math: All-to-All Cost and the Cross-Node Ratio
Let us model the cost of one MoE dispatch on a cluster of nodes, each with GPUs. Let be the number of tokens held by one GPU before dispatch, the activation hidden dimension, the top-k count, and the activation element size in bytes (typically for BF16).
For unconstrained routing each of the chosen experts is equally likely to live on any node, so the expected number of distinct nodes a token must reach is
For and this evaluates to about nodes per token, of which exactly one is the home node. So a typical token travels to remote nodes — and a copy of its activation row must cross IB once per remote node.
Multiplying through, the number of activation rows leaving one GPU over InfiniBand in a single dispatch is
and the IB byte budget per dispatch is . The combine on the way back is symmetric, so the round-trip cost is.
On the NVLink side, the analogous count of intra-node per-token transfers depends on how many of the picks landed inside the home node (or, with the hierarchical algorithm we will build, inside any node the token already visits). The cost model implemented in the cost-model simulator below is a tractable closed form; the key proportion is the ratio
We want to be at most the link speed ratio — i.e. we want IB to carry at most as much traffic per second as NVLink, so that wall time on the two links matches and the dispatch is bandwidth-balanced.
DeepSeek's Trick: Node-Limited Routing
DeepSeek-V3 introduces a hard structural constraint on the router: a token may activate experts on at most distinct nodes. In V3 they fix with . The implementation is a single masking step that runs before the top-k selection:
- For each token, score every expert with the router as usual: .
- For each node , compute the top-2 expert score on that node and call its sum — the node's "affinity" for this token.
- Keep the top nodes by ; zero out all expert scores on the remaining nodes.
- Run top- selection over the surviving scores. The token now visits at most nodes by construction.
The router is still differentiable — masking is just multiplying by a gradient-free indicator — and the bias-term load balancer from §6 keeps the per-node affinities from collapsing onto a few favourite nodes. Quality loss in V3 ablations is below the measurement noise on downstream evals; the bandwidth savings are not.
With the cap in place, the expected distinct-node count for one token is at most , so
Crucially, this bound is independent of : doubling the cluster from 8 to 16 nodes does not increase per-device IB traffic so long as stays fixed. The collective scales with the cap, not with the cluster.
This is what "near-perfect computation-communication overlap" means. Pick the smallest such that IB wall time matches the 10 ms or so of expert compute per layer. For V3-scale workloads, that turns out to be 4. The cost-model simulator in the next section makes this concrete.
Manual Numerical Walkthrough
Set , , , , (BF16). We compare two routings.
Case A: unconstrained top-8 across N = 8 nodes
Expected distinct nodes per token:.
Cross-node hops per token: .
Per-device rows over IB per dispatch: .
Per-device bytes over IB per dispatch: . Round trip: .
Wall time at 50 GB/s: . Per MoE layer alone. Across 58 layers, the forward pass alone needs ~580 ms of pure IB time. A whole training step would be IB-bound.
Case B: node-limited with M = 4
Distinct nodes per token capped at . Cross-node hops per token: .
Per-device rows over IB per dispatch: .
Per-device bytes over IB per dispatch: . Round trip: .
Wall time at 50 GB/s: per MoE layer.
That alone is a 30% reduction in IB time without any change to expert count or model dimension. Push further down to 2 and IB drops to ~3.5 ms — but quality starts to degrade because each token has only two candidate node-pools to pick experts from. DeepSeek's ablations land on as the sweet spot.
Visualizing Dispatch and the Bandwidth Budget
First, the mechanism. The animation below replays the five stages of an EP dispatch — local, route, dispatch all-to-all, expert compute, combine all-to-all — on a four-GPU cluster. Watch the colored tokens leave their home GPU during Dispatch, hit the expert that owns them during Compute, then travel back on Combine. The same logic scales to a multi-node cluster: just imagine each GPU column belonging to one of two nodes, with the all-to-all crossing the node boundary.
Now the cost. The simulator below puts the math from The Math behind a set of sliders. Drag down from and watch the IB bar shrink while the NVLink bar barely moves — that is exactly the asymmetry that node-limited routing exploits. The "Find smallest M that fits IB" button picks the smallest cap that keeps IB under the 10 ms compute budget, the same logic DeepSeek's team used to land on .
Two patterns to take away. First, IB saturation grows almost linearly with : small clusters with no cap and large clusters with a cap of 4 sit at similar IB load. Second, NVLink barely moves as you change because the intra-node bundle has to be moved either way — node-limited routing is a pure win on the slow link.
Plain Python: A Two-Phase All-to-All
Below is the algorithm as plain NumPy — no torch.distributed yet. Eight simulated ranks across two nodes, top-2 routing over sixteen experts. The two phases are visible as separate loops: first we bundle by destination node (the IB hop), then we re-sort bundles to the exact destination GPU within each node (the NVLink hop).
Two observations. First, the IB phase does not need to know about the GPUs per node — it only sees nodes. That collapse from destinations to destinations is what lets NCCL pipeline IB more efficiently. Second, the NVLink phase is a small local all-to-all of bundles that are already on the right node. The node-limited router determines which bundles get sent, but the structure of the two-phase collective is independent of that.
PyTorch: Hierarchical all_to_all_single
In production this becomes two calls to torch.distributed.all_to_all_single, one on a cross-node process group and one on an intra-node group. The annotated code below is the skeleton; production implementations add capacity padding (so every collective transfers fixed-size buffers) and fuse the size-exchange into the dispatch itself.
NCCL note. In NCCL ≥ 2.18,all_to_all_singleon a process group spanning a single node uses NVLink/NVSwitch directly. On a cross-node group it uses the IB NIC. By splitting the collective into two process groups, we let NCCL pick the right transport for each phase — and we expose the two phases to the overlap scheduler in DualPipe, which is the topic of the next section.
What Changes at 14.8T Tokens and 671B Parameters
Every quantity in the cost model scales with model and cluster size. At DeepSeek-V3 scale the relevant numbers are roughly:
| Quantity | Value at V3 scale | Why it matters |
|---|---|---|
| Routed experts E | 256 | Owners cover dozens of GPUs |
| Top-k | 8 | Eight cross-node hops per token without a cap |
| Hidden dim d | 7168 | 14 KB per activation row in BF16 |
| Nodes N | 8 H800 nodes (training group) | Sets the unconstrained 1/N home-node probability |
| GPUs / node | 8 | NVLink fanout target after IB hop |
| Node cap M | 4 | The single hyperparameter that balances IB ≈ NVLink |
| MoE layers | 58 | Per-step IB time multiplies by this |
| Tokens / device / step | ~4096 | Sets per-dispatch bytes |
Three things break or matter at this scale that did not in the 4-rank toy.
1. Capacity padding turns dynamic traffic into static
Real all_to_all_single calls are far cheaper when every rank knows the buffer sizes ahead of time — NCCL can skip the size-exchange and pre-allocate. Production EP rounds every per-expert token count up to a fixed capacity, drops the overflow, and pads the underflow with zeros. The system trades a small fraction of routed tokens for a deterministic, schedulable collective. The capacity factor (typically 1.25× the expected load) is one of the very few hyperparameters the operator has to tune after launch.
2. The number of expert-parallel ranks is bounded by the network
Above nodes or so, even node-limited routing's constant hops begin to saturate IB on smaller clusters. DeepSeek-V3 keeps EP groups at one training sub-cluster (8 nodes = 64 GPUs), replicates that group via ZeRO data parallelism, and never grows EP horizontally. Scale comes from more DP replicas, not more EP nodes.
3. Failure modes are network-shaped
A flaky IB cable or a hot-spot port on a single node makes the EP all-to-all tail-latency-dominated. Because all ranks block on the same collective, one slow link freezes the whole step. Production clusters monitor per-link NIC counters and rebalance EP groups before the slow link becomes the global bottleneck.
Engineering Reality: Overlap, Topology, and Stragglers
Node-limited routing is necessary but not sufficient. Even with , IB time per MoE layer is still on the order of compute time per MoE layer. Doing nothing else means the cluster spends half its life on the network. The way DeepSeek-V3 closes the loop is by making the dispatch and combine overlap with the compute of another micro-batch.
- Compute-comm overlap. While micro-batch is running its MoE expert compute, the dispatch all-to-all for micro-batch is already in flight on the NIC. The two queues run in parallel; the scheduler must be careful not to let IB drain before the next compute is ready.
- Topology-aware process groups. The cross-node group should map each node to one IB-attached rank that aggregates and forwards. Otherwise eight ranks per node fight for the same NIC and you lose the asymmetry advantage immediately.
- Straggler shielding. The slowest IB link in the cluster sets the wall time of the collective. Production clusters run weekly NIC linkspeed checks and quarantine nodes that drift.
- Dropless vs lossy capacity. "Dropless" EP, where overflow tokens are re-routed to spare experts, removes the quality penalty of capacity dropping but pushes more bytes on the wire. Pick lossy capacity in pre-training (sharper budgets, slight quality cost) and dropless in fine-tuning (where a single dropped token can pollute a small SFT dataset).