The previous four sections built DeepSeekMoE on a single device: 256 routed experts, 1 shared expert, top-8 gating. The arithmetic is clean on paper. It is impossible on hardware. A single H800 holds 80 GB; a DeepSeek-V3 MoE layer alone weighs hundreds of gigabytes. The expert pool simply does not fit on one GPU. Expert parallelism — sharding the experts across many GPUs and shuttling tokens to wherever their chosen expert lives — is the engineering trick that turns the beautiful single-device math into a 671B-parameter system that actually trains.
The bet of expert parallelism. Stop trying to fit every expert on every GPU. Put each expert on exactly one GPU, and instead mail the tokens to the expert. You pay for two all-to-all collectives per layer; in exchange you get a model 30× bigger than any single accelerator could hold.
The Memory Wall of 671B Parameters
Start with a concrete accounting. DeepSeek-V3 has 256 routed experts, each a 2-layer FFN with hidden dimension and model dimension . That is roughly parameters per expert, or about 58 MB at BF16. One layer's 256 experts therefore weigh . Multiply across 58 MoE layers and you reach roughly 880 GB of expert parameters alone — before counting attention, embeddings, the optimizer states (which are usually 4× the parameter footprint), gradients, or activations.
An 80 GB H800 cannot hold a single MoE layer's experts in BF16, let alone the whole model with optimizer state. Pure data parallelism is dead on arrival: replicating an 880 GB expert pool on every GPU is not an arithmetic problem, it is a physics problem. Tensor parallelism (the Megatron-style trick of splitting each matmul across devices) helps for a single dense FFN, but it does not exploit the structural sparsity of MoE — it shards each expert across GPUs even though most experts do nothing for most tokens.
Sharding Experts Across the Cluster
Picture a hospital again, but now spread across four cities. Each city has two specialists; eight specialists in total. When a patient arrives at a local clinic, the triage desk decides which specialist they need, and the patient flies to the city where that specialist works. After the appointment they fly home. The clinics are tiny (one router each). The specialists are stationary. The patients travel.
The four cities are GPUs. The eight specialists are experts (two per GPU). The patients are tokens. The flight is the all-to-all collective: in a single coordinated step, every GPU sends its tokens to wherever they need to go and receives the tokens it must process — all at the same time, all over the network. After expert compute, a reverse all-to-all flies the outputs home.
Two pieces of the system stay replicated, not sharded: the router and the shared experts. The router is small (a single linear) and must run on every device — every device decides where its own tokens go without asking anyone. The shared experts, similarly small relative to the routed pool, live on every device and run locally for every token. The expensive, sharded thing is the routed pool.
The Math: Permutations and All-to-All
Let be the world size of the expert-parallel group, the number of routed experts, and assume is divisible by . Each device owns the contiguous expert block . Define .
On step , device holds a local token slice of shape . After routing, it has top- expert ids per token, which we flatten into a list of dispatch decisions. Group these by destination device, producing per-destination send counts . The dispatch all-to-all is the permutation:
Each device then runs its local experts on the received rows — no communication during compute. The combine all-to-all is the inverse permutation, mailing each output back to its source device. After applying the gates and summing across the slots, the per-token output is restored.
Communication cost
Each device sends about elements during dispatch and the same again during combine. The total volume per layer per device is therefore roughly bytes (where is bytes per element — 2 for BF16). For DeepSeek-V3 with , that's about 0.94 GB per device per layer per step, and we have 58 such layers. The all-to-all traffic is the single largest bandwidth consumer in MoE training.
The load-balance precondition
The math above assumes is roughly uniform. If one device's experts are popular and the rest are dead, that device's inbox overflows while the others sit idle — compute and communication both stall. Expert parallelism therefore requires the load-balancing mechanisms we will derive in chapter 6 (the auxiliary-loss-free bias terms). Without balance, the whole sharding strategy collapses.
Manual Numerical Walkthrough
Let us trace a tiny dispatch by hand. Four devices, eight experts (two per device), top-1 routing, three tokens per device.
Click to expand: one all-to-all by hand
Setup. devices, experts, two experts per device, so, , , . Each device has 3 local tokens and routes top-1 — call the routed expert ids:
- Device 0's tokens pick experts .
- Device 1's tokens pick experts .
- Device 2's tokens pick experts .
- Device 3's tokens pick experts .
Step 1: expert id → destination device. Apply :
- Device 0 dests: .
- Device 1 dests: .
- Device 2 dests: .
- Device 3 dests: .
Step 2: build the send-count matrix rows sends to :
| src ↓ / dst → | 0 | 1 | 2 | 3 |
|---|---|---|---|---|
| device 0 | 1 | 1 | 1 | 0 |
| device 1 | 1 | 1 | 0 | 1 |
| device 2 | 0 | 1 | 1 | 1 |
| device 3 | 1 | 0 | 1 | 1 |
Row sums are all 3 (every device sends out all 3 of its tokens). Column sums are also all 3 (every device receives exactly 3 tokens). That column-sum equality is exactly the balance precondition — if column 0 had been 5 and column 2 had been 1, device 0 would have to do 5× the work of device 2 this step.
Step 3: the all-to-all itself. Every device simultaneously sends row chunks out and receives column chunks in. So after the collective, device 0's inbox is:
- 1 token from device 0 (its own local token that picked expert 0)
- 1 token from device 1 (the token that picked expert 1)
- 0 tokens from device 2
- 1 token from device 3 (the token that picked expert 0)
Total: 3 tokens, two destined for expert 0 and one for expert 1 — the two experts device 0 happens to own.
Step 4: local expert compute on device 0. Run expert 0 once on its packed (2, ) mini-batch and expert 1 once on its (1, ) mini-batch. No cross-device anything.
Step 5: combine all-to-all. The transpose of tells device 0 to send 1 row back to device 0, 1 to device 1, 0 to device 2, 1 to device 3 — exactly the reverse permutation. Each output lands back in the originating device's row index, gates are applied locally, and one MoE layer is done.
The takeaway. Two collectives, no cross-device compute, all data motion is a pure permutation of rows. Scale this from 4 devices to 256 EP ranks, 12 tokens to 4096 × 8 dispatch rows, and the picture is identical — just bigger arrays.
Visualizing the Dispatch
Step through the five stages below. Watch how the colored tokens (color = chosen expert id) leave their home GPU during Dispatch, land next to the matching expert on the destination GPU during Compute, and travel back during Combine. The side panel tracks how many tokens actually had to cross the network — the rest picked a local expert by luck of the router.
Three observations to anchor. First, every GPU sends and receives in the same collective — that is what "all-to-all" means and why it is bandwidth-efficient. Second, the traffic counter exposes the load-balance pressure: in a uniformly balanced batch, every device sends and receives the same number of rows. Third, the compute step happens entirely locally — once tokens land, the experts run normally, identical to the single-device MoE from earlier sections.
Plain Python: A Four-Device Simulator
Before touching torch.distributed, here is the entire mechanism in NumPy. We simulate four devices as four Python lists, route every device's tokens locally, build the send buffers, and assemble each device's inbox by hand. The collective is the dictionary rearrangement on lines 36–43 — that is, literally, what an all-to-all does.
Notice the only line that could not exist in real code: the block that builds device 0's expert dict by filtering. In real expert parallelism, that filter happens at process creation — each rank only ever allocates its own experts. There is no global dict, no other device's weights waiting to be filtered out.
Sanity check. Set . Every token's destination is device 0; the send dict has one entry; the all-to-all is a no-op; we recover the single-device MoE from section 5.1. Set (one expert per device): every all-to-all is fully cross-device and there are no local picks — the bandwidth tax is at its maximum.
PyTorch: all_to_all_single in Anger
Production code uses torch.distributed.all_to_all_single with split sizes. The module below is a complete, runnable expert-parallel MoE layer — short enough to read end-to-end, faithful enough to the real DeepSeek implementation that the structure transfers directly.
Three subtleties that bite first-time EP implementers:
- The counts exchange is mandatory. all_to_all_single requires the receiver buffer to be pre-sized. You cannot know your inbox size without first exchanging the per-source counts — that is why two collectives appear before the activations move. Skipping this step is the most common cause of segfaults in hand-written EP code.
- Sorting by destination is the cheap optimization. all_to_all_single moves contiguous blocks. If you do not sort, you fall back to all_to_all (with explicit per-rank tensors) which is dramatically slower because it triggers many small NCCL sends instead of one large one. The argsort + argsort-of-argsort idiom is worth memorizing.
- Autograd flows through the collective. NCCL's PyTorch bindings have proper backward implementations for all_to_all_single: the gradient of a dispatch is the corresponding combine, and vice versa. You write the forward; autograd produces a correctly-distributed backward at no additional code cost. This is why hand-rolling collectives in raw
dist.send/dist.recvis almost never worth it.
for e in range(self.local_E) loop is the readable version. At DeepSeek scale, this is replaced by a single grouped-GEMM call (NVIDIA's CUTLASS or Triton kernels) that runs all local experts in one launch, packing rows by expert offset. Same semantics, an order of magnitude less kernel-launch overhead.What Changes at Massive Scale
At cluster scale, expert parallelism does not stand alone — it is one axis of a 3D or 4D parallelism mesh. DeepSeek-V3's reported training configuration is illustrative:
| Parallelism axis | Group size | What it shards |
|---|---|---|
| Data parallelism (DP) | 64 replicas | the batch — each replica sees a different microbatch slice |
| Pipeline parallelism (PP) | 16 stages | model layers — each stage owns a contiguous block |
| Expert parallelism (EP) | 32 ranks | the routed experts within one MoE layer |
| Tensor parallelism (TP) | 1 (none in V3) | would shard within a single matmul; skipped to keep all-to-all local |
The cluster mesh is therefore ranks. The MoE layer's 256 routed experts are sharded across the 32-rank EP group: 8 experts per GPU. Every step, every EP group does its own pair of all-to-alls; the 64 data-parallel replicas do them independently in parallel.
Node-limited routing
Cross-node bandwidth (NVLink + InfiniBand combined) is typically 4–8× slower than intra-node NVLink. DeepSeek-V3 introduces node-limited routing: each token is restricted to experts living on at most nodes out of the 8 nodes in its EP group. The router's top- is masked so the chosen experts span no more than nodes. This halves cross-node traffic with almost no quality loss — a beautifully practical compromise.
Capacity factor and token dropping
Even with auxiliary load balancing, the routed traffic is never perfectly uniform. To bound the receiver buffer at compile time, MoE implementations set a capacity factor : each expert reserves space for at most rows per step. Tokens beyond the cap are dropped for that expert — they pass through the shared experts and the residual stream unmodified by the MoE block. DeepSeek typically uses in training and bumps it to for inference where latency matters more than throughput.
Communication-compute overlap
The bandwidth budget can be partly hidden. While device is running its local experts on the inbox, it can begin sending output rows back to their source devices for the combine — once any single expert finishes, its rows can start traveling. Real implementations pipeline this overlap explicitly. The shared experts play a second role here: they are running locally during the dispatch all-to-all, hiding 10–30% of the network latency behind useful compute.
| Quantity | DeepSeek-V3 value | Why it matters |
|---|---|---|
| Routed experts per MoE layer | 256 | deep specialization pool |
| EP group size | 32 GPUs | experts/GPU = 8 |
| Top-k | 8 | every token visits 8 experts per layer |
| Node-limited M | 4 nodes | halves cross-node all-to-all |
| Capacity factor (train) | ≈ 1.0 | bounds receive buffer |
| All-to-all bytes / GPU / layer / step | ≈ 0.94 GB | the bandwidth bill |
| MoE layers in V3 | 58 | total bill ≈ 55 GB/GPU/step |
Engineering Reality: Capacity, Overlap, and Topology
Expert parallelism is the easiest piece of MoE to implement at toy scale and the easiest to get wrong at production scale. Four failure modes are worth carrying in your head:
- Imbalanced routing → straggler GPU. If one device owns the most popular experts, the entire EP group waits for it on every step. Symptoms: low SM utilization on most ranks, high on one; all-to-all latency climbing over training. Fix: aggressive load-balancing (chapter 6) and capacity factor > 1.
- EP group placed across the wrong topology. If your 32-rank EP group spans 8 nodes instead of 4, every all-to-all is cross-node. Modern clusters give you NVLink within a node and InfiniBand across nodes — those are 10× apart in bandwidth. Place EP groups along NVLink-dense axes, not pipeline-dense ones.
- Token-drop cascade in mixed precision. With and BF16 accumulation, a tiny drift in router logits can flip a token from "just under the cap" to "dropped". Dropped tokens get zero gradient through the MoE for that step. Stack hundreds of such events per batch and you have a silent quality regression that no loss curve will catch — only eval metrics will. Periodically log per-expert drop rate.
- All-to-all on the wrong stream. If the dispatch collective shares a CUDA stream with the local expert compute, they serialize even though they could overlap. Production EP code puts collectives on dedicated streams and uses CUDA events to synchronize only where strictly required. A 30% throughput swing is on the table.
The one sentence to carry forward: expert parallelism turns MoE's structural sparsity into a memory savings — and pays for it with two all-to-all collectives per layer per step. Everything else in massive MoE engineering — load balancing, capacity factors, node-limited routing, communication-compute overlap — exists to keep that all-to-all cheap.