Chapter 11
20 min read
Section 65 of 117

Expert Parallelism and Cross-Node All-to-All

Distributed Training: DualPipe and the Parallelism Stack

We have now seen four parallelism strategies — data, tensor, pipeline, and DualPipe's overlap of compute and communication — and built each one on top of the same primitive: a collective that synchronises tensors across ranks. Expert parallelism is the fifth, and it is the one that cracks open the path to 671B parameters. It is also the one that turns the cluster's network — not its compute — into the bottleneck. This section is the story of that bottleneck and DeepSeek's deceptively small fix.

The thesis. Expert parallelism makes each MoE layer shoot two all-to-all collectives over the network per forward pass. In a multi-node cluster the link that limits training throughput is no longer NVLink — it is the InfiniBand fabric between nodes, which is roughly an order of magnitude slower. Node-limited routing, a one-line change to the router, caps the InfiniBand traffic to a constant multiple of the NVLink traffic so the two finish at the same time. That single equation is what makes DeepSeek-V3's pre-training feasible on commodity 8-GPU nodes.

Where Expert Parallelism Sits in the Stack

Before zooming into the network, let us place expert parallelism among the strategies you already know. The four core parallelism axes partition four different things across the cluster:

StrategyWhat it shardsCollective per stepNetwork it uses
Data parallel (DP)Batchall-reduce of gradientsAll — typically intra + inter-node
Tensor parallel (TP)Each matmul along a hidden dimall-reduce inside one layerNVLink only (kept on-node)
Pipeline parallel (PP)Layers along depthpoint-to-point activationsMostly NVLink + some IB
Expert parallel (EP)Routed experts across rankstwo all-to-all per MoE layerCrosses NVLink + InfiniBand

Tensor parallelism is deliberately kept on-node because its all-reduce sits in the critical path of every matmul — losing a factor of ten on bandwidth would be catastrophic. Pipeline parallelism crosses nodes cheaply because the messages are small (activations, not weights). Data parallelism crosses nodes for the gradient all-reduce, but modern sharded variants (ZeRO-3, FSDP) hide most of that behind compute. Expert parallelism is the new entry, and it is the only collective in the stack that must traverse the cross-node fabric — there is no on-node shortcut, because the chosen expert can live anywhere.

Why we cannot avoid cross-node EP. A 671B-parameter MoE has ~880 GB of routed expert weights per layer (see §5.5). Even at BF16 that is many times more than fits in a single node's 8 × 80 GB. The routed pool must span nodes. Therefore every all-to-all on the dispatch hits InfiniBand. The question is not whether to cross nodes but how to do it cheaply.

The Real Problem: Two Networks, One Pinch Point

Inside an H800 node, eight GPUs share an NVLink/NVSwitch fabric that delivers around 600 GB/s of effective bisection bandwidth per GPU. Between nodes the cluster uses InfiniBand — usually one or two 200–400 Gbps HCAs per node — which works out to about 50 GB/s per GPU once you amortise NIC sharing. The two numbers are not close. NVLink is roughly 12×12\times faster than IB. Any algorithm that pushes the same number of bytes across both will spend twelve times longer on the IB hop than the NVLink one. That is the whole story.

Now consider a naive EP layout. We have NN nodes, each with 8 GPUs, holding EE routed experts spread evenly. A token activates top-kk experts. Assuming uniform routing, the probability that any one chosen expert sits on the token's home node is 1/N1/N. So the probability it sits off the home node is 11/N1 - 1/N. For N=8N = 8, that is 87.5%. Almost every chosen expert is across the network.

The damage compounds because top-k routing fires kk of these per token. Each cross-node hop drags a row of the activation tensor — typically d=7168d = 7168 values at 2 bytes each, so 14 KB — across IB. With a per-device micro-batch of T=4096T = 4096 tokens at k=8k = 8 and N=8N = 8, that is 409680.87514KB400MB4096 \cdot 8 \cdot 0.875 \cdot 14\,\text{KB} \approx 400\,\text{MB} of IB traffic per device per dispatch — and there is a matching combine on the way back, doubling it. At 50 GB/s that is 16 ms of pure network time per MoE layer. Multiply by 58 MoE layers and the forward pass alone wants ≈ 0.9 s of IB time. The GPUs are idle the whole while.

The asymmetry is the bug, and the fix. If we could somehow route most of the k cross-node hops within the 12× faster on-node fabric, the cluster would spend the same wall time on IB and NVLink — i.e. the two would balance. That is precisely what node-limited routing achieves, and we will derive it from the cost equation in a moment.

Intuition: A City With Both Subways and Airplanes

Forget GPUs for a moment. Imagine a courier service operating in eight cities. Within a city there is a subway: cheap, fast, near-infinite capacity. Between cities the courier must fly: expensive, slow, bandwidth-limited. A package's destination is randomly chosen among the eight cities. If we let every package choose freely, 7 out of 8 packages get on a plane. The airport is the bottleneck — the planes always have a queue, the subways are mostly empty.

Now suppose, before the package leaves the warehouse, we attach a rule: a single package can visit at most four cities on its journey. If you need to drop off four parcels, you must choose four cities, but never a fifth. Inside those four cities you can take as many subway rides as you like. What changes? Each package now boards at most one plane (to the chosen second city) plus the subway hops within each visited city. Aggregate plane traffic falls by a factor of two — no, by a factor of (N/M)(N/M) where M=4M = 4. The subways stay just as fast. The airport stops being the bottleneck.

That cap is node-limited routing. The cities are nodes; the airport is the InfiniBand NIC; the subway is NVLink; the packages are activation rows. Capping the number of nodes a token may touch is the only knob in the system that operates on the bandwidth-expensive resource without sacrificing expressive top-k routing.

The Math: All-to-All Cost and the Cross-Node Ratio

Let us model the cost of one MoE dispatch on a cluster of NN nodes, each with GG GPUs. Let TT be the number of tokens held by one GPU before dispatch, dd the activation hidden dimension, kk the top-k count, and bb the activation element size in bytes (typically b=2b = 2 for BF16).

For unconstrained routing each of the kk chosen experts is equally likely to live on any node, so the expected number of distinct nodes a token must reach is

E[distinct nodes    unconstrained]=N(1(11N)k).\mathbb{E}[\text{distinct nodes}\;|\;\text{unconstrained}] = N \cdot \left(1 - \left(1 - \tfrac{1}{N}\right)^{k}\right).

For N=8N = 8 and k=8k = 8 this evaluates to about 5.255.25 nodes per token, of which exactly one is the home node. So a typical token travels to 5.251=4.255.25 - 1 = 4.25 remote nodes — and a copy of its activation row must cross IB once per remote node.

Multiplying through, the number of activation rows leaving one GPU over InfiniBand in a single dispatch is

RIB=T[N(1(11N)k)1],R_{\text{IB}} = T \cdot \left[N \cdot \left(1 - \left(1 - \tfrac{1}{N}\right)^{k}\right) - 1\right],

and the IB byte budget per dispatch is BIB=RIBdbB_{\text{IB}} = R_{\text{IB}} \cdot d \cdot b. The combine on the way back is symmetric, so the round-trip cost is2BIB2 B_{\text{IB}}.

On the NVLink side, the analogous count of intra-node per-token transfers depends on how many of the kk picks landed inside the home node (or, with the hierarchical algorithm we will build, inside any node the token already visits). The cost model implemented in the cost-model simulator below is a tractable closed form; the key proportion is the ratio

ρ    BIBBNVLink.\rho \;\equiv\; \frac{B_{\text{IB}}}{B_{\text{NVLink}}}.

We want ρ\rho to be at most the link speed ratio βNVLink/βIB12\beta_{\text{NVLink}} / \beta_{\text{IB}} \approx 12 — i.e. we want IB to carry at most as much traffic per second as NVLink, so that wall time on the two links matches and the dispatch is bandwidth-balanced.

DeepSeek's Trick: Node-Limited Routing

DeepSeek-V3 introduces a hard structural constraint on the router: a token may activate experts on at most MM distinct nodes. In V3 they fix M=4M = 4 with N=8N = 8. The implementation is a single masking step that runs before the top-k selection:

  • For each token, score every expert with the router as usual: se=(softmax of xWr)es_e = (\text{softmax of } x^\top W_r)_e.
  • For each node nn, compute the top-2 expert score on that node and call its sum affn\mathrm{aff}_n — the node's "affinity" for this token.
  • Keep the top MM nodes by affn\mathrm{aff}_n; zero out all expert scores on the remaining NMN - M nodes.
  • Run top-kk selection over the surviving scores. The token now visits at most MM nodes by construction.

The router is still differentiable — masking is just multiplying by a gradient-free indicator — and the bias-term load balancer from §6 keeps the per-node affinities from collapsing onto a few favourite nodes. Quality loss in V3 ablations is below the measurement noise on downstream evals; the bandwidth savings are not.

With the cap in place, the expected distinct-node count for one token is at most MM, so

RIBnode-limitedT(M1).R_{\text{IB}}^{\text{node-limited}} \le T \cdot (M - 1).

Crucially, this bound is independent of NN: doubling the cluster from 8 to 16 nodes does not increase per-device IB traffic so long as MM stays fixed. The collective scales with the cap, not with the cluster.

This is what "near-perfect computation-communication overlap" means. Pick the smallest MM such that IB wall time matches the 10 ms or so of expert compute per layer. For V3-scale workloads, that MM turns out to be 4. The cost-model simulator in the next section makes this concrete.

Manual Numerical Walkthrough

Set N=8N = 8, k=8k = 8, T=4096T = 4096, d=7168d = 7168, b=2b = 2 (BF16). We compare two routings.

Case A: unconstrained top-8 across N = 8 nodes

Expected distinct nodes per token:8(1(7/8)8)=80.656=5.258 \cdot (1 - (7/8)^8) = 8 \cdot 0.656 = 5.25.

Cross-node hops per token: 5.251=4.255.25 - 1 = 4.25.

Per-device rows over IB per dispatch: 40964.25=17,4084096 \cdot 4.25 = 17{,}408.

Per-device bytes over IB per dispatch: 17,40871682250MB17{,}408 \cdot 7168 \cdot 2 \approx 250\,\text{MB}. Round trip: 500MB500\,\text{MB}.

Wall time at 50 GB/s: 500MB/50GB/s10ms500\,\text{MB} / 50\,\text{GB/s} \approx 10\,\text{ms}. Per MoE layer alone. Across 58 layers, the forward pass alone needs ~580 ms of pure IB time. A whole training step would be IB-bound.

Case B: node-limited with M = 4

Distinct nodes per token capped at M=4M = 4. Cross-node hops per token: 41=34 - 1 = 3.

Per-device rows over IB per dispatch: 40963=12,2884096 \cdot 3 = 12{,}288.

Per-device bytes over IB per dispatch: 12,28871682176MB12{,}288 \cdot 7168 \cdot 2 \approx 176\,\text{MB}. Round trip: 352MB352\,\text{MB}.

Wall time at 50 GB/s: 352MB/50GB/s7ms352\,\text{MB} / 50\,\text{GB/s} \approx 7\,\text{ms} per MoE layer.

That alone is a 30% reduction in IB time without any change to expert count or model dimension. Push MM further down to 2 and IB drops to ~3.5 ms — but quality starts to degrade because each token has only two candidate node-pools to pick experts from. DeepSeek's ablations land on M=4M = 4 as the sweet spot.

Sanity check on the 12× ratio. At M=4M = 4, ~3 of every 8 chosen experts are off-node and 5 are on-node. NVLink moves 5/3 ≈ 1.7× as many rows as IB. NVLink delivers 12× the bandwidth, so its wall time is 1.7/120.141.7 / 12 \approx 0.14 of IB's. IB is still the bottleneck — by design — but only slightly.

Visualizing Dispatch and the Bandwidth Budget

First, the mechanism. The animation below replays the five stages of an EP dispatch — local, route, dispatch all-to-all, expert compute, combine all-to-all — on a four-GPU cluster. Watch the colored tokens leave their home GPU during Dispatch, hit the expert that owns them during Compute, then travel back on Combine. The same logic scales to a multi-node cluster: just imagine each GPU column belonging to one of two nodes, with the all-to-all crossing the node boundary.

Loading dispatch animation…

Now the cost. The simulator below puts the math from The Math behind a set of sliders. Drag MM down from NN and watch the IB bar shrink while the NVLink bar barely moves — that is exactly the asymmetry that node-limited routing exploits. The "Find smallest M that fits IB" button picks the smallest cap that keeps IB under the 10 ms compute budget, the same logic DeepSeek's team used to land on M=4M = 4.

Loading cost-model simulator…

Two patterns to take away. First, IB saturation grows almost linearly with (N1)/M(N - 1)/M: small clusters with no cap and large clusters with a cap of 4 sit at similar IB load. Second, NVLink barely moves as you change MM because the intra-node bundle has to be moved either way — node-limited routing is a pure win on the slow link.

Plain Python: A Two-Phase All-to-All

Below is the algorithm as plain NumPy — no torch.distributed yet. Eight simulated ranks across two nodes, top-2 routing over sixteen experts. The two phases are visible as separate loops: first we bundle by destination node (the IB hop), then we re-sort bundles to the exact destination GPU within each node (the NVLink hop).

🐍hierarchical_a2a_numpy.py
3Two nodes, four GPUs each

We pretend to be an 8-rank cluster split across two nodes. That is the smallest configuration where cross-node traffic is even possible — with one node the IB network never runs.

EXECUTION STATE
NODES = 2
GPUS_PER_NODE = 4
WORLD = 8
6Experts and routing

16 experts spread evenly: 2 experts per GPU, 8 experts per node. Each token activates k = 2 experts. That is small enough to trace by hand but rich enough that some token-expert pairs will land off-node.

EXECUTION STATE
E = 16
k = 2
experts_per_gpu = 2
8Where does expert e live?

owner(e) returns the GPU rank that holds expert e; node_of(r) returns the node that holds rank r. The two lambdas together answer the only question that matters for routing cost: is the chosen expert on the same node as the token?

11Local batch slice

Each rank holds its slice of the global batch. Real DeepSeek-V3 uses ~4 K tokens per device per micro-batch; we use 4 so we can read the buffers by eye.

EXECUTION STATE
T_local = 4
d = 4
local_x[0] =
(4, 4)
16Router is replicated

W_router lives on every rank — a small (E × d) matrix. Tokens never travel to decide where to go; the routing decision is local. This is why the router cannot be sharded along the expert axis.

EXECUTION STATE
W_router =
(16, 4)
18Top-k routing as argsort

We compute logits = x · Wᵀ then take the indices of the k largest. In production code this is fused with a softmax for the gate weights, but for traffic accounting we only need the chosen expert ids.

EXECUTION STATE
picks =
(T_local, k) = (4, 2)
23Phase 1: bundle by destination NODE, not destination GPU

This is the non-obvious step. Instead of building 8 send buffers (one per remote GPU), we build NODES = 2 send buffers (one per remote node). Each bundle holds every token from this rank that needs to visit at least one expert on that node. We will let the NVLink hop sort tokens to the right GPU later.

27A token may end up in multiple bundles

If a token's k experts span both nodes, it goes into both IB bundles. The duplication is exactly the cost of top-k > 1 — and the node cap M will bound how big this duplication can get.

33IB all-to-all: one collective, one bundle per destination node

Every rank sends one bundle to every remote node. From rank r's view this is exactly NODES − 1 messages on the IB NIC. NCCL fuses them into one all-to-all so the NIC pipelines sends and receives.

38Phase 2: NVLink hop sorts inside the node

Once a bundle lands on the destination node, an intra-node all-to-all (over NVLink) fans tokens out to the exact GPU that owns each chosen expert. NVLink bandwidth is ~12× IB, so this hop is cheap.

47Inbox = local expert work queue

Every (token, expert) pair that survived both hops lands in inbox[owner(e)]. The expert then runs on its inbox the same way as in the single-device MoE; expert parallelism has been entirely absorbed by the two collectives.

44 lines without explanation
1import numpy as np
2
3# Cluster: 2 nodes, 4 GPUs/node = 8 ranks. Experts: 16 (2 per GPU).
4NODES, GPUS_PER_NODE = 2, 4
5WORLD = NODES * GPUS_PER_NODE          # 8
6E, k = 16, 2                            # 16 routed experts, top-k = 2
7experts_per_gpu = E // WORLD
8node_of = lambda r: r // GPUS_PER_NODE
9owner = lambda e: e // experts_per_gpu  # gpu id (0..7) that owns expert e
10
11# Per-GPU local batch: 4 tokens of dim d=4.
12d, T_local = 4, 4
13rng = np.random.default_rng(0)
14local_x = [rng.standard_normal((T_local, d)) for _ in range(WORLD)]
15
16# Replicated router (each rank runs it on its own tokens).
17W_router = rng.standard_normal((E, d)) * 0.1
18
19def route_topk(x, k):
20    logits = x @ W_router.T            # (T, E)
21    return np.argsort(-logits, axis=1)[:, :k]   # (T, k)
22
23# ---- PHASE 1: build per-destination-NODE bundles (IB hop) ----
24ib_send_buffers = [[None] * NODES for _ in range(WORLD)]
25for r in range(WORLD):
26    picks = route_topk(local_x[r], k)        # (T, k)
27    for dest_node in range(NODES):
28        # Tokens with at least one expert living on dest_node.
29        mask = np.array([
30            any(node_of(owner(e)) == dest_node for e in picks[t])
31            for t in range(T_local)
32        ])
33        ib_send_buffers[r][dest_node] = local_x[r][mask]
34
35# IB all-to-all: every rank ships ONE bundle per remote node.
36ib_recv = [[ib_send_buffers[s][node_of(r)] for s in range(WORLD)]
37           for r in range(WORLD)]
38
39# ---- PHASE 2: NVLink fan-out to the exact owning GPU ----
40inbox = [[] for _ in range(WORLD)]
41for r in range(WORLD):
42    # Each rank now has bundles destined for *its node*. NVLink hop sorts
43    # them to the right GPU within the node.
44    for bundle in ib_recv[r]:
45        if bundle is None or len(bundle) == 0:
46            continue
47        # Re-route inside the node by recomputing topk on this small slice.
48        picks = route_topk(bundle, k)
49        for t in range(len(bundle)):
50            for e in picks[t]:
51                if node_of(owner(e)) == node_of(r):
52                    inbox[owner(e)].append((bundle[t], e))
53
54print("rank 0 inbox size:", len(inbox[0]))
55print("rank 6 inbox size:", len(inbox[6]))

Two observations. First, the IB phase does not need to know about the G=4G = 4 GPUs per node — it only sees nodes. That collapse from NGN \cdot G destinations to NN destinations is what lets NCCL pipeline IB more efficiently. Second, the NVLink phase is a small local all-to-all of bundles that are already on the right node. The node-limited router determines which bundles get sent, but the structure of the two-phase collective is independent of that.

PyTorch: Hierarchical all_to_all_single

In production this becomes two calls to torch.distributed.all_to_all_single, one on a cross-node process group and one on an intra-node group. The annotated code below is the skeleton; production implementations add capacity padding (so every collective transfers fixed-size buffers) and fuse the size-exchange into the dispatch itself.

🐍hierarchical_a2a_torch.py
1torch.distributed is the only API you need

all_to_all_single is the single primitive that powers expert parallelism. Wrap it in a process group for cross-node, another for intra-node, and you have a two-tier collective fabric.

4Two process groups, two networks

nodes_pg contains one rank per (node, local-slot) pair — collectively this group runs over InfiniBand. intra_pg is the set of ranks inside one node — this group runs over NVLink. Splitting one logical all-to-all into two physical ones is how we exploit the bandwidth asymmetry.

EXECUTION STATE
nodes_pg = PG over IB
intra_pg = PG over NVLink
15Map expert id to node

node_id is just integer division: experts are laid out contiguously across nodes. With 16 experts and 2 nodes, experts 0–7 live on node 0 and 8–15 on node 1. This map turns the (T, k) tensor of expert ids into a (T, k) tensor of destination nodes.

21Bundle by destination node

For each remote node, find the local row indices that need to visit at least one expert there. .any(dim=1) collapses the k dimension: a token is in this bundle if any of its k experts lives on node n. index_select copies those rows into a contiguous send buffer — IB DMA wants contiguous memory.

25ib_send_sizes — the split tensor

all_to_all_single needs to know how many rows it is sending to each peer. ib_send_sizes[n] = how many rows go to node n. The receiving side fills in ib_recv_sizes the same way after the first collective on it.

EXECUTION STATE
ib_send_sizes = (2,)
28First collective: exchange sizes

Before we can allocate the receive buffer we need to know how many rows are inbound. A tiny long-tensor all-to-all does that. This is two collectives per dispatch — and it is why DeepSeek pre-pads to a fixed capacity in production: one extra all-to-all per layer × 58 layers × hundreds of micro-batches per step adds up.

32Allocate the IB receive buffer

ib_recv_sizes.sum() is the total number of rows we will receive from all remote nodes. We allocate one flat (sum, d) buffer; all_to_all_single will write each peer's contribution into the slice given by output_split_sizes.

34The IB all-to-all itself

One call. NCCL schedules NODES − 1 concurrent send/recv pairs on the IB NIC. Bandwidth is the single greatest predictor of how long this line takes — and the whole reason for node-limited routing is to shrink ib_send_sizes.sum() per token.

44Phase 2 mirrors phase 1 over NVLink

After IB, every node holds every token that wants to visit it. A second all_to_all_single, this time on intra_pg over NVLink, sorts tokens to the exact GPU. NVLink delivers ~600 GB/s per GPU vs ~50 GB/s for IB, so this hop usually finishes faster than phase 1 even though it moves the same number of rows.

50Return = inputs to the experts

What returns is exactly what an isolated expert wants to see: rows of x and the expert ids to apply. From the expert's point of view, all distribution has vanished — it sees a normal forward pass.

44 lines without explanation
1import torch
2import torch.distributed as dist
3
4def moe_dispatch_hierarchical(
5    x: torch.Tensor,          # (T, d) local tokens
6    expert_ids: torch.Tensor, # (T, k) chosen expert ids per token
7    *,
8    world_size: int,          # total ranks
9    gpus_per_node: int,
10    nodes_pg: dist.ProcessGroup,   # cross-node group: 1 rank per node-slot
11    intra_pg: dist.ProcessGroup,   # intra-node group: ranks within one node
12) -> tuple[torch.Tensor, torch.Tensor]:
13    """Two-phase MoE dispatch. Returns (recv_x, recv_expert_ids)."""
14
15    nodes = world_size // gpus_per_node
16    T, k = expert_ids.shape
17    d = x.shape[-1]
18    device = x.device
19    node_id = lambda eid: torch.div(eid, (expert_ids.numel() // nodes),
20                                    rounding_mode="floor")
21    target_nodes = node_id(expert_ids)            # (T, k)
22
23    # ---------- Phase 1: cross-node IB all-to-all ----------
24    # For each remote node, gather rows of x that need to visit it.
25    ib_send_chunks: list[torch.Tensor] = []
26    ib_send_sizes = torch.zeros(nodes, dtype=torch.long, device=device)
27    for n in range(nodes):
28        rows = (target_nodes == n).any(dim=1).nonzero(as_tuple=True)[0]
29        ib_send_chunks.append(x.index_select(0, rows))
30        ib_send_sizes[n] = rows.numel()
31
32    ib_send = torch.cat(ib_send_chunks, dim=0)    # (Σ rows, d)
33    ib_recv_sizes = torch.empty_like(ib_send_sizes)
34    dist.all_to_all_single(ib_recv_sizes, ib_send_sizes, group=nodes_pg)
35
36    ib_recv = torch.empty((int(ib_recv_sizes.sum().item()), d), device=device,
37                          dtype=x.dtype)
38    dist.all_to_all_single(
39        ib_recv, ib_send,
40        output_split_sizes=ib_recv_sizes.tolist(),
41        input_split_sizes=ib_send_sizes.tolist(),
42        group=nodes_pg,
43    )
44
45    # ---------- Phase 2: intra-node NVLink all-to-all ----------
46    # ib_recv now holds every token that needs an expert *on this node*.
47    # Bucket by exact owning GPU within the node.
48    local_target = (
49        node_id(expert_ids_for(ib_recv)) * 0 +    # placeholder; real code
50        gpu_within_node(ib_recv)                  # recomputes target ranks
51    )
52    # ... (length-collectives + dist.all_to_all_single on intra_pg) ...
53
54    return ib_recv, expert_ids  # simplified for clarity
NCCL note. In NCCL ≥ 2.18, all_to_all_single on a process group spanning a single node uses NVLink/NVSwitch directly. On a cross-node group it uses the IB NIC. By splitting the collective into two process groups, we let NCCL pick the right transport for each phase — and we expose the two phases to the overlap scheduler in DualPipe, which is the topic of the next section.

What Changes at 14.8T Tokens and 671B Parameters

Every quantity in the cost model scales with model and cluster size. At DeepSeek-V3 scale the relevant numbers are roughly:

QuantityValue at V3 scaleWhy it matters
Routed experts E256Owners cover dozens of GPUs
Top-k8Eight cross-node hops per token without a cap
Hidden dim d716814 KB per activation row in BF16
Nodes N8 H800 nodes (training group)Sets the unconstrained 1/N home-node probability
GPUs / node8NVLink fanout target after IB hop
Node cap M4The single hyperparameter that balances IB ≈ NVLink
MoE layers58Per-step IB time multiplies by this
Tokens / device / step~4096Sets per-dispatch bytes

Three things break or matter at this scale that did not in the 4-rank toy.

1. Capacity padding turns dynamic traffic into static

Real all_to_all_single calls are far cheaper when every rank knows the buffer sizes ahead of time — NCCL can skip the size-exchange and pre-allocate. Production EP rounds every per-expert token count up to a fixed capacity, drops the overflow, and pads the underflow with zeros. The system trades a small fraction of routed tokens for a deterministic, schedulable collective. The capacity factor (typically 1.25× the expected load) is one of the very few hyperparameters the operator has to tune after launch.

2. The number of expert-parallel ranks is bounded by the network

Above N=16N = 16 nodes or so, even node-limited routing's constant MM hops begin to saturate IB on smaller clusters. DeepSeek-V3 keeps EP groups at one training sub-cluster (8 nodes = 64 GPUs), replicates that group via ZeRO data parallelism, and never grows EP horizontally. Scale comes from more DP replicas, not more EP nodes.

3. Failure modes are network-shaped

A flaky IB cable or a hot-spot port on a single node makes the EP all-to-all tail-latency-dominated. Because all ranks block on the same collective, one slow link freezes the whole step. Production clusters monitor per-link NIC counters and rebalance EP groups before the slow link becomes the global bottleneck.

Engineering Reality: Overlap, Topology, and Stragglers

Node-limited routing is necessary but not sufficient. Even with M=4M = 4, IB time per MoE layer is still on the order of compute time per MoE layer. Doing nothing else means the cluster spends half its life on the network. The way DeepSeek-V3 closes the loop is by making the dispatch and combine overlap with the compute of another micro-batch.

  • Compute-comm overlap. While micro-batch mm is running its MoE expert compute, the dispatch all-to-all for micro-batch m+1m + 1 is already in flight on the NIC. The two queues run in parallel; the scheduler must be careful not to let IB drain before the next compute is ready.
  • Topology-aware process groups. The cross-node group should map each node to one IB-attached rank that aggregates and forwards. Otherwise eight ranks per node fight for the same NIC and you lose the asymmetry advantage immediately.
  • Straggler shielding. The slowest IB link in the cluster sets the wall time of the collective. Production clusters run weekly NIC linkspeed checks and quarantine nodes that drift.
  • Dropless vs lossy capacity. "Dropless" EP, where overflow tokens are re-routed to spare experts, removes the quality penalty of capacity dropping but pushes more bytes on the wire. Pick lossy capacity in pre-training (sharper budgets, slight quality cost) and dropless in fine-tuning (where a single dropped token can pollute a small SFT dataset).
The takeaway. Expert parallelism is what lets a single 671B MoE exist on a fleet of 80 GB GPUs. The cost of that existence is two cross-node all-to-all collectives per MoE layer per micro-batch — and on commodity InfiniBand fabrics, that cost is roughly an order of magnitude more expensive than the equivalent intra-node collective. Node-limited routing collapses the gap by bounding the number of nodes a token may visit, restoring rough parity between IB and NVLink wall time. DualPipe (next section) then hides the remaining cost behind compute. Without both tricks, V3-class training is not bandwidth-feasible.
Loading comments...