The Real Problem: Bubbles Burn Money
Pipeline parallelism is a deal with the devil. You split the model vertically across GPUs — rank 0 holds the embedding and the first few transformer blocks, rank 1 holds the next slab, and so on until rank holds the final blocks and the loss head. Each microbatch then flows forward through the ranks one at a time, like products on a factory belt. The good news is that you can train a model that does not fit on a single GPU — DeepSeek-V3 is 671 billion parameters, and no single H800 can hold even a tenth of that. The bad news is that, at any given moment, most of your GPUs are doing nothing.
Look at what happens during the very first microbatch. At time 0, rank 0 starts computing the forward pass for microbatch 0. Ranks 1 through are idle — they have no inputs yet. At time 1, rank 0 finishes microbatch 0 and starts microbatch 1, while rank 1 receives microbatch 0's activations and finally starts working. But ranks 2 through are still idle. It takes full time steps before every rank is busy. Then the backward pass produces the symmetric problem on the way out: as microbatches finish, ranks start dropping off the back. Those idle slots — the pipeline bubble — are pure cost.
For the simplest GPipe-style schedule the bubble fraction is , where is the number of microbatches per step. With and , that is of every GPU-hour wasted. On a 2,048-GPU run that is the equivalent of running ~614 H800s for the whole training period purely to push electrons through cooling fans. The 1F1B improvement of Megatron-LM brings this down but not below ~15% in any realistic configuration.
On top of the bubble, modern MoE models add a second cost: every forward and backward through an MoE block contains a cross-node all-to-all that shuffles tokens between experts. DeepSeek-V3 has 256 routed experts spread across 8 nodes; a single token might need to be sent to a node four hops away over InfiniBand. If that all-to-all sits on the critical path — i.e. if compute has to wait for it — your GPU utilisation drops a second time. Even with zero pipeline bubbles, a naïve MoE step on DeepSeek-V3's topology can spend ~35% of wallclock in communication.
DualPipe, introduced in the DeepSeek-V3 technical report (Appendix A), attacks both costs at once. It rearranges the pipeline so that two microbatch streams flow in opposite directions simultaneously, collapsing the bubble to a sliver at the edges of each diamond. And it co-designs an SM-resident all-to-all kernel that runs on the same chip as the GEMM kernels, so the cross-node comms is hidden behind compute rather than gating it. The net effect: at the same , , DualPipe's bubble fraction drops from ~30% to under 10% and the comms time effectively vanishes. That is one of the main reasons a 2,048-GPU training of a 671 B-parameter MoE finishes in roughly two months on a budget of USD.
Intuition: Two Pipelines Crossing in the Middle
Imagine a single-track railway with eight stations. Trains start from station 0, stop at every station for one minute of loading, and exit at station 7. If you launch one train per minute from station 0, then after eight minutes you have eight trains in motion, one at each station — full utilisation. But the first seven minutes, and the last seven, you have stations sitting empty. That is the 1F1B bubble.
DualPipe's trick is to also run trains in the opposite direction, starting from station 7 and arriving at station 0. Now during minute 1, station 0 is loading the westbound train, but station 7 is already loading the eastbound train. By minute 4, every station is loading two trains at once — one heading east, one heading west. The two streams cross in the middle, like a double helix. As long as each station has the physical capacity to handle two trains simultaneously, the whole network runs at twice the throughput in the same wallclock.
For pipeline parallelism this analogy translates almost directly. A GPU has 132 streaming multiprocessors on an H800. Loading two trains simultaneously means running two pieces of model work in parallel — a forward chunk of one microbatch and a backward chunk of another. CUDA streams give you the primitive (two streams can execute concurrently); SM partitioning gives you the resource (you can split the 132 SMs between the two streams). The scheduling job — figuring out which two microbatches each rank handles in each time slot — is what DualPipe formalises.
The second piece of intuition is harder. Even with zero bubbles, a rank in the middle of the pipeline still has to send activations forward and receive gradients backward, AND it has to dispatch tokens to MoE experts on other nodes and then combine the results. If those four comms operations sit on the critical path, the GPU stalls. DualPipe's second move is to write the all-to-all as an SM-resident kernel: instead of launching a NCCL operation that blocks on the network, you spin up a small kernel that occupies ~20 of your 132 SMs and asynchronously pulls bytes over NVLink and InfiniBand while the other 112 SMs grind through GEMMs. The comms cost becomes a fixed tax on SM occupancy, not a variable wait on the network.
The Math of the Bubble
Let us write down the bubble fraction precisely, because the improvement from 1F1B to DualPipe is small to state and large in wallclock impact. For a pipeline of depth running microbatches, the simplest GPipe schedule has a forward sweep of length time slots, followed by a backward sweep of the same length. Total wallclock:
and the bubble fraction (idle cells divided by total cells) works out to:
The 1F1B optimisation of Megatron-LM interleaves forwards and backwards in the steady state but does not change the asymptotic bubble; it still spends slots in warm-up and in cool-down. Plug , :
DualPipe runs two streams in opposite directions on the same set of ranks. The schedule length collapses to:
— half the GPipe length, because the forward and backward sweeps now happen concurrently rather than sequentially. Of the cells in this grid, a clean count (verified by the Python schedule generator below) gives:
Half the 1F1B bubble. Plug the same , :
Drop the GPU count to a single ring of but raise to 32 (production DeepSeek-V3 numbers) and DualPipe's bubble drops to . The takeaway is not the exact fraction — it depends on how cleanly you can saturate both streams in your particular ring — but the factor of two. Halving the bubble at fixed is one of the largest constant- factor wallclock wins in distributed-training literature since 1F1B itself.
What the overlap actually buys
Bubble fraction is necessary but not sufficient. The second number that matters is the overlap fraction: how many of the busy cells contain both a forward and a backward simultaneously. Call this . The wallclock per cell is the longer of forward and backward — call it . The total wallclock is:
independent of whether the cells are overlapped. The overlap manifests as halved total work per cell, not halved wallclock — because the rank does fwd+bwd of two microbatches in the time it would otherwise take to do just one. Combined with the halved schedule length from running two streams concurrently, DualPipe ends up doing 2M microbatches worth of work in the wallclock of M microbatches of 1F1B. That is a clean 2× efficiency win at the cost of one more CUDA stream per rank and a careful SM-budget allocation.
Manual Numerical Walkthrough
Two worked examples by hand: counting the bubbles in a 4×4 grid, and sizing the wallclock saving for one DeepSeek-V3 training step.
Click to expand: Counting bubbles by hand at R = 4, M = 4
Setup. 4 ranks, 4 microbatches per stream. Schedule length slots. Grid has cells.
Step 1 — place stream A (left → right). Microbatch is at rank in slot . Filling the grid:
| t=0 | t=1 | t=2 | t=3 | t=4 | t=5 | t=6 | |
|---|---|---|---|---|---|---|---|
| r0 | F0 | F1 | F2 | F3 | . | . | . |
| r1 | . | F0 | F1 | F2 | F3 | . | . |
| r2 | . | . | F0 | F1 | F2 | F3 | . |
| r3 | . | . | . | F0 | F1 | F2 | F3 |
Step 2 — place stream B (right → left). Microbatch is at rank in slot . After overlaying:
| t=0 | t=1 | t=2 | t=3 | t=4 | t=5 | t=6 | |
|---|---|---|---|---|---|---|---|
| r0 | F0 | F1 | F2 | F3+B3 | B2 | B1 | B0 |
| r1 | . | F0+B3 | F1+B2 | F2+B1 | F3+B0 | . | . |
| r2 | B3 | F0+B2 | F1+B1 | F2+B0 | F3 | . | . |
| r3 | B3 | B2 | B1+F0 | F1 | F2 | F3 | . |
Step 3 — count. Idle cells (the "." slots): . Busy: . Bubble fraction: . Cells containing both an F and a B (the overlap cells) — 9 in total — make the overlap fraction .
Compare with 1F1B at the same (R, M). The 1F1B grid is twice as long: slots, 56 cells. Bubble cells: in each triangular corner, so 24 idle cells out of 56 — a bubble. At this small the absolute saving is percentage points of GPU-time, or roughly 33% wallclock reduction.
Sanity check the asymptote. As , both schedules approach 0% bubble — pipeline bubbles are an O(R/M) tax. The benefit of DualPipe is not that it beats 1F1B asymptotically; it is that it beats 1F1B at the small values forced by activation memory.
Click to expand: Sizing the dollar saving for DeepSeek-V3
The run. 2,048 H800 GPUs, ~2 months wallclock, quoted compute cost USD. Pipeline-parallel depth within each node; data-parallel replicas across the 256 nodes; expert- parallel across nodes.
Microbatches per step. DeepSeek-V3 uses microbatches per pipeline cycle, sequence length 4,096, microbatch size 4. Total tokens per step (across the 256 data-parallel replicas): order of , reported in the paper as a 15M-token effective batch size for the activated-parameters subset of the MoE.
Bubble at (R=8, M=32). 1F1B: . DualPipe: . Saving: percentage points of GPU-time.
Translate to wallclock. At the realised MFU (DeepSeek-V3 reports ~40% on H800 in FP8), the bubble saving of 8.9% reduces wallclock by roughly the same fraction: days shaved off a 60-day run. At a quoted USD/GPU-hour of USD/GPU-hour, that is USD saved on a single training run — a half-million-dollar scheduling improvement.
Add the comms overlap. The bubble reduction is half the story; the other half is the SM-resident all-to-all, which hides another ~25% of wallclock that would have been spent in dispatch and combine. Combined, the paper attributes about a 1.6× end-to-end speed-up to DualPipe over a standard 1F1B + separate-NCCL-stream baseline. That is most of the difference between "could be trained" and "was trained for under six million dollars".
Visualizing DualPipe Side-by-Side With 1F1B
Three widgets, each isolating one layer of the DualPipe story. The first is the headline comparison: drag the and sliders and watch the bubble fraction in both schedules change in real time. Press Play to animate microbatches sweeping through. The 1F1B grey corners are the bubbles; the DualPipe green cells are the overlapped forward+backward slots that eat them.
The second widget zooms into a single DualPipe cycle and shows the four sub-phases of each microbatch — forward attention, forward MoE MLP, backward input gradient, backward weight gradient — and the compute/comms split inside each cell. Toggle the overlap checkbox to see how the schedule collapses to half the wallclock when comms can run concurrently with compute.
The third widget makes the SM-budget trade-off explicit. An H800 has 132 SMs; DualPipe reserves a slice (typically ~20) for the SM-resident all-to-all kernel. Move the slider and watch compute throughput (sky bar) trade off against comms bandwidth (emerald bar). The U-curve at the bottom is the iteration wallclock — note the valley around 20 SMs, which is the operating point DeepSeek-V3 picks in their published recipes.
The Second Win: Hiding All-to-All Behind Compute
Reducing the bubble is the headline DualPipe number, but it is not the only thing the algorithm does — and at MoE scale it may not even be the biggest win. The second move is making all-to-all comms hide behind compute. To understand why that matters, think about what happens inside one MoE block on one rank during one microbatch.
- Dispatch. The block's input tokens are routed to experts via a top-k gate. With 256 experts spread across 8 nodes, every node has to send roughly of its tokens off-node, over InfiniBand. Volume per token: in FP8. For an 8K-token microbatch that is of network traffic per microbatch per layer.
- Expert compute. Each expert runs a small MLP on its assigned tokens. This is GEMM-bound and lives on the tensor cores.
- Combine. Tokens flow back from experts to their originating ranks, weighted by gate scores. Same volume as dispatch.
In a naïve implementation, you serialise: dispatch → compute → combine. At DeepSeek-V3's scale the dispatch and combine together take ~25-40% of an MoE block's wallclock — even though they only move of the GEMM's FLOPs. Wallclock is binding on bandwidth, not compute. If your tensor cores idle during dispatch and combine, you have effectively cut H800 throughput by a third.
DualPipe's SM-resident all-to-all writes the dispatch/combine as a CUDA kernel that runs on SMs rather than a NCCL operation launched on a separate stream. Concretely, the kernel spins up on a fixed slice of SMs (typically 20 of 132), allocates send/receive buffers in pinned host memory, and uses GPUDirect to DMA bytes between the NIC and HBM without ever stalling on a kernel boundary. The compute kernels (GEMMs, softmax, layernorm) get the other 112 SMs and run concurrently.
Because the comms kernel and the compute kernels are on the same CUDA stream — but consuming disjoint SM partitions — the scheduler can intermix them at sub-microsecond granularity. There is no "launch latency" on the comms side, and the GPU's warp scheduler keeps both pools of SMs busy. In the right operating point (the valley of the third widget above), the binding side of the iteration is neither comms nor compute alone — both run at peak, and the total wallclock is rather than .
| Block stage | Without overlap | With SM-resident all-to-all | Saving |
|---|---|---|---|
| Attention (compute) | 1.0 ms | 1.0 ms | 0% |
| Dispatch (comms) | 1.2 ms | hidden | 100% |
| MoE expert MLP (compute) | 1.5 ms | 1.5 ms | 0% |
| Combine (comms) | 1.2 ms | hidden | 100% |
| Total per layer | 4.9 ms | 2.7 ms | 45% |
The numbers above are illustrative (DeepSeek does not publish a per-layer breakdown), but the structure is exact: the 45% per-layer wallclock saving compounds across the 61 transformer blocks of DeepSeek-V3, on top of the bubble reduction. Multiply across a 2-month, 2,048-GPU run and you save tens of days of compute — which is most of the difference between "technically feasible" and "economically sane".
Plain Python: Scheduling a DualPipe Cycle
Before we touch CUDA, let us build the scheduler in plain Python. The function below produces the exact grid you saw in the visualisation above: for every (rank, time) cell, which microbatch is doing its forward and which is doing its backward. The bubble and overlap fractions are computed directly from the grid. This is the kernel of the algorithm; the actual CUDA code is just careful bookkeeping around send/recv and CUDA streams.
PyTorch: A Working DualPipe Rank
Now the same idea in real PyTorch, with NCCL send/recv and two CUDA streams. The production DeepSeek-V3 implementation lives inside their custom eep library and uses SM-resident all-to-all kernels instead of NCCL, but the structural backbone — one compute stream, one comms stream, per-slot synchronisation, bidirectional sweep — is faithful to what runs on the cluster. Read this loop as the rank-local view of the schedule grid: for every time slot, what does THIS GPU launch onto which stream.
At Massive Scale: Why DeepSeek-V3 Needed DualPipe
DualPipe is not an optimisation you apply to a 7 B-parameter model for fun. It is a load-bearing piece of the DeepSeek-V3 training recipe — without it, the 2-month, 2,048-GPU budget does not close. Four hard constraints pushed the team to invent it.
- Activation memory ceiling. With , , , one BF16 microbatch activation on one rank is . Across 61 transformer blocks and microbatches in flight, that is per rank — more than fits in 80 GB HBM. DualPipe's halved schedule length means you only need to retain activations for microbatches at a time, not , which (combined with selective recomputation) lets the activation footprint fit.
- MoE all-to-all bandwidth. Each MoE block moves ~50 MB per microbatch per direction off-node. With 32 microbatches and 61 blocks per step, that is of off-node traffic per step per rank. At an InfiniBand throughput of 50 GB/s, the dispatch+combine alone would take per step on the critical path — longer than the compute. SM-resident all-to-all is the only thing that hides this cost.
- The 14.8 T-token budget. Training 671 B parameters on 14.8 T tokens at the chinchilla-optimal ratio means training FLOPs. At 40% MFU on 2,048 H800s in FP8 (1979 TFLOPS peak per GPU), that is seconds, or ~430 days. The factor-of-many gap between that number and the reported ~60-day wallclock is closed by FP8 (section 10.1), DualPipe (this section), and aggressive gradient accumulation. None of those three is optional.
- No TP, by design. Tensor parallelism (section 11.3) would normally be the go-to lever for 671 B parameters, but it shreds activation tensors across NVLink with all-reduce operations on every block — adding yet another cross-rank dependency on the critical path. DeepSeek-V3 famously trains without TP: just data, pipeline, and expert parallelism. That decision is only viable because DualPipe makes pipeline parallelism much more efficient than the textbook bubble analysis would suggest.
The shape of the training step at scale
For one optimiser step on the DeepSeek-V3 cluster, the wallclock breakdown looks roughly like:
| Phase | Wallclock share | Hidden by DualPipe? |
|---|---|---|
| Attention GEMMs (FP8) | 32% | — |
| MoE MLP GEMMs (FP8) | 38% | — |
| Dispatch + combine (all-to-all) | 0% | Yes — hidden by SM-resident kernel |
| P2P send/recv (pipeline) | 0% | Yes — hidden by 2nd CUDA stream |
| Layernorm + softmax (BF16) | 8% | — |
| Optimiser step + DP all-reduce | 5% | Partially — overlapped across DP groups |
| Pipeline bubbles | 9% | Reduced from 18% by bidirectional schedule |
| Other (memcpy, kernel launch) | 8% | — |
The two "0%" rows are the whole story: dispatch/combine and pipeline P2P are off the critical path entirely thanks to DualPipe. Compute is left as the binding side — which is exactly where you want to be, because compute is what you paid for. MFU (model FLOPs utilisation) climbs from a baseline of ~25% on a naïve schedule to ~40% on DualPipe — the difference between an H800 cluster delivering 500 PFLOPS and one delivering 800 PFLOPS on the same hardware.
Engineering Reality: What DualPipe Costs You
DualPipe is not a free lunch. Four engineering costs you sign up for the moment you turn it on:
- Custom CUDA. SM-resident all-to-all is not in NCCL, PyTorch, or any open framework as of mid-2026. DeepSeek publishes the algorithm but not the kernel; reproducing it in your own stack requires a small team of CUDA engineers comfortable with NVSHMEM, GPUDirect RDMA, and warp-level shuffles. Expect 1–2 engineer-quarters to ship a production-quality version.
- Topology lock-in. DualPipe assumes a symmetric ring of ranks with equal compute capacity and equal link bandwidth between adjacent ranks. On a heterogeneous cluster (e.g. mixed A100 and H100) the bidirectional schedule degrades because the slower link becomes the binding side of every overlapped cell. The algorithm is most efficient on a dedicated, uniform pod — which is exactly what frontier labs build.
- Memory pressure on the diamond edges. At the first and last few slots of every cycle, the two streams have not fully crossed yet, so the rank is only running one stream at a time but has buffers allocated for both. Peak memory is ~25% higher than 1F1B at the same . The fix is to size conservatively and rely on the schedule's bubble reduction to recover throughput.
- Debugging is harder. "The loss spiked at step 47k" with a 1F1B schedule usually points to a data shard or a numerical issue. With DualPipe it can also be: a stale send buffer from a previous diamond, an SM-budget mistune (the comms kernel evicted by a compute kernel), or a race in the dispatch/combine pair when an expert overflows its token capacity. New failure modes, new tooling.
Against those four costs, the benefits compound:
| Benefit | Magnitude vs 1F1B + NCCL | Where it shows up |
|---|---|---|
| Pipeline bubble | ~½× | GPU-hours per training run |
| MoE all-to-all latency | ≈ 0 | MFU and wallclock |
| Activation memory | ~½× | Max sequence length, max # microbatches |
| MFU | +15–25 percentage points | End-to-end training cost |
| Engineering complexity | +1 quarter (one-time) | Up-front team cost |
The deepest lesson. Every order of magnitude of scale has historically required a co-design step: GPT-2 needed gradient checkpointing, GPT-3 needed BF16 master weights, PaLM needed 1F1B, GPT-4-scale needed FP8. DeepSeek-V3 needed DualPipe. The pattern is always the same: a constraint that was invisible at small scale becomes binding at frontier scale, and the answer is never a marginal improvement of the previous tool but a structural rethink of how compute and communication interleave. The next 10× of scale will require its own DualPipe — quite likely a co-design with optical interconnects, FP4 precision, and asynchronous optimisation. The skill the chapter is trying to teach is not "use DualPipe"; it is learn to look for the critical-path latency that nobody else is looking at, and rewrite the schedule around it.