Boo-AI — Master Artificial Intelligence by Building from Scratch

The Real Problem: Bubbles Burn Money

Pipeline parallelism is a deal with the devil. You split the model vertically across $R$ GPUs — rank 0 holds the embedding and the first few transformer blocks, rank 1 holds the next slab, and so on until rank $R-1$ holds the final blocks and the loss head. Each microbatch then flows forward through the ranks one at a time, like products on a factory belt. The good news is that you can train a model that does not fit on a single GPU — DeepSeek-V3 is 671 billion parameters, and no single H800 can hold even a tenth of that. The bad news is that, at any given moment, most of your GPUs are doing nothing.

Look at what happens during the very first microbatch. At time 0, rank 0 starts computing the forward pass for microbatch 0. Ranks 1 through $R-1$ are idle — they have no inputs yet. At time 1, rank 0 finishes microbatch 0 and starts microbatch 1, while rank 1 receives microbatch 0's activations and finally starts working. But ranks 2 through $R-1$ are still idle. It takes $R - 1$ full time steps before every rank is busy. Then the backward pass produces the symmetric problem on the way out: as microbatches finish, ranks start dropping off the back. Those idle slots — the pipeline bubble — are pure cost.

For the simplest GPipe-style schedule the bubble fraction is $b = (R - 1) / (M + R - 1)$ , where $M$ is the number of microbatches per step. With $R = 8$ and $M = 16$ , that is $7 / 23 \approx 30\%$ of every GPU-hour wasted. On a 2,048-GPU run that is the equivalent of running ~614 H800s for the whole training period purely to push electrons through cooling fans. The 1F1B improvement of Megatron-LM brings this down but not below ~15% in any realistic configuration.

Why is this not solved by "just use more microbatches"? Because

M

is capped by activation memory. Each microbatch in flight needs its forward activations resident until the matching backward arrives — that is

M \cdot B \cdot S \cdot H \cdot 2

bytes (BF16 activations) per rank in DeepSeek-V3, which already sits at ~250 GB per H800 once you add gradients and optimiser state. You cannot just crank

M

to make the bubble vanish; you have to attack it structurally.

On top of the bubble, modern MoE models add a second cost: every forward and backward through an MoE block contains a cross-node all-to-all that shuffles tokens between experts. DeepSeek-V3 has 256 routed experts spread across 8 nodes; a single token might need to be sent to a node four hops away over InfiniBand. If that all-to-all sits on the critical path — i.e. if compute has to wait for it — your GPU utilisation drops a second time. Even with zero pipeline bubbles, a naïve MoE step on DeepSeek-V3's topology can spend ~35% of wallclock in communication.

DualPipe, introduced in the DeepSeek-V3 technical report (Appendix A), attacks both costs at once. It rearranges the pipeline so that two microbatch streams flow in opposite directions simultaneously, collapsing the bubble to a sliver at the edges of each diamond. And it co-designs an SM-resident all-to-all kernel that runs on the same chip as the GEMM kernels, so the cross-node comms is hidden behind compute rather than gating it. The net effect: at the same $R = 8$ , $M = 16$ , DualPipe's bubble fraction drops from ~30% to under 10% and the comms time effectively vanishes. That is one of the main reasons a 2,048-GPU training of a 671 B-parameter MoE finishes in roughly two months on a budget of $5.576 \cdot 10^{6}$ USD.

Intuition: Two Pipelines Crossing in the Middle

Imagine a single-track railway with eight stations. Trains start from station 0, stop at every station for one minute of loading, and exit at station 7. If you launch one train per minute from station 0, then after eight minutes you have eight trains in motion, one at each station — full utilisation. But the first seven minutes, and the last seven, you have stations sitting empty. That is the 1F1B bubble.

DualPipe's trick is to also run trains in the opposite direction, starting from station 7 and arriving at station 0. Now during minute 1, station 0 is loading the westbound train, but station 7 is already loading the eastbound train. By minute 4, every station is loading two trains at once — one heading east, one heading west. The two streams cross in the middle, like a double helix. As long as each station has the physical capacity to handle two trains simultaneously, the whole network runs at twice the throughput in the same wallclock.

For pipeline parallelism this analogy translates almost directly. A GPU has 132 streaming multiprocessors on an H800. Loading two trains simultaneously means running two pieces of model work in parallel — a forward chunk of one microbatch and a backward chunk of another. CUDA streams give you the primitive (two streams can execute concurrently); SM partitioning gives you the resource (you can split the 132 SMs between the two streams). The scheduling job — figuring out which two microbatches each rank handles in each time slot — is what DualPipe formalises.

The second piece of intuition is harder. Even with zero bubbles, a rank in the middle of the pipeline still has to send activations forward and receive gradients backward, AND it has to dispatch tokens to MoE experts on other nodes and then combine the results. If those four comms operations sit on the critical path, the GPU stalls. DualPipe's second move is to write the all-to-all as an SM-resident kernel: instead of launching a NCCL operation that blocks on the network, you spin up a small kernel that occupies ~20 of your 132 SMs and asynchronously pulls bytes over NVLink and InfiniBand while the other 112 SMs grind through GEMMs. The comms cost becomes a fixed tax on SM occupancy, not a variable wait on the network.

The deeper analogy. A 1F1B pipeline is like a single-lane highway with traffic flowing one way at a time. DualPipe is a double-decker highway with two lanes flowing in opposite directions, AND the on-ramps are pre-loaded with cars so the merge never stalls. Almost every modern frontier training stack is now copying this design.

The Math of the Bubble

Let us write down the bubble fraction precisely, because the improvement from 1F1B to DualPipe is small to state and large in wallclock impact. For a pipeline of depth $R$ running $M$ microbatches, the simplest GPipe schedule has a forward sweep of length $M + R - 1$ time slots, followed by a backward sweep of the same length. Total wallclock:

$T_{\text{GPipe}} = 2(M + R - 1)$

and the bubble fraction (idle cells divided by total cells) works out to:

$b_{\text{GPipe}} = \frac{R - 1}{M + R - 1}$

The 1F1B optimisation of Megatron-LM interleaves forwards and backwards in the steady state but does not change the asymptotic bubble; it still spends $R - 1$ slots in warm-up and $R - 1$ in cool-down. Plug $R = 8$ , $M = 16$ :

$b_{\text{1F1B}} = \frac{7}{23} \approx 30\%$

DualPipe runs two streams in opposite directions on the same set of ranks. The schedule length collapses to:

$T_{\text{DualPipe}} = M + R - 1$

— half the GPipe length, because the forward and backward sweeps now happen concurrently rather than sequentially. Of the $R \cdot (M + R - 1)$ cells in this grid, a clean count (verified by the Python schedule generator below) gives:

$b_{\text{DualPipe}} \approx \frac{R - 1}{2(M + R - 1)}$

Half the 1F1B bubble. Plug the same $R = 8$ , $M = 16$ :

$b_{\text{DualPipe}} \approx \frac{7}{46} \approx 15\%$

Drop the GPU count to a single ring of $R = 8$ but raise $M$ to 32 (production DeepSeek-V3 numbers) and DualPipe's bubble drops to $7 / 78 \approx 9\%$ . The takeaway is not the exact fraction — it depends on how cleanly you can saturate both streams in your particular ring — but the factor of two. Halving the bubble at fixed $M$ is one of the largest constant- factor wallclock wins in distributed-training literature since 1F1B itself.

What the overlap actually buys

Bubble fraction is necessary but not sufficient. The second number that matters is the overlap fraction: how many of the busy cells contain both a forward and a backward simultaneously. Call this $f$ . The wallclock per cell is the longer of forward and backward — call it $\tau$ . The total wallclock is:

$T_{\text{wall}} = (M + R - 1) \cdot \tau$

independent of whether the cells are overlapped. The overlap manifests as halved total work per cell, not halved wallclock — because the rank does fwd+bwd of two microbatches in the time it would otherwise take to do just one. Combined with the halved schedule length from running two streams concurrently, DualPipe ends up doing 2M microbatches worth of work in the wallclock of M microbatches of 1F1B. That is a clean 2× efficiency win at the cost of one more CUDA stream per rank and a careful SM-budget allocation.

Manual Numerical Walkthrough

Two worked examples by hand: counting the bubbles in a 4×4 grid, and sizing the wallclock saving for one DeepSeek-V3 training step.

Click to expand: Counting bubbles by hand at R = 4, M = 4

Setup. 4 ranks, 4 microbatches per stream. Schedule length $T = M + R - 1 = 7$ slots. Grid has $R \cdot T = 28$ cells.

Step 1 — place stream A (left → right). Microbatch $m$ is at rank $r$ in slot $m + r$ . Filling the grid:

	t=0	t=1	t=2	t=3	t=4	t=5	t=6
r0	F0	F1	F2	F3	.	.	.
r1	.	F0	F1	F2	F3	.	.
r2	.	.	F0	F1	F2	F3	.
r3	.	.	.	F0	F1	F2	F3

Step 2 — place stream B (right → left). Microbatch $M - 1 - m'$ is at rank $r$ in slot $m' + (R - 1 - r)$ . After overlaying:

	t=0	t=1	t=2	t=3	t=4	t=5	t=6
r0	F0	F1	F2	F3+B3	B2	B1	B0
r1	.	F0+B3	F1+B2	F2+B1	F3+B0	.	.
r2	B3	F0+B2	F1+B1	F2+B0	F3	.	.
r3	B3	B2	B1+F0	F1	F2	F3	.

Step 3 — count. Idle cells (the "." slots): $3 + 2 + 2 + 1 = 8$ . Busy: $28 - 8 = 20$ . Bubble fraction: $8 / 28 \approx 28.6\%$ . Cells containing both an F and a B (the overlap cells) — 9 in total — make the overlap fraction $9 / 28 \approx 32\%$ .

Compare with 1F1B at the same (R, M). The 1F1B grid is twice as long: $T = 14$ slots, 56 cells. Bubble cells: $2 \cdot \binom{R}{2} = 12$ in each triangular corner, so 24 idle cells out of 56 — a $24 / 56 \approx 43\%$ bubble. At this small $M$ the absolute saving is $43\% - 29\% = 14$ percentage points of GPU-time, or roughly 33% wallclock reduction.

Sanity check the asymptote. As $M \to \infty$ , both schedules approach 0% bubble — pipeline bubbles are an O(R/M) tax. The benefit of DualPipe is not that it beats 1F1B asymptotically; it is that it beats 1F1B at the small $M$ values forced by activation memory.

Click to expand: Sizing the dollar saving for DeepSeek-V3

The run. 2,048 H800 GPUs, ~2 months wallclock, quoted compute cost $5.576 \cdot 10^{6}$ USD. Pipeline-parallel depth $R = 8$ within each node; data-parallel replicas across the 256 nodes; expert- parallel across nodes.

Microbatches per step. DeepSeek-V3 uses $M = 32$ microbatches per pipeline cycle, sequence length 4,096, microbatch size 4. Total tokens per step (across the 256 data-parallel replicas): order of $1.3 \cdot 10^{8}$ , reported in the paper as a 15M-token effective batch size for the activated-parameters subset of the MoE.

Bubble at (R=8, M=32). 1F1B: $b = 7 / 39 \approx 17.9\%$ . DualPipe: $b \approx 7 / 78 \approx 9.0\%$ . Saving: $17.9 - 9.0 = 8.9$ percentage points of GPU-time.

Translate to wallclock. At the realised MFU (DeepSeek-V3 reports ~40% on H800 in FP8), the bubble saving of 8.9% reduces wallclock by roughly the same fraction: $60 \cdot 0.089 \approx 5.3$ days shaved off a 60-day run. At a quoted USD/GPU-hour of $5.576 \cdot 10^{6} / (2048 \cdot 24 \cdot 60) \approx 1.89$ USD/GPU-hour, that is $2048 \cdot 24 \cdot 5.3 \cdot 1.89 \approx 4.9 \cdot 10^{5}$ USD saved on a single training run — a half-million-dollar scheduling improvement.

Add the comms overlap. The bubble reduction is half the story; the other half is the SM-resident all-to-all, which hides another ~25% of wallclock that would have been spent in dispatch and combine. Combined, the paper attributes about a 1.6× end-to-end speed-up to DualPipe over a standard 1F1B + separate-NCCL-stream baseline. That is most of the difference between "could be trained" and "was trained for under six million dollars".

Visualizing DualPipe Side-by-Side With 1F1B

Three widgets, each isolating one layer of the DualPipe story. The first is the headline comparison: drag the $R$ and $M$ sliders and watch the bubble fraction in both schedules change in real time. Press Play to animate microbatches sweeping through. The 1F1B grey corners are the bubbles; the DualPipe green cells are the overlapped forward+backward slots that eat them.

Loading pipeline bubble comparison…

Three settings worth trying. (1) At

R = 8

M = 4

(small

M

, the activation-memory-bound regime) the 1F1B bubble is over 40% and DualPipe drops it to ~20% — the regime where DualPipe matters most. (2) At

R = 16

M = 32

the schedule fills the screen and you can see the diamond structure clearly. (3) Increase

M

to 32 and both schedules approach single-digit bubble percentages — confirming the O(R/M) scaling.

The second widget zooms into a single DualPipe cycle and shows the four sub-phases of each microbatch — forward attention, forward MoE MLP, backward input gradient, backward weight gradient — and the compute/comms split inside each cell. Toggle the overlap checkbox to see how the schedule collapses to half the wallclock when comms can run concurrently with compute.

Loading DualPipe scheduler…

The third widget makes the SM-budget trade-off explicit. An H800 has 132 SMs; DualPipe reserves a slice (typically ~20) for the SM-resident all-to-all kernel. Move the slider and watch compute throughput (sky bar) trade off against comms bandwidth (emerald bar). The U-curve at the bottom is the iteration wallclock — note the valley around 20 SMs, which is the operating point DeepSeek-V3 picks in their published recipes.

Loading comm/compute overlap explorer…

Why the curve has a valley. Set comms SMs to 0 and the network blocks the GPU: the right-going stream has nothing to chew on while it waits for activations to arrive. Set comms SMs to 60 and compute starves: there are not enough SMs left to keep the GEMMs saturated. The valley is where the binding side switches from comms (left of the valley) to compute (right of it) — and at that crossover both are running at full throttle.

The Second Win: Hiding All-to-All Behind Compute

Reducing the bubble is the headline DualPipe number, but it is not the only thing the algorithm does — and at MoE scale it may not even be the biggest win. The second move is making all-to-all comms hide behind compute. To understand why that matters, think about what happens inside one MoE block on one rank during one microbatch.

Dispatch. The block's input tokens are routed to experts via a top-k gate. With 256 experts spread across 8 nodes, every node has to send roughly $7/8$ of its tokens off-node, over InfiniBand. Volume per token: $H \cdot \text{bytes/element} = 7168 \cdot 1 = 7\,\text{KB}$ in FP8. For an 8K-token microbatch that is $\sim 50\,\text{MB}$ of network traffic per microbatch per layer.
Expert compute. Each expert runs a small MLP on its assigned tokens. This is GEMM-bound and lives on the tensor cores.
Combine. Tokens flow back from experts to their originating ranks, weighted by gate scores. Same volume as dispatch.

In a naïve implementation, you serialise: dispatch → compute → combine. At DeepSeek-V3's scale the dispatch and combine together take ~25-40% of an MoE block's wallclock — even though they only move $\sim 1\%$ of the GEMM's FLOPs. Wallclock is binding on bandwidth, not compute. If your tensor cores idle during dispatch and combine, you have effectively cut H800 throughput by a third.

DualPipe's SM-resident all-to-all writes the dispatch/combine as a CUDA kernel that runs on SMs rather than a NCCL operation launched on a separate stream. Concretely, the kernel spins up on a fixed slice of SMs (typically 20 of 132), allocates send/receive buffers in pinned host memory, and uses GPUDirect to DMA bytes between the NIC and HBM without ever stalling on a kernel boundary. The compute kernels (GEMMs, softmax, layernorm) get the other 112 SMs and run concurrently.

Because the comms kernel and the compute kernels are on the same CUDA stream — but consuming disjoint SM partitions — the scheduler can intermix them at sub-microsecond granularity. There is no "launch latency" on the comms side, and the GPU's warp scheduler keeps both pools of SMs busy. In the right operating point (the valley of the third widget above), the binding side of the iteration is neither comms nor compute alone — both run at peak, and the total wallclock is $\max(\tau_{\text{compute}}, \tau_{\text{comms}})$ rather than $\tau_{\text{compute}} + \tau_{\text{comms}}$ .

Block stage	Without overlap	With SM-resident all-to-all	Saving
Attention (compute)	1.0 ms	1.0 ms	0%
Dispatch (comms)	1.2 ms	hidden	100%
MoE expert MLP (compute)	1.5 ms	1.5 ms	0%
Combine (comms)	1.2 ms	hidden	100%
Total per layer	4.9 ms	2.7 ms	45%

The numbers above are illustrative (DeepSeek does not publish a per-layer breakdown), but the structure is exact: the 45% per-layer wallclock saving compounds across the 61 transformer blocks of DeepSeek-V3, on top of the bubble reduction. Multiply across a 2-month, 2,048-GPU run and you save tens of days of compute — which is most of the difference between "technically feasible" and "economically sane".

Why this is not just "use a separate NCCL stream". A separate NCCL stream gives you overlap only when the GEMMs can run independently of the comms — but in an MoE block the GEMMs depend on the comms results! Dispatch gates the expert MLP, which gates the combine. The SM-resident kernel lets you split within the dependent chain: while microbatch m's combine is in flight, microbatch m+1's dispatch can start on the same SMs, even though they belong to the same logical block. Independent CUDA streams cannot do this because they cannot share SMs at sub-block granularity.

Plain Python: Scheduling a DualPipe Cycle

Before we touch CUDA, let us build the scheduler in plain Python. The function below produces the exact grid you saw in the visualisation above: for every (rank, time) cell, which microbatch is doing its forward and which is doing its backward. The bubble and overlap fractions are computed directly from the grid. This is the kernel of the algorithm; the actual CUDA code is just careful bookkeeping around send/recv and CUDA streams.

dualpipe_schedule.py

🐍python

Explanation(10)

Code(105)

22One time slot on one GPU

Every slot on every rank can hold up to one forward chunk and one backward chunk simultaneously. That second slot — bwd_mb being filled while fwd_mb is already filled — is the whole DualPipe trick. In a 1F1B schedule a slot can only hold one of them; in DualPipe the rank does both at once, on different SM partitions of the same H800.

EXECUTION STATE

fwd_mb = Optional[int] — microbatch id of the forward chunk running in this slot

bwd_mb = Optional[int] — microbatch id of the backward chunk running in this slot

27Is this slot doing useful work?

A slot is busy when at least one of the two chunks is scheduled. The bubble fraction = 1 − (busy / total). For DualPipe at R=8, M=16 we will measure ~9% bubbles; for 1F1B at the same parameters, ~30%. That ~20-percentage-point delta is real wallclock and real money on a 2,048-GPU run.

EXECUTION STATE

busy = True if the slot does any work

31Is this slot doing TWO useful things at once?

Overlapped slots are the green cells in the visualisation above — one rank running a forward AND a backward in the same wallclock window. In the steady state of a DualPipe cycle, the majority of cells are overlapped. That is why DualPipe's wallclock matches a single-direction pipeline of half the length.

EXECUTION STATE

overlapped = True iff both fwd_mb and bwd_mb are set

36The schedule constructor — DualPipe in a dozen lines

This function is the heart of DualPipe. Given R ranks and M microbatches per stream, it returns a 2-D grid of Slots that describes which microbatch lives at which (rank, time). The two for-loops below place the left-going stream and the right-going stream onto the grid. Wherever they cross, the slot inherits both microbatch ids and becomes 'overlapped'.

EXECUTION STATE

R = pipeline depth — # of GPUs in the pipeline-parallel axis; for DeepSeek-V3, R = 8

M = microbatches per stream; with two streams, total in-flight microbatches = 2M

T = M + R − 1 — exactly half the length of the matching 1F1B schedule

grid =

List[List[Slot]] of shape R × T

48Place the left-going stream

Microbatch m of stream A enters rank 0 at slot t = m, then moves one rank to the right each subsequent slot — so it reaches rank r at slot m + r. With R = 8 and M = 16, microbatch 5 hits rank 3 at slot 8 and rank 7 at slot 12. By the time microbatch 15 enters rank 0 at slot 15, microbatch 0 is just finishing on rank 7.

LOOP TRACE · 4 iterations

m=0, r=0

t = m + r = 0

grid[0][0].fwd_mb = 0 — microbatch 0 enters rank 0

m=0, r=7

t = m + r = 7

grid[7][7].fwd_mb = 0 — microbatch 0 finishes its forward on rank 7

m=5, r=3

t = m + r = 8

grid[3][8].fwd_mb = 5

m=15, r=0

t = m + r = 15

grid[0][15].fwd_mb = 15 — last microbatch of stream A enters rank 0

55Place the right-going stream (the bidirectional half of DualPipe)

This loop is the magic. Stream B's microbatch enters rank R−1 at slot m and moves one rank to the LEFT each subsequent slot — reaching rank r at slot m + (R−1 − r). In real DualPipe this stream carries the backward pass of microbatches that have already finished their forward in the previous diamond, so we tag the slot as bwd_mb. When stream A and stream B occupy the same slot on the same rank, the slot is 'overlapped' and the rank handles both — that is where the bubbles go.

LOOP TRACE · 3 iterations

m=0, r=7

t = m + (R-1 - r) = 0

grid[7][0].bwd_mb = M-1-0 = 15 — microbatch 15 of stream B enters rank 7

m=0, r=0

t = m + (R-1 - r) = 7

grid[0][7].bwd_mb = 15 — microbatch 15 of stream B finishes on rank 0

m=7, r=3 (crosses stream-A's microbatch 8 here)

t = m + (R-1 - r) = 11

grid[3][11].fwd_mb = 8 — set by the previous loop

grid[3][11].bwd_mb = 8 — set by THIS loop — overlap!

62Bubble fraction — the number that matters most

Idle cells = R·T − busy. For DualPipe at (R=8, M=16) the grid has R·T = 8·23 = 184 cells and you count ~168 busy — bubble ≈ 9%. For 1F1B at the same (R, M) the grid is twice as long and the bubble climbs to (R−1)/(M+R−1) ≈ 30%. DualPipe roughly halves the bubble fraction on every shape we care about.

EXECUTION STATE

busy = count of cells with at least one assigned microbatch

R * T = total cells in the grid

69Overlap fraction — how often a rank is doing two things at once

This is the fraction of cells where fwd and bwd both live. In the DualPipe steady state, overlap typically reaches 50–80% of cells, which is why the cycle takes only half as long as the equivalent 1F1B schedule — every overlapped slot is doing the work of two non-overlapped slots, on a single H800 with its 132 SMs split between compute and the SM-resident all-to-all kernel.

76ASCII pretty-printer

Helper that dumps the grid so you can eyeball it in a notebook. 'F0+B7' means this slot has microbatch 0 doing forward and microbatch 7 doing backward simultaneously. '.' is a bubble. Run the script and you will see the characteristic diamond: bubbles cluster at the top-left and bottom-right corners, overlap concentrates along the diagonal where the two streams cross.

94Try it

With R = 8 ranks and M = 16 microbatches per stream the function reports ~9% bubbles and ~70% overlap, and the 1F1B baseline at the same parameters reports ~30% bubbles. That is the headline DualPipe number on a single page. The next snippet shows what one rank actually has to do during one of those overlapped slots.

EXECUTION STATE

R = 8

M = 16

T = M + R - 1 = 23

95 lines without explanation

1"""
2A from-scratch DualPipe-style schedule generator.
3
4We model a synchronous pipeline-parallel run over R ranks and M micro-
5batches. Each microbatch must visit every rank in order during the forward
6pass, then every rank in reverse order during the backward pass.
7
8DualPipe's idea: launch TWO microbatch streams in opposite directions at
9the same time. The 'left' stream enters at rank 0 and exits at rank R-1;
10the 'right' stream enters at rank R-1 and exits at rank 0. In the steady
11state every rank handles a forward chunk of one stream AND a backward
12chunk of the other in the same time slot — the bubble collapses to two
13epsilons at the diamond's tips instead of (R-1)/M of the whole training
14step.
15"""
16
17from dataclasses import dataclass
18from typing import List, Optional
19
20
21@dataclass
22class Slot:
23    """One time slot on one rank."""
24    fwd_mb: Optional[int] = None      # microbatch id doing forward here
25    bwd_mb: Optional[int] = None      # microbatch id doing backward here
26
27    @property
28    def busy(self) -> bool:
29        return self.fwd_mb is not None or self.bwd_mb is not None
30
31    @property
32    def overlapped(self) -> bool:
33        return self.fwd_mb is not None and self.bwd_mb is not None
34
35
36def dualpipe_schedule(R: int, M: int) -> List[List[Slot]]:
37    """Return a [R][T] grid of Slots for a DualPipe diamond.
38
39    R = number of pipeline ranks (GPUs in the pipeline dimension).
40    M = microbatches per stream (so 2*M total microbatches in flight).
41    """
42    T = M + R - 1                     # length of one diamond, in slots
43    grid: List[List[Slot]] = [[Slot() for _ in range(T)] for _ in range(R)]
44
45    # Stream A: left -> right.  Microbatch m enters rank 0 at slot m,
46    # reaches rank r at slot m + r.
47    for m in range(M):
48        for r in range(R):
49            grid[r][m + r].fwd_mb = m
50
51    # Stream B: right -> left.  Microbatch m' enters rank R-1 at slot m',
52    # reaches rank r at slot m' + (R-1 - r).  In real DualPipe, this stream
53    # carries the backward pass of microbatches already forwarded.
54    for m in range(M):
55        for r in range(R):
56            grid[r][m + (R - 1 - r)].bwd_mb = M - 1 - m
57
58    return grid
59
60
61def bubble_fraction(grid: List[List[Slot]]) -> float:
62    R = len(grid)
63    T = len(grid[0])
64    busy = sum(1 for row in grid for s in row if s.busy)
65    return 1 - busy / (R * T)
66
67
68def overlap_fraction(grid: List[List[Slot]]) -> float:
69    R = len(grid)
70    T = len(grid[0])
71    overlapped = sum(1 for row in grid for s in row if s.overlapped)
72    return overlapped / (R * T)
73
74
75def render(grid: List[List[Slot]]) -> str:
76    """ASCII print of the schedule for quick eyeballing."""
77    lines = []
78    for r, row in enumerate(grid):
79        cells = []
80        for s in row:
81            if s.overlapped:
82                cells.append(f"F{s.fwd_mb}+B{s.bwd_mb}")
83            elif s.fwd_mb is not None:
84                cells.append(f"F{s.fwd_mb}   ")
85            elif s.bwd_mb is not None:
86                cells.append(f"  B{s.bwd_mb} ")
87            else:
88                cells.append("  .   ")
89        lines.append(f"r{r}: " + "|".join(cells))
90    return "\n".join(lines)
91
92
93# Example: 8 ranks, 16 microbatches per stream.
94R, M = 8, 16
95grid = dualpipe_schedule(R, M)
96print(f"R = {R}, M = {M}, T = {len(grid[0])}")
97print(f"bubble fraction  = {bubble_fraction(grid):.2%}")
98print(f"overlap fraction = {overlap_fraction(grid):.2%}")
99print()
100print(render(grid))
101
102
103# Compare against the 1F1B baseline: bubble = (R-1) / (M + R - 1)
104oneF1B_bubble = (R - 1) / (M + R - 1)
105print(f"\n1F1B bubble at the same (R, M): {oneF1B_bubble:.2%}")

PyTorch: A Working DualPipe Rank

Now the same idea in real PyTorch, with NCCL send/recv and two CUDA streams. The production DeepSeek-V3 implementation lives inside their custom eep library and uses SM-resident all-to-all kernels instead of NCCL, but the structural backbone — one compute stream, one comms stream, per-slot synchronisation, bidirectional sweep — is faithful to what runs on the cluster. Read this loop as the rank-local view of the schedule grid: for every time slot, what does THIS GPU launch onto which stream.

dualpipe_rank.py

🐍python

Explanation(11)

Code(80)

20Pipeline ring topology

Eight ranks form a ring. Forward traffic goes rank → next_rank (rank+1 mod R); backward traffic goes rank → prev_rank (rank−1 mod R). That ring is the physical topology DeepSeek-V3 wires up: 8 H800s per node connected by NVLink in an NVSwitch fabric, and the inter-node hop uses 8 InfiniBand HDR NICs at 50 GB/s each.

EXECUTION STATE

rank = this GPU's pipeline-rank index (0..7)

world = pipeline depth R; for DeepSeek-V3 each node hosts one ring of 8

prev_rank = (rank - 1) mod world

next_rank = (rank + 1) mod world

25The local pipeline stage

Every rank owns a slab of the model — one transformer block (or a small group of blocks) of hidden dimension 7168 in DeepSeek-V3. The block sees activations of shape [microbatch, seqlen, 7168] in BF16 (the activations stay BF16 even though the GEMMs run in FP8 — see section 10.1 for why). The optimizer carries an FP32 master copy and its Adam moments, sharded across data-parallel replicas via ZeRO-3.

EXECUTION STATE

block = nn.Module — one transformer block on this rank

optim = AdamW with FP32 master weights, BF16/FP8 forward weights

32Microbatch buffer shape

Each in-flight microbatch is a contiguous BF16 tensor on the GPU. For DeepSeek-V3 a typical shape is [B=4, S=4096, H=7168] ≈ 117 MB per buffer. With 2M microbatches in flight across the diamond, you need ~234 MB × M bytes resident on every rank for activation buffers alone — which is why the activation-memory savings of FP8 from section 10.1 are so important when paired with DualPipe.

EXECUTION STATE

B = 4 — microbatch size (rows)

S = 4096 — sequence length (tokens)

H = 7168 — hidden dim for DeepSeek-V3

37The DualPipe step — one diamond of length T = M + R − 1

Outer loop is over slots t. At each slot, the rank performs four sub-steps: (1) post non-blocking receives for whichever stream(s) this slot is on; (2) wait for inputs the compute side needs; (3) launch compute concurrently with the just-issued comms; (4) finally, after the last slot, step the optimiser. This is the rank-local view; the cross-rank choreography is implicit in the send/recv pairs.

EXECUTION STATE

R = world (pipeline depth)

M = microbatches per stream

T = M + R − 1 = length of one diamond

46Predicate: is this slot on the forward stream?

From the schedule analysis: stream A's microbatch m is at rank r in slot m + r, so this rank is on stream A in slot t iff 0 ≤ t − rank < M. We post the receive on the comm_stream so the receive operation does not block the SM. By the time we hit the .wait() on line 53, NCCL has hopefully already filled the buffer in the background — that is the latency that DualPipe hides.

EXECUTION STATE

t - rank = the index of stream-A's microbatch currently arriving at this rank

fwd_recv[t] = torch.distributed.Work handle for the irecv at this slot

49Predicate: is this slot on the backward stream?

Symmetrically, stream B's microbatch m' is at rank r in slot m' + (R−1 − r). So this rank is on stream B in slot t iff 0 ≤ t − (R−1 − rank) < M. The receive is posted to the same comm_stream — but in a real implementation you would put dispatch on the NVLink channel and combine on the InfiniBand channel so they do not fight for the same NIC. The SM-resident all-to-all kernel lets us pick which physical link each receive uses.

EXECUTION STATE

t - (R-1 - rank) = stream-B's microbatch index arriving here

bwd_recv[t] = Work handle for the right-to-left receive

54Synchronisation point — but only for this slot's inputs

Critical detail: we do not wait for the NEXT slot's receives here — those were issued on the comm_stream above and they continue to run while we burn compute on the current slot. This is what 'overlapping comms with compute' actually means in CUDA: the comm_stream and the compute_stream are two distinct CUDA streams, the runtime can schedule both at once, and the only synchronisation is the per-slot .wait() that says 'I now need this tensor for the GEMM'.

60Forward chunk: GEMM on the compute stream

block(fwd_recv[t].tensor) runs the local pipeline stage's forward — a transformer block with attention, MLP, and (for the MoE blocks) a token dispatch via SM-resident all-to-all. The output is sent to the next rank with a non-blocking isend so we do not stall again. Crucially, this is happening on compute_stream, while comm_stream is already pulling in the next microbatch's bytes. That is the overlap that converts pipeline parallelism from 'mostly idle GPUs' into 'mostly busy GPUs'.

EXECUTION STATE

y = shape [B, S, H], dtype BF16 — forward output of this rank's block

66Backward chunk: SAME slot, different microbatch

This is the bidirectional half. While the forward chunk of microbatch (t − rank) is running, we ALSO run the backward chunk of microbatch (M−1 − (t − (R−1 − rank))) on the same rank. In a single-stream pipeline these two pieces of work would compete for the GPU; here they share the SMs (compute_stream is one stream but the kernels are sub-SM-granularity), and the network bytes for both flow over the comm_stream concurrently. This is the line that closes the bubble.

EXECUTION STATE

g_in = shape [B, S, H], BF16 — gradient w.r.t. this rank's INPUT activation, to be sent back to prev_rank

74Optimiser step at the end of the diamond

Only one optimiser step per full DualPipe cycle, even though we processed 2M microbatches — this is gradient accumulation at the pipeline level. The Adam moments are sharded by ZeRO-3, so each rank only touches its slice of the master weights, and the all-reduce of gradients across data-parallel replicas happens between cycles. In the DeepSeek-V3 paper this gives an effective batch size of 15M tokens per step.

81Two CUDA streams — the operational embodiment of DualPipe

Without two distinct CUDA streams, none of the overlap is possible — CUDA serialises within a stream. The compute_stream gets priority −1 (higher priority) so a GEMM that suddenly becomes ready does not get pre-empted by a comms kernel that has plenty of slack. In production, the SM-resident all-to-all replaces the comm_stream entirely: comms becomes a kernel that runs ON SMs alongside the GEMM, and the only fight is for SM occupancy — which is exactly what the third interactive widget above lets you explore.

EXECUTION STATE

compute_stream = CUDA stream for GEMM / attention / norm kernels

comm_stream = CUDA stream for NCCL send/recv (replaced by SM-resident kernels in production)

69 lines without explanation

1"""
2A minimal DualPipe rank, written in plain PyTorch.
3
4We are simulating ONE rank of an 8-rank DualPipe. In a real run there are
5two NCCL streams (one for the left-going forward, one for the right-going
6backward) and an SM-resident all-to-all kernel feeding the MoE experts.
7Here we use vanilla send/recv on a single CUDA stream so the structure is
8readable; the production version of this loop lives inside DeepSeek's eep
9library and dispatches the comms onto a tunable slice of SMs that runs
10concurrently with the GEMMs.
11"""
12
13import torch
14import torch.distributed as dist
15from torch import nn, Tensor
16from contextlib import nullcontext
17
18# rank, world_size, prev_rank, next_rank set up by torchrun.
19rank = dist.get_rank()
20world = dist.get_world_size()
21prev_rank = (rank - 1) % world
22next_rank = (rank + 1) % world
23
24# The local pipeline stage — say one transformer block, 7168 hidden.
25block: nn.Module = build_local_block()
26# Adam states sharded across DP ranks (ZeRO-3 style); irrelevant to the
27# pipeline mechanics so we just keep the optimizer.
28optim = torch.optim.AdamW(block.parameters(), lr=1.0e-4)
29
30# A microbatch is just an activation tensor of shape [B, S, H].
31B, S, H = 4, 4096, 7168
32def make_buffer() -> Tensor:
33    return torch.empty(B, S, H, dtype=torch.bfloat16, device="cuda")
34
35
36def dualpipe_step(microbatches_A: list, microbatches_B: list):
37    """One DualPipe cycle. microbatches_A flow left->right, B flow right->left."""
38    R, M = world, len(microbatches_A)
39    T = M + R - 1
40    fwd_recv: dict = {}
41    bwd_recv: dict = {}
42
43    for t in range(T):
44        # 1. Issue non-blocking comms for whichever streams this slot owns.
45        with torch.cuda.stream(comm_stream):
46            if 0 <= t - rank < M:
47                fwd_recv[t] = dist.irecv(make_buffer(), src=prev_rank)
48            if 0 <= t - (R - 1 - rank) < M:
49                bwd_recv[t] = dist.irecv(make_buffer(), src=next_rank)
50
51        # 2. Wait for THIS slot's inputs (overlap with previous slot's compute).
52        if t in fwd_recv:
53            fwd_recv[t].wait()
54        if t in bwd_recv:
55            bwd_recv[t].wait()
56
57        # 3. Run the forward and backward chunks concurrently.
58        ctx = torch.cuda.stream(compute_stream)
59        with ctx if t in fwd_recv else nullcontext():
60            if t in fwd_recv:
61                y = block(fwd_recv[t].tensor)
62                dist.isend(y, dst=next_rank)
63        with ctx if t in bwd_recv else nullcontext():
64            if t in bwd_recv:
65                # In the real impl, this calls block.backward with the
66                # received grad-output. Here we just call .backward.
67                g_in = block_backward(block, bwd_recv[t].tensor)
68                dist.isend(g_in, dst=prev_rank)
69
70        # 4. End of cycle: optimiser step on the FP32 master copy.
71        if t == T - 1:
72            optim.step()
73            optim.zero_grad(set_to_none=True)
74
75
76# Two CUDA streams = the heart of overlap. In production these are paired
77# with NVLink and IB channels respectively, and the MoE all-to-all takes
78# its own SM-resident kernel rather than NCCL.
79compute_stream = torch.cuda.Stream(priority=-1)
80comm_stream    = torch.cuda.Stream(priority=0)

At Massive Scale: Why DeepSeek-V3 Needed DualPipe

DualPipe is not an optimisation you apply to a 7 B-parameter model for fun. It is a load-bearing piece of the DeepSeek-V3 training recipe — without it, the 2-month, 2,048-GPU budget does not close. Four hard constraints pushed the team to invent it.

Activation memory ceiling. With $H = 7168$ , $S = 4096$ , $B = 4$ , one BF16 microbatch activation on one rank is $4 \cdot 4096 \cdot 7168 \cdot 2 \approx 235\,\text{MB}$ . Across 61 transformer blocks and $2M = 64$ microbatches in flight, that is $\sim 900\,\text{GB}$ per rank — more than fits in 80 GB HBM. DualPipe's halved schedule length means you only need to retain activations for $M$ microbatches at a time, not $2M$ , which (combined with selective recomputation) lets the activation footprint fit.
MoE all-to-all bandwidth. Each MoE block moves ~50 MB per microbatch per direction off-node. With 32 microbatches and 61 blocks per step, that is $\sim 200\,\text{GB}$ of off-node traffic per step per rank. At an InfiniBand throughput of 50 GB/s, the dispatch+combine alone would take $\sim 4\,\text{s}$ per step on the critical path — longer than the compute. SM-resident all-to-all is the only thing that hides this cost.
The 14.8 T-token budget. Training 671 B parameters on 14.8 T tokens at the chinchilla-optimal ratio means $6 N D \approx 6 \cdot 10^{25}$ training FLOPs. At 40% MFU on 2,048 H800s in FP8 (1979 TFLOPS peak per GPU), that is $6 \cdot 10^{25} / (2048 \cdot 1979 \cdot 10^{12} \cdot 0.4) \approx 3.7 \cdot 10^{7}$ seconds, or ~430 days. The factor-of-many gap between that number and the reported ~60-day wallclock is closed by FP8 (section 10.1), DualPipe (this section), and aggressive gradient accumulation. None of those three is optional.
No TP, by design. Tensor parallelism (section 11.3) would normally be the go-to lever for 671 B parameters, but it shreds activation tensors across NVLink with all-reduce operations on every block — adding yet another cross-rank dependency on the critical path. DeepSeek-V3 famously trains without TP: just data, pipeline, and expert parallelism. That decision is only viable because DualPipe makes pipeline parallelism much more efficient than the textbook bubble analysis would suggest.

The shape of the training step at scale

For one optimiser step on the DeepSeek-V3 cluster, the wallclock breakdown looks roughly like:

Phase	Wallclock share	Hidden by DualPipe?
Attention GEMMs (FP8)	32%	—
MoE MLP GEMMs (FP8)	38%	—
Dispatch + combine (all-to-all)	0%	Yes — hidden by SM-resident kernel
P2P send/recv (pipeline)	0%	Yes — hidden by 2nd CUDA stream
Layernorm + softmax (BF16)	8%	—
Optimiser step + DP all-reduce	5%	Partially — overlapped across DP groups
Pipeline bubbles	9%	Reduced from 18% by bidirectional schedule
Other (memcpy, kernel launch)	8%	—

The two "0%" rows are the whole story: dispatch/combine and pipeline P2P are off the critical path entirely thanks to DualPipe. Compute is left as the binding side — which is exactly where you want to be, because compute is what you paid for. MFU (model FLOPs utilisation) climbs from a baseline of ~25% on a naïve schedule to ~40% on DualPipe — the difference between an H800 cluster delivering 500 PFLOPS and one delivering 800 PFLOPS on the same hardware.

Engineering Reality: What DualPipe Costs You

DualPipe is not a free lunch. Four engineering costs you sign up for the moment you turn it on:

Custom CUDA. SM-resident all-to-all is not in NCCL, PyTorch, or any open framework as of mid-2026. DeepSeek publishes the algorithm but not the kernel; reproducing it in your own stack requires a small team of CUDA engineers comfortable with NVSHMEM, GPUDirect RDMA, and warp-level shuffles. Expect 1–2 engineer-quarters to ship a production-quality version.
Topology lock-in. DualPipe assumes a symmetric ring of ranks with equal compute capacity and equal link bandwidth between adjacent ranks. On a heterogeneous cluster (e.g. mixed A100 and H100) the bidirectional schedule degrades because the slower link becomes the binding side of every overlapped cell. The algorithm is most efficient on a dedicated, uniform pod — which is exactly what frontier labs build.
Memory pressure on the diamond edges. At the first and last few slots of every cycle, the two streams have not fully crossed yet, so the rank is only running one stream at a time but has buffers allocated for both. Peak memory is ~25% higher than 1F1B at the same $M$ . The fix is to size $M$ conservatively and rely on the schedule's bubble reduction to recover throughput.
Debugging is harder. "The loss spiked at step 47k" with a 1F1B schedule usually points to a data shard or a numerical issue. With DualPipe it can also be: a stale send buffer from a previous diamond, an SM-budget mistune (the comms kernel evicted by a compute kernel), or a race in the dispatch/combine pair when an expert overflows its token capacity. New failure modes, new tooling.

Against those four costs, the benefits compound:

Benefit	Magnitude vs 1F1B + NCCL	Where it shows up
Pipeline bubble	~½×	GPU-hours per training run
MoE all-to-all latency	≈ 0	MFU and wallclock
Activation memory	~½×	Max sequence length, max # microbatches
MFU	+15–25 percentage points	End-to-end training cost
Engineering complexity	+1 quarter (one-time)	Up-front team cost

The deepest lesson. Every order of magnitude of scale has historically required a co-design step: GPT-2 needed gradient checkpointing, GPT-3 needed BF16 master weights, PaLM needed 1F1B, GPT-4-scale needed FP8. DeepSeek-V3 needed DualPipe. The pattern is always the same: a constraint that was invisible at small scale becomes binding at frontier scale, and the answer is never a marginal improvement of the previous tool but a structural rethink of how compute and communication interleave. The next 10× of scale will require its own DualPipe — quite likely a co-design with optical interconnects, FP4 precision, and asynchronous optimisation. The skill the chapter is trying to teach is not "use DualPipe"; it is learn to look for the critical-path latency that nobody else is looking at, and rewrite the schedule around it.