Boo-AI — Master Artificial Intelligence by Building from Scratch

A 671 B-parameter model does not fit on a single GPU. The first obvious response — chop the layers across GPUs and call them in order — works, but throws away most of the GPU time you just paid for. Pipeline parallelism is the family of scheduling tricks that get that wasted time back. This section is about why the bubble exists, where the math says it has to be, and the two schedules — GPipe and 1F1B — that bound it. Both are the floor that DualPipe will beat in section 11.5.

The Real Problem: When a Model Will Not Fit on One GPU

Sections 11.2 and 11.3 dealt with two ways to split work across GPUs. Data parallelism keeps a full copy of the model on every GPU and shards the batch; tensor parallelism keeps the batch whole and shards each individual matmul. Both have a hard ceiling. Data parallelism needs the entire model to fit on one device — at 671 B parameters that is impossible on any GPU made today. Tensor parallelism can split a matmul across a handful of GPUs inside a node, but the all-reduces it triggers on every layer pin it to NVLink-class bandwidth: try to extend TP across a node boundary and you immediately spend more time talking than computing.

Pipeline parallelism takes a different cut. Instead of slicing a single layer across GPUs, it slices the whole stack of layers: the first $L/P$ layers live on GPU 0, the next $L/P$ layers on GPU 1, and so on for $P$ stages. The forward pass is a linear chain: activations from stage 0 feed into stage 1, then 2, then 3. The backward pass is the same chain in reverse.

Written like that, the picture looks fine. A 671 B model that does not fit on one GPU fits comfortably on 16 stages of 42 B each. The communication is cheap — only the boundary activation between adjacent stages crosses the wire, which is dozens of MB per microbatch, not GB. And because each GPU only stores 1/P of the parameters, the optimiser and gradient memory drop by the same factor. Compared to TP, the comms bill is one to two orders of magnitude lower per layer.

The problem appears the moment you try to time it. With one batch in flight, stage 0 does its forward, then sits idle while stages 1, 2, 3 finish theirs. Then stages 1, 2, 3 sit idle while stage 0 does its backward. Out of P stages, only one is doing useful work at any given moment. For a four-stage pipeline, the GPU you bought for $30,000 is busy 25% of the time. For sixteen stages, 6%. This is the first incarnation of the bubble.

The bubble is the dimensionless number that drives everything downstream in this chapter. Every pipeline-parallel paper since GPipe — including PipeDream, Megatron-LM's interleaved 1F1B, and DeepSeek V3's DualPipe — exists to push this number down. The rest of the section derives where it comes from and why the two textbook schedules cannot eliminate it on their own.

Intuition: An Assembly Line With One Worker Awake at a Time

Imagine a car factory with four stations, one worker each. A car visits station 0 (chassis), then 1 (engine), then 2 (electronics), then 3 (paint). Each step takes the same amount of time. If the factory only works on one car at a time, every other station sits idle: while station 1 installs the engine, the chassis worker has nothing to do, and the electronics and paint workers have not yet seen this car. Four workers, one car-station of useful work per hour. That is the naive pipeline.

The fix every real factory uses is obvious: keep multiple cars in the line at once. As soon as car 0 leaves station 0 for station 1, car 1 enters station 0. By the time car 0 is at station 3, the line holds four cars at once and every station is busy. The line is full. The cost of having paid for four stations is finally being paid back.

Microbatching is exactly this trick, with a twist. A car is a microbatch — a small slice of the global batch. The factory line is the pipeline. The forward pass through the four stages is the four stations. So far so good. But neural-network training has a backward pass too: once a car reaches station 3, it has to come back through stations 2, 1, 0 in reverse to deliver gradients. So the factory is now bidirectional: cars stream forward, and gradients stream back. The forward queue and backward queue have to share the same four workers. The schedule you choose decides how cleanly forward and backward dovetail — and the leftover gaps are the bubble.

Two schedules, one floor. GPipe is the "forward all, then backward all" schedule. 1F1B (one forward, one backward) interleaves them as soon as it can. Both achieve the same arithmetic bubble fraction. They differ in memory: GPipe must keep every forward's activations alive until backward starts; 1F1B drains them as soon as possible. For a 671 B model with M = 64 microbatches, that is the difference between OOM and a clean training run.

The Mathematics of the Bubble

Let the pipeline have $P$ stages and the global batch be chopped into $M$ microbatches. Let one forward on one stage take $t_F$ seconds and one backward on one stage take $t_B$ . In what follows we use the standard assumption $t_B = 2 \, t_F$ , because backward computes both the activation gradient and the weight gradient.

Naive (no microbatching)

A single microbatch goes through all $P$ stages forward and back. Total wall time: $T_{\text{naive}} = P \, t_F + P \, t_B$ . Active GPU-time is also $P \, t_F + P \, t_B$ (one stage works at a time). Total GPU-time available is $P \cdot T_{\text{naive}}$ . So the bubble fraction is $f_{\text{naive}} = 1 - \frac{P(t_F + t_B)}{P \cdot P(t_F + t_B)} = 1 - \frac{1}{P}$ .

For $P = 4$ that is 75%. For $P = 16$ it is 94%. Almost all GPU time is wasted. No amount of TP, FSDP, or fast NVLink can fix a schedule this bad — the only thing that can is filling the pipeline with more microbatches.

GPipe

Forward of every microbatch streams down the pipeline in lockstep. The first microbatch leaves stage $P-1$ at time $P \, t_F$ ; the last one leaves at $(M + P - 1) \, t_F$ . Then the backward sweep starts, taking a symmetric $(M + P - 1) \, t_B$ seconds. Total wall time: $T_{\text{gpipe}} = (M + P - 1)(t_F + t_B)$ .

Active GPU-time is $M \, P (t_F + t_B)$ — each of $M$ microbatches touches all $P$ stages, both forward and backward. Total GPU-time available is $P \cdot T_{\text{gpipe}}$ . Bubble fraction: $f_{\text{gpipe}} = 1 - \frac{M \, P (t_F + t_B)}{P \cdot (M + P - 1)(t_F + t_B)} = \frac{P - 1}{M + P - 1}$ .

Two ways to read this formula. (1) For fixed $P$ , doubling $M$ roughly halves the bubble. (2) For fixed $M$ , growing the pipeline deeper makes the bubble grow linearly in $P$ . To keep the bubble below 10% you need $M \geq 9 (P - 1)$ microbatches.

1F1B

1F1B does not change the wall time relative to GPipe — once the pipeline is full the cadence is the same, and once it drains the tail is the same. So $f_{\text{1f1b}} = f_{\text{gpipe}} = \frac{P - 1}{M + P - 1}$ .

What does change is the peak number of microbatches whose forward activations have to be held in memory. In GPipe, stage 0 holds all $M$ forwards until the backward sweep begins. In 1F1B, stage $s$ holds at most $P - s$ — stage 0 holds $P$ , stage $P - 1$ holds 1. For a 671 B model with $M = 64$ microbatches and $P = 16$ stages, this is the difference between 16 microbatches of activations per stage (1F1B) and 64 (GPipe) — a 4× memory saving that, in practice, is the only reason the model fits at all.

Why GPipe and 1F1B share the same bubble. The bubble is a function of two things only: how long it takes to fill the pipeline at the start, and how long it takes to drain it at the end. Both schedules pay the same fill and drain cost — they just disagree about what to do during the middle. The middle is fully busy in both cases. The disagreement shows up in memory, not in time.

Manual Numerical Walkthrough

Set $P = 4$ stages, $M = 8$ microbatches, $t_F = 1$ unit, $t_B = 2$ units. We will compute the three bubble fractions by hand and confirm them against the simulator.

Click to expand: bubble fraction for P = 4, M = 8 — three schedules

Naive. Single microbatch through 4 stages forward (4 units) and back (8 units) — total $T = 12$ . Active per GPU is $1 + 2 = 3$ ; across 4 GPUs that is $12$ . Total GPU-time = $4 \cdot 12 = 48$ . Bubble = $48 - 12 = 36$ , fraction = $36/48 = 0.75$ . With $M = 8$ separate microbatches done one at a time, every number scales by 8 — the fraction stays $1 - 1/P = 0.75$ .

GPipe. Forward sweep: $(M + P - 1) \, t_F = 11$ units. Backward sweep: $(M + P - 1) \, t_B = 22$ units. Total wall time $T = 33$ . Active work = forward + backward across all stages and microbatches = $M \cdot P \cdot (t_F + t_B) = 8 \cdot 4 \cdot 3 = 96$ . Total GPU-time = $P \cdot T = 4 \cdot 33 = 132$ . Bubble = $132 - 96 = 36$ , fraction = $36/132 = 3/11 \approx 0.273$ . Cross-check with the formula: $(P-1)/(M+P-1) = 3/11$ — matches.

1F1B. Same wall time, same active work — fraction is also $3/11$ . What differs is memory: GPipe holds 8 microbatches of activations at stage 0; 1F1B holds at most 4. A back-of-the-envelope check: stage 0's 1F1B queue is F0, F1, F2, F3, F4, B0, F5, B1, F6, B2, F7, B3, B4, B5, B6, B7. At the moment F4 finishes, B0 has not yet run; stages 1, 2, 3 are still finishing their warmups for the same set. Peak in-flight forwards at stage 0 = 4.

What the numbers tell us. Going from naive to microbatched dropped the bubble from 75% to 27.3% — pure scheduling win, no hardware change. Doubling M to 16 would drop it further to $3/19 \approx 16\%$ . Doubling again to M = 32 would drop it to $3/35 \approx 8.6\%$ . At some point the microbatch becomes small enough that per-microbatch overhead (kernel launches, comms latency) starts dominating — typically around M / P ≈ 4. That is why real 671 B-class training uses M between 32 and 64, not 1024.

Visualising the Schedule

The widget below renders the per-stage schedule as a Gantt chart. Each row is a GPU; each blue block is a forward microbatch (1 time unit), each amber block is a backward (2 time units), each striped red area is a bubble. The legend reports the bubble fraction in real time so you can match it against the formulas above.

Loading pipeline schedule…

Three experiments to run inside the widget:

With the schedule set to Naive and P = 4, watch how three out of four lanes are striped red at any given moment. Bump P to 6 — six out of seven stages are idle most of the time. This is why you cannot ship a real LLM with a naive pipeline.
Switch to GPipe at P = 4, M = 4. Bubble ≈ 43% — the microbatching helps, but not enough. Now crank M to 16. Bubble drops to ≈ 16%. The fill and drain bookends are the same; the middle is fully busy. This is the GPipe formula made visible.
Switch to 1F1B with the same numbers. The arithmetic bubble fraction is identical to GPipe, but look at the stage 0 row: forwards no longer pile up before backwards start. Activation memory at stage 0 is bounded by P − 0 = P, not M. This is the unlock that lets DualPipe push M down further without OOM.

Plain Python: Simulate the Pipeline by Hand

Before we wire up PyTorch, let us write the scheduler the same way the visualiser does it. The point of this code is to make the schedule itself a first-class object — the queue per stage and the dependencies between them — so you can reason about the bubble without worrying about CUDA, gradients, or DDP.

pipeline_schedule.py

🐍python

Explanation(9)

Code(109)

1Why a simulator instead of timing real GPUs

Real pipeline kernels live inside Megatron-LM, DeepSpeed, and torchtitan. To see the bubble structure cleanly we need something more honest: an explicit per-stage queue with explicit dependencies, where the only thing time depends on is the schedule itself. The entire model is integer arithmetic on event start/end times — no NumPy, no CUDA.

19Time units: t_F = 1, t_B = 2

Backward takes roughly twice forward because it computes two gradients: dL/dactivation (so the next stage can keep going) and dL/dweight (for the optimiser). The exact ratio depends on the layer mix; 2× is the convention from the GPipe paper and good enough for visualising the bubble.

EXECUTION STATE

T_F = 1

T_B = 2

22Naive queues — one mb at a time, full sweep each

Forward sweeps stages 0 → P − 1, then backward sweeps P − 1 → 0. Microbatches do not overlap. The next microbatch only enters stage 0 after the previous one has come all the way back. This is what happens if you naively split a model with .to('cuda:i') and write a normal training loop.

LOOP TRACE · 2 iterations

mb=0

q[0] = [('F',0), ('B',0)]

mb=1

q[0] = [('F',0), ('B',0), ('F',1), ('B',1)]

30GPipe queues — forward fan-out, then backward fan-out

Each stage first receives every microbatch's forward in order 0, 1, …, M − 1, then every microbatch's backward in reverse order M − 1, …, 0. The reverse on backward matters: the last microbatch to finish forward at stage P − 1 is the one that can immediately start backward there. Backward gradients are accumulated into the optimiser in the same reverse order.

LOOP TRACE · 2 iterations

all forwards

q[s] = F0, F1, ..., F_{M-1}

all backwards

q[s] = ..., B_{M-1}, ..., B0

381F1B queues — the trick that bounds memory

Stage s warms up by doing (P − 1 − s) forwards before any backward. That number is exactly the number of microbatches that have to be in flight before this stage can start its first backward. After warmup the stage alternates F and B until all M forwards are done; then it drains the remaining backwards. Result: stage 0 holds activations for P microbatches at peak; stage P − 1 holds activations for 1.

EXECUTION STATE

warmup(s) = P − 1 − s

peak in-flight at s = P − s (vs M in GPipe)

56simulate — the event loop

We walk every per-stage queue in lockstep. At each pass we ask each stage: 'can the next op in your queue run now?' It can if both (a) the stage's clock is free and (b) every dependency op has already finished. Whichever ops satisfy that, we schedule, advance their stage's clock, record their end time so other stages can see it, and repeat.

EXECUTION STATE

ptr[s] = index of next op in queue s

busy[s] = stage s's clock — when it is free again

f_end = map (stage, mb) → forward end time

b_end = map (stage, mb) → backward end time

LOOP TRACE · 3 iterations

first pass

scheduled = F(0, 0) — only op with no dependencies

second pass

scheduled = F(0, 1) on stage 0 AND F(1, 0) on stage 1

...

scheduled = pipeline fills up

71Dependency check — three cases, one line each

The three guard clauses encode the entire pipeline-parallel data flow: F on stage s > 0 needs the previous stage's forward; B on the last stage needs that stage's own forward; B on any earlier stage needs the next stage's backward. If any one of these is missing we skip this stage this pass and let another stage make progress first.

93bubble_stats — what the schedule cost

Wall-clock T is the max end time. Active time is the sum of every op's duration — that is the total useful work the P GPUs did. Bubble = P · T − active is the total GPU-time that was wasted on idle. Fraction = bubble / (P · T) is the dimensionless number that goes into your scaling spreadsheet.

EXECUTION STATE

P · T = total GPU-time available

active = F's + B's actually run

bubble = P · T − active

100Running it: the headline result for P = 4, M = 8

Output: naive prints fraction ≈ 0.667 (over the same 8 microbatches done serially), GPipe ≈ 0.273, 1F1B ≈ 0.273. The bubble shrank from two-thirds to a quarter by scheduling alone — no extra GPUs, no extra compute, just a smarter per-stage queue. The two well-known schedules differ in memory, not in wall time, which is exactly what the math predicts.

EXECUTION STATE

naive fraction = ≈ 0.667

gpipe fraction = ≈ 0.273

1f1b fraction = ≈ 0.273

100 lines without explanation

1"""
2Pipeline-parallel schedule simulator.
3
4We model a network with P stages (one per GPU) and a batch chopped into
5M microbatches. Forward on a stage takes t_F time units, backward takes
6t_B (typically ~2 * t_F because backward computes both the activation
7gradient and the weight gradient).
8
9A 'schedule' is a per-stage queue of ops, where each op is ('F'|'B', mb_idx).
10We simulate the queues by stepping events one at a time, respecting:
11
12  - a stage can only run one op at a time,
13  - F(s, mb) requires F(s-1, mb) to be finished (forward flows 0 -> P-1),
14  - B(s, mb) requires:
15        F(P-1, mb) if s == P-1   (start backward at the last stage)
16        B(s+1, mb) otherwise     (backward flows P-1 -> 0).
17"""
18
19T_F, T_B = 1, 2          # forward / backward duration per stage per microbatch
20
21
22def build_naive_queues(P, M):
23    """One microbatch sweeps the whole pipeline, then the next."""
24    queues = [[] for _ in range(P)]
25    for mb in range(M):
26        for s in range(P):           queues[s].append(("F", mb))
27        for s in range(P-1, -1, -1): queues[s].append(("B", mb))
28    return queues
29
30
31def build_gpipe_queues(P, M):
32    """Forward all M microbatches across stages, then backward all of them."""
33    queues = [[] for _ in range(P)]
34    for mb in range(M):
35        for s in range(P): queues[s].append(("F", mb))
36    for mb in reversed(range(M)):
37        for s in reversed(range(P)): queues[s].append(("B", mb))
38    return queues
39
40
41def build_1f1b_queues(P, M):
42    """Each stage warms up (P-1-s) forwards, then alternates F and B,
43       then drains the remaining backwards. Memory ceiling: P in-flight
44       activations per stage instead of M."""
45    queues = [[] for _ in range(P)]
46    for s in range(P):
47        warmup = max(0, min(P - 1 - s, M))
48        f_idx, b_idx = 0, 0
49        for _ in range(warmup):
50            queues[s].append(("F", f_idx)); f_idx += 1
51        while f_idx < M:
52            queues[s].append(("F", f_idx)); f_idx += 1
53            queues[s].append(("B", b_idx)); b_idx += 1
54        while b_idx < M:
55            queues[s].append(("B", b_idx)); b_idx += 1
56    return queues
57
58
59def simulate(queues, P, t_F=T_F, t_B=T_B):
60    """Walk every queue in lockstep, scheduling the next op whenever its
61       dependencies and the stage's busy clock both allow it."""
62    ptr      = [0] * P
63    busy     = [0] * P
64    f_end    = {}              # (stage, mb) -> end time
65    b_end    = {}
66    events   = []
67    total    = sum(len(q) for q in queues)
68    done     = 0
69    while done < total:
70        progressed = False
71        for s in range(P):
72            if ptr[s] >= len(queues[s]): continue
73            op, mb = queues[s][ptr[s]]
74            if op == "F" and s > 0 and (s-1, mb) not in f_end: continue
75            if op == "B" and s == P-1 and (s, mb) not in f_end: continue
76            if op == "B" and s <  P-1 and (s+1, mb) not in b_end: continue
77            dep = (f_end.get((s-1, mb), 0) if op == "F" and s > 0
78                   else f_end.get((s, mb),   0) if op == "B" and s == P-1
79                   else b_end.get((s+1, mb), 0) if op == "B"
80                   else 0)
81            start = max(busy[s], dep)
82            dur   = t_F if op == "F" else t_B
83            end   = start + dur
84            busy[s] = end
85            (f_end if op == "F" else b_end)[(s, mb)] = end
86            events.append((s, op, mb, start, end))
87            ptr[s] += 1
88            done   += 1
89            progressed = True
90        if not progressed:
91            raise RuntimeError("schedule deadlock")
92    return events
93
94
95def bubble_stats(events, P):
96    total_time = max(e[4] for e in events)
97    active     = sum(e[4] - e[3] for e in events)
98    bubble     = P * total_time - active
99    return total_time, bubble, bubble / (P * total_time)
100
101
102if __name__ == "__main__":
103    P, M = 4, 8
104    for name, builder in [("naive", build_naive_queues),
105                          ("gpipe", build_gpipe_queues),
106                          ("1f1b ", build_1f1b_queues)]:
107        ev = simulate(builder(P, M), P)
108        T, B, frac = bubble_stats(ev, P)
109        print(f"{name}  wall={T:3d}  bubble={B:3d}  fraction={frac:.3f}")

Running this for $P = 4$ , $M = 8$ prints:

Schedule	Wall-clock units	Bubble units	Bubble fraction
naive	96	192	0.667
gpipe	33	36	0.273
1f1b	33	36	0.273

The naive fraction accounts for M separate microbatches done serially; the GPipe and 1F1B fractions match the closed form $(P-1)/(M+P-1) = 3/11 = 0.273$ exactly. The schedule object you just built in Python is the same schedule that Megatron-LM and DeepSpeed implement on actual GPUs.

PyTorch: A Real Two-Stage Pipeline End-to-End

Now let us wire the schedule into a real autograd graph. We keep both stages on CPU so the code runs anywhere, but the cross-stage handshake — detach on forward, manually re-inject the gradient on backward — is exactly what a multi-GPU pipeline does over NCCL.

pipeline_two_stage_gpipe.py

🐍python

Explanation(10)

Code(55)

1Two stages, one model — the pipeline-parallel idea

A real model would be split layer-by-layer into more stages, but the boundary mechanism is the same: stage k's output is stage k+1's input. We use a tiny MLP on each stage so the example fits in one screen. In a real distributed launch you would do stage0.to('cuda:0'), stage1.to('cuda:1'), and the .detach() crossing below would also do a torch.distributed.send to the next rank.

11Stage definition

Each stage is a self-contained nn.Module with its own parameters and its own optimiser. Because the two stages live in (logically) different processes, they cannot share an autograd graph — every cross-stage tensor has to be detached and re-attached, which is exactly what the boundary handshake below does.

EXECUTION STATE

stage0.parameters() = two Linears + biases on rank 0

stage1.parameters() = two Linears + biases on rank 1

19M = 4 microbatches of size B = 4

The whole batch (M · B = 16) is chopped into M slices. Each slice is what flows through one F or B in the schedule we built in the Python simulator. M is the only knob that shrinks the bubble — see the GPipe formula above.

EXECUTION STATE

M = 4

B = 4

x_full = shape (16, 8)

35Forward microbatch i through stage 0

Standard PyTorch forward — autograd records a graph rooted at x_i, going through stage0's weights, ending at h_i. h_i is a leaf into the cross-stage boundary; it carries the gradient signal that will flow back into stage0 later.

EXECUTION STATE

x_i = (B, dim) microbatch input

h_i = (B, dim) stage-0 output, autograd-tracked

36The boundary handshake — detach + requires_grad_(True)

This single line is the entire reason pipeline parallelism works. h_i.detach() cuts the autograd graph so that stage1's forward will build a brand-new graph rooted at h_i_detached. .requires_grad_(True) tells autograd to track grads at this new leaf, so that when we eventually .backward() on stage1, h_i_detached.grad will contain the upstream gradient we need to ship back to stage0. In a multi-GPU setup, this is the moment you call torch.distributed.send(h_i, dst=rank+1).

EXECUTION STATE

h_i_detached = (B, dim), no graph upstream, grads ON

37Stage 1 forward + loss

Same as any normal MLP forward, but the input is the detached boundary tensor. The loss for this microbatch is stashed in a list so we can .backward() it later — that is the deferred-backward part of the GPipe schedule.

EXECUTION STATE

o_i = (B, dim) stage-1 output

loss_i = scalar

43Backward phase — reverse microbatch order

GPipe issues backwards in reverse forward order. The reason is that on the last stage, the last microbatch's forward is the most recently finished, so its backward can start immediately. Iterating in reverse minimises the warm-up bubble of the backward sweep.

45First backward — fills h_i_detached.grad

loss_i.backward() walks the stage-1 graph backward, deposits parameter grads into stage1's .grad fields, and (crucially) writes into h_i_detached.grad — the upstream gradient at the boundary. That gradient is the message we now have to send back across the pipe.

EXECUTION STATE

h_i_detached.grad = (B, dim) upstream gradient

stage1.*.grad = filled with this microbatch's contribution

47Second backward — ship the boundary gradient into stage 0

h_i.backward(h_i_detached.grad) treats h_i_detached.grad as the gradient flowing into h_i from outside. PyTorch then unrolls stage 0's graph and accumulates grads into stage0's parameters. In a multi-GPU setup, the right-hand side is what you torch.distributed.recv into from rank k+1 before calling this line.

EXECUTION STATE

stage0.*.grad = accumulated upstream grads

49Optimiser step — once per training step, across all microbatches

Both stages' .grad fields now hold the sum over all M microbatches' contributions. One optimiser step on each stage finishes the pipeline-parallel training step. In real code the step would be guarded by an all-reduce inside data-parallel groups too — see Chapter 11 section 2.

45 lines without explanation

1import torch
2import torch.nn as nn
3
4# ---- Two halves of the model, each on its own (logical) device ----------
5# In a real distributed launch, stage0 lives on rank 0 and stage1 on rank 1.
6# Here we keep both on CPU so the code runs anywhere; the schedule and the
7# .detach() / .requires_grad_() handshake at the boundary are identical to
8# the real cross-GPU version.
9
10class Stage(nn.Module):
11    def __init__(self, dim):
12        super().__init__()
13        self.net = nn.Sequential(nn.Linear(dim, dim), nn.ReLU(), nn.Linear(dim, dim))
14    def forward(self, x):
15        return self.net(x)
16
17
18torch.manual_seed(0)
19dim, M, B = 8, 4, 4                  # M microbatches of size B each
20stage0 = Stage(dim)
21stage1 = Stage(dim)
22opt0   = torch.optim.SGD(stage0.parameters(), lr=0.01)
23opt1   = torch.optim.SGD(stage1.parameters(), lr=0.01)
24
25x_full = torch.randn(M * B, dim)
26y_full = torch.randn(M * B, dim)
27loss_fn = nn.MSELoss(reduction="sum")
28
29# ---- GPipe schedule: forward all M, then backward all M -----------------
30acts_at_boundary = []    # we must keep stage-0 outputs alive across the F pass
31losses           = []
32
33# Phase 1: forward every microbatch through both stages
34for i in range(M):
35    x_i = x_full[i*B:(i+1)*B]
36    y_i = y_full[i*B:(i+1)*B]
37    h_i = stage0(x_i)                                  # forward on stage 0
38    h_i_detached = h_i.detach().requires_grad_(True)   # boundary handshake
39    o_i = stage1(h_i_detached)                         # forward on stage 1
40    losses.append(loss_fn(o_i, y_i))
41    acts_at_boundary.append((h_i, h_i_detached))
42
43# Phase 2: backward every microbatch (reverse order — last to finish F is
44# first to start B; that is what minimises the warm-up bubble)
45for i in reversed(range(M)):
46    loss_i = losses[i]
47    loss_i.backward()                  # this fills h_i_detached.grad
48    h_i, h_i_detached = acts_at_boundary[i]
49    h_i.backward(h_i_detached.grad)    # ship grad back into stage 0
50
51opt0.step(); opt0.zero_grad()
52opt1.step(); opt1.zero_grad()
53
54print("microbatches processed:", M)
55print("avg loss:", float(sum(losses)) / M)

The mechanism that makes pipeline parallelism work at all sits on one line: $h.detach().requires\_grad\_(True)$ . This is the moment the autograd graph is cut between stages. Without the cut, every backward would propagate all the way to the inputs in one go — which is impossible across devices and processes. With the cut, the boundary tensor on the receiving side becomes a fresh leaf, its .grad field will be filled by stage 1's backward, and we can ship that gradient back across the wire ourselves with h.backward(boundary.grad).

Where libraries hide this from you. torchtitan and Megatron-LM wrap this handshake inside their pipeline-schedule runners. In Megatron, the p2p_communication module is the file to read — the send/recv primitives there are exactly what replaces the line above when stages live on different GPUs. If you ever debug a pipeline-parallel run that produces NaN losses on every microbatch except the first, the bug is almost always in the boundary handshake: either someone forgot the requires_grad_(True), or someone left the upstream graph attached and ran out of memory.

At Massive Scale: GPipe, PipeDream, and the Road to DualPipe

The two schedules above are textbook. Real frontier training builds on them, then pushes much further. The progression is worth mapping out because it sets up section 11.5.

GPipe (2019, Huang et al.)

Google's GPipe paper introduced microbatching and the (P − 1) / (M + P − 1) bubble formula. It also introduced re-materialisation: rather than store every microbatch's activations, drop them after forward and recompute them in backward. That cuts activation memory by ~80% at the cost of ~33% extra compute. Almost every pipeline implementation since GPipe ships with re-materialisation on by default for large models.

PipeDream / 1F1B (2019, Narayanan et al.)

Microsoft's PipeDream introduced the 1F1B schedule. By interleaving forward and backward as soon as the pipeline is full, it bounds the peak activation memory per stage by O(P) instead of O(M). The arithmetic bubble fraction is unchanged, but the savings let you push $M$ higher without OOMing — which, in turn, shrinks the bubble. 1F1B is the schedule Megatron-LM ships by default. DeepSeek V3 starts here.

Interleaved 1F1B (Megatron-LM, 2021)

Each GPU now owns multiple non-contiguous stages instead of one contiguous block. With $v$ virtual stages per device, the effective fill/drain time becomes $(P - 1)/v$ microbatches instead of $P - 1$ , dropping the bubble by a factor of $v$ at the cost of $v \times$ more cross-stage communication. Megatron typically uses $v = 2$ or $v = 4$ .

DualPipe (DeepSeek V3, 2024)

DeepSeek's DualPipe goes a step further: it runs two pipelines in opposite directions on the same GPUs at the same time, overlapping the bubble of one pipeline with the compute of the other. On a 16-stage DeepSeek V3 training run with M = 64 microbatches, DualPipe reduces the bubble from $15/79 \approx 19\%$ (1F1B) to nearly zero in the compute dimension, paying for it with a doubled boundary-comm bill that DeepSeek hides behind expert-parallel all-to-all using cross-node Infiniband bandwidth that would otherwise be idle. Section 11.5 derives the full DualPipe schedule and explains why DeepSeek could ship the run on H800 GPUs (which have crippled cross-node bandwidth compared to H100s) without losing throughput.

Schedule	Bubble	Peak in-flight forwards / stage	Comms / step
Naive	1 − 1/P	1	P − 1 boundaries
GPipe	(P − 1)/(M + P − 1)	M (at stage 0)	P − 1 boundaries × 2
1F1B	(P − 1)/(M + P − 1)	P − s (at stage s)	P − 1 boundaries × 2
Interleaved 1F1B (v stages/dev)	(P − 1)/(v(M + P − 1))	v · (P − s)	v × baseline
DualPipe	≈ 0 (compute), comms-bound	P − s (per direction)	2× baseline, hidden behind all-to-all

Engineering Reality: What Trips People Up

The math says the bubble is $(P-1)/(M+P-1)$ . The training run says you got 70% of expected throughput. Five things to look at, in order, when that happens.

Microbatch too small. When the per-microbatch forward is shorter than the kernel-launch overhead on a single stage, the "1 time unit" in the model breaks down and the bubble formula stops applying. Rule of thumb: keep per-microbatch per-stage forward above 2 ms. For DeepSeek V3 this means microbatch token count above ~4096.
Stage imbalance. If stage 3 is 20% slower than stage 2 (deeper layers, attention with longer KV, or one extra LayerNorm), the entire pipeline runs at stage 3's clock and every other stage acquires a hidden bubble. Balance the layer count, not just the parameter count.
Boundary comms not overlapped with compute. A naive pipeline implementation does send/recv synchronously between stages. The right thing is to issue the send on stage s before stage s's next op so that the next stage can receive while stage s is busy with the following microbatch. Megatron-LM does this through ncclSend + ncclRecv on a non-default stream.
Activation memory blows up at stage 0. If you accidentally ran GPipe instead of 1F1B, stage 0 holds M microbatches of activations. For a 671 B model with M = 64 and full attention activations, that is ~1.6 TB at stage 0 — guaranteed OOM. The fix is the schedule, not the model.
Loss / optimiser step in the wrong place. Computing the loss outside the pipeline schedule (instead of at the last stage's end-of-forward) re-introduces a serialisation point. Optimiser steps must run only after all microbatch backwards are accumulated — running per-microbatch steps wrecks gradient accumulation and the implicit learning rate.

What you should walk away with. Pipeline parallelism is the lever that lets a 671 B model exist on hardware at all. The bubble fraction $(P-1)/(M+P-1)$ is the price you pay for slicing the layer stack, and the entire research line from GPipe through DualPipe is about cutting that fraction down without inflating memory or comms. The two textbook schedules in this section are the floor. DualPipe, in the next section, is how DeepSeek V3 walks under it.