A 671 B-parameter model does not fit on a single GPU. The first obvious response — chop the layers across GPUs and call them in order — works, but throws away most of the GPU time you just paid for. Pipeline parallelism is the family of scheduling tricks that get that wasted time back. This section is about why the bubble exists, where the math says it has to be, and the two schedules — GPipe and 1F1B — that bound it. Both are the floor that DualPipe will beat in section 11.5.
The Real Problem: When a Model Will Not Fit on One GPU
Sections 11.2 and 11.3 dealt with two ways to split work across GPUs. Data parallelism keeps a full copy of the model on every GPU and shards the batch; tensor parallelism keeps the batch whole and shards each individual matmul. Both have a hard ceiling. Data parallelism needs the entire model to fit on one device — at 671 B parameters that is impossible on any GPU made today. Tensor parallelism can split a matmul across a handful of GPUs inside a node, but the all-reduces it triggers on every layer pin it to NVLink-class bandwidth: try to extend TP across a node boundary and you immediately spend more time talking than computing.
Pipeline parallelism takes a different cut. Instead of slicing a single layer across GPUs, it slices the whole stack of layers: the first layers live on GPU 0, the next layers on GPU 1, and so on for stages. The forward pass is a linear chain: activations from stage 0 feed into stage 1, then 2, then 3. The backward pass is the same chain in reverse.
Written like that, the picture looks fine. A 671 B model that does not fit on one GPU fits comfortably on 16 stages of 42 B each. The communication is cheap — only the boundary activation between adjacent stages crosses the wire, which is dozens of MB per microbatch, not GB. And because each GPU only stores 1/P of the parameters, the optimiser and gradient memory drop by the same factor. Compared to TP, the comms bill is one to two orders of magnitude lower per layer.
The problem appears the moment you try to time it. With one batch in flight, stage 0 does its forward, then sits idle while stages 1, 2, 3 finish theirs. Then stages 1, 2, 3 sit idle while stage 0 does its backward. Out of P stages, only one is doing useful work at any given moment. For a four-stage pipeline, the GPU you bought for $30,000 is busy 25% of the time. For sixteen stages, 6%. This is the first incarnation of the bubble.
Intuition: An Assembly Line With One Worker Awake at a Time
Imagine a car factory with four stations, one worker each. A car visits station 0 (chassis), then 1 (engine), then 2 (electronics), then 3 (paint). Each step takes the same amount of time. If the factory only works on one car at a time, every other station sits idle: while station 1 installs the engine, the chassis worker has nothing to do, and the electronics and paint workers have not yet seen this car. Four workers, one car-station of useful work per hour. That is the naive pipeline.
The fix every real factory uses is obvious: keep multiple cars in the line at once. As soon as car 0 leaves station 0 for station 1, car 1 enters station 0. By the time car 0 is at station 3, the line holds four cars at once and every station is busy. The line is full. The cost of having paid for four stations is finally being paid back.
Microbatching is exactly this trick, with a twist. A car is a microbatch — a small slice of the global batch. The factory line is the pipeline. The forward pass through the four stages is the four stations. So far so good. But neural-network training has a backward pass too: once a car reaches station 3, it has to come back through stations 2, 1, 0 in reverse to deliver gradients. So the factory is now bidirectional: cars stream forward, and gradients stream back. The forward queue and backward queue have to share the same four workers. The schedule you choose decides how cleanly forward and backward dovetail — and the leftover gaps are the bubble.
The Mathematics of the Bubble
Let the pipeline have stages and the global batch be chopped into microbatches. Let one forward on one stage take seconds and one backward on one stage take . In what follows we use the standard assumption , because backward computes both the activation gradient and the weight gradient.
Naive (no microbatching)
A single microbatch goes through all stages forward and back. Total wall time: . Active GPU-time is also (one stage works at a time). Total GPU-time available is . So the bubble fraction is .
For that is 75%. For it is 94%. Almost all GPU time is wasted. No amount of TP, FSDP, or fast NVLink can fix a schedule this bad — the only thing that can is filling the pipeline with more microbatches.
GPipe
Forward of every microbatch streams down the pipeline in lockstep. The first microbatch leaves stage at time ; the last one leaves at . Then the backward sweep starts, taking a symmetric seconds. Total wall time: .
Active GPU-time is — each of microbatches touches all stages, both forward and backward. Total GPU-time available is . Bubble fraction: .
Two ways to read this formula. (1) For fixed , doubling roughly halves the bubble. (2) For fixed , growing the pipeline deeper makes the bubble grow linearly in . To keep the bubble below 10% you need microbatches.
1F1B
1F1B does not change the wall time relative to GPipe — once the pipeline is full the cadence is the same, and once it drains the tail is the same. So .
What does change is the peak number of microbatches whose forward activations have to be held in memory. In GPipe, stage 0 holds all forwards until the backward sweep begins. In 1F1B, stage holds at most — stage 0 holds , stage holds 1. For a 671 B model with microbatches and stages, this is the difference between 16 microbatches of activations per stage (1F1B) and 64 (GPipe) — a 4× memory saving that, in practice, is the only reason the model fits at all.
Manual Numerical Walkthrough
Set stages, microbatches, unit, units. We will compute the three bubble fractions by hand and confirm them against the simulator.
Click to expand: bubble fraction for P = 4, M = 8 — three schedules
Naive. Single microbatch through 4 stages forward (4 units) and back (8 units) — total . Active per GPU is ; across 4 GPUs that is . Total GPU-time = . Bubble = , fraction = . With separate microbatches done one at a time, every number scales by 8 — the fraction stays .
GPipe. Forward sweep: units. Backward sweep: units. Total wall time . Active work = forward + backward across all stages and microbatches = . Total GPU-time = . Bubble = , fraction = . Cross-check with the formula: — matches.
1F1B. Same wall time, same active work — fraction is also . What differs is memory: GPipe holds 8 microbatches of activations at stage 0; 1F1B holds at most 4. A back-of-the-envelope check: stage 0's 1F1B queue is F0, F1, F2, F3, F4, B0, F5, B1, F6, B2, F7, B3, B4, B5, B6, B7. At the moment F4 finishes, B0 has not yet run; stages 1, 2, 3 are still finishing their warmups for the same set. Peak in-flight forwards at stage 0 = 4.
What the numbers tell us. Going from naive to microbatched dropped the bubble from 75% to 27.3% — pure scheduling win, no hardware change. Doubling M to 16 would drop it further to . Doubling again to M = 32 would drop it to . At some point the microbatch becomes small enough that per-microbatch overhead (kernel launches, comms latency) starts dominating — typically around M / P ≈ 4. That is why real 671 B-class training uses M between 32 and 64, not 1024.
Visualising the Schedule
The widget below renders the per-stage schedule as a Gantt chart. Each row is a GPU; each blue block is a forward microbatch (1 time unit), each amber block is a backward (2 time units), each striped red area is a bubble. The legend reports the bubble fraction in real time so you can match it against the formulas above.
Three experiments to run inside the widget:
- With the schedule set to Naive and P = 4, watch how three out of four lanes are striped red at any given moment. Bump P to 6 — six out of seven stages are idle most of the time. This is why you cannot ship a real LLM with a naive pipeline.
- Switch to GPipe at P = 4, M = 4. Bubble ≈ 43% — the microbatching helps, but not enough. Now crank M to 16. Bubble drops to ≈ 16%. The fill and drain bookends are the same; the middle is fully busy. This is the GPipe formula made visible.
- Switch to 1F1B with the same numbers. The arithmetic bubble fraction is identical to GPipe, but look at the stage 0 row: forwards no longer pile up before backwards start. Activation memory at stage 0 is bounded by P − 0 = P, not M. This is the unlock that lets DualPipe push M down further without OOM.
Plain Python: Simulate the Pipeline by Hand
Before we wire up PyTorch, let us write the scheduler the same way the visualiser does it. The point of this code is to make the schedule itself a first-class object — the queue per stage and the dependencies between them — so you can reason about the bubble without worrying about CUDA, gradients, or DDP.
Running this for , prints:
| Schedule | Wall-clock units | Bubble units | Bubble fraction |
|---|---|---|---|
| naive | 96 | 192 | 0.667 |
| gpipe | 33 | 36 | 0.273 |
| 1f1b | 33 | 36 | 0.273 |
The naive fraction accounts for M separate microbatches done serially; the GPipe and 1F1B fractions match the closed form exactly. The schedule object you just built in Python is the same schedule that Megatron-LM and DeepSpeed implement on actual GPUs.
PyTorch: A Real Two-Stage Pipeline End-to-End
Now let us wire the schedule into a real autograd graph. We keep both stages on CPU so the code runs anywhere, but the cross-stage handshake — detach on forward, manually re-inject the gradient on backward — is exactly what a multi-GPU pipeline does over NCCL.
The mechanism that makes pipeline parallelism work at all sits on one line: . This is the moment the autograd graph is cut between stages. Without the cut, every backward would propagate all the way to the inputs in one go — which is impossible across devices and processes. With the cut, the boundary tensor on the receiving side becomes a fresh leaf, its .grad field will be filled by stage 1's backward, and we can ship that gradient back across the wire ourselves with h.backward(boundary.grad).
At Massive Scale: GPipe, PipeDream, and the Road to DualPipe
The two schedules above are textbook. Real frontier training builds on them, then pushes much further. The progression is worth mapping out because it sets up section 11.5.
GPipe (2019, Huang et al.)
Google's GPipe paper introduced microbatching and the (P − 1) / (M + P − 1) bubble formula. It also introduced re-materialisation: rather than store every microbatch's activations, drop them after forward and recompute them in backward. That cuts activation memory by ~80% at the cost of ~33% extra compute. Almost every pipeline implementation since GPipe ships with re-materialisation on by default for large models.
PipeDream / 1F1B (2019, Narayanan et al.)
Microsoft's PipeDream introduced the 1F1B schedule. By interleaving forward and backward as soon as the pipeline is full, it bounds the peak activation memory per stage by O(P) instead of O(M). The arithmetic bubble fraction is unchanged, but the savings let you push higher without OOMing — which, in turn, shrinks the bubble. 1F1B is the schedule Megatron-LM ships by default. DeepSeek V3 starts here.
Interleaved 1F1B (Megatron-LM, 2021)
Each GPU now owns multiple non-contiguous stages instead of one contiguous block. With virtual stages per device, the effective fill/drain time becomes microbatches instead of , dropping the bubble by a factor of at the cost of more cross-stage communication. Megatron typically uses or .
DualPipe (DeepSeek V3, 2024)
DeepSeek's DualPipe goes a step further: it runs two pipelines in opposite directions on the same GPUs at the same time, overlapping the bubble of one pipeline with the compute of the other. On a 16-stage DeepSeek V3 training run with M = 64 microbatches, DualPipe reduces the bubble from (1F1B) to nearly zero in the compute dimension, paying for it with a doubled boundary-comm bill that DeepSeek hides behind expert-parallel all-to-all using cross-node Infiniband bandwidth that would otherwise be idle. Section 11.5 derives the full DualPipe schedule and explains why DeepSeek could ship the run on H800 GPUs (which have crippled cross-node bandwidth compared to H100s) without losing throughput.
| Schedule | Bubble | Peak in-flight forwards / stage | Comms / step |
|---|---|---|---|
| Naive | 1 − 1/P | 1 | P − 1 boundaries |
| GPipe | (P − 1)/(M + P − 1) | M (at stage 0) | P − 1 boundaries × 2 |
| 1F1B | (P − 1)/(M + P − 1) | P − s (at stage s) | P − 1 boundaries × 2 |
| Interleaved 1F1B (v stages/dev) | (P − 1)/(v(M + P − 1)) | v · (P − s) | v × baseline |
| DualPipe | ≈ 0 (compute), comms-bound | P − s (per direction) | 2× baseline, hidden behind all-to-all |
Engineering Reality: What Trips People Up
The math says the bubble is . The training run says you got 70% of expected throughput. Five things to look at, in order, when that happens.
- Microbatch too small. When the per-microbatch forward is shorter than the kernel-launch overhead on a single stage, the "1 time unit" in the model breaks down and the bubble formula stops applying. Rule of thumb: keep per-microbatch per-stage forward above 2 ms. For DeepSeek V3 this means microbatch token count above ~4096.
- Stage imbalance. If stage 3 is 20% slower than stage 2 (deeper layers, attention with longer KV, or one extra LayerNorm), the entire pipeline runs at stage 3's clock and every other stage acquires a hidden bubble. Balance the layer count, not just the parameter count.
- Boundary comms not overlapped with compute. A naive pipeline implementation does send/recv synchronously between stages. The right thing is to issue the send on stage s before stage s's next op so that the next stage can receive while stage s is busy with the following microbatch. Megatron-LM does this through ncclSend + ncclRecv on a non-default stream.
- Activation memory blows up at stage 0. If you accidentally ran GPipe instead of 1F1B, stage 0 holds M microbatches of activations. For a 671 B model with M = 64 and full attention activations, that is ~1.6 TB at stage 0 — guaranteed OOM. The fix is the schedule, not the model.
- Loss / optimiser step in the wrong place. Computing the loss outside the pipeline schedule (instead of at the last stage's end-of-forward) re-introduces a serialisation point. Optimiser steps must run only after all microbatch backwards are accumulated — running per-microbatch steps wrecks gradient accumulation and the implicit learning rate.
What you should walk away with. Pipeline parallelism is the lever that lets a 671 B model exist on hardware at all. The bubble fraction is the price you pay for slicing the layer stack, and the entire research line from GPipe through DualPipe is about cutting that fraction down without inflating memory or comms. The two textbook schedules in this section are the floor. DualPipe, in the next section, is how DeepSeek V3 walks under it.