The Real Problem: One Batch, Too Slow
Section 11.1 made the case that a single GPU cannot hold a frontier model. But before any of the more exotic parallelism schemes — tensor, pipeline, expert, sequence — there is a simpler problem that data parallelism solves, and that you will meet on the very first day you own more than one GPU.
SGD's gradient is an average. With a batch of examples and loss on each example, one step of gradient descent is:
The sum is embarrassingly parallel — you can compute any partition of those terms anywhere and the result is the same as long as you add them at the end. So the moment you have a second GPU sitting next to the first, the obvious speed-up writes itself: give half the batch to each GPU, run forward and backward on the local half, then add the two gradient contributions and apply one shared update.
That idea — same model, sharded data, summed gradients — is data parallelism. It is the single most-deployed parallelism strategy in deep learning: the one you reach for first, the one every other parallelism scheme composes on top of, the one that makes 8-GPU nodes and 8 000-GPU clusters look architecturally similar from Python's point of view. And it is the reason a frontier run can process tens of millions of tokens per second instead of tens of thousands.
| Scale | Goal | Bottleneck | Why DP is the right hammer |
|---|---|---|---|
| 1 GPU → 8 GPUs | Faster iteration on prototypes | Wall-clock per step | Linear speed-up on tiny code change |
| 8 → 64 GPUs | Larger global batch, stable LR scaling | Gradient AllReduce volume | Still dominant — comm < compute on NVLink + IB |
| 64 → 1024 GPUs | Pre-training a 7–70 B model | Bandwidth × N grows; weights barely fit | DP + ZeRO-1/2 stays competitive |
| 1024 → 8192 GPUs | Pre-training a 200+ B MoE | AllReduce ≫ compute, weights don't fit | DP alone breaks — needs FSDP/ZeRO-3 + TP/PP/EP |
Intuition: Many Workers, One Identical Model
Picture a workshop. You are training a model — call it — and the training set is a giant bin of examples. With one worker, training is sequential: pick a handful of examples, compute the average gradient, nudge , repeat.
Hire eight workers. Give each of them the same copy of the current model and a different handful of examples. Each worker computes their local mean gradient. They are now staring at eight different vectors pointing in eight slightly different directions — same model, different data, different gradients.
Here is the only subtle part of DP: the eight workers cannot just each apply their own gradient and walk away. If they did, they would end the step with eight different models and the workshop would split into eight uncoordinated trainings. So they pause. They announce their gradient vectors to each other. They sum them and divide by eight. Now every worker holds the same averaged direction. Every worker takes the same step. The eight copies of the model are once again identical, ready for the next iteration.
Three intuitive consequences flow from this picture:
- DP is mathematically equivalent to single-GPU SGD on a bigger batch. Eight workers each at batch 8 is — at the gradient level — identical to one worker at batch 64. The reason this matters: you can predict what DP will do, in terms of loss curve and learning-rate sensitivity, by treating it as a batch scale-up.
- The sync step is the bill. Forward and backward are perfectly parallel — N workers, N× the throughput, no coordination needed. The only thing that costs you is the gradient-averaging round, and that cost grows with the number of parameters in the model, not the number of examples in the batch. That is why a 70 B model on 1024 GPUs is communication-bound while a 70 M model on 1024 GPUs is compute-bound.
- You cannot use DP to fit a bigger model. Every worker holds the entire model. Eight workers means eight identical copies of . DP buys you more throughput, not more capacity. The moment your model stops fitting on one GPU, you need ZeRO/FSDP, tensor parallelism, or pipeline parallelism — DP alone will not save you. Sections 11.3–11.7 are the answer to that.
The Math: Mini-Batch SGD Across N Workers
Let the global mini-batch be and partition it into disjoint shards of size each, one per rank. The single-GPU gradient at parameter is:
Rewrite the sum as a sum-over-ranks:
where is the local mean gradient computed on rank . This identity is the entire mathematical content of data parallelism. The global gradient is the average of the local gradients provided each rank divides by its own shard size (so that is already a mean, not a sum).
The SGD update on every rank is then:
Three engineering consequences fall out of these two lines:
- Synchrony is required. Every is evaluated at the same . If rank 3 is one step behind, its is — a gradient for the wrong model. Async DP accepts this bias for throughput; sync DP refuses.
- The communication primitive is sum, not max or min. We need . Then we divide locally by . NCCL and other collective libraries expose this as AllReduce(SUM) — every rank ends up holding the same sum, no central server, no leader election.
- The learning rate must scale. Replacing batch with batch reduces gradient variance by . The empirical rule (Goyal et al. 2017) is linear LR scaling: multiply by , then warm it up linearly for the first ~5 % of steps to keep early training stable. Frontier runs (LLaMA, DeepSeek-V3) tune this empirically — the linear rule is the prior, not the answer.
drop_last=True on the DistributedSampler so every shard is identical, or weight each gradient by its shard size before averaging. The former is what every production codebase does — uniform shards keep all-reduce sizes uniform too, which keeps NCCL happy.AllReduce: The Sync Primitive That Makes DP Work
The math says we need , identical on every rank, in as few bytes per rank as possible. There are three natural ways to implement it; only one of them scales.
Naive parameter server: bytes on the server
Pick one rank as the server. Every other rank sends its gradient to the server, the server adds them up, then broadcasts the sum back. Total bytes through the server: about . The server is a hot spot — every byte of every gradient flows through it, both in and out. Network and CPU at the server saturate the moment is more than a handful. This is what DistBelief used in 2012; nobody uses it for synchronous frontier training today.
Tree all-reduce: latency
Arrange the ranks in a binary tree. Sum up the tree, broadcast down. Total bytes per rank: for small messages, but each link is sequential — the cost is dominated by latency, not bandwidth. Good for small payloads (latencies of microseconds), bad for whole-model gradient buffers.
Ring all-reduce: bytes per rank
This is the algorithm every modern DP framework uses, because it has a remarkable property: the bytes each rank must send and receive approach a constant (about ) as grows. Doubling the cluster does not double the traffic per GPU; it barely moves it. The algorithm has two phases, each running for rounds:
- Scatter-reduce. Split the gradient buffer on every rank into equal chunks. In round , rank sends chunk index to rank , which adds it onto its own copy of that chunk. After rounds, rank owns the full sum of exactly one chunk. Bytes per rank: .
- All-gather. Same ring, no addition — each fully-summed chunk walks more hops, overwriting each rank's stale copy. After another rounds, every rank holds every fully-summed chunk. Bytes per rank: another .
Total bytes per rank: Compare with the naive PS at on a single link. For ranks and a 70 B BF16 model (), the ring sends about per rank; the parameter server's central node would have to absorb . At 400 GB/s of network bandwidth (typical IB on a frontier cluster), that single link would need an hour per step.
Manual Numerical Walkthrough
Two by-hand calculations: one round of ring all-reduce on 4 GPUs with toy 4-element gradient vectors, and a wall-clock estimate of DP scaling on a 70 B model across 256 H100s.
Click to expand: Ring all-reduce on 4 GPUs, by hand
Setup. 4 GPUs, each with a 4-element gradient buffer. We give the buffers easy-to-track integer values: GPU chunk starts as .
| GPU | a | b | c | d |
|---|---|---|---|---|
| GPU 0 | 10 | 11 | 12 | 13 |
| GPU 1 | 20 | 21 | 22 | 23 |
| GPU 2 | 30 | 31 | 32 | 33 |
| GPU 3 | 40 | 41 | 42 | 43 |
Targets. The per-chunk sums every rank should end up with: , , , .
Phase 1, round 1 (scatter-reduce). On round 1 rank sends chunk to rank , which adds it onto its own copy of chunk . So rank 0 sends chunk a to rank 1; rank 1 sends chunk b to rank 2; rank 2 sends chunk c to rank 3; rank 3 sends chunk d to rank 0. After round 1:
| GPU | a | b | c | d |
|---|---|---|---|---|
| GPU 0 | 10 | 11 | 12 | 13 + 43 = 56 |
| GPU 1 | 20 + 10 = 30 | 21 | 22 | 23 |
| GPU 2 | 30 | 31 + 21 = 52 | 32 | 33 |
| GPU 3 | 40 | 41 | 42 + 32 = 74 | 43 |
Phase 1, rounds 2–3. Same pattern, shifted one chunk back each round. The interactive viewer above unrolls all three rounds step by step (the bookkeeping is fiddly enough by hand that it is worth watching it animate). After round 3, each GPU owns the full sum of exactly one chunk:
| GPU | Fully-summed chunk | Value |
|---|---|---|
| GPU 0 | d | 10+20+30+40+... wait, no — d sums to 13+23+33+43 = 112 |
| GPU 1 | a | 10+20+30+40 = 100 |
| GPU 2 | b | 11+21+31+41 = 104 |
| GPU 3 | c | 12+22+32+42 = 108 |
Phase 2, all-gather (rounds 4–6). Each fully-summed chunk now walks the ring without further adding. After 3 more rounds every GPU holds the same vector: . Total rounds: . Total bytes per GPU: . For a vector of 4 floats that is 24 bytes per GPU — but the same formula applies to a 140 GB gradient: 210 GB sent per rank, independent of beyond a few.
Click to expand: Sizing DP for a 70 B model on 256 H100s
Model. , BF16 gradients, so .
Cluster. 256 H100s, 32 nodes × 8 GPUs. Per-GPU peak: 989 BF16 TFLOPS. Inter-node: 400 GB/s NDR Infiniband; intra-node: 900 GB/s NVLink. NCCL uses a hierarchical ring that effectively operates at the inter-node 400 GB/s bound for whole-cluster collectives.
AllReduce time per step. Bytes per GPU: . Time: .
Compute time per step. Take micro-batch 1 × sequence 4096 per GPU. FLOPs per step across the cluster: . Aggregate cluster compute at 50 % MFU: FLOP/s. Time: .
Overlap headroom. Compute > all-reduce, so DDP can hide the entire all-reduce inside the backward pass with bucket-level overlap. Effective step time stays near 3.5 s, throughput is .
What happens at 2048 GPUs. Compute per step drops to , but all-reduce per GPU is essentially unchanged at . Communication now dominates — the cluster spends 60 % of every step waiting for the network. This is the wall pure DP hits, and why every > 1024-GPU frontier run uses FSDP or sharded-optimiser DP (ZeRO-1/2) layered on top.
Visualizing the Ring and the Scaling Wall
Two widgets. The first animates ring all-reduce step by step — watch the cyan chunks accumulate into the green fully-reduced chunks, then propagate around the ring in the all-gather phase. The second is a scaling calculator: pick a model, GPU count, network bandwidth, and precision, and see when DP's communication cost overtakes compute.
Plain Python: DP and Ring-AllReduce From Scratch
Before reaching for PyTorch DDP, let us build the whole thing in plain NumPy — including the ring all-reduce, by hand, no MPI. The toy model is a single linear layer; the toy "cluster" is a list of arrays in one process. The mechanics are byte-identical to what runs across 2048 H800s.
The output of that script ends with two lines: DP averaged grad ≈ big-batch grad — the mathematical identity at the heart of DP, verified to floating point. If those two numbers ever disagree, DP is broken; finding why they disagree is one of the most common distributed-training debugging exercises (mis-sized shards, missing set_epoch, non-deterministic kernels, dropout-RNG drift).
PyTorch: Real DDP on a GPU Cluster
The plain-Python version was the mechanism. The PyTorch version is the same mechanism, except: the ring all-reduce lives in NCCL kernel code, the bucketing is automatic, the overlap with backward is hooked into autograd. Two extra lines around your existing training loop and you go from one GPU to one thousand.
Three numbers worth memorising for any DDP run:
| Knob | Default | Why it matters at scale |
|---|---|---|
| bucket_cap_mb | 25 MB | Smaller → more buckets → finer overlap, more latency overhead. Tune up for huge models. |
| find_unused_parameters | False | True adds a graph traversal per step. Leave False for transformer pre-training. |
| static_graph | False | True lets DDP reuse bucket assignment across steps — strict win when graph is fixed. |
| gradient_as_bucket_view | False | True makes .grad a view into the bucket — saves 1× model size, no downside for normal training. |
At Massive Scale: Where Pure DP Falls Off
DP's simplicity makes it the universal first choice; the same simplicity dictates its ceiling. Three forces push frontier runs past pure DP and into the hybrid parallelism stack that the rest of Chapter 11 builds.
- Memory: weights, gradients, master, Adam. For a 70 B model in BF16, every rank holds of weights, the same again in gradients, in FP32 master weights, and in Adam moments — a total of per rank, far more than any single GPU's 80–192 GB of HBM. Pure DP cannot run this model at all. ZeRO-1 shards the optimiser state, ZeRO-2 also shards the gradient, ZeRO-3 (and PyTorch's FSDP) shards the parameters themselves — at the cost of extra all-gather operations every forward and backward. Section 11.4 builds this in full.
- Communication: per step, every step. Even when memory fits, the all-reduce volume is fixed by model size. Doubling GPUs doubles compute throughput, but per-GPU all-reduce time barely moves — so the gap between compute and comm narrows until comm dominates. At that point you must either compress the gradients (PowerSGD, 1-bit Adam, FP8 collectives) or stop having every rank hold every parameter. Tensor parallelism (Section 11.5) attacks the latter — each layer's matrix is split across a few ranks, turning a big AllReduce into smaller in-shard reductions.
- Pipeline bubbles for very deep models. Once a model is too big for one node, you can also slice it by layer (pipeline parallelism, Section 11.6). DP composes with that: across a 256-stage pipeline, each pipeline replica is its own data-parallel group running synchronous DP. Frontier runs are almost always "DP × TP × PP × EP" — a Cartesian product of parallelism dimensions with DP as the outermost loop.
What DeepSeek-V3 actually does
DeepSeek-V3 uses DP at the outermost level across roughly 64-rank data-parallel groups, with each group running a complex inner configuration of expert parallelism, sequence parallelism, and pipeline parallelism that the rest of the chapter unpacks. The DP all-reduce on the optimiser-state side is sharded ZeRO-style so that no rank holds the whole optimiser. The point is not that DP is obsolete at frontier scale — it is that it is one axis of a 4-D parallelism layout, and its specific contribution is what we characterised in this section: synchronous, replicated-model, ring-all-reduce gradient averaging.
Engineering Reality: Bucketing, Overlap, and the Tricks That Matter
Four engineering wins separate a textbook DP implementation from what production frameworks actually do. Each is worth roughly a 2× throughput improvement on a real cluster; combined they are the difference between "multi-GPU works" and "multi-GPU is the platform."
- Gradient bucketing. Issuing one all-reduce per parameter is a disaster: a 7 B model has hundreds of thousands of parameters, each all-reduce has microseconds of kernel-launch overhead, and the total per-step overhead dwarfs the actual transfer. DDP bundles consecutive parameters into ~25 MB buckets and issues one all-reduce per bucket. The bucket size is tuned so the per-bucket overhead is small but the bucket itself is large enough to be bandwidth-bound. 25 MB on a 400 GB/s link is 60 µs of actual transfer; the launch overhead is roughly 10 µs — a healthy 6:1 ratio.
- Backward-pass overlap. Buckets fire all-reduce as soon as they fill, while the backward pass is still computing earlier layers' gradients. On a transformer, this means the all-reduce on layer 80's gradients is in flight while CUDA is still computing the gradient for layer 60. By the time backward returns, almost all communication is done — DDP's step time is , not their sum. This is the single biggest reason DP scales further than its naive analysis suggests.
- Mixed-precision collectives. Sending gradients in BF16 instead of FP32 halves the wire volume; sending them in FP8 (after appropriate scaling, see Section 10.3) halves it again. Modern NCCL exposes typed all-reduce, and frameworks (DeepSpeed, Megatron-LM, the DeepSeek stack) cast on the wire and accumulate in higher precision on receive. For a 70 B model, BF16 → FP8 collectives save 70 GB per step per rank. On 1024 ranks that is 70 TB of network traffic per step removed.
- Sharded optimiser state (ZeRO-1). Even before touching the full ZeRO-3 / FSDP redesign, ZeRO-1 splits Adam's moments across DP ranks: rank owns the master copy of parameter shard . The forward pass still uses replicated BF16 weights, but optimiser memory drops by . For a 70 B model on 64 DP ranks, that frees of HBM per rank — enough to enable longer sequences or larger batches. Section 11.4 explains the full sharding hierarchy.
The failure modes you will actually see
| Symptom | Likely cause | Fix |
|---|---|---|
| Loss diverges from single-GPU baseline | Forgot to scale LR by N_DP | Linear LR scaling + warmup |
| NCCL hang on step 2 with no error | Mis-sized gradient buffer on one rank | drop_last=True + uniform shard size |
| Loss curve identical to single-GPU but 2× slower than expected | Backward not overlapping with comm | Enable static_graph, set gradient_as_bucket_view |
| Step time fluctuates ±50% step to step | Stragglers / contended IB links | Topology-aware pinning; investigate one slow node |
| Different log losses on rank 0 vs rank 5 | Dropout seeded the same on every rank | Seed = base_seed + rank for dropout RNG only |
| MoE training: severe load imbalance + comm spikes | Expert all-to-all collides with DP all-reduce | Stagger collectives or use separate process groups |
The deepest lesson of this section. Data parallelism is the parallelism strategy with the strongest mathematical guarantees — it is provably equivalent to single-GPU SGD on a bigger batch, end of story. Every other parallelism technique we will meet — tensor, pipeline, expert, sequence — has a numerical caveat: a slightly different forward pass, a slightly different gradient, a slightly different floating-point accumulation order. DP has none. The price you pay for that purity is the per-step gradient all-reduce — bandwidth proportional to model size, paid every single step of training. The rest of Chapter 11 is the engineering of how to pay that price as cheaply as possible (FSDP, ZeRO) or how to avoid paying it for the entire model at once (TP, PP, EP). But DP is the baseline every other technique is measured against, and the one every other technique composes on top of.