Chapter 11
10 min read
Section 60 of 117

Why One GPU Is Not Enough

Distributed Training: DualPipe and the Parallelism Stack

The Real Problem: A 671B Model Will Not Fit

An NVIDIA H100 has 80 GB of HBM. That is a remarkable amount of memory — twenty times what the original GPT-2 needed end-to-end. Yet DeepSeek-V3, the model this book is built around, has 671 billion parameters. In BF16, the working weights alone occupy 671×109×2=1.34671 \times 10^{9} \times 2 = 1.34 TB. That is sixteen full H100s before we have allocated a single gradient, a single optimizer slot, or a single activation tensor.

And the working weights are the cheapest line item. Add gradients, add AdamW's two FP32 moments per parameter, add a master FP32 copy of the weights for numerical safety (Chapter 10), and the per-parameter memory cost is not 2 bytes but roughly 2+2+12=162 + 2 + 12 = 16 bytes. The same model now needs 671×1610.7671 \times 16 \approx 10.7 TB of HBM — about 134 H100s just to hold the static state of one training step. Then we layer activations on top: at batch 1 and a 4 K sequence, DeepSeek-V3's 61 layers contribute another tens of GB. Push the sequence to 32 K and activations rival the working weights.

The old single-GPU tricks do not bend far enough. There is no clever compression that crushes 10 TB into 80 GB without throwing away information the model needs. Gradient checkpointing helps with activations but does nothing for parameters. Mixed precision (Chapter 10) cuts the constant but cannot change the linear scaling in NN. Quantization-aware training shrinks deployment cost, not training cost. The honest conclusion is that modern foundation models fundamentally do not fit on any single accelerator — not now, and not in any plausible HBM roadmap. You need many GPUs. The rest of this chapter is about how to spread one training step across many GPUs without wasting either compute or bandwidth.

What this section establishes. Before we can argue about which parallelism strategy to use (data, tensor, pipeline, expert), we need to internalize the memory wall that forces us off a single GPU in the first place. If you can compute the four memory terms — parameters, gradients, optimizer state, activations — in your head for any model, you can predict in advance whether a training run will fit, and how many GPUs you will need at minimum.

Intuition: Four Tenants Sharing an 80 GB Apartment

Think of each H100 as a small apartment with 80 GB of floor space. Four tenants want to move in for the duration of one training step:

  1. Parameters — the model's memorized knowledge. Always present. The long-term resident.
  2. Gradients — a temporary roommate the same size as the parameters. Arrives during backward, leaves after the optimizer step.
  3. Optimizer state — AdamW's bookkeeping. For a BF16 model, this tenant brings six times as much luggage as the parameters themselves: an FP32 master copy and two FP32 moment estimates.
  4. Activations — the only tenant whose size depends on the batch and sequence length. Arrives during forward, stays until backward consumes it.

For a small model the four tenants share happily. For a 7B-parameter model, AdamW alone wants 84 GB and the apartment is already over capacity. For a 671B-parameter model the parameters alone want sixteen apartments, and the optimizer state wants another hundred. There is no rearranging that fits four tenants this large into one 80 GB unit. You must rent more apartments — and the rest of this chapter is about how to split the tenants between buildings without forcing them to ship boxes back and forth constantly.

A useful number to memorize: for a BF16 model trained with AdamW, the per-parameter footprint is roughly 2+2+12=162 + 2 + 12 = 16 bytes. So billions of parameters16 × that many GB. Llama-7B → 112 GB. Llama-70B → 1.12 TB. DeepSeek-V3 → 10.7 TB. Activations come on top of that.

The Mathematics of the Memory Wall

Let NN be the parameter count, LL the number of layers, dd the hidden dimension, BB the per-GPU batch, and SS the sequence length in tokens. Let bp,bg,bo,bab_p, b_g, b_o, b_a be the bytes used for parameters, gradients, optimizer state per parameter, and activations respectively. The per-step memory in bytes is

M(N,L,d,B,S)=Nbp+Nbg+Nbo+kLBSdbaM(N, L, d, B, S) = N \cdot b_p + N \cdot b_g + N \cdot b_o + k \cdot L \cdot B \cdot S \cdot d \cdot b_a

Three of the four terms are independent of input shape — they grow only with model size. The fourth, the activation term, is the only term you can shrink by changing the batch or sequence length. The factor k12k \approx 12 is the empirical per-block activation multiplier for a transformer block under selective recomputation (Q, K, V, scores, post-softmax, FFN intermediates).

For a BF16 model trained with AdamW, the static (model + grads + optimizer) cost reduces to Mstatic=(bp+bg+bo)N=16NM_{\text{static}} = (b_p + b_g + b_o) \cdot N = 16 N bytes. The fit-on-one-GPU inequality becomes

16N+12LBSdba    HBMGPU16 N + 12 L B S d b_a \;\le\; \text{HBM}_{\text{GPU}}

For an H100 the right-hand side is 802308.59×101080 \cdot 2^{30} \approx 8.59 \times 10^{10} bytes. Solving for NN with B=1,S=4096,d=4096,L=32,ba=2B = 1, S = 4096, d = 4096, L = 32, b_a = 2 — Llama-7B's shape — gives the static-only budget N5.4×109N \le 5.4 \times 10^{9}, just below 7 billion. Even at the tightest configuration, a 7B model sits at the edge.

SymbolMeaningTypical value
NTotal parameters1e8 → 7e11
LTransformer layers (depth)12 → 120
dHidden dimension d_model768 → 16384
BPer-GPU micro-batch size1 → 32
SSequence length (tokens)1024 → 131072
b_pBytes per parameter (working)1 (FP8), 2 (BF16), 4 (FP32)
b_gBytes per gradient2 (BF16) or 4 (FP32)
b_oOptimizer bytes per param (AdamW)12 (FP32 w + m + v)
b_aBytes per activation element1 (FP8), 2 (BF16), 4 (FP32)
kPer-block activation multiplier≈ 12 with selective recompute

Manual Numerical Walkthrough

Click to walk through the budget for Llama-7B and DeepSeek-V3 by hand

Case A — Llama-7B, BF16 + AdamW, batch 1, seq 4096.

N      = 7  e9     params
L      = 32        layers
d      = 4096      hidden
B      = 1         batch
S      = 4096      seq
b_p    = 2  (BF16)
b_g    = 2  (BF16)
b_o    = 12 (FP32 master w + m + v)
b_a    = 2  (BF16)
k      = 12

params   = 7e9 * 2  = 14.0 GB
grads    = 7e9 * 2  = 14.0 GB
optim    = 7e9 * 12 = 84.0 GB
acts     = 12 * 32 * 1 * 4096 * 4096 * 2 = 12.9 GB

total    = 14 + 14 + 84 + 12.9 = 124.9 GB
H100     = 80   GB
needed   = ceil(124.9 / 80) = 2 GPUs

Already two H100s for the smallest production-class open model, at batch 1, with no margin for the framework, communication buffers, or fragmentation. A real training run wants per-GPU batches above 1 and never operates at 100% HBM — in practice Llama-7B SFT runs on 8 GPUs minimum.

Case B — Same model, but batch 4 and seq 8192. Only the activation term changes:

acts     = 12 * 32 * 4 * 8192 * 4096 * 2
         = 103.1 GB        <- 8x the previous activation term

total    = 14 + 14 + 84 + 103 = 215 GB
needed   = ceil(215 / 80) = 3 GPUs

One change to the data loader — 4x larger batch, 2x longer context — has bumped the GPU floor from 2 to 3. This is why throughput engineers obsess over the activation term.

Case C — DeepSeek-V3 671B, same recipe, batch 1, seq 4096.

N      = 671 e9
L      = 61
d      = 7168
B      = 1
S      = 4096

params   = 671e9 * 2  = 1.34 TB
grads    = 671e9 * 2  = 1.34 TB
optim    = 671e9 * 12 = 8.05 TB
acts     = 12 * 61 * 1 * 4096 * 7168 * 2 = 41.9 GB

total    = 1.34 + 1.34 + 8.05 + 0.042 TB
         = 10.77 TB
needed   = ceil(10.77 TB / 80 GB) = 135 GPUs

The optimizer state alone is 8 TB. No single machine on Earth has 8 TB of accelerator memory. DeepSeek-V3 cannot exist without sharding that state across machines — which is exactly what ZeRO and FSDP do (Section 11.7), and is why FSDP is non-negotiable at frontier scale.

Case D — DeepSeek-V3 with their FP8 recipe (Chapter 10): working weights FP8 (1 byte), grads BF16 (2 bytes), compressed AdamW (~6 bytes). Then bp+bg+bo=9b_p + b_g + b_o = 9 instead of 16:

static   = 671e9 * 9 = 6.04 TB
acts     ~ 31 GB (FP8 activations on the FP8 paths)
total    ~ 6.07 TB
needed   = 76 H100s minimum   <- vs 135 with BF16 recipe

FP8 cuts the GPU floor almost in half — which compounds into roughly half the cluster cost for the same training run. That single multiplier is the financial reason DeepSeek invested so heavily in the FP8 stack you read about in Chapter 10.

Visualizing the Budget — and When It Breaks

The visualization below stacks the four memory terms as a single bar and draws the 80 GB H100 budget as a dashed line. Try the Llama-7B preset and notice that with mixed BF16 it just barely fits at batch 1. Switch to Llama-70B and the bar leaves the budget line behind immediately. Push the sequence slider to 32 K and watch the activation segment (rose) balloon past everything else.

Loading memory-budget visualizer…

Three readings to take from this. First, optimizer state (amber) is the silent dominator at the BF16 + AdamW recipe almost every shop ships — it is bigger than parameters, gradients, and activations combined for most realistic batches. That is what makes ZeRO/FSDP the highest-leverage memory intervention. Second, FP8 (the DeepSeek-V3 preset) is the only configuration that meaningfully shrinks the static cost; mixed BF16 has been the standard for five years and will not save you from the 671B wall. Third, sequence length is a one-way street: the activation term scales linearly in SS, and long-context capability literally cannot be added without either pipeline parallelism or activation recomputation (or both — DeepSeek-V3 uses both).

What the visualization does not show

The bar accounts for the four big tenants. In practice each GPU also reserves ~2 GB for the CUDA context, ~1 GB for NCCL communication buffers, and 5-15% for memory fragmentation. Always budget for 75% of HBM, not 100%. The bar is the theoretical floor; the practical ceiling is lower.

Plain Python: Counting Bytes Before Touching a GPU

Below is the calculation as a function — the kind of script every team should have in their bootstrapping repo. It takes no dependencies, runs in microseconds, and saves the kind of debug cycle that otherwise involves provisioning a multi-GPU node just to discover an OOM after 40 minutes of model loading.

🐍memory_budget.py
3Why bytes, not bits

All accelerator memory budgets are quoted in bytes (GB on H100, MB on edge devices). Every shape, every dtype, ultimately reduces to a byte count. The whole job of this script is to do that arithmetic before any tensor is allocated — because once you call .cuda() and OOM, the only feedback is a stack trace, never a number.

EXECUTION STATE
GB = 2^30 = 1073741824
5The five inputs that decide everything

Five knobs control the entire budget: parameter count, depth, hidden dim, per-GPU batch, and sequence length. Notice what is NOT here — the dataset size and number of training steps do not appear, because the cost of one step is independent of them. The hard wall is per step, not per epoch.

11param_bytes — the precision of the working weights

Each parameter occupies param_bytes of HBM. FP32 weights cost 4, BF16 cost 2, FP8 cost 1. Mixed-precision training (Chapter 10) keeps the WORKING weights in BF16 or FP8 but a separate FP32 MASTER copy lives in the optimizer state — that is accounted for separately in optim_bytes.

EXECUTION STATE
param_bytes (BF16) = 2
13optim_bytes = 12 — the surprising hog

AdamW stores three FP32 tensors per parameter: the master weight w, the first moment m, and the second moment v. Three × 4 bytes = 12 bytes per parameter. For a 7B model that is 84 GB of optimizer state alone — more than fits on one H100, and over six times the BF16 weight size. This single line is the reason ZeRO and FSDP exist (Section 11.7).

EXECUTION STATE
master w (FP32) = 4 B/param
m (FP32) = 4 B/param
v (FP32) = 4 B/param
17params: linear in N

Parameter memory scales linearly with parameter count. Easy to forget how brutal that is: every billion parameters in BF16 costs 2 GB. DeepSeek-V3 at 671B parameters needs 1.34 TB just for the BF16 weights — sixteen full H100s before grads, optimizer, or activations enter the conversation.

22Activations: the only term that depends on input shape

Parameters, gradients, and optimizer state all scale only with N. Activations are different — they scale with batch × seq × hidden × layers. The 12 factor bundles the per-block intermediates (Q, K, V, attention scores, post-softmax, FFN up/down) under selective recomputation. Double the sequence length and activation memory doubles; this is why long-context training is its own engineering problem (Chapter 12).

EXECUTION STATE
k (per-block factor) = 12
28Three cases, three regimes

GPT-2 fits on a phone. Llama-7B fits on one H100 only with mixed precision and tiny batch. DeepSeek-V3 requires hundreds of H100s just to hold a single forward+backward step. The same formula spans all three — the difference is purely in the constants you plug in.

42Ceiling-divide for GPU count

The trick -(-total // H100) is Python's idiomatic integer ceiling divide. It tells you the minimum number of GPUs needed if you could perfectly slice memory across them. Real systems waste 10-30% to replication and communication overhead, so multiply by 1.3 for a realistic floor.

39 lines without explanation
1# memory_budget.py — figure out if a training step will fit on one GPU.
2
3GB = 1024 ** 3   # one gibibyte in bytes
4
5def training_step_bytes(
6    N_params: int,         # total parameters in the model
7    layers: int,           # transformer layers (depth)
8    hidden: int,           # d_model
9    batch: int,            # micro-batch size per GPU
10    seq: int,              # sequence length in tokens
11    param_bytes: int = 2,  # BF16 working copy of weights
12    grad_bytes: int = 2,   # BF16 gradients
13    optim_bytes: int = 12, # AdamW: master FP32 (4) + m FP32 (4) + v FP32 (4)
14    act_bytes: int = 2,    # BF16 activations
15):
16    params = N_params * param_bytes
17    grads  = N_params * grad_bytes
18    optim  = N_params * optim_bytes
19
20    # Activation memory for one forward+backward with selective recompute,
21    # the standard ~12-per-token estimate for a transformer block.
22    acts = 12 * layers * batch * seq * hidden * act_bytes
23
24    total = params + grads + optim + acts
25    return dict(params=params, grads=grads, optim=optim,
26                acts=acts, total=total)
27
28# ----- Three case studies, same formula -----
29cases = {
30    "GPT-2 124M, ctx 1k": dict(N_params=124_000_000,
31                                layers=12, hidden=768,
32                                batch=1, seq=1024),
33    "Llama-7B, ctx 4k"  : dict(N_params=7_000_000_000,
34                                layers=32, hidden=4096,
35                                batch=1, seq=4096),
36    "DeepSeek-V3 671B, ctx 4k": dict(N_params=671_000_000_000,
37                                      layers=61, hidden=7168,
38                                      batch=1, seq=4096),
39}
40
41H100 = 80 * GB
42for name, kw in cases.items():
43    r = training_step_bytes(**kw)
44    print(f"{name}")
45    for k in ("params", "grads", "optim", "acts", "total"):
46        print(f"  {k:6s} = {r[k] / GB:8.2f} GB")
47    print(f"  -> needs at least {-(-r['total'] // H100)} x 80GB H100 to fit one step\n")

Running this script prints:

GPT-2 124M, ctx 1k
  params =     0.23 GB
  grads  =     0.23 GB
  optim  =     1.39 GB
  acts   =     0.21 GB
  total  =     2.06 GB
  -> needs at least 1 x 80GB H100 to fit one step

Llama-7B, ctx 4k
  params =    13.04 GB
  grads  =    13.04 GB
  optim  =    78.23 GB
  acts   =    12.00 GB
  total  =   116.31 GB
  -> needs at least 2 x 80GB H100 to fit one step

DeepSeek-V3 671B, ctx 4k
  params =  1249.55 GB
  grads  =  1249.55 GB
  optim  =  7497.32 GB
  acts   =    39.13 GB
  total  = 10035.55 GB
  -> needs at least 126 x 80GB H100 to fit one step
Calibrate before you cluster. Run this script for every new model size you design. If the answer is > 4 GPUs, you are no longer doing data-parallel-only training and you owe your team an architecture decision (Sections 11.2 through 11.6) before anyone provisions hardware.

PyTorch: Reading the Same Numbers from a Live Model

The formula gets you within 5-10% of the truth for the static terms; activation memory needs a real measurement because the constant 12 depends on the exact block architecture (attention variant, FFN width, recomputation policy). Below is the end-to-end pattern: count parameter/grad/optimizer bytes from the model object, then measure peak HBM during a real forward + backward to calibrate the activation constant.

🐍verify_with_pytorch.py
5p.numel() * p.element_size() — the canonical byte count

Every PyTorch tensor exposes both numel() (number of elements) and element_size() (bytes per element — 2 for BF16, 4 for FP32, 1 for FP8). Their product is the exact byte cost. Get into the habit of summing this across model.parameters() before any training run; it is the cheapest debug you will ever do.

EXECUTION STATE
BF16 element_size() = 2
FP32 element_size() = 4
9Why grads cost the same as params (usually)

Every parameter that requires_grad gets a .grad tensor with the same shape and dtype during backward. So grad bytes ≈ param bytes. The exception is when you keep grads in a different precision (some FP8 recipes do BF16 grads for FP8 weights) — then grad_bytes is 2× param_bytes.

13AdamW is the silent dominator

Three FP32 tensors per parameter, independent of the model's compute dtype. For a BF16 model the optimizer state is 6× the working weights. For an FP8 model it can be 12× the working weights. This is why memory-saving optimizer variants (8-bit Adam, ZeRO sharding) target the optimizer state first — it is the biggest single line item.

22One Llama-7B block, not the whole model

Linear(hidden, 3*hidden) is the fused QKV projection. Linear(hidden, ffn) and Linear(ffn, hidden) are the gate-up and down projection of the SwiGLU MLP. A single block has roughly 200M parameters; multiplying by 32 layers reproduces the 7B headline number.

EXECUTION STATE
QKV params = 4096 × 12288 ≈ 50 M
FFN up = 4096 × 11008 ≈ 45 M
FFN down = 11008 × 4096 ≈ 45 M
28BF16 + cuda — the production default

to(dtype=torch.bfloat16, device='cuda') is the one-line move that takes a model from 'theoretical' to 'trains on H100'. Notice we do this BEFORE measuring memory; otherwise the params live on CPU and the numbers are meaningless.

35Direct measurement of activation memory

torch.cuda.reset_peak_memory_stats() + max_memory_allocated() bracket the peak HBM use during forward + backward. The difference between peak and 'before' is the activation memory — exactly the quantity our 12·L·B·S·d formula is trying to estimate. Comparing the measured value to the formula calibrates the constant 12 for your specific block.

41Multiply per-block measurements by depth

Activations are stored layer-by-layer for the backward pass — Llama-7B's 32 layers each contribute their share of activation memory. This is the exact line that motivates pipeline parallelism (Section 11.4): if we could compute layers on different GPUs we would not need to hold all their activations on one device.

37 lines without explanation
1# verify_with_pytorch.py — read the same numbers from a real model.
2import torch
3import torch.nn as nn
4
5def param_bytes(model: nn.Module) -> int:
6    return sum(p.numel() * p.element_size() for p in model.parameters())
7
8def grad_bytes(model: nn.Module) -> int:
9    # PyTorch lazily allocates .grad after the first backward,
10    # so we mirror the parameter dtype/shape ahead of time.
11    return sum(p.numel() * p.element_size() for p in model.parameters())
12
13def adamw_state_bytes(model: nn.Module) -> int:
14    # AdamW keeps: master w (FP32), m (FP32), v (FP32) per param.
15    # We compute against FP32 even if the model is BF16 — that is
16    # the entire point of the mixed-precision recipe (Chapter 10).
17    return sum(p.numel() * 4 * 3 for p in model.parameters())
18
19# A toy transformer block the size of one Llama-7B layer (one of 32).
20hidden, ffn, heads, vocab = 4096, 11008, 32, 32000
21
22block = nn.Sequential(
23    nn.Linear(hidden, 3 * hidden, bias=False),    # QKV
24    nn.Linear(hidden, hidden,     bias=False),    # out proj
25    nn.Linear(hidden, ffn,        bias=False),    # FFN up
26    nn.Linear(ffn,    hidden,     bias=False),    # FFN down
27).to(dtype=torch.bfloat16, device="cuda")
28
29GB = 1024 ** 3
30print(f"One Llama-7B-style block:")
31print(f"  params (BF16) : {param_bytes(block)        / GB:6.3f} GB")
32print(f"  grads  (BF16) : {grad_bytes(block)         / GB:6.3f} GB")
33print(f"  AdamW  (FP32) : {adamw_state_bytes(block)  / GB:6.3f} GB")
34
35# Activation memory: measure it directly.
36torch.cuda.reset_peak_memory_stats()
37before = torch.cuda.memory_allocated()
38x = torch.randn(1, 4096, hidden, dtype=torch.bfloat16,
39                device="cuda", requires_grad=True)
40y = block(x).pow(2).mean()
41y.backward()
42peak = torch.cuda.max_memory_allocated()
43print(f"  acts peak     : {(peak - before) / GB:6.3f} GB")
44print(f"  -> multiply by 32 layers for the full model.")

On an H100 this prints something close to:

One Llama-7B-style block:
  params (BF16) :  0.404 GB
  grads  (BF16) :  0.404 GB
  AdamW  (FP32) :  2.421 GB
  acts peak     :  0.413 GB
  -> multiply by 32 layers for the full model.

Scaling by 32 layers reproduces the back-of-envelope numbers from the previous section. Two things to notice. First, the measured activation peak (0.413 GB per block) matches the formula prediction (12·1·4096·4096·2 / 2^30 ≈ 0.39 GB per block) to within 7%, confirming the constant k12k \approx 12 for this block shape. Second, the AdamW row is six times the parameter row even though the model is in BF16 — exactly the 12-bytes-per-parameter penalty we predicted. Always measure on the actual hardware you will train on; the formula is a starting point, not the final answer.

At Massive Scale: From 80 GB to 8 Terabytes

Once we accept that one GPU cannot hold a frontier model, four bottlenecks change character. None of them is fundamentally about compute. All of them are about how memory and data move between GPUs.

BottleneckWhere it bitesWhich strategy attacks it
Replicated optimizer stateAdamW state grows linearly in N and dwarfs everything else.ZeRO / FSDP sharding (Section 11.7)
Parameter memory > one GPUEven BF16 weights of a 70B model exceed 80 GB.Tensor parallelism (Section 11.3), FSDP
Activation memory at long contextActs scale linearly in S; at 128 K context they dominate.Pipeline parallelism (Section 11.4), gradient checkpointing
Cross-GPU communication costEach parallelism strategy adds a different network pattern.DualPipe + NVLink + InfiniBand topology (Section 11.5)

Each row of that table is one section of this chapter. The sequencing is deliberate: we attack the cheapest bottleneck (data parallelism, Section 11.2) first because it works for any model that already fits per-GPU; we then peel off optimizer state with ZeRO; we then peel off the parameters themselves with tensor parallelism; and finally we peel off activations with pipeline parallelism — which is where DeepSeek's DualPipe algorithm (Section 11.5) does its real work, by removing the bubble that naive pipeline schedules waste.

Communication is the recurring tax. Every byte you move between GPUs has to traverse NVLink (within a node, ~900 GB/s) or InfiniBand (between nodes, ~50 GB/s). At 671B parameters, gradient all-reduce per step is over a terabyte of traffic; the difference between a 50% MFU and a 35% MFU is almost always how cleverly that traffic is overlapped with compute. Memory forces you off one GPU; bandwidth decides whether the alternative is fast.

Engineering Reality: Four Levers, One Chapter

At frontier scale every shop ends up combining several parallelism strategies — pure data parallelism died with models past 1B parameters, and pure pipeline parallelism wastes too much compute. DeepSeek-V3's production recipe layers them like this:

  1. Data parallelism across the outermost dimension — every replica trains on a different mini-batch shard, gradient all-reduce at the end of each step (Section 11.2).
  2. FSDP / ZeRO-3 shards the model parameters and optimizer state across the data-parallel replicas. No replica ever holds the full model at once — they gather it on demand (Section 11.7).
  3. Pipeline parallelism splits the 61 layers across GPUs in the depth dimension. DeepSeek's DualPipe algorithm (Section 11.5) overlaps forward and backward microbatches so no GPU is ever idle waiting for a peer.
  4. Expert parallelism — DeepSeek-V3 is a mixture-of-experts model, so the experts themselves are sharded across GPUs and routed to per token via cross-node all-to-all (Section 11.6).

Notably absent: tensor parallelism. DeepSeek made a deliberate decision to skip TP, because the combination of FSDP + DualPipe + expert parallelism already squeezed the model down to a per-GPU footprint that fits — and TP's all-reduce traffic was deemed too expensive given their InfiniBand budget (Section 11.7 explains the trade). This is the level of engineering judgment you are working toward: knowing which parallelism levers to pull, and which to leave alone, for a specific model and a specific cluster.

Everything in this chapter exists because of the inequality you worked through in this section. Memorize it, run the budget script before every training run, and the rest of the chapter will land on prepared ground.

Loading comments...