The Real Problem: A 671B Model Will Not Fit
An NVIDIA H100 has 80 GB of HBM. That is a remarkable amount of memory — twenty times what the original GPT-2 needed end-to-end. Yet DeepSeek-V3, the model this book is built around, has 671 billion parameters. In BF16, the working weights alone occupy TB. That is sixteen full H100s before we have allocated a single gradient, a single optimizer slot, or a single activation tensor.
And the working weights are the cheapest line item. Add gradients, add AdamW's two FP32 moments per parameter, add a master FP32 copy of the weights for numerical safety (Chapter 10), and the per-parameter memory cost is not 2 bytes but roughly bytes. The same model now needs TB of HBM — about 134 H100s just to hold the static state of one training step. Then we layer activations on top: at batch 1 and a 4 K sequence, DeepSeek-V3's 61 layers contribute another tens of GB. Push the sequence to 32 K and activations rival the working weights.
The old single-GPU tricks do not bend far enough. There is no clever compression that crushes 10 TB into 80 GB without throwing away information the model needs. Gradient checkpointing helps with activations but does nothing for parameters. Mixed precision (Chapter 10) cuts the constant but cannot change the linear scaling in . Quantization-aware training shrinks deployment cost, not training cost. The honest conclusion is that modern foundation models fundamentally do not fit on any single accelerator — not now, and not in any plausible HBM roadmap. You need many GPUs. The rest of this chapter is about how to spread one training step across many GPUs without wasting either compute or bandwidth.
Intuition: Four Tenants Sharing an 80 GB Apartment
Think of each H100 as a small apartment with 80 GB of floor space. Four tenants want to move in for the duration of one training step:
- Parameters — the model's memorized knowledge. Always present. The long-term resident.
- Gradients — a temporary roommate the same size as the parameters. Arrives during backward, leaves after the optimizer step.
- Optimizer state — AdamW's bookkeeping. For a BF16 model, this tenant brings six times as much luggage as the parameters themselves: an FP32 master copy and two FP32 moment estimates.
- Activations — the only tenant whose size depends on the batch and sequence length. Arrives during forward, stays until backward consumes it.
For a small model the four tenants share happily. For a 7B-parameter model, AdamW alone wants 84 GB and the apartment is already over capacity. For a 671B-parameter model the parameters alone want sixteen apartments, and the optimizer state wants another hundred. There is no rearranging that fits four tenants this large into one 80 GB unit. You must rent more apartments — and the rest of this chapter is about how to split the tenants between buildings without forcing them to ship boxes back and forth constantly.
The Mathematics of the Memory Wall
Let be the parameter count, the number of layers, the hidden dimension, the per-GPU batch, and the sequence length in tokens. Let be the bytes used for parameters, gradients, optimizer state per parameter, and activations respectively. The per-step memory in bytes is
Three of the four terms are independent of input shape — they grow only with model size. The fourth, the activation term, is the only term you can shrink by changing the batch or sequence length. The factor is the empirical per-block activation multiplier for a transformer block under selective recomputation (Q, K, V, scores, post-softmax, FFN intermediates).
For a BF16 model trained with AdamW, the static (model + grads + optimizer) cost reduces to bytes. The fit-on-one-GPU inequality becomes
For an H100 the right-hand side is bytes. Solving for with — Llama-7B's shape — gives the static-only budget , just below 7 billion. Even at the tightest configuration, a 7B model sits at the edge.
| Symbol | Meaning | Typical value |
|---|---|---|
| N | Total parameters | 1e8 → 7e11 |
| L | Transformer layers (depth) | 12 → 120 |
| d | Hidden dimension d_model | 768 → 16384 |
| B | Per-GPU micro-batch size | 1 → 32 |
| S | Sequence length (tokens) | 1024 → 131072 |
| b_p | Bytes per parameter (working) | 1 (FP8), 2 (BF16), 4 (FP32) |
| b_g | Bytes per gradient | 2 (BF16) or 4 (FP32) |
| b_o | Optimizer bytes per param (AdamW) | 12 (FP32 w + m + v) |
| b_a | Bytes per activation element | 1 (FP8), 2 (BF16), 4 (FP32) |
| k | Per-block activation multiplier | ≈ 12 with selective recompute |
Manual Numerical Walkthrough
Click to walk through the budget for Llama-7B and DeepSeek-V3 by hand
Case A — Llama-7B, BF16 + AdamW, batch 1, seq 4096.
N = 7 e9 params L = 32 layers d = 4096 hidden B = 1 batch S = 4096 seq b_p = 2 (BF16) b_g = 2 (BF16) b_o = 12 (FP32 master w + m + v) b_a = 2 (BF16) k = 12 params = 7e9 * 2 = 14.0 GB grads = 7e9 * 2 = 14.0 GB optim = 7e9 * 12 = 84.0 GB acts = 12 * 32 * 1 * 4096 * 4096 * 2 = 12.9 GB total = 14 + 14 + 84 + 12.9 = 124.9 GB H100 = 80 GB needed = ceil(124.9 / 80) = 2 GPUs
Already two H100s for the smallest production-class open model, at batch 1, with no margin for the framework, communication buffers, or fragmentation. A real training run wants per-GPU batches above 1 and never operates at 100% HBM — in practice Llama-7B SFT runs on 8 GPUs minimum.
Case B — Same model, but batch 4 and seq 8192. Only the activation term changes:
acts = 12 * 32 * 4 * 8192 * 4096 * 2
= 103.1 GB <- 8x the previous activation term
total = 14 + 14 + 84 + 103 = 215 GB
needed = ceil(215 / 80) = 3 GPUsOne change to the data loader — 4x larger batch, 2x longer context — has bumped the GPU floor from 2 to 3. This is why throughput engineers obsess over the activation term.
Case C — DeepSeek-V3 671B, same recipe, batch 1, seq 4096.
N = 671 e9
L = 61
d = 7168
B = 1
S = 4096
params = 671e9 * 2 = 1.34 TB
grads = 671e9 * 2 = 1.34 TB
optim = 671e9 * 12 = 8.05 TB
acts = 12 * 61 * 1 * 4096 * 7168 * 2 = 41.9 GB
total = 1.34 + 1.34 + 8.05 + 0.042 TB
= 10.77 TB
needed = ceil(10.77 TB / 80 GB) = 135 GPUsThe optimizer state alone is 8 TB. No single machine on Earth has 8 TB of accelerator memory. DeepSeek-V3 cannot exist without sharding that state across machines — which is exactly what ZeRO and FSDP do (Section 11.7), and is why FSDP is non-negotiable at frontier scale.
Case D — DeepSeek-V3 with their FP8 recipe (Chapter 10): working weights FP8 (1 byte), grads BF16 (2 bytes), compressed AdamW (~6 bytes). Then instead of 16:
static = 671e9 * 9 = 6.04 TB acts ~ 31 GB (FP8 activations on the FP8 paths) total ~ 6.07 TB needed = 76 H100s minimum <- vs 135 with BF16 recipe
FP8 cuts the GPU floor almost in half — which compounds into roughly half the cluster cost for the same training run. That single multiplier is the financial reason DeepSeek invested so heavily in the FP8 stack you read about in Chapter 10.
Visualizing the Budget — and When It Breaks
The visualization below stacks the four memory terms as a single bar and draws the 80 GB H100 budget as a dashed line. Try the Llama-7B preset and notice that with mixed BF16 it just barely fits at batch 1. Switch to Llama-70B and the bar leaves the budget line behind immediately. Push the sequence slider to 32 K and watch the activation segment (rose) balloon past everything else.
Three readings to take from this. First, optimizer state (amber) is the silent dominator at the BF16 + AdamW recipe almost every shop ships — it is bigger than parameters, gradients, and activations combined for most realistic batches. That is what makes ZeRO/FSDP the highest-leverage memory intervention. Second, FP8 (the DeepSeek-V3 preset) is the only configuration that meaningfully shrinks the static cost; mixed BF16 has been the standard for five years and will not save you from the 671B wall. Third, sequence length is a one-way street: the activation term scales linearly in , and long-context capability literally cannot be added without either pipeline parallelism or activation recomputation (or both — DeepSeek-V3 uses both).
What the visualization does not show
The bar accounts for the four big tenants. In practice each GPU also reserves ~2 GB for the CUDA context, ~1 GB for NCCL communication buffers, and 5-15% for memory fragmentation. Always budget for 75% of HBM, not 100%. The bar is the theoretical floor; the practical ceiling is lower.
Plain Python: Counting Bytes Before Touching a GPU
Below is the calculation as a function — the kind of script every team should have in their bootstrapping repo. It takes no dependencies, runs in microseconds, and saves the kind of debug cycle that otherwise involves provisioning a multi-GPU node just to discover an OOM after 40 minutes of model loading.
Running this script prints:
GPT-2 124M, ctx 1k params = 0.23 GB grads = 0.23 GB optim = 1.39 GB acts = 0.21 GB total = 2.06 GB -> needs at least 1 x 80GB H100 to fit one step Llama-7B, ctx 4k params = 13.04 GB grads = 13.04 GB optim = 78.23 GB acts = 12.00 GB total = 116.31 GB -> needs at least 2 x 80GB H100 to fit one step DeepSeek-V3 671B, ctx 4k params = 1249.55 GB grads = 1249.55 GB optim = 7497.32 GB acts = 39.13 GB total = 10035.55 GB -> needs at least 126 x 80GB H100 to fit one step
Calibrate before you cluster. Run this script for every new model size you design. If the answer is > 4 GPUs, you are no longer doing data-parallel-only training and you owe your team an architecture decision (Sections 11.2 through 11.6) before anyone provisions hardware.
PyTorch: Reading the Same Numbers from a Live Model
The formula gets you within 5-10% of the truth for the static terms; activation memory needs a real measurement because the constant 12 depends on the exact block architecture (attention variant, FFN width, recomputation policy). Below is the end-to-end pattern: count parameter/grad/optimizer bytes from the model object, then measure peak HBM during a real forward + backward to calibrate the activation constant.
On an H100 this prints something close to:
One Llama-7B-style block: params (BF16) : 0.404 GB grads (BF16) : 0.404 GB AdamW (FP32) : 2.421 GB acts peak : 0.413 GB -> multiply by 32 layers for the full model.
Scaling by 32 layers reproduces the back-of-envelope numbers from the previous section. Two things to notice. First, the measured activation peak (0.413 GB per block) matches the formula prediction (12·1·4096·4096·2 / 2^30 ≈ 0.39 GB per block) to within 7%, confirming the constant for this block shape. Second, the AdamW row is six times the parameter row even though the model is in BF16 — exactly the 12-bytes-per-parameter penalty we predicted. Always measure on the actual hardware you will train on; the formula is a starting point, not the final answer.
At Massive Scale: From 80 GB to 8 Terabytes
Once we accept that one GPU cannot hold a frontier model, four bottlenecks change character. None of them is fundamentally about compute. All of them are about how memory and data move between GPUs.
| Bottleneck | Where it bites | Which strategy attacks it |
|---|---|---|
| Replicated optimizer state | AdamW state grows linearly in N and dwarfs everything else. | ZeRO / FSDP sharding (Section 11.7) |
| Parameter memory > one GPU | Even BF16 weights of a 70B model exceed 80 GB. | Tensor parallelism (Section 11.3), FSDP |
| Activation memory at long context | Acts scale linearly in S; at 128 K context they dominate. | Pipeline parallelism (Section 11.4), gradient checkpointing |
| Cross-GPU communication cost | Each parallelism strategy adds a different network pattern. | DualPipe + NVLink + InfiniBand topology (Section 11.5) |
Each row of that table is one section of this chapter. The sequencing is deliberate: we attack the cheapest bottleneck (data parallelism, Section 11.2) first because it works for any model that already fits per-GPU; we then peel off optimizer state with ZeRO; we then peel off the parameters themselves with tensor parallelism; and finally we peel off activations with pipeline parallelism — which is where DeepSeek's DualPipe algorithm (Section 11.5) does its real work, by removing the bubble that naive pipeline schedules waste.
Communication is the recurring tax. Every byte you move between GPUs has to traverse NVLink (within a node, ~900 GB/s) or InfiniBand (between nodes, ~50 GB/s). At 671B parameters, gradient all-reduce per step is over a terabyte of traffic; the difference between a 50% MFU and a 35% MFU is almost always how cleverly that traffic is overlapped with compute. Memory forces you off one GPU; bandwidth decides whether the alternative is fast.
Engineering Reality: Four Levers, One Chapter
At frontier scale every shop ends up combining several parallelism strategies — pure data parallelism died with models past 1B parameters, and pure pipeline parallelism wastes too much compute. DeepSeek-V3's production recipe layers them like this:
- Data parallelism across the outermost dimension — every replica trains on a different mini-batch shard, gradient all-reduce at the end of each step (Section 11.2).
- FSDP / ZeRO-3 shards the model parameters and optimizer state across the data-parallel replicas. No replica ever holds the full model at once — they gather it on demand (Section 11.7).
- Pipeline parallelism splits the 61 layers across GPUs in the depth dimension. DeepSeek's DualPipe algorithm (Section 11.5) overlaps forward and backward microbatches so no GPU is ever idle waiting for a peer.
- Expert parallelism — DeepSeek-V3 is a mixture-of-experts model, so the experts themselves are sharded across GPUs and routed to per token via cross-node all-to-all (Section 11.6).
Notably absent: tensor parallelism. DeepSeek made a deliberate decision to skip TP, because the combination of FSDP + DualPipe + expert parallelism already squeezed the model down to a per-GPU footprint that fits — and TP's all-reduce traffic was deemed too expensive given their InfiniBand budget (Section 11.7 explains the trade). This is the level of engineering judgment you are working toward: knowing which parallelism levers to pull, and which to leave alone, for a specific model and a specific cluster.
Everything in this chapter exists because of the inequality you worked through in this section. Memorize it, run the budget script before every training run, and the rest of the chapter will land on prepared ground.