The Problem: Tensor Parallelism Was Supposed to Be Mandatory
Every other 600-billion-parameter recipe in the open record — Megatron-LM, GPT-NeoX, Llama-3 405B — reaches the same conclusion at the same point in the design: the model is too fat to fit on one GPU, so you slice each weight matrix across the GPUs of a single node and let an all_reduce glue the partial results back together. That technique is tensor parallelism (TP), and it is what makes a single transformer layer survive the H100's 80 GB budget.
TP is also expensive in a way that nothing else in the stack is. Each layer of forward and backward fires two all_reduce collectives in the critical path. They are blocking — no kernel inside the layer can start until every rank has its slice of the previous result. They eat NVLink bandwidth that DualPipe was supposed to give back to compute. And because they happen inside the layer, they break the symmetry that lets pipeline schedulers overlap forward and backward cleanly.
DeepSeek-V3 refused to pay that bill. The official report says it plainly: "In order to reduce the memory footprint during training, we employ the following techniques. […] Notably, DeepSeek-V3 does not use tensor parallelism during training." That decision is the load-bearing one for the entire parallelism stack — it is what lets DualPipe schedule cleanly, what lets the all-to-all collectives from expert parallelism share the network with nothing else, and what makes the cluster topology drawable on one page.
Intuition: Trade Three Cheap Tricks for One Expensive One
Tensor parallelism solves a memory problem with a communication tool. That is a category error when communication is already your scarce resource. The right question is: which parts of the per-GPU budget are easiest to shrink without talking to other GPUs?
Three answers fall out of the budget once you write it down:
- Activations are full of cheap-to-recompute waste. The fattest activations in an MLA block — the RMSNorm output and the up-projection — are also the cheapest to redo. Don't save them; rebuild them on the backward pass.
- The EMA isn't hot. The exponential moving average copy of the weights is only touched once per step, and the update is embarrassingly parallel with the next forward. It does not need to live in HBM; it can live on pinned host RAM and be updated asynchronously.
- The embedding and the LM head are the same tensor. Tie their weights and put them on the same pipeline stage, and their matrix — about 1.8 GB in BF16 for V3 — costs you once instead of twice.
The Memory Bookkeeping: What Lives on a GPU
Before we apply any tricks, let us write down where the bytes actually go. For a model with total parameters running under pipeline stages and ZeRO-1 across a data-parallel degree , the per-GPU memory at one training step breaks into six buckets:
| Bucket | Size (bytes) | Why it is there |
|---|---|---|
| Weights (BF16) | 2N / P | One pipeline stage holds N/P parameters in working precision. |
| Gradients (BF16) | 2N / P | Same shape as weights; produced by the backward pass. |
| Optimizer state (FP32) | 12N / (P·D) | AdamW master copy + two FP32 moments. Sharded by ZeRO-1 across the DP group. |
| EMA (FP32) | 4N / (P·D) | Second FP32 copy of every weight, kept for evaluation and final-checkpoint smoothing. |
| Activations (BF16) | ≈ 12 · L · B · S · d | Saved between forward and backward. Scales with sequence length and batch. |
| Embed + LM head (BF16) | 2 · V · d (or 4 · V · d untied) | Vocabulary embedding and output projection — share or duplicate the same V × d matrix. |
Four of these six are dictated by physics: you cannot avoid weights, gradients, the optimizer state, or activations. But the EMA and the embed/head are discretionary — the framework chose to put them on the GPU because that was the easiest place to put them. The tricks in this section take each discretionary line and ask: does this really need to be here?
Trick 1: Recompute RMSNorm and the MLA Up-Projection
Standard PyTorch autograd saves every intermediate it thinks the backward pass might need. For a transformer block that means at least four wide tensors per layer: the input , the RMSNorm output , the up-projection result , and the block output . Three are width- ; one — — is width-. In a 7168-wide model that single tensor is 4× the size of all the others combined.
Selective recomputation throws away and . On the backward pass, the framework redoes the RMSNorm and the up-projection — both of which are fast: RMSNorm is one fused kernel, the up-projection is a single GEMM. The boundary tensors and stay on the tape because the next block needs them and there is no shortcut to recompute attention cheaply.
The math behind the cost is the punchline. If the layer has forward FLOPs , the up-projection contributes roughly and the RMSNorm contributes . Doubling those two terms (because we run them twice now) costs about extra for the recomputed block — but the backward of that same block costs , so the extra is of the block's end-to-end FLOPs. Amortised across the whole forward+ backward step of the network, that becomes the often-quoted overhead for activation checkpointing.
Why these two operations specifically? Because they have the largest memory-to-FLOP ratio in the block. RMSNorm produces a width- tensor with FLOPs — almost pure memory. The up-projection produces a width- tensor with FLOPs. Compare to attention: attention is FLOPs and its activations are smaller than its compute. So you save attention's output and you recompute everything that is fat-but-cheap.
The animation makes the bookkeeping concrete. Watch the naive column accumulate four glowing tensors during forward, and the checkpoint column accumulate only two. When the backward arrives, the checkpoint column lights up amber where it has to redo work — first the up-projection (because the next backward op wants ), and then the RMSNorm (because the up-projection itself wants ). Both columns finish at the same final answer.
Trick 2: Stream the EMA to CPU Memory
Every large training run keeps an exponential moving average of the parameters — a second FP32 copy that updates with rule after every optimizer step. EMA weights are smoother than raw weights and produce better evaluation losses; you also typically ship the EMA as the final checkpoint.
At 671 B parameters, an FP32 EMA is of state. Even sharded across a DP-degree of 64 it still costs ~42 GB per GPU — over half the H800 budget by itself. That is far too much for a buffer that gets touched once per step.
DeepSeek pushes the EMA off the GPU entirely. The shadow copy lives in pinned host RAM — page-locked memory that the GPU can DMA to and from without involving the kernel scheduler. After each optimizer step, a side CUDA stream copies the new BF16 weights to the host, upcasts to FP32, and blends them into the shadow copy with . The whole transfer is async and overlaps with the next forward pass — the model never waits for it.
.to('cpu', non_blocking=True) only behaves asynchronously when the destination is pinned. With pageable host memory the runtime falls back to a synchronous staging copy, which would block the GPU until the EMA finished — killing the overlap and burning the whole optimization.Trick 3: Co-locate and Share the Embedding and Output Head
The input embedding maps a token id in to a width- vector. The output head — the "LM head" — maps a width- vector back to a distribution over the same tokens. Both are matrices. For V3 with and , that is of BF16 weights — per matrix.
Under pipeline parallelism, the embedding lives on the first pipeline stage and the LM head lives on the last. If you treat them as separate parameters, you pay twice. DeepSeek-V3 makes two coordinated choices:
- Tie the weights. The LM head reuses the embedding matrix transposed. One parameter, two roles — the standard weight-sharing trick that goes back to Press & Wolf (2016).
- Co-locate them on the same PP stage. The DualPipe schedule places the embedding and the head on the same rank, so the tied weight lives on exactly one GPU and no extra all-reduce is needed to keep two copies in sync.
The first choice saves 1.83 GB on the LM-head rank. The second choice saves the gradient sync that an untied head would otherwise require. Together they remove almost 4 GB from the critical-path rank's budget — enough on its own to push DeepSeek-V3 across the 80 GB line on the rank that holds the deepest activations.
Manual Numerical Walkthrough: 671B Without TP
Plug the DeepSeek-V3 numbers in — one bucket at a time
Setup. 671 B total parameters, 16 PP stages, 64-way data parallelism, sequence length 4096, micro-batch 1, hidden width 7168, 61 transformer layers, 128K vocabulary.
Weights (BF16). spread across the stage — but the stage holds layers, and onlyactive parameters of an MoE layer count for working memory because the routed experts that did not fire this token do not need to be in fast cache. The effective working set per GPU is closer to for active params, plus a thin slice of routed experts.
Gradients (BF16). Same shape, same number: per GPU for active weights.
Optimizer state (ZeRO-1). per GPU. ZeRO-1 shards it across the 64-way DP group, so each rank keeps only 1/64th of the full 12-bytes-per-param state.
EMA (Trick 2). Without Trick 2: per GPU on HBM. With Trick 2: 0 GB (lives on host).
Activations. Approximately per layer of one micro-batch, where 4 is the average layer count per stage. Without recomputation (Trick 1): per stage. With recomputation: (about 30% savings from dropping the wide intermediates).
Embed + head (Trick 3). Without sharing: on the embed rank and the head rank. With sharing: , only on the co-located rank.
Sum on the heaviest rank (the embed+head one). Without tricks: . With all three tricks: .
That is the answer. per H800 leaves for NCCL buffers, micro-batch staging, attention KV during warm-up, and the inevitable allocator fragmentation — with zero tensor parallelism in the picture.
Visualizing the Budget
The walkthrough above is one configuration. The simulator below lets you sweep the knobs — PP stages, sequence length, micro-batch — and switch each trick on or off independently. The dashed line is the 80 GB H800 limit; the goal of the whole exercise is to walk the second bar comfortably under it.
Two things to notice as you experiment. First, the embed/head saving (Trick 3) is the biggest individual jump on the most heavily-loaded rank, but it does not change the average budget much — it is a localized, rank-specific win. Second, recomputation (Trick 1) is the saving that scales with sequence length: pull the sequence slider to 32K and the activation bar grows linearly; toggle recomputation on and the bar drops by ~30% no matter where the slider is.
Plain Python: Manual Save-and-Redo for One Block
Stripped of autograd, what does selective recomputation actually look like? The code below is the smallest honest implementation: a forward function that explicitly chooses what to put on the tape, and a backward function that explicitly recomputes whatever is missing. Two modes, identical math, different memory peaks — and a final assertion that the gradients match.
The crucial line is the dict-build inside block_forward: that single conditional is the entire policy. Naive autograd is the "always include everything" case; selective recompute is the case where the framework picks which boundary tensors to keep and trusts the backward to rebuild the rest. PyTorch's torch.utils.checkpoint wraps exactly this logic into a decorator — same idea, with the autograd graph rebuilt automatically.
PyTorch: torch.utils.checkpoint and a Pinned-Host EMA
In production the two tricks become a one-line wrap and a ~20-line helper class. The wrap is the "dropping intermediates" story; the helper class is the "EMA on CPU" story. Trick 3 (shared embed/head) lives in the pipeline builder, which we touched on in the DualPipe section and will not duplicate here.
Production note.use_reentrant=Falseis non-negotiable in 2025: the legacy reentrant mode silently breaks under AMP, mixed-grad-scaling, and any non-tensor input, and PyTorch will drop it entirely in 2.6. If you inherit a code base that still uses the reentrant API, switching is a one-flag change and it's worth doing before you debug anything else.
What Changes at 671B Parameters and 14.8T Tokens
Everything in this section scales, but the way each trick scales is different and worth tracking separately.
| Trick | Scaling axis | How it scales | Limit |
|---|---|---|---|
| Recompute RMSNorm + MLA-up | Sequence length × batch × layers per stage | Linear: doubling sequence length doubles the saving in absolute GB. | Bound below by the size of the boundary tensors (x and a); you cannot drop them. |
| EMA on CPU | Total parameters | Linear in N: every extra billion params adds 4 GB to the savings. | Bound by host RAM and PCIe/NVLink-C2C bandwidth; at extreme N you run out of CPU memory before HBM. |
| Shared embed/head | Vocabulary × hidden | Constant in N (once you have one of each) but scales with V·d. Bigger tokenizers, bigger savings. | Bound by the architectural choice — some recipes deliberately untie the head for quality reasons. |
The numbers compound. On a 64-node, 512-GPU cluster, the recomputation saving alone is of HBM reclaimed cluster-wide. The EMA-to-CPU saving moves about of FP32 state off the accelerators completely. The shared head reclaims another on the two stages that used to duplicate it.
The most important number is the one that doesn't appear in the table: zero new collectives. None of these tricks introduces an all_reduce, an all_gather, or a barrier. The activation recomputation is a per-GPU forward pass. The EMA copy is a per-GPU host transfer. The head sharing is a static rebinding. The communication graph that DualPipe scheduled in the previous section stays untouched.
Engineering Reality: When the Tricks Bite Back
Three places where the trio is not free:
Recomputation breaks the "just look at the activation" debug pattern
Once an intermediate is recomputed instead of saved, any tooling that wanted to peek at it — gradient inspectors, activation loggers, debugger hooks — sees something subtler than it expects. Hooks fire twice for tensors inside the checkpoint region (once on the discarded forward, once on the recomputed forward). The fix is either to register hooks outside the checkpoint, or to use the explicit save-on-CPU pattern (checkpoint_sequential with a known boundary list).
The async EMA can hide bugs until eval time
Because the EMA update is fire-and-forget on a side stream, nothing checks the value of the shadow tensor until you load it for evaluation — potentially hundreds of steps later. A common failure mode is forgetting to torch.cuda.synchronize() before reading the EMA back for a mid-training eval, which can race with the last in-flight update. Always synchronize the side stream explicitly before serializing the EMA.
Tied weights mean tied learning rates
If the embedding and the output head share parameters, they share gradients, and they share whatever optimizer state goes with those gradients. Recipes that want a different learning rate on the LM head than on the embedding cannot tie them — the saving disappears, and you pay the 1.8 GB. DeepSeek-V3 deliberately uses one optimizer group for the shared matrix, which is why the saving holds.