A single H800 has 80 GB of HBM. A 70 B-parameter dense model in BF16 needs 140 GB just for the weights. Data parallelism cannot help — every replica still has to hold the whole model. The fix is to cut the model itself in half, then in quarters, then in eighths, distributing the matmul across GPUs and stitching the answer back together with a single collective per layer. That is tensor parallelism. The previous section showed how to scale by replicating; this section shows how to scale by splitting.
The Real Problem: One Layer No Longer Fits
Data parallelism (Section 11.2) gets you more throughput by running the same model on many GPUs with different mini-batches, then averaging gradients. It works beautifully — until the model itself stops fitting on a single GPU. For LLaMA-70B, the full forward+backward state in BF16 with FP32 master weights and Adam moments costs roughly — 14× larger than an 80 GB H800. No amount of data parallelism rescues you. The model has to be split.
There are three ways to split it:
- Pipeline parallelism (PP) — give different layers to different GPUs. Easy memory math, but introduces pipeline bubbles: when GPU 0 is on micro-batch 1, GPUs 1-7 are idle. We cover this in Section 11.4.
- Tensor parallelism (TP) — give a slice of each layer to every GPU. Every GPU is busy on every layer, but every layer pays a collective communication tax. The topic of this section.
- Expert parallelism (EP) — give different MoE experts to different GPUs. Specific to mixture-of-experts; Section 11.6.
Tensor parallelism is the most surgical of the three. It does not change what the model computes — it changes where each piece of the computation lives. Forward and backward of a TP-sharded layer are mathematically identical to the unsharded layer, up to floating-point rounding. The cost is a collective communication operation per layer.
Intuition: Split the Matmul, Not the Data
Imagine you are computing where is a small matrix of inputs and is enormous — too big to hold on one machine. You have two coworkers. Two ways to share the work:
- Cut W into vertical strips (columns). You take the left half of W's columns, your coworker takes the right half. Each of you multiplies the full X by your strip. You get the left half of Y, your coworker gets the right half. No talking required — you each independently computed a piece of the answer. This is column-parallel.
- Cut W into horizontal strips (rows). Now X also has to be split: you take the left columns of X (matching W's top rows), your coworker takes the right columns. You each compute something the FULL shape of Y, but each result is a partial — it only contains the contribution from your rows. To get the real Y, you have to add the partials together. This is row-parallel, and the addition step is the all-reduce.
Neither approach alone is enough. Column-parallel leaves you with a sharded output; row-parallel demands a sharded input. The Megatron-LM insight is to compose them: a column-parallel layer feeds its sharded output directly into a row-parallel layer, which all-reduces at the end. The composition produces a replicated output from a replicated input — drop-in replacement for an unsharded MLP — with only ONE collective per pair.
The Math: Column-Parallel and Row-Parallel
Let be the weight matrix. Partition the inner dimension into equal chunks across TP ranks. We write where each block lives on rank .
Column-parallel
The forward pass on rank is:
with replicated. The output is sharded along the inner dimension, no communication required. The full unsharded output exists implicitly across the ranks, but is never materialised.
Row-parallel
Now suppose and we partition the row dimension : . The input must already be sharded along — which is exactly what column-parallel produced — so on rank we have . The local matmul is:
Each has the FULL output shape but is a partial — only the contribution from rank 's slice of . The actual output is the sum:
Communication volume
A ring all-reduce of a tensor of size elements moves roughly elements through each rank's links. The total per-layer communication volume (attention output + MLP down-proj, both row-parallel) is:
The factor of approaches 1 as TP degree grows — so doubling TP from 4 to 8 only increases per-layer comm volume by about 14%. But it halves the local compute, so the ratio roughly doubles. That ratio is the crucial number: when it climbs above ~0.3 (comm time is 30% of compute time), tensor cores start starving and wall-clock throughput collapses.
Manual Numerical Walkthrough
We walk through a 2-rank TP forward pass on a 2×4-wide MLP, by hand. Tiny enough to verify every number; structurally identical to a 70 B-model layer.
Click to expand: TP=2 MLP forward pass by hand
Setup. One token, hidden dimension , FFN inner dimension , TP degree . Ignore the activation for clarity.
Tensors. Input (shape ). The full unsharded weights are:
(shape ), (shape ).
Reference output. First , then . So .
TP=2: rank 0. Owns the first two columns of and the first two rows of :
Column-parallel: . Row-parallel: .
TP=2: rank 1. Owns the last two columns of and the last two rows of :
Column-parallel: . Row-parallel: .
All-reduce. — exactly the reference. Mathematical equivalence verified.
Memory bookkeeping. Rank 0 stores 4 floats of and 4 of — 8 total, vs 16 unsharded. Two-rank TP halves the weight footprint, as expected. The all-reduce moves floats between the two ranks — the constant comm tax for using TP.
What scales. Replace with 8 192, with 28 672, with 8, and with 8 192 (a typical micro-batch × sequence). The algorithm is line-for-line the same. The weight shard shrinks from to elements per rank, the all-reduce moves floats per layer per rank, twice per layer (attention + MLP).
Visualizing Sharding and Communication Cost
Slide TP degree, hidden dim, FFN inner dim, batch×seq, and NVLink bandwidth. The diagram redraws the matmul split; the stat cards retally per-GPU memory, FLOPs, and all-reduce volume. Watch the bottom callout — comm/compute ratio is the number that decides whether TP is profitable.
Plain Python: TP From Scratch with NumPy
Before we bring in NCCL and CUDA, prove that the algebra works on a single CPU process. Below, we simulate TP=4 by literally slicing the weight matrices into 4 shards, computing each rank's partial forward pass in a loop, and reconstructing the answer with a sum. The output has to match the unsharded reference exactly — if it doesn't, your sharding is wrong long before you spend a minute on distributed debugging.
PyTorch: Megatron-Style TP with all_reduce
Now on real silicon. Same algorithm, two custom classes — and — wired together with one per layer. This is a stripped-down copy of what Megatron-LM, NVIDIA TransformerEngine, and DeepSeek's in-house kernels do at much greater scale.
At Massive Scale: Why TP Almost Always Stops at 8
The recurring number in frontier-training papers — Megatron-Turing 530B, LLaMA-70B, GPT-3, DeepSeek-V3 — is TP = 8. Not 16, not 4. Three physical constraints conspire to make 8 the sweet spot.
- NVLink lives inside one node. A DGX-class node has 8 GPUs connected by NVLink/NVSwitch at 600+ GB/s. Cross-node links (InfiniBand or RoCE) are 50-200 GB/s — 3-12× slower. TP's all-reduce runs every layer, twice. At BF16, a 70B model at TP=8 moves ~270 MB per layer per all-reduce; on NVLink that is ~0.45 ms, on InfiniBand it would be ~5 ms — and you have 80 layers. TP that crosses node boundaries is dead on arrival.
- Comm volume per layer is fixed in the batch dimension; compute is not. The per-layer all-reduce is ; the per-layer FLOPs are . The ratio is . As you increase TP from 1 to 8, each rank's compute drops by 8× but comm stays the same; ratio rises 8×. Past TP=8 on a single node, the ratio enters the danger zone where tensor cores idle on comm.
- TP=8 happens to match the head count. Attention heads (e.g. 64 for LLaMA-70B, 128 for DeepSeek-V3) are evenly divisible by 8, so column-parallel QKV gives each rank an integer number of heads. Going to TP=16 would require splitting individual heads — possible, but messier and rarely worth it.
The 3D stack: DP × TP × PP
TP=8 alone gets you a 70B model running on 8 GPUs; the real run uses hundreds or thousands. The standard composition is:
For a 2048-GPU DeepSeek-V3 run on H800s, a typical configuration is (DeepSeek famously avoids TP — see Section 11.7), , , . For a dense LLaMA-style 70B, , , on 2048 GPUs is canonical. TP carves up the layer, PP carves up the layer stack, DP scales throughput. Each axis has a different cost structure; the engineering is in finding the configuration that minimises wall-clock for the target FLOP budget.
| Parallelism | What it splits | Comm pattern | Bandwidth need |
|---|---|---|---|
| Data (DP) | mini-batch | all-reduce on gradients, once per step | low (per step, not per layer) |
| Tensor (TP) | weights along inner dim | all-reduce on activations, twice per layer | very high (NVLink only) |
| Pipeline (PP) | layer stack | point-to-point activation send/recv at stage boundaries | moderate |
| Expert (EP) | MoE experts | all-to-all on token routing, twice per MoE layer | high (cross-node ok) |
Engineering Reality: Costs, Failure Modes, and the 3D Stack
TP is mature. Megatron-LM has shipped it since 2019, every frontier lab uses some flavour of it, and the abstractions (ColumnParallel / RowParallel) are stable. That does not make it free.
- Hardware lock-in to fast intra-node fabric. TP assumes NVLink-class bandwidth between every pair of TP ranks. On consumer GPUs (no NVLink) or PCIe-only nodes, TP throughput falls off a cliff — you are better off with PP or with model offloading (ZeRO). DeepSeek-V3's decision to avoid TP entirely (Section 11.7) is partly because their H800s have NVLink throttled to ~400 GB/s vs H100's 900 GB/s — the comm/compute ratio tipped just far enough to prefer ZeRO-style memory sharding.
- Subtle bugs in the gradient direction. The forward pass of a TP layer is intuitive; the backward pass requires careful insertion of all-reduce in places you would not expect. The gradient of a column-parallel layer needs an all-reduce on the input gradient before being passed to the previous layer; forget this and your gradients become biased and training slowly diverges over thousands of steps. Megatron-LM's trick (identity in forward, all-reduce in backward; all-reduce in forward, identity in backward) automates this, but only inside the framework — custom TP kernels routinely ship with this bug.
- RNG and dropout coordination. Activations are replicated outside TP-sharded ops, sharded inside. Dropout needs the SAME mask on every rank in the replicated regions, but a DIFFERENT mask per shard in the sharded regions. Get this wrong and you double-dropout some neurons and never drop others — another silent failure mode that only shows up as a slowly worse loss curve.
- Activation memory is sharded inside the block but replicated at the boundary. The sharded intermediate between column-parallel and row-parallel is — TP=8 gives you 8× less activation memory in the FFN inner region. But the block's INPUT and OUTPUT are replicated, costing on every rank. For activation checkpointing math, the right number to count is the maximum, not the sum.
- Composing with FP8 needs care. The amax scaling used in FP8 (Section 10.3) is computed per-tensor — but with TP the "tensor" is sharded across ranks. A naive per-shard amax produces different scales on different ranks for what is logically one tensor; the matmul then silently loses precision. The fix is an all-reduce on amax before the FP8 cast — one extra (tiny) collective per TP-FP8 layer.
The deepest lesson. Data parallelism trades GPUs for throughput. Tensor parallelism trades intra-node bandwidth for the ability to run models that no single GPU can hold. Pipeline parallelism trades wall-clock time (bubbles) for the ability to run models that no single NODE can hold. None of them are free; all of them are necessary. The art of scaling training is composing the three so that no single bottleneck — HBM, NVLink, or InfiniBand — becomes the binding constraint. The next section unpacks pipeline parallelism; section 11.5 shows how DeepSeek's DualPipe algorithm bends the pipeline-bubble curve, and 11.7 explains why their final answer was to abandon TP entirely.