A 671 B-parameter model trained in FP8 should explode. It does not — but only because every dot product inside it is secretly doing arithmetic in two precisions at once. The data lives in 8 bits. The multiplies happen in tensor cores at 8 bits in, ~14 bits out. But the running sum across the K dimension is quietly promoted to a full 32-bit register on a different piece of silicon every 128 steps — and that single trick is the difference between a stable training curve and a diverging one.
The Real Problem: A Matmul Is a Very Long Sum
Section 10.3 fixed the storage precision problem with fine-grained quantisation: tile activations and weights, give every tile its own scale factor, and the FP8 grid suddenly covers the value range any tensor needs. So the inputs to a matmul are now reasonably well represented in FP8. What about the matmul itself?
A single output entry of a GEMM is . For a 7 B-class model, is somewhere between 4096 and 16384 inside the attention and MLP layers. Each individual product is a fine FP8 number. The problem is what happens when you add four thousand of them together.
FP8 e4m3 has 3 mantissa bits plus a hidden bit — about 4 effective bits of precision, or one part in 16. Adding two FP8 values produces a result that, when rounded back to FP8, has roughly 6% rounding noise per add. Compound that across operations and the partial sum is dominated by accumulated rounding error long before the true sum is computed. Worse: when the running sum grows much larger than any individual incoming product (which it will, after a few hundred adds with the same sign), the small product rounds away to zero on contact with the accumulator. This is swamping, the classical numerical-analysis failure mode of finite-precision arithmetic, and FP8 swamps almost immediately.
Hopper's WGMMA tensor-core instruction was designed knowing this. It does the multiply at FP8 × FP8 and routes the result into an FP32 register. Problem solved? Not quite — and the rest of this section is about why.
Intuition: A Bucket That Keeps Spilling Sand
Picture an accumulator as a bucket and each product as a handful of sand. The bucket has a finite number of decimal places of capacity — say 4 digits if it is FP8, 8 digits if it is BF16, 24 digits if it is FP32. Every time you pour a handful in, the bucket adjusts its reading — but only to the nearest digit it can show.
Early on, both the bucket and the handfuls are small numbers of similar magnitude. Each pour shows up cleanly. Some time later the bucket reads 100 and you pour in a handful of 0.3. A 4-digit bucket sees — the small handful vanished. A 24-digit bucket records every grain. The total weight you poured is the same in both cases. The amount the bucket remembers is not.
The DeepSeek V3 trick is to give yourself two buckets. A small, quick-to-write bucket (the tensor-core register, ~14 effective bits) catches sand as fast as the tensor cores can pour it. Every 128 pours, you tip that small bucket into a giant 24-digit bucket and reset the small one. Now no individual handful is ever lost: while it is being accumulated within a 128-pour window the magnitudes of bucket and handful stay comparable, so nothing is rounded to zero. The giant bucket only ever receives full 14-digit-precision tips, and it has the capacity to hold all of them faithfully.
The Mathematics of Rounding-Error Walk
For a finite-precision float with mantissa bits, the unit roundoff is . Every arithmetic operation in this format produces a result that is the true result times with . For FP8 e4m3, and . For BF16, . For FP32, . For the H800 WGMMA register, the effective from DeepSeek's measurement is about — between BF16 and FP32, but much closer to BF16.
Now sum products with a precision- accumulator. After each add the accumulated value picks up a relative rounding kick . Wilkinson's forward-error bound for a naive sum is the standard textbook result:
The error grows linearly in in the worst case. For randomly-signed errors the typical bound is more forgiving — a random walk of kicks of size drifts by — but for we still have a typical drift of . With FP8 acc that is — total noise. With the WGMMA register's 14-bit floor that is — small but not invisible. With FP32 acc it is — gone.
Why promotion every Nc steps works
Suppose we accumulate in the limited-precision register only for steps, then flush to FP32. The within-window error walks for steps — magnitude — and the across-window FP32 add contributes essentially zero error. After total steps the windows form an independent sum, so the overall error scales as:
Wait — the dropped out. That looks like we gained nothing. The catch: this bound treats the within-window rounding as relative to the running sum. The actual benefit of small is that the running sum never grows large relative to the next addend , so swamping never starts. The Wilkinson bound assumes you have not lost any addends entirely — but with FP8 acc and , half of the addends are completely absorbed by the bucket before they register, and the bound becomes worthless. Promotion every resets the bucket before swamping kicks in.
How small does need to be? Empirically: the bucket-vs-handful magnitude ratio inside one window needs to stay below the register's ULP. For 14-bit precision and centered Gaussian-ish inputs, that ratio crosses dangerous territory somewhere around products. DeepSeek V3 picks with a generous safety margin — cheap to do because the WGMMA tile shape already wants a multiple of 16 along K anyway, and 128 is just 8 of those tiles.
Manual Numerical Walkthrough
Five-element dot product, all numbers small enough to do by hand. We will see swamping happen in real time.
Click to expand: FP8 vs FP32 acc on a 5-element dot product
Setup. Two FP8 vectors, and . Every entry is exactly representable in FP8 e4m3 (these are powers of two — they sit on the float grid). True dot product: .
FP8 accumulator, step by step. The FP8 e4m3 grid near 64 has spacing — values jump 56, 64, 72, 80, etc. So:
• After step 1: . Exact (16 is on the grid).
• After step 2: true 32, rounded to nearest FP8 = 32. Exact.
• After step 3: true 48, rounded to nearest FP8 = 48. Exact (grid spacing at this magnitude is 4).
• After step 4: true 64, rounded to nearest FP8 = 64. Exact.
• After step 5: true . But grid spacing at 64 is . The nearest grid point is 64. The 0.0625 contribution vanishes entirely. .
The damage. Absolute error 0.0625, relative error . One contribution out of five was wiped out by the bucket. In a real GEMM, this happens to thousands of contributions per output entry.
FP32 accumulator, same vectors. FP32 grid spacing at 64 is . The 0.0625 contribution is enormous compared to this — there is no rounding at all. Final sum 64.0625, error relative.
Promotion accumulator, Nc = 2. Accumulate two products in FP8 register: (exact). Promote to FP32 register, reset. Accumulate next two: . Promote, reset. Final product: , accumulated into the FP8 register (still small). Promote: FP32 register holds . Exact.
The lesson the toy example reveals. The damage is not gradual. It is a step function — the moment the accumulator passes a binade boundary in the limited-precision format, small contributions stop registering at all. Promotion keeps the accumulator small enough that you never cross that boundary inside a window.
Visualizing the Three Accumulators
The widget below runs the same FP8 inputs through four different accumulators: pure FP8, tensor-core-only (~14 bits), DeepSeek-style promotion every steps, and an FP32 truth line drawn as a dashed reference. Drag K up to 8192 and watch the red and amber lines peel away from the truth while the green promoted line tracks it exactly.
Plain Python: Three Accumulators, One Dot Product
Here is the entire concept in 40 lines of NumPy. No GPUs, no kernels — just a precision sandbox that lets us watch the rounding error unfold one add at a time.
PyTorch: Block GEMM with Promoted Accumulator
Once we move from a single dot product to a full GEMM, the same promotion logic lives inside the outer K loop of a block-tiled kernel. The PyTorch version below is the structural template — exactly the layout that CUTLASS's FP8 GEMMs use on H100/H800, written in pure Python so we can read every line.
At Massive Scale: WGMMA, Nc=128, and the H800 Trick
DeepSeek V3 trained 671 B parameters in FP8 across 14.8 T tokens on a fleet of H800 GPUs. Every linear layer — every QKV projection, every MLP, every attention output — runs on FP8 tensor cores. Without high-precision accumulation, that run would have diverged inside the first thousand steps. With it, the team reported BF16-equivalent loss curves at a fraction of the memory bandwidth.
The actual production kernel does five things on top of the textbook two-bucket pattern:
- Tile shape locked to the WGMMA instruction. H800's FP8 WGMMA wants the K tile to be a multiple of 16. Nc = 128 = 8 tiles, the smallest power-of-two multiple that still amortises the promotion cost. Larger Nc loses precision faster than it saves cycles; smaller Nc wastes CUDA-core add throughput.
- Asynchronous promotion via warp specialisation. Tensor cores and CUDA cores are different SM units. The kernel alternates them: while one warp pours sand into the tensor-core bucket, another warp empties yesterday's bucket into the FP32 accumulator. Done right, the promotion is free — it hides entirely behind WGMMA latency.
- Block-wise input scales fold into the promotion add. From §10.3, every 128 × 128 tile of activations has its own scale factor. That scale enters the FP32 add — not the FP8 multiply — which means the per-tile dynamic-range correction lives at full precision, and the FP8 multiplies see only well-normalised values.
- Backward pass uses the same trick, twice. dW = X^T · dY and dX = dY · W^T are also FP8 matmuls. Each gets its own promoted accumulator. The optimiser state stays in FP32, so the promoted accumulator pours directly into the master weight update. The FP8 storage layer is invisible to the optimiser.
- Periodic FP32 reset across batches. Over a full step, the FP32 accumulator holds the entire GEMM result. Between steps it is consumed by the next layer and reset. Over a training run of steps the accumulator is recreated times — but never grows unboundedly, because each step zeroes it.
| Strategy | Where it accumulates | Effective bits | Stability at K=4096 | Used in production? |
|---|---|---|---|---|
| FP8 acc | FP8 register, same place as inputs | ~4 | Catastrophic — diverges in steps | Never |
| BF16 acc (mixed-precision classic) | BF16 register | ~8 | Marginal — needs loss scaling | Pre-Hopper FP16 training |
| WGMMA tensor-core register only | TC register, ~14 bits | ~14 | Drifts over a full training run | What you get if you do nothing |
| Promote every Nc to FP32 (DeepSeek V3) | CUDA-core FP32 register | ~24 | Matches FP32 to ~1e-6 | DeepSeek V3, FP8 frontier models |
| Full FP32 acc per add | CUDA-core FP32 register, no TC bucket | ~24 | Matches FP32 exactly | Slower; loses tensor-core throughput |
Engineering Reality: When FP32 Acc Is Not Enough
Promoted FP32 accumulation handles the FP8 GEMM case. There are operations inside a transformer where even FP32 is not sufficient and the kernel has to do something else:
- Softmax denominator. Inside attention, the row sum of can underflow or overflow even in FP32 once the logits span a wide range. Production kernels subtract the row max before exponentiation (the standard numerically-stable softmax trick) and frequently keep the running sum in FP32 even when the rest of attention is FP16/FP8.
- LayerNorm / RMSNorm variance. is another long sum. With it can drift in FP16 acc; modern norms always accumulate the variance in FP32, regardless of the weight precision.
- Optimizer state. AdamW's first and second moments accumulate over millions of steps. Even FP32 is not generous here — many production setups use mixed FP32/BF16 with stochastic rounding, or pure FP32 master weights, regardless of how aggressively the forward/backward pass is quantised.
- Loss scaling. If you reduce the loss across the batch in BF16, you can lose contributions from low-magnitude examples. Production training always reduces in FP32 and only casts back down for the next forward.
The pattern across all four cases is the same as the FP8 case: the long-tail accumulator is the precision-sensitive part. FP8 storage and FP8 multiply are fine. FP8 add across thousands of steps is not. Engineering is the discipline of finding every long-accumulator and making sure it lives in a register that can hold the answer.
The deepest lesson. FP8 training is not really "FP8 training." It is FP8 storage, FP8 multiplication, and FP32 accumulation — held together by tile-level scales, two-bucket promotion every 128 steps, and a careful audit of every reduction in the model. Take any of those three legs away and the training run collapses. The reason DeepSeek V3 could publish an FP8 671 B model in 2024 is not that tensor cores got better — it is that the team mapped every sum in the transformer to the right precision tier. The next quantisation regime (FP4? Microscaling FP6?) will be the same exercise, fought one accumulator at a time.