Boo-AI — Master Artificial Intelligence by Building from Scratch

A 671 B-parameter model trained in FP8 should explode. It does not — but only because every dot product inside it is secretly doing arithmetic in two precisions at once. The data lives in 8 bits. The multiplies happen in tensor cores at 8 bits in, ~14 bits out. But the running sum across the K dimension is quietly promoted to a full 32-bit register on a different piece of silicon every 128 steps — and that single trick is the difference between a stable training curve and a diverging one.

The Real Problem: A Matmul Is a Very Long Sum

Section 10.3 fixed the storage precision problem with fine-grained quantisation: tile activations and weights, give every tile its own scale factor, and the FP8 grid suddenly covers the value range any tensor needs. So the inputs to a matmul are now reasonably well represented in FP8. What about the matmul itself?

A single output entry of a GEMM is $c_{ij} = \sum_{k=1}^{K} a_{ik} \, b_{kj}$ . For a 7 B-class model, $K$ is somewhere between 4096 and 16384 inside the attention and MLP layers. Each individual product $a_{ik} \, b_{kj}$ is a fine FP8 number. The problem is what happens when you add four thousand of them together.

FP8 e4m3 has 3 mantissa bits plus a hidden bit — about 4 effective bits of precision, or one part in 16. Adding two FP8 values produces a result that, when rounded back to FP8, has roughly 6% rounding noise per add. Compound that across $K = 4096$ operations and the partial sum is dominated by accumulated rounding error long before the true sum is computed. Worse: when the running sum grows much larger than any individual incoming product (which it will, after a few hundred adds with the same sign), the small product rounds away to zero on contact with the accumulator. This is swamping, the classical numerical-analysis failure mode of finite-precision arithmetic, and FP8 swamps almost immediately.

Hopper's WGMMA tensor-core instruction was designed knowing this. It does the multiply at FP8 × FP8 and routes the result into an FP32 register. Problem solved? Not quite — and the rest of this section is about why.

The hidden detail in WGMMA. The DeepSeek V3 paper (§3.3.2) reports that the FP8 GEMM pipeline on H800 maintains only about 14 effective bits of precision in its partial-sum register before that register is flushed to an FP32 destination. NVIDIA does not document this number on a spec sheet; DeepSeek measured it by comparing tensor-core results against a CUDA-core FP32 reference. 14 bits is enough for a chunk of 128 products. It is not enough for

K = 4096

products. Without explicit intervention, FP8 training drifts and eventually diverges.

Intuition: A Bucket That Keeps Spilling Sand

Picture an accumulator as a bucket and each product as a handful of sand. The bucket has a finite number of decimal places of capacity — say 4 digits if it is FP8, 8 digits if it is BF16, 24 digits if it is FP32. Every time you pour a handful in, the bucket adjusts its reading — but only to the nearest digit it can show.

Early on, both the bucket and the handfuls are small numbers of similar magnitude. Each pour shows up cleanly. Some time later the bucket reads 100 and you pour in a handful of 0.3. A 4-digit bucket sees $100 + 0.3 \to 100$ — the small handful vanished. A 24-digit bucket records every grain. The total weight you poured is the same in both cases. The amount the bucket remembers is not.

The DeepSeek V3 trick is to give yourself two buckets. A small, quick-to-write bucket (the tensor-core register, ~14 effective bits) catches sand as fast as the tensor cores can pour it. Every 128 pours, you tip that small bucket into a giant 24-digit bucket and reset the small one. Now no individual handful is ever lost: while it is being accumulated within a 128-pour window the magnitudes of bucket and handful stay comparable, so nothing is rounded to zero. The giant bucket only ever receives full 14-digit-precision tips, and it has the capacity to hold all of them faithfully.

Why specifically two buckets, not one big bucket. Because the tensor cores can only pour into the small bucket — that is the hardware. WGMMA writes its partial sum to its own register; you cannot instruct it to write directly into a CUDA-core FP32 register. The two-bucket pattern is exactly the software workaround for that hardware constraint: tensor cores fill the fast register, then a separate CUDA-core add transfers the value into a slower-but-larger accumulator.

The Mathematics of Rounding-Error Walk

For a finite-precision float with $p$ mantissa bits, the unit roundoff is $u = 2^{-p-1}$ . Every arithmetic operation in this format produces a result that is the true result times $(1 + \delta)$ with $|\delta| \leq u$ . For FP8 e4m3, $p = 3$ and $u \approx 0.06$ . For BF16, $u \approx 4 \cdot 10^{-3}$ . For FP32, $u \approx 6 \cdot 10^{-8}$ . For the H800 WGMMA register, the effective $u$ from DeepSeek's measurement is about $6 \cdot 10^{-5}$ — between BF16 and FP32, but much closer to BF16.

Now sum $K$ products $x_k = a_k b_k$ with a precision- $p$ accumulator. After each add the accumulated value $S_k$ picks up a relative rounding kick $\delta_k$ . Wilkinson's forward-error bound for a naive sum is the standard textbook result:

$\bigl| \hat{S}_K - S_K \bigr| \leq K \cdot u \cdot \sum_{k=1}^{K} |x_k|$

The error grows linearly in $K$ in the worst case. For randomly-signed errors the typical bound is more forgiving — a random walk of $K$ kicks of size $u$ drifts by $\sqrt{K} \, u$ — but for $K = 4096$ we still have a typical drift of $64 u$ . With FP8 acc that is $\approx 4$ — total noise. With the WGMMA register's 14-bit floor that is $\approx 4 \cdot 10^{-3}$ — small but not invisible. With FP32 acc it is $\approx 4 \cdot 10^{-6}$ — gone.

Why promotion every Nc steps works

Suppose we accumulate in the limited-precision register only for $N_c$ steps, then flush to FP32. The within-window error walks for $N_c$ steps — magnitude $\sqrt{N_c} \, u_{\text{tc}}$ — and the across-window FP32 add contributes essentially zero error. After $K$ total steps the windows form an independent sum, so the overall error scales as:

$\bigl| \hat{S}_K - S_K \bigr| \sim \sqrt{K / N_c} \cdot \sqrt{N_c} \, u_{\text{tc}} \cdot \bar{x} = \sqrt{K} \, u_{\text{tc}} \, \bar{x}$

Wait — the $N_c$ dropped out. That looks like we gained nothing. The catch: this bound treats the within-window rounding as relative to the running sum. The actual benefit of small $N_c$ is that the running sum $S_k$ never grows large relative to the next addend $x_{k+1}$ , so swamping never starts. The Wilkinson bound assumes you have not lost any addends entirely — but with FP8 acc and $K = 4096$ , half of the addends are completely absorbed by the bucket before they register, and the bound becomes worthless. Promotion every $N_c$ resets the bucket before swamping kicks in.

How small does $N_c$ need to be? Empirically: the bucket-vs-handful magnitude ratio inside one window needs to stay below the register's ULP. For 14-bit precision and centered Gaussian-ish inputs, that ratio crosses dangerous territory somewhere around $N_c = 256$ products. DeepSeek V3 picks $N_c = 128$ with a generous safety margin — cheap to do because the WGMMA tile shape already wants a multiple of 16 along K anyway, and 128 is just 8 of those tiles.

Manual Numerical Walkthrough

Five-element dot product, all numbers small enough to do by hand. We will see swamping happen in real time.

Click to expand: FP8 vs FP32 acc on a 5-element dot product

Setup. Two FP8 vectors, $a = (4, 4, 4, 4, 0.25)$ and $b = (4, 4, 4, 4, 0.25)$ . Every entry is exactly representable in FP8 e4m3 (these are powers of two — they sit on the float grid). True dot product: $4 \cdot 16 + 0.0625 = 64.0625$ .

FP8 accumulator, step by step. The FP8 e4m3 grid near 64 has spacing $2^{6-3} = 8$ — values jump 56, 64, 72, 80, etc. So:

• After step 1: $S_1 = 16$ . Exact (16 is on the grid).
• After step 2: true 32, rounded to nearest FP8 = 32. Exact.
• After step 3: true 48, rounded to nearest FP8 = 48. Exact (grid spacing at this magnitude is 4).
• After step 4: true 64, rounded to nearest FP8 = 64. Exact.
• After step 5: true $64 + 0.0625 = 64.0625$ . But grid spacing at 64 is $2^{6-3} = 8$ . The nearest grid point is 64. The 0.0625 contribution vanishes entirely. $\hat{S}_5 = 64$ .

The damage. Absolute error 0.0625, relative error $0.0625 / 64.0625 \approx 10^{-3}$ . One contribution out of five was wiped out by the bucket. In a real GEMM, this happens to thousands of contributions per output entry.

FP32 accumulator, same vectors. FP32 grid spacing at 64 is $2^{6-23} \approx 7.6 \cdot 10^{-6}$ . The 0.0625 contribution is enormous compared to this — there is no rounding at all. Final sum 64.0625, error $< 10^{-7}$ relative.

Promotion accumulator, Nc = 2. Accumulate two products in FP8 register: $16 + 16 = 32$ (exact). Promote to FP32 register, reset. Accumulate next two: $16 + 16 = 32$ . Promote, reset. Final product: $0.0625$ , accumulated into the FP8 register (still small). Promote: FP32 register holds $32 + 32 + 0.0625 = 64.0625$ . Exact.

The lesson the toy example reveals. The damage is not gradual. It is a step function — the moment the accumulator passes a binade boundary in the limited-precision format, small contributions stop registering at all. Promotion keeps the accumulator small enough that you never cross that boundary inside a window.

Visualizing the Three Accumulators

The widget below runs the same FP8 inputs through four different accumulators: pure FP8, tensor-core-only (~14 bits), DeepSeek-style promotion every $N_c$ steps, and an FP32 truth line drawn as a dashed reference. Drag K up to 8192 and watch the red and amber lines peel away from the truth while the green promoted line tracks it exactly.

Loading accumulation visualizer…

Reading the widget. The dashed grey line is the FP32 truth — what the running sum should be after each step. The red FP8-acc line drifts catastrophically; the amber tensor-core-only line drifts gently but persistently; the green promotion line is visually indistinguishable from truth at the default settings. The vertical green dashed bars mark the

N_c

promotion ticks — every time you cross one, the tensor-core register has been flushed and the small bucket is fresh again.

Plain Python: Three Accumulators, One Dot Product

Here is the entire concept in 40 lines of NumPy. No GPUs, no kernels — just a precision sandbox that lets us watch the rounding error unfold one add at a time.

fp8_accumulation.py

🐍python

Explanation(7)

Code(58)

1Why simulate floats in NumPy?

We are not allowed to depend on a CUDA box to learn this concept. NumPy gives us FP64 — way more precision than we will ever need — and we use it as a precision sandbox: round to N mantissa bits, saturate at max-abs, simulate any float format we want. CUTLASS's CPU reference kernels use exactly this 'shadow float' trick to validate tensor-core math.

4A generic round-to-N-bits routine

Three steps: find the binade (the power-of-two interval that contains x), compute the spacing between adjacent representable values inside that binade (2^(e - mant_bits)), and snap x to the nearest grid point. Subnormals are flushed by the saturate check at the top. This single routine becomes FP8, BF16, and the 14-bit tensor-core register just by changing mant_bits.

EXECUTION STATE

e = floor(log2(|x|)) — the binade index

scale = 2^(e - mant_bits), the ULP at this magnitude

13Three formats, one line each

FP8 e4m3: 3 mantissa bits + hidden bit = 4 effective bits ≈ relative precision 2^-4 ≈ 6%. BF16: 7 mantissa bits + hidden = 8 effective bits ≈ 0.4%. The H100 WGMMA tensor-core register: roughly 14 effective bits ≈ 0.006%. That last number is the one almost nobody knew until DeepSeek V3 measured it and published the value.

EXECUTION STATE

q_fp8 = ULP ≈ 6% of magnitude (e4m3)

q_bf16 = ULP ≈ 0.4% of magnitude

q_tc = ULP ≈ 6 × 10⁻⁵ of magnitude

18Accumulator #1 — naive FP8

Run a Python sum, but quantise the running sum back to FP8 after every add. This is what would happen if the GEMM kernel stored the partial sum in an FP8 register. After K=4096 steps you are guaranteed several percent of error — the FP8 grid is too coarse, small contributions get swallowed (the 'swamping' phenomenon: when x ≫ y, x + y rounds back to x). Production kernels NEVER do this. We include it as the worst-case baseline.

EXECUTION STATE

s = running sum, rounded to FP8 each step

25Accumulator #2 — tensor-core register only

Better, but still wrong. The H100 / H800 WGMMA instruction does FP8 × FP8 → FP32, but the internal partial-sum register inside the WGMMA pipeline holds only ~14 effective bits before it is flushed to the named FP32 register. If you let WGMMA accumulate across a long K dimension without promoting to a CUDA-core register, you saturate at that 14-bit precision floor and lose the rest. DeepSeek V3 §3.3.2 quantifies the resulting drift in pretraining loss curves.

EXECUTION STATE

s = running sum, rounded to 14 bits each step

31Accumulator #3 — promote every Nc steps

The DeepSeek V3 trick. Let WGMMA accumulate in its limited register for exactly Nc=128 products, then take that 14-bit-precision partial sum and ADD it into a separate full-precision FP32 register held in CUDA cores (not tensor cores). Reset the tensor-core register and repeat. The 14-bit cap now applies only to a window of 128 products at a time — much smaller error budget per window — and the long-K accumulator stays at full FP32 precision.

EXECUTION STATE

tc = tensor-core 14-bit running sum (resets every Nc)

cuda_fp32 = CUDA-core FP32 running sum (never resets)

Nc = promotion interval, 128 in DeepSeek V3

LOOP TRACE · 3 iterations

i=1..127

tc = growing inside 14-bit register

i=128 (flush)

cuda_fp32 = += tc; tc = 0

i=129..256

tc = fresh window, fresh error budget

41The stress test

K=4096 is a realistic single-head dot product in a 7B-class model. We quantise both vectors to FP8 first (this is the data format), then run all three accumulators against the FP64 ground truth. Output on a typical seed: FP8 acc shows ~1e-2 relative error (1%, catastrophic), TC-only ~3e-4 (0.03%, borderline — accumulates into training instability over millions of steps), Promote Nc=128 < 1e-6 (rounding-floor only). Same data, same kernel skeleton, three orders of magnitude separating production from disaster.

EXECUTION STATE

K = 4096

FP8 acc = ≈ 1% relative error

TC only = ≈ 0.03% relative error

Promote = < 1e-6 relative error

51 lines without explanation

1import numpy as np
2
3# ---- FP8 (e4m3) and BF16 quantisation, written from scratch ----------
4def round_to(x, mant_bits, max_abs):
5    """Round x to a float format with 'mant_bits' mantissa bits + hidden bit."""
6    if x == 0.0:
7        return 0.0
8    sign = np.sign(x)
9    ax = abs(x)
10    if ax >= max_abs:
11        return sign * max_abs                  # saturate
12    e = np.floor(np.log2(ax))
13    scale = 2.0 ** (e - mant_bits)
14    return sign * np.round(ax / scale) * scale
15
16q_fp8  = lambda x: round_to(x, mant_bits=3, max_abs=448.0)      # e4m3
17q_bf16 = lambda x: round_to(x, mant_bits=7, max_abs=3.39e38)    # bf16
18q_tc   = lambda x: round_to(x, mant_bits=14, max_abs=1e30)      # H100 WGMMA register
19
20# ---- The matmul inner loop, three ways -------------------------------
21def dot_fp8_acc(a_fp8, b_fp8):
22    """Accumulate FP8 partial products in an FP8 register. PRODUCTION-ILLEGAL."""
23    s = 0.0
24    for ai, bi in zip(a_fp8, b_fp8):
25        s = q_fp8(s + ai * bi)
26    return s
27
28def dot_tc_only(a_fp8, b_fp8):
29    """Accumulate in the tensor-core 14-bit partial-sum register only."""
30    s = 0.0
31    for ai, bi in zip(a_fp8, b_fp8):
32        s = q_tc(s + ai * bi)
33    return s
34
35def dot_promote(a_fp8, b_fp8, Nc=128):
36    """DeepSeek V3: every Nc products, flush the TC register into a CUDA-core
37    FP32 accumulator that has the full 24-bit mantissa."""
38    tc, cuda_fp32 = 0.0, 0.0
39    for i, (ai, bi) in enumerate(zip(a_fp8, b_fp8), start=1):
40        tc = q_tc(tc + ai * bi)
41        if i % Nc == 0:
42            cuda_fp32 += tc        # full FP32 add on the CUDA core
43            tc = 0.0
44    return cuda_fp32 + tc          # flush the tail
45
46# ---- Stress test on a K=4096 dot product -----------------------------
47rng  = np.random.default_rng(0)
48a    = np.array([q_fp8(x) for x in rng.standard_normal(4096) * 0.5])
49b    = np.array([q_fp8(x) for x in rng.standard_normal(4096) * 0.5])
50truth = float(np.dot(a.astype(np.float64), b.astype(np.float64)))
51
52for name, fn in [("FP8 acc", dot_fp8_acc),
53                 ("TC only  ", dot_tc_only),
54                 ("Promote Nc=128", lambda x, y: dot_promote(x, y, 128))]:
55    v = fn(a, b)
56    err = abs(v - truth) / max(abs(truth), 1e-12)
57    print(f"{name:>15s}: {v:+10.4f}   rel.err = {err:.2e}")
58print(f"{'FP32 truth':>15s}: {truth:+10.4f}")

PyTorch: Block GEMM with Promoted Accumulator

Once we move from a single dot product to a full GEMM, the same promotion logic lives inside the outer K loop of a block-tiled kernel. The PyTorch version below is the structural template — exactly the layout that CUTLASS's FP8 GEMMs use on H100/H800, written in pure Python so we can read every line.

block_gemm_promoted.py

🐍python

Explanation(6)

Code(57)

1Where this code maps to real CUTLASS

CUTLASS's FP8 GEMM is a tiled inner-product where the K dimension is split into Nc-sized chunks. Inside each chunk, the WGMMA tensor-core instruction accumulates partials in its limited-precision register. Between chunks, the partials are copied into a CUDA-core FP32 register and added. Our Python loop is the CPU shadow of that pattern: one Python iteration = one WGMMA chunk; the inner matmul is the tensor-core operation; the outer 'C = C + partial' add is the CUDA-core promotion.

9Vectorised quantize

Same algorithm as the NumPy version, written so it broadcasts over an entire (M, K) or (N, K) matrix. Important detail: clamp(min=1e-30) before taking log2 — without it, torch.log2(0) returns -inf and the rest of the pipeline produces NaNs. This is the kind of edge case that bites you when you try to write your own FP8 emulation: zeros and subnormals must be handled before the log.

EXECUTION STATE

e = binade index per element, shape matches x

scale = ULP per element

17The block-GEMM skeleton

Two loops in real CUTLASS — an outer K loop over Nc-sized chunks, and (implicit) an inner WGMMA over the chunk. Here we collapse the inner WGMMA into a single (M, Nc) @ (N, Nc).T matmul in FP32 followed by a quantise to 14 bits — this is what the tensor-core register actually delivers per chunk. The 'C = C + partial' line is the WGMMA → FP32-register copy plus add, executed by CUDA cores at full 24-bit mantissa precision.

EXECUTION STATE

M, K, N = 32, 4096, 32 in the example

C =

FP32 accumulator, shape (M, N)

Nc = promotion interval, default 128

23Per-chunk WGMMA simulation

partial is the result of one WGMMA chunk. In our model it is the exact (M, N) matmul of two (M, Nc) and (N, Nc) FP8 tiles, then quantised to tc_bits=14 to model the precision floor of the tensor-core partial register. On a real H800, this whole expression is a single hardware instruction that takes ~16 GPU cycles for an 64×128×128 tile.

EXECUTION STATE

A_blk =

(M, Nc) FP8 tile

B_blk =

(N, Nc) FP8 tile

partial =

(M, N) chunk result, ~14-bit precision

28Promotion in one line

The 'C = C + partial' is the entire promotion step. Because C is dtype=float32, the add happens at full FP32 precision on the CUDA core. The 14-bit precision floor that hit the partial does NOT compound across chunks — each chunk contributes its small error budget independently, and the chunk results are summed cleanly. This is exactly how DeepSeek V3's FP8 kernels are structured.

EXECUTION STATE

C =

running FP32 sum across chunks

partial =

this chunk's contribution, already in FP32

LOOP TRACE · 3 iterations

k0=0

C = += partial(chunk 0)

k0=128

C = += partial(chunk 1)

k0=K-Nc

C = final FP32 result

31The headline numbers

On the default seed: FP8 acc returns rel.err ~6e-2 — 6% off, catastrophic for any training loss curve. TC only returns rel.err ~2e-4 — borderline, would accumulate into a measurable loss divergence over a 10⁵-step training run. Promote Nc=128 returns rel.err < 1e-6 — the FP32 truth, up to FP8 input-quantisation noise. The structural change is one line of Python; the precision change is four orders of magnitude.

EXECUTION STATE

FP8 acc = ≈ 6 × 10⁻²

TC only = ≈ 2 × 10⁻⁴

Promote = < 1 × 10⁻⁶

51 lines without explanation

1import torch
2
3# We don't have an FP8 type in stock CPU PyTorch, so we model FP8 as a
4# quantised float and use FP32 / BF16 to study the accumulator dimension
5# in isolation. The structural lesson — what the kernel does, what the
6# accumulator looks like, how often it is flushed — transfers verbatim
7# to a CUTLASS FP8 GEMM on Hopper.
8
9def quantize(x, mant_bits, max_abs):
10    sign = torch.sign(x)
11    ax = x.abs().clamp(min=1e-30)
12    e = torch.floor(torch.log2(ax))
13    scale = 2.0 ** (e - mant_bits)
14    return torch.where(ax >= max_abs,
15                       sign * max_abs,
16                       sign * torch.round(ax / scale) * scale)
17
18def block_gemm_promoted(A, B, Nc=128, tc_bits=14):
19    """Compute C = A @ B^T with WGMMA-style promoted accumulation.
20
21    A: (M, K) fp8-quantised tensor (we keep it in fp32 storage for simplicity)
22    B: (N, K) fp8-quantised tensor
23    Returns C of shape (M, N), accumulated in FP32 with Nc-window flushes.
24    """
25    M, K = A.shape
26    N, _ = B.shape
27    C = torch.zeros(M, N, dtype=torch.float32)              # CUDA-core FP32
28    for k0 in range(0, K, Nc):
29        k1 = min(k0 + Nc, K)
30        A_blk = A[:, k0:k1]                                  # (M, Nc)
31        B_blk = B[:, k0:k1]                                  # (N, Nc)
32        # Inside this block, the tensor-core register accumulates at ~14 bits.
33        # Model that by computing the matmul in fp32 and then quantising.
34        partial = A_blk @ B_blk.T                            # (M, N), fp32
35        partial = quantize(partial, mant_bits=tc_bits, max_abs=1e30)
36        C = C + partial                                      # full FP32 add
37    return C
38
39# ----- Spot check ------------------------------------------------------
40torch.manual_seed(0)
41M, K, N = 32, 4096, 32
42A_f32 = torch.randn(M, K) * 0.5
43B_f32 = torch.randn(N, K) * 0.5
44A = quantize(A_f32, 3, 448.0)                                # FP8 e4m3
45B = quantize(B_f32, 3, 448.0)
46
47C_truth = A.double() @ B.double().T                          # FP64 truth
48C_naive = quantize(A @ B.T, 3, 448.0)                        # FP8 acc
49C_tc    = quantize(A @ B.T, 14, 1e30)                        # TC-only
50C_promo = block_gemm_promoted(A, B, Nc=128, tc_bits=14)      # Promote Nc=128
51
52def rel_err(C, T):
53    return (C - T).abs().mean() / T.abs().mean().clamp(min=1e-12)
54
55print(f"FP8 acc        rel.err = {rel_err(C_naive, C_truth):.2e}")
56print(f"TC only        rel.err = {rel_err(C_tc,    C_truth):.2e}")
57print(f"Promote Nc=128 rel.err = {rel_err(C_promo, C_truth):.2e}")

At Massive Scale: WGMMA, Nc=128, and the H800 Trick

DeepSeek V3 trained 671 B parameters in FP8 across 14.8 T tokens on a fleet of H800 GPUs. Every linear layer — every QKV projection, every MLP, every attention output — runs on FP8 tensor cores. Without high-precision accumulation, that run would have diverged inside the first thousand steps. With it, the team reported BF16-equivalent loss curves at a fraction of the memory bandwidth.

The actual production kernel does five things on top of the textbook two-bucket pattern:

Tile shape locked to the WGMMA instruction. H800's FP8 WGMMA wants the K tile to be a multiple of 16. Nc = 128 = 8 tiles, the smallest power-of-two multiple that still amortises the promotion cost. Larger Nc loses precision faster than it saves cycles; smaller Nc wastes CUDA-core add throughput.
Asynchronous promotion via warp specialisation. Tensor cores and CUDA cores are different SM units. The kernel alternates them: while one warp pours sand into the tensor-core bucket, another warp empties yesterday's bucket into the FP32 accumulator. Done right, the promotion is free — it hides entirely behind WGMMA latency.
Block-wise input scales fold into the promotion add. From §10.3, every 128 × 128 tile of activations has its own scale factor. That scale enters the FP32 add — not the FP8 multiply — which means the per-tile dynamic-range correction lives at full precision, and the FP8 multiplies see only well-normalised values.
Backward pass uses the same trick, twice. dW = X^T · dY and dX = dY · W^T are also FP8 matmuls. Each gets its own promoted accumulator. The optimiser state stays in FP32, so the promoted accumulator pours directly into the master weight update. The FP8 storage layer is invisible to the optimiser.
Periodic FP32 reset across batches. Over a full step, the FP32 accumulator holds the entire GEMM result. Between steps it is consumed by the next layer and reset. Over a training run of $10^7$ steps the accumulator is recreated $10^7$ times — but never grows unboundedly, because each step zeroes it.

Strategy	Where it accumulates	Effective bits	Stability at K=4096	Used in production?
FP8 acc	FP8 register, same place as inputs	~4	Catastrophic — diverges in steps	Never
BF16 acc (mixed-precision classic)	BF16 register	~8	Marginal — needs loss scaling	Pre-Hopper FP16 training
WGMMA tensor-core register only	TC register, ~14 bits	~14	Drifts over a full training run	What you get if you do nothing
Promote every Nc to FP32 (DeepSeek V3)	CUDA-core FP32 register	~24	Matches FP32 to ~1e-6	DeepSeek V3, FP8 frontier models
Full FP32 acc per add	CUDA-core FP32 register, no TC bucket	~24	Matches FP32 exactly	Slower; loses tensor-core throughput

Engineering Reality: When FP32 Acc Is Not Enough

Promoted FP32 accumulation handles the FP8 GEMM case. There are operations inside a transformer where even FP32 is not sufficient and the kernel has to do something else:

Softmax denominator. Inside attention, the row sum of $\exp(s_{ij})$ can underflow or overflow even in FP32 once the logits span a wide range. Production kernels subtract the row max before exponentiation (the standard numerically-stable softmax trick) and frequently keep the running sum in FP32 even when the rest of attention is FP16/FP8.
LayerNorm / RMSNorm variance. $\sigma^2 = \frac{1}{d} \sum (x_i - \mu)^2$ is another long sum. With $d = 8192$ it can drift in FP16 acc; modern norms always accumulate the variance in FP32, regardless of the weight precision.
Optimizer state. AdamW's first and second moments accumulate over millions of steps. Even FP32 is not generous here — many production setups use mixed FP32/BF16 with stochastic rounding, or pure FP32 master weights, regardless of how aggressively the forward/backward pass is quantised.
Loss scaling. If you reduce the loss across the batch in BF16, you can lose contributions from low-magnitude examples. Production training always reduces in FP32 and only casts back down for the next forward.

The pattern across all four cases is the same as the FP8 case: the long-tail accumulator is the precision-sensitive part. FP8 storage and FP8 multiply are fine. FP8 add across thousands of steps is not. Engineering is the discipline of finding every long-accumulator and making sure it lives in a register that can hold the answer.

The deepest lesson. FP8 training is not really "FP8 training." It is FP8 storage, FP8 multiplication, and FP32 accumulation — held together by tile-level scales, two-bucket promotion every 128 steps, and a careful audit of every reduction in the model. Take any of those three legs away and the training run collapses. The reason DeepSeek V3 could publish an FP8 671 B model in 2024 is not that tensor cores got better — it is that the team mapped every sum in the transformer to the right precision tier. The next quantisation regime (FP4? Microscaling FP6?) will be the same exercise, fought one accumulator at a time.