Boo-AI — Master Artificial Intelligence by Building from Scratch

One activation channel out of four thousand can hold a value a hundred times larger than every other channel. If you give the whole tensor a single FP8 scale, that one channel decides what every other number is allowed to look like — and the rest get crushed to zero. DeepSeek-V3's answer is brutally simple: stop using one scale. Give every 1×128 strip of activations and every 128×128 patch of weights its own FP32 multiplier. That is fine-grained quantisation, and it is the only reason FP8 training works at 671 B parameters.

The Real Problem: One Outlier Burns Every Other Number

FP8 E4M3 — the activation format used by DeepSeek-V3, NVIDIA H100 Transformer Engine, and every modern FP8 training stack — has exactly $2^{8} = 256$ representable values. Its maximum magnitude is $448$ . To store a tensor of FP32 values $x$ in FP8, the standard recipe is symmetric scaling:

$x_{\text{fp8}} = \text{round}\!\left(\frac{x}{s}\right), \quad s = \frac{\max(|x|)}{448}$

The scale $s$ is a single FP32 number that pushes the largest absolute value in the tensor exactly to the FP8 ceiling. Everything smaller than $s$ rounds to zero. So far, so good — for well-behaved tensors.

Transformer activations are not well-behaved. Dettmers et al. (2022, LLM.int8()) measured them carefully and found something that the architecture papers never talk about: a small number of specific channels — sometimes just 0.1% of all channels — carry activation magnitudes 10× to 100× larger than the median. These outlier channels are not transient. They appear at the same indices across thousands of inputs. They grow more prominent as models get larger. By 6.7 B parameters, they dominate the per-tensor amax.

The arithmetic of disaster. Suppose 99.9% of an activation tensor lives in

[-3, 3]

and one cell holds

220

. Per-tensor amax is

220

, so

s = 220/448 \approx 0.491

. The smallest FP8-representable nonzero value at this scale is

\pm s

itself. Every original value with

|x| < 0.246

rounds to zero. That is roughly 80% of the tensor — silently zeroed out by one outlier.

Zeroing out 80% of activations does not produce a NaN. It does not crash. It produces a model that does not learn — gradients are computed from these mostly-zero tensors, the optimiser pretends everything is normal, and after several thousand steps the loss curve simply refuses to come down. Engineering teams have wasted whole training runs chasing this exact bug. The first attempts at FP8 training in 2022–2023 all hit it.

And outliers are not the only failure mode. Even without a catastrophic cell, the tail of a Gaussian-shaped activation distribution wastes most of FP8's tiny dynamic range whenever one tile happens to contain a 4-σ value and the next contains none. Per-tensor scaling forces both tiles to share the same ruler, and the ruler has only 8 mantissa levels per binade. Two tiles, one ruler, eight levels — most values land between the same two FP8 codes.

Strategy	Scale count (4096-channel tensor)	Outlier damage	Used by
Per-tensor	1 FP32 scalar	Whole tensor crushed	Naive FP8 (§10.2 — fails)
Per-row (per-token)	1 scale per row	Per-token contamination	Some early FP8 systems
1×128 tile (activations)	32 scales per row	Contained to one tile	DeepSeek-V3
128×128 block (weights)	1 scale per 16k weights	Contained to one patch	DeepSeek-V3
Per-element	1 scale per element	Zero damage, but no compression	Not viable

Intuition: Many Local Rulers Beat One Global Ruler

Imagine you are asked to measure the heights of every object in a warehouse — pencils, boxes, forklifts, an industrial crane — and write each measurement on a card with only three digits of precision. You have a single ruler and you have to choose its length up front.

A short ruler — say 1 metre — gives you millimetre precision on pencils and boxes, but the crane does not fit. Your only option is to write "1000+" for the crane and lose the information.

A long ruler — say 30 metres — fits the crane (30 m written as 29.7) but now everything else has 100-mm precision. The pencil and the box are indistinguishable. Per-tensor FP8 is the long-ruler strategy: one global scale chosen to fit the biggest value, and every smaller value loses its resolution.

The fine-grained answer is obvious once you see it: do not use one ruler. Issue a separate ruler to every workstation. The pencil bench gets a 30 cm ruler with millimetre marks. The pallet area gets a 3 m ruler with centimetre marks. The crane gets its own giant ruler. Each ruler still has only three digits, but each ruler is calibrated to the local range. Globally, you measure everything with maximum local resolution. Locally, you give up nothing.

The cost of many rulers. Every workstation needs to remember which ruler it is using — that is the scale stored per block. For a (1024, 4096) tensor with 1×128 tiles, that is 32 768 extra FP32 scalars — about 128 KB. The tensor itself in FP8 is 4 MB. So 3% overhead in memory buys an 80-fold reduction in quantisation error. There is no realistic alternative trade-off anywhere on the curve.

The picture extends to weights. A weight matrix in a transformer FFN has its own outliers — usually tied to specific input channels. DeepSeek-V3 quantises weights in 128×128 patches: each patch acts as a small "neighbourhood" with its own ruler. The patch is larger than the activation tile because weights are accessed in a different pattern by the GEMM kernel (this is the topic of §10.4), but the philosophy is identical.

The Block-Wise Quantisation Equations

Let $X \in \mathbb{R}^{R \times C}$ be the tensor we want to quantise. Partition it into rectangular blocks of shape $B_r \times B_c$ . Call the block at row-block $i$ , column-block $j$ the matrix $X^{(i,j)} \in \mathbb{R}^{B_r \times B_c}$ . For each block we compute one FP32 scale:

$s^{(i,j)} = \frac{\max_{p,q} |X^{(i,j)}_{p,q}|}{F_{\max}}$

where $F_{\max} = 448$ for E4M3. The block is then quantised:

$Q^{(i,j)}_{p,q} = \text{round}\!\left(\frac{X^{(i,j)}_{p,q}}{s^{(i,j)}}\right) \in [-F_{\max}, +F_{\max}]$

The reconstruction (dequantised) value is $\hat{X}^{(i,j)}_{p,q} = s^{(i,j)} \cdot Q^{(i,j)}_{p,q}$ . The block-local quantisation error $\epsilon^{(i,j)} = \hat{X}^{(i,j)} - X^{(i,j)}$ is bounded by half a quantisation step:

$|\epsilon^{(i,j)}_{p,q}| \leq \frac{s^{(i,j)}}{2}$

Read that bound carefully. The error in a block is proportional to that block's scale, which is proportional to that block's amax. A block with no outliers has a small amax → small scale → tiny error. A block with one outlier has a large amax → large scale → large error everywhere in that block. The damage is confined to the block that contains the outlier. That is the entire mathematical content of fine-grained quantisation.

Memory accounting

For a tensor of shape $(R, C)$ with blocks of shape $(B_r, B_c)$ , the storage is:

$\text{bytes} = \underbrace{R \cdot C}_{\text{FP8 payload}} + \underbrace{4 \cdot \lceil R / B_r \rceil \cdot \lceil C / B_c \rceil}_{\text{FP32 scales}}$

The overhead ratio is $4 / (B_r \cdot B_c)$ . At $B_r = B_c = 128$ (DeepSeek weight policy) this is $4 / 16384 \approx 0.024\%$ — negligible. At $(B_r, B_c) = (1, 128)$ (DeepSeek activation policy) this is $4 / 128 \approx 3.1\%$ — small. At per- element $(1, 1)$ it would be 400% — worse than just keeping FP32. The block size is where the trade-off lives.

Why per-token and per-channel are NOT enough

Some early FP8 systems used per-row scaling for activations: one scale per token. That works when the outlier pattern is per-row. But outlier channels — fixed column indices that misbehave in every row — defeat per-row scaling: every row has that one big column. A different early strategy used per-channel scaling: one scale per column. That defeats outlier tokens, but creates a problem for the matmul kernel — the contraction axis for an activation × weight GEMM is the channel axis, and you cannot accumulate at FP8 if the scale varies along the contraction axis without breaking the GEMM's inner-loop scaling invariant. DeepSeek-V3's 1×128 tile splits the difference: one scale per (token, 128-channel-group). Outlier channels are absorbed into their own 128-group; outlier tokens are absorbed into their own row; the contraction-axis tiling cleanly matches the GEMM's K-block, which lets the FP8 kernel broadcast the scale across the K-loop in registers.

Manual Numerical Walkthrough

Eight numbers. Two strategies. One outlier. All arithmetic by hand so you can verify every step.

Click to expand: Per-tensor vs 1×4 tile, one outlier

Setup. The row of activations is $x = [0.5, -0.7, 0.3, 0.9, \;\; 224, 0.1, -0.4, 0.2]$ . Seven normal values around magnitude 1, one outlier at 224. We model FP8 as integer rounding clamped to ±448 (this is the cleanest grid that shows the effect; real E4M3 has 8 mantissa levels per binade, same headline behaviour).

Strategy A — per-tensor (one block of 8). $\text{amax} = 224$ , so $s = 224/448 = 0.5$ . Quantise:

$0.5 / 0.5 = 1.0 \to 1, \; \hat{x} = 0.5$ (exact)
$-0.7 / 0.5 = -1.4 \to -1, \; \hat{x} = -0.5$ (err = 0.2)
$0.3 / 0.5 = 0.6 \to 1, \; \hat{x} = 0.5$ (err = 0.2)
$0.9 / 0.5 = 1.8 \to 2, \; \hat{x} = 1.0$ (err = 0.1)
$224 / 0.5 = 448 \to 448, \; \hat{x} = 224$ (exact — sits on the ceiling)
$0.1 / 0.5 = 0.2 \to 0, \; \hat{x} = 0$ (err = 0.1)
$-0.4 / 0.5 = -0.8 \to -1, \; \hat{x} = -0.5$ (err = 0.1)
$0.2 / 0.5 = 0.4 \to 0, \; \hat{x} = 0$ (err = 0.2)

Sum of squared errors: $0 + 0.04 + 0.04 + 0.01 + 0 + 0.01 + 0.01 + 0.04 = 0.15$ . RMSE $= \sqrt{0.15/8} \approx 0.137$ . Notice that the 0.1 value rounded to zero — a 100% relative error — and the 0.2 value also rounded to zero. Two of eight values destroyed.

Strategy B — 1×4 tile (two blocks of 4). Block 1: $[0.5, -0.7, 0.3, 0.9]$ , amax = 0.9, $s_1 = 0.9 / 448 \approx 0.00201$ . Block 2: $[224, 0.1, -0.4, 0.2]$ , amax = 224, $s_2 = 224/448 = 0.5$ .

Quantise block 1 with $s_1$ :

$0.5 / 0.00201 = 248.8 \to 249, \; \hat{x} = 0.5004$
$-0.7 / 0.00201 = -348.3 \to -348, \; \hat{x} = -0.6995$
$0.3 / 0.00201 = 149.3 \to 149, \; \hat{x} = 0.2995$
$0.9 / 0.00201 = 448.0 \to 448, \; \hat{x} = 0.9000$

Errors in block 1 are all ≤ 0.001 — essentially perfect reconstruction. Block 2's errors are unchanged from strategy A on those same four entries (s = 0.5).

New sum of squared errors: $\approx 0 + 0 + 0 + 0 + 0 + 0.01 + 0.01 + 0.04 = 0.06$ . RMSE $\approx \sqrt{0.06/8} \approx 0.087$ . Lower than strategy A — and crucially, the four well-behaved values regained their full precision. The outlier's damage is contained to its block.

Reading the numbers. Strategy A wastes FP8 precision on values it could have represented perfectly, because they had to share a scale with the outlier. Strategy B does not. On a real (16 384 × 4096) activation tensor with real outlier channels, this ratio compounds: per-tensor achieves ~12 dB SNR; 1×128 achieves ~33 dB; the difference between a 21-dB gap and a flat loss curve.

Visualizing Where the Damage Goes

The widget below shows a 16×32 activation tensor with three outlier channels and one catastrophic single-cell spike. Switch between the four scaling strategies and watch the dashed block boundaries shrink. The error heatmap reveals where the damage accumulates: with per-tensor scaling almost every cell is contaminated; with 1×128 tile scaling only the tile containing the catastrophic cell is hurt; the rest of the tensor regains near-FP8 precision.

Loading fine-grained quantisation visualizer…

Three things to try in the widget. (1) Toggle between Reconstructed tensor and Quantisation error — under per-tensor scaling the error view lights up almost uniformly, because the same coarse scale damages everywhere. (2) Switch to 1×128 tile and notice that only a single tile in row 3 (the one containing the catastrophic cell) stays bright in the error view. (3) Compare the Signal / noise number across strategies — every +6 dB is roughly one extra bit of effective precision; the gap between per-tensor and tile-wise is typically 15–25 dB, equivalent to 2–4 extra bits.

Plain Python: Per-Tensor vs Block-Wise

Pure Python, no NumPy, no PyTorch. The whole machinery of FP8 block-wise quantisation in under 40 lines. Read this once and the DeepSeek paper's quantisation diagrams stop looking like magic — they are this loop, vectorised on a GPU.

fp8_blockwise_python.py

🐍python

Explanation(6)

Code(57)

4Why 448? The E4M3 ceiling

FP8 E4M3 (4 exponent, 3 mantissa bits) can represent magnitudes from roughly 2⁻⁹ up to 448. Anything larger overflows to infinity; anything smaller underflows to zero. This single number — 448 — is the source of every problem in this section. We cannot fit a value of 220 directly. We can only fit (220 / scale) if we choose a scale large enough.

EXECUTION STATE

FP8_MAX = 448.0 (E4M3 ceiling)

6One block, one scale: the atomic operation

The block is the unit of FP8 packaging. We compute a single FP32 scale that pushes the block's biggest absolute value to 448 exactly. Everything in the block is divided by this scale before rounding, then multiplied back when read. The whole game of fine-grained quantisation is choosing the block boundary.

EXECUTION STATE

amax = max(|v|) inside the block, FP32

scale = amax / 448, FP32

11Round-then-clamp = the FP8 quantisation grid

After dividing by scale, every value sits in [-448, +448]. We round to the FP8 grid (here modelled as integer rounding for clarity) and clamp at the ends. The reconstruction recon = scale × q is the closest representable value to the original. The error err = recon − v is what we are trying to make small — globally — by choosing block boundaries wisely.

EXECUTION STATE

q = FP8-grid integer, list

recon = scale × q, FP32 reconstruction

err = recon − original, per-element error

17Sweep blocks of arbitrary shape

Two nested loops walk the tensor in (br × bc) tiles. The granularity argument controls everything: (16, 32) is the whole tensor (one scale). (1, 32) is per-row, i.e. per-token. (1, 8) is a tile — the DeepSeek-V3 activation strategy at miniature scale. (128, 128) on a real weight matrix is the DeepSeek weight strategy. Same function, three regimes.

EXECUTION STATE

br, bc = block rows × cols

scales = list of ((r0,c0), s) pairs — one per block

36Synthetic activations with realistic outliers

Real transformer activations are not iid Gaussian — Dettmers et al. (2022) showed that a few specific channels carry magnitudes 10–100× larger than the rest. We replicate that: columns 5 and 18 host outlier channels, and cell (3, 12) is a single catastrophic spike at 220. This is the workload that breaks per-tensor FP8.

EXECUTION STATE

x = 16×32 list of lists, mostly N(0,1)

x[3][12] = 220.0 — the bomb

43Three quantisation regimes, same tensor

(16, 32) — one scale, dominated by the 220. amax = 220 forces scale = 220/448 ≈ 0.491. Every value smaller than 0.491 rounds to zero in FP8. About 95% of the tensor is silently zeroed out. (1, 32) — sixteen scales. Row 3 still has the bomb, but the other fifteen rows are protected. RMSE drops sharply. (1, 8) — sixty-four scales. Now the bomb only contaminates one 1×8 tile. The remaining sixty-three tiles enjoy nearly full FP8 precision. RMSE drops again, often by another factor of two or three. Finer granularity buys lower error in exchange for a few more FP32 scalars.

LOOP TRACE · 3 iterations

per-tensor (1 scale)

RMSE = ≈ 0.50 — disaster

per-row (16 scales)

RMSE = ≈ 0.08

1×8 tile (64 scales)

RMSE = ≈ 0.02 — DeepSeek territory

51 lines without explanation

1import math
2import random
3
4FP8_MAX = 448.0   # E4M3 maximum representable magnitude
5
6def quantise_block(values):
7    """Symmetric block quantisation: one FP32 scale per block."""
8    amax = max(abs(v) for v in values)
9    scale = amax / FP8_MAX if amax > 0 else 1e-30
10    # Round to the FP8 integer grid (8 mantissa levels per binade in E4M3;
11    # we model that here as integer rounding after scaling — exact behaviour
12    # depends on the format, but the headline is identical).
13    q = [max(-FP8_MAX, min(FP8_MAX, round(v / scale))) for v in values]
14    recon = [scale * x for x in q]
15    err = [r - v for r, v in zip(recon, values)]
16    return scale, recon, err
17
18def quantise(tensor, block):
19    """Quantise a 2-D tensor (list of lists) with rectangular blocks."""
20    R, C = len(tensor), len(tensor[0])
21    br, bc = block
22    recon = [[0.0] * C for _ in range(R)]
23    scales = []
24    for r0 in range(0, R, br):
25        for c0 in range(0, C, bc):
26            vals = [tensor[r0 + i][c0 + j]
27                    for i in range(br) for j in range(bc)]
28            s, rec, _ = quantise_block(vals)
29            scales.append(((r0, c0), s))
30            k = 0
31            for i in range(br):
32                for j in range(bc):
33                    recon[r0 + i][c0 + j] = rec[k]
34                    k += 1
35    return recon, scales
36
37def rmse(a, b):
38    n = len(a) * len(a[0])
39    s = sum((a[r][c] - b[r][c]) ** 2 for r in range(len(a)) for c in range(len(a[0])))
40    return math.sqrt(s / n)
41
42# Build a 16 x 32 activation tensor: mostly N(0,1), with two outlier
43# channels and a single catastrophic cell — mirroring real LLM activations.
44random.seed(0)
45x = [[random.gauss(0, 1) for _ in range(32)] for _ in range(16)]
46for r in range(16):
47    x[r][5]  = 80 + random.random() * 40    # outlier channel
48    x[r][18] = 60 + random.random() * 30    # outlier channel
49x[3][12] = 220.0                            # catastrophic single cell
50
51per_tensor, s_pt = quantise(x, (16, 32))    # 1 scale total
52per_row,    s_pr = quantise(x, (1, 32))     # 16 scales (one per token)
53tile_1x8,   s_t  = quantise(x, (1, 8))      # 64 scales (DeepSeek-style)
54
55print(f"per-tensor   RMSE = {rmse(x, per_tensor):.3f}   ({len(s_pt)} scales)")
56print(f"per-row      RMSE = {rmse(x, per_row):.3f}   ({len(s_pr)} scales)")
57print(f"1x8 tile     RMSE = {rmse(x, tile_1x8):.3f}   ({len(s_t)} scales)")

PyTorch: DeepSeek-Style Tile and Block Scaling

Same logic, vectorised across a real-sized activation tensor with the exact (1, 128) and (128, 128) granularities DeepSeek-V3 uses. The reshape-and-permute trick collapses the per-block loop into two fused PyTorch ops — no Python iteration, no per-tile kernel launch. On an H100 this entire round-trip on a 1024×4096 tensor takes well under a millisecond.

fp8_blockwise_torch.py

🐍python

Explanation(9)

Code(49)

1Why PyTorch for what is essentially indexing?

PyTorch buys us two things on top of the Python version: vectorised reshape/permute over millions of elements (the GPU does in microseconds what a Python loop would take minutes for), and direct interoperability with real models — you can drop this function into a forward hook to study activations from any HuggingFace model. DeepSeek's actual implementation is a custom CUDA kernel, but this pure-PyTorch version captures every numerical detail.

3Function signature: tensor in, tensor + scales out

Two outputs because real FP8 pipelines store both the quantised tensor and its scales — the scales are read back during the GEMM accumulation step (next section). For a (R, C) activation with (1, 128) tiles, we get (R, C/128) scales — typically ~1% of the original tensor's bytes. The scale memory is the price of fine granularity.

EXECUTION STATE

x = input tensor, shape (R, C), FP32 or BF16

block = (br, bc) tile shape

9Reshape into tile-major layout

The key move. We view (R, C) as (nR, br, nC, bc) — a 4-D tensor where the inner two dims are one tile. Then permute to (nR, nC, br, bc) so each tile is contiguous in memory. After this, .amax(dim=(-1,-2)) collapses each tile down to one scalar in a single fused op. No Python loop, no kernel launches per tile.

EXECUTION STATE

tiles = shape (R/br, br, C/bc, bc) after view

tiles (permuted) = shape (R/br, C/bc, br, bc)

13amax per tile → FP32 scale per tile

amax is the absolute maximum inside each tile. Dividing by 448 gives the FP32 multiplier that maps that tile's biggest value exactly to the FP8 ceiling. The .clamp(min=1e-12) guards against a tile of all zeros — exact zero everywhere would otherwise produce scale=0 and a divide-by-zero crash. In production this clamp is unnecessary because activations are never exactly all-zero after non-linearities, but it keeps the demo robust.

EXECUTION STATE

amax = shape (nR, nC, 1, 1)

scale = amax / 448, shape (nR, nC, 1, 1)

17Quantise + clamp + dequantise

Three lines, one round-trip. (1) Divide every tile by its scale — each tile now sits roughly in [-448, +448]. (2) Clamp at the ends to handle the rare case where a value lands slightly past 448 due to floating-point rounding. (3) Round to integers — this is our stand-in for the FP8 quantisation grid. Real FP8 has 8 mantissa levels per binade, not 1, but integer rounding shows the same outlier behaviour without dragging in IEEE-754 details. (4) Multiply by scale to dequantise — recon is now the closest representable FP8 value to x, in FP32 storage.

EXECUTION STATE

q = FP8 grid integers, shape (nR, nC, br, bc)

recon = scale × q, FP32 reconstruction

21Restore the original layout

Undo the permute and view. The final tensor has the same shape as the input (R, C), and elementwise it is the FP8 round-trip of x. The scales we return are (nR, nC) — one FP32 per tile. Downstream GEMM code reads these scales next to the FP8 matrix during the inner-product accumulation; we cover that flow in section-04.

EXECUTION STATE

recon = shape (R, C), FP32

scale (returned) = shape (nR, nC)

26Build a realistic activation tensor

1024 × 4096 is the post-attention activation shape of a small transformer block. Two outlier channels (137 and 901) and one catastrophic cell mimic the published 'outlier channel' phenomenon. Without these injections the demo would understate the problem — a clean N(0,1) tensor would barely tell the strategies apart.

EXECUTION STATE

A = shape (1024, 4096), FP32

32Three granularities, one tensor

Run the same quantiser at three granularities and compare. The activation policy (1, 128) makes 32 768 scales — about 0.8% extra memory. The weight-style (128, 128) policy makes 256 scales for the same tensor — fewer scales, larger blocks, more outlier contamination. Per-tensor (1024, 4096) makes one scale and is the failure mode we are running away from.

EXECUTION STATE

scales_act = (1024, 32) — DeepSeek activation choice

scales_w = (8, 32) — DeepSeek weight choice

scales_pt = (1, 1) — disaster

41SNR in decibels — the headline metric

Signal-to-quantisation-noise ratio is the standard quality measure for any quantiser. Every +6 dB ≈ one extra bit of effective precision. Per-tensor on this workload typically yields ~10–15 dB (about 2 bits). 1×128 tile yields ~30–35 dB (about 5–6 bits — comparable to actual FP8). 128×128 sits in between. The numerical gap between per-tensor and 1×128 is the gap between training that diverges and training that converges.

LOOP TRACE · 3 iterations

per-tensor

SNR = ~12 dB — broken

1×128 (DeepSeek act)

SNR = ~33 dB — production-ready

128×128 (DeepSeek weight)

SNR = ~28 dB — fine for weights

40 lines without explanation

1import torch
2
3FP8_MAX = 448.0  # E4M3
4
5def quantise_fp8_blockwise(x: torch.Tensor, block: tuple[int, int]) -> tuple[torch.Tensor, torch.Tensor]:
6    """Symmetric block-wise FP8 simulation: returns (dequantised tensor, FP32 scales)."""
7    R, C = x.shape
8    br, bc = block
9    assert R % br == 0 and C % bc == 0, "tensor must be divisible by block size"
10
11    # 1. Reshape into (R/br, br, C/bc, bc) so each (br, bc) tile is contiguous.
12    tiles = x.view(R // br, br, C // bc, bc)            # (nR, br, nC, bc)
13    tiles = tiles.permute(0, 2, 1, 3).contiguous()      # (nR, nC, br, bc)
14
15    # 2. amax per tile -> FP32 scale per tile.
16    amax = tiles.abs().amax(dim=(-1, -2), keepdim=True) # (nR, nC, 1, 1)
17    scale = (amax / FP8_MAX).clamp(min=1e-12)           # avoid div-by-zero
18
19    # 3. Quantise + clamp to FP8 range, then dequantise.
20    q = (tiles / scale).clamp(-FP8_MAX, FP8_MAX).round()
21    recon = q * scale                                   # (nR, nC, br, bc)
22
23    # 4. Restore the original (R, C) layout.
24    recon = recon.permute(0, 2, 1, 3).reshape(R, C)
25    return recon, scale.squeeze(-1).squeeze(-1)         # (nR, nC) scales
26
27# Build a realistic 1024 x 4096 activation tensor with outlier channels.
28torch.manual_seed(0)
29A = torch.randn(1024, 4096)
30A[:, 137] *= 30.0     # outlier channel 1
31A[:, 901] *= 50.0     # outlier channel 2
32A[42, 2719] = 220.0   # single catastrophic cell
33
34# DeepSeek-V3 activation policy: 1 x 128 tile.
35A_recon_act, scales_act = quantise_fp8_blockwise(A, block=(1, 128))
36
37# DeepSeek-V3 weight policy: 128 x 128 block (simulated on the same matrix
38# to illustrate the granularity — in practice this is applied to W, not A).
39A_recon_w, scales_w = quantise_fp8_blockwise(A, block=(128, 128))
40
41# Per-tensor baseline (the failure mode).
42A_recon_pt, scales_pt = quantise_fp8_blockwise(A, block=(1024, 4096))
43
44def snr_db(x, x_hat):
45    return 10 * torch.log10((x ** 2).sum() / ((x - x_hat) ** 2).sum().clamp(min=1e-30))
46
47print(f"per-tensor       SNR = {snr_db(A, A_recon_pt):6.1f} dB  scales: {scales_pt.numel():>6}")
48print(f"1x128 tile (act) SNR = {snr_db(A, A_recon_act):6.1f} dB  scales: {scales_act.numel():>6}")
49print(f"128x128 block    SNR = {snr_db(A, A_recon_w):6.1f} dB  scales: {scales_w.numel():>6}")

At Massive Scale: DeepSeek-V3's FP8 GEMM

DeepSeek-V3 trains a 671 B-parameter Mixture-of-Experts model on 14.8 T tokens, with most GEMMs executed in FP8. That works only because the team co-designed the FP8 format, the block layout, and a custom CUDA kernel that handles the scales correctly during the inner-product accumulation. The key facts:

Tensor	Format	Block shape	Scale layout	Rationale
Activations (forward)	E4M3	1 × 128	(R, C/128) FP32	Outlier channels are absorbed per-128-group
Activation gradients (backward)	E5M2	1 × 128	(R, C/128) FP32	Wider exponent for small grads
Weights	E4M3	128 × 128	(R/128, C/128) FP32	Patches match GEMM block tile
Weight gradients	FP32	—	—	Master copy stays in FP32 (§10.5)
Optimizer state (m, v)	FP32 / BF16	—	—	Adam moments need precision

Two design points are worth highlighting. First, the tile shape $(1, 128)$ for activations is chosen so the 128-element K-dimension of the GEMM's inner accumulate loop aligns with one tile-scale: the kernel loads 128 FP8 values, multiplies by one FP32 scale broadcast in a register, and accumulates in FP32. No re-scaling mid-loop, no scalar dependency chain inside the hot path. Second, the weight block $(128, 128)$ matches the canonical CUTLASS tile size on H100 for FP8 matmul, so each warp consumes exactly one weight scale per output tile.

What changes from the toy demo to a 671 B model

Per-tile scales become a tensor of their own. On a (16384, 4096) activation, the scale tensor is (16384, 32) FP32 — 2 MB. Across all activations in a transformer block, scale tensors sum to gigabytes. They are not free, and they have to be moved between memory hierarchies just like the FP8 payload.
Scale computation is on the critical path. amax must be computed before the FP8 write-back. On H100, fused-amax-and-quantise kernels (the "TE" primitive in NVIDIA Transformer Engine) write the FP8 tensor and the scale in one pass over the activation. A naive two-pass implementation doubles activation memory bandwidth and is unusable.
Backward needs E5M2, not E4M3. Activation gradients have a wider dynamic range than activations themselves — E5M2 trades mantissa for exponent and is the format DeepSeek uses on the backward pass. Each tile still gets its own scale; the only thing that changes is the underlying FP8 format. This split is invisible to the forward-pass code but critical for stability.
The amax history matters. Some FP8 systems (NVIDIA Transformer Engine) use a delayed scaling policy: compute amax during forward, store it, then use the previous step's amax for the current step's quantisation. This trades a small amount of accuracy (the amax can change by ±20% across steps) for the ability to fuse the FP8 write with the GEMM. DeepSeek-V3 uses just-in-time scaling — compute amax this step, use it this step — paying the bandwidth cost to avoid the accuracy hit.

Engineering Reality: Why 128 and Not 16 or 1024

The block size is a hyperparameter. Smaller blocks give better quantisation; larger blocks give better hardware throughput. Why does DeepSeek-V3 settle on exactly 128?

Block size	Outlier containment	Scale memory	GEMM alignment	Verdict
1 (per-element)	Perfect	+400% — disastrous	No useful kernel	Pointless
8	Excellent	+50%	Misaligned with K-loop	Too small
32	Very good	+12%	Half a wavefront	OK but suboptimal
128	Good	+3.1% (1×128) / 0.024% (128×128)	Matches H100 GEMM tile	DeepSeek-V3 choice
512	Mediocre	+0.78%	Larger than one GEMM tile	Outliers leak
4096 (per-row)	Per-token only	+0.1%	Whole-K scaling	Loses channel granularity

The choice is not arbitrary. 128 is the canonical K-dim of the H100's FP8 matrix-multiply-accumulate (MMA) instruction — each tensor-core instruction consumes a 16×16×16 fragment of FP8 operands, accumulated over the K dimension in steps of 16, and the kernel-level "K-block" that the warp loops over is 128. Aligning the quantisation block with the K-block means the scale broadcast is constant across the entire inner loop. Any smaller block forces a mid-loop scale change, which costs registers and breaks the dataflow. Any larger block lets one outlier's damage spread further than necessary.

Four engineering gotchas that the clean theory hides:

Tail padding. If $C$ is not a multiple of 128, the last tile is padded — and the padding zeros contribute to amax (they don't, since they are zero, but they do take up the scale's memory). Most training stacks ensure all dimensions are multiples of 128 by construction, which is why "hidden size 4096 / 8192 / 16384" is everywhere in modern LLM configs.
Mixed block shapes within one tensor. Some experimental schemes (e.g., NVIDIA's "current scaling") use 1×128 for activations and 1×K for the same tensor when it appears in a different GEMM. Each consumer of the tensor requires re-quantisation with its own preferred block shape. DeepSeek-V3 avoided this by standardising on (1, 128) for all activation appearances.
The activation-gradient bug. If you quantise activation gradients with a block shape that doesn't align with the gradient's upstream tensor's block shape, the backward GEMM produces subtly wrong values that only show up as slow training divergence. The fix is to keep block shapes consistent across forward and backward — the DeepSeek paper explicitly documents this constraint.
Communication-bound quantisation. For tensor-parallel and expert-parallel comms, the FP8 payload plus its scales travels the network. Sending the scales as BF16 instead of FP32 halves the metadata bandwidth; some systems do this for AllReduce while keeping FP32 scales for GEMM. Whether the savings are worth the extra accuracy loss depends on the network — H100 NVLink can ship FP32 scales for free, but slower interconnects (PCIe, Ethernet) benefit from BF16 metadata.

The deepest lesson. FP8 didn't fail in 2022 because the format was wrong — E4M3 is fine, the silicon is fast, the bandwidth savings are real. It failed because one global scale is the wrong abstraction for tensors whose magnitudes vary by orders of magnitude across positions. The DeepSeek-V3 contribution is not a new format or a new kernel; it is the observation that quantisation granularity is a design knob, and that the right setting — 1×128 for activations, 128×128 for weights — happens to align with the GEMM's natural tile structure on H100. The lesson generalises: every time a numerical method fails at scale, ask whether the failure is in the method or in the granularity at which the method is applied.