One activation channel out of four thousand can hold a value a hundred times larger than every other channel. If you give the whole tensor a single FP8 scale, that one channel decides what every other number is allowed to look like — and the rest get crushed to zero. DeepSeek-V3's answer is brutally simple: stop using one scale. Give every 1×128 strip of activations and every 128×128 patch of weights its own FP32 multiplier. That is fine-grained quantisation, and it is the only reason FP8 training works at 671 B parameters.
The Real Problem: One Outlier Burns Every Other Number
FP8 E4M3 — the activation format used by DeepSeek-V3, NVIDIA H100 Transformer Engine, and every modern FP8 training stack — has exactly representable values. Its maximum magnitude is . To store a tensor of FP32 values in FP8, the standard recipe is symmetric scaling:
The scale is a single FP32 number that pushes the largest absolute value in the tensor exactly to the FP8 ceiling. Everything smaller than rounds to zero. So far, so good — for well-behaved tensors.
Transformer activations are not well-behaved. Dettmers et al. (2022, LLM.int8()) measured them carefully and found something that the architecture papers never talk about: a small number of specific channels — sometimes just 0.1% of all channels — carry activation magnitudes 10× to 100× larger than the median. These outlier channels are not transient. They appear at the same indices across thousands of inputs. They grow more prominent as models get larger. By 6.7 B parameters, they dominate the per-tensor amax.
Zeroing out 80% of activations does not produce a NaN. It does not crash. It produces a model that does not learn — gradients are computed from these mostly-zero tensors, the optimiser pretends everything is normal, and after several thousand steps the loss curve simply refuses to come down. Engineering teams have wasted whole training runs chasing this exact bug. The first attempts at FP8 training in 2022–2023 all hit it.
And outliers are not the only failure mode. Even without a catastrophic cell, the tail of a Gaussian-shaped activation distribution wastes most of FP8's tiny dynamic range whenever one tile happens to contain a 4-σ value and the next contains none. Per-tensor scaling forces both tiles to share the same ruler, and the ruler has only 8 mantissa levels per binade. Two tiles, one ruler, eight levels — most values land between the same two FP8 codes.
| Strategy | Scale count (4096-channel tensor) | Outlier damage | Used by |
|---|---|---|---|
| Per-tensor | 1 FP32 scalar | Whole tensor crushed | Naive FP8 (§10.2 — fails) |
| Per-row (per-token) | 1 scale per row | Per-token contamination | Some early FP8 systems |
| 1×128 tile (activations) | 32 scales per row | Contained to one tile | DeepSeek-V3 |
| 128×128 block (weights) | 1 scale per 16k weights | Contained to one patch | DeepSeek-V3 |
| Per-element | 1 scale per element | Zero damage, but no compression | Not viable |
Intuition: Many Local Rulers Beat One Global Ruler
Imagine you are asked to measure the heights of every object in a warehouse — pencils, boxes, forklifts, an industrial crane — and write each measurement on a card with only three digits of precision. You have a single ruler and you have to choose its length up front.
A short ruler — say 1 metre — gives you millimetre precision on pencils and boxes, but the crane does not fit. Your only option is to write "1000+" for the crane and lose the information.
A long ruler — say 30 metres — fits the crane (30 m written as 29.7) but now everything else has 100-mm precision. The pencil and the box are indistinguishable. Per-tensor FP8 is the long-ruler strategy: one global scale chosen to fit the biggest value, and every smaller value loses its resolution.
The fine-grained answer is obvious once you see it: do not use one ruler. Issue a separate ruler to every workstation. The pencil bench gets a 30 cm ruler with millimetre marks. The pallet area gets a 3 m ruler with centimetre marks. The crane gets its own giant ruler. Each ruler still has only three digits, but each ruler is calibrated to the local range. Globally, you measure everything with maximum local resolution. Locally, you give up nothing.
The picture extends to weights. A weight matrix in a transformer FFN has its own outliers — usually tied to specific input channels. DeepSeek-V3 quantises weights in 128×128 patches: each patch acts as a small "neighbourhood" with its own ruler. The patch is larger than the activation tile because weights are accessed in a different pattern by the GEMM kernel (this is the topic of §10.4), but the philosophy is identical.
The Block-Wise Quantisation Equations
Let be the tensor we want to quantise. Partition it into rectangular blocks of shape . Call the block at row-block , column-block the matrix . For each block we compute one FP32 scale:
where for E4M3. The block is then quantised:
The reconstruction (dequantised) value is . The block-local quantisation error is bounded by half a quantisation step:
Read that bound carefully. The error in a block is proportional to that block's scale, which is proportional to that block's amax. A block with no outliers has a small amax → small scale → tiny error. A block with one outlier has a large amax → large scale → large error everywhere in that block. The damage is confined to the block that contains the outlier. That is the entire mathematical content of fine-grained quantisation.
Memory accounting
For a tensor of shape with blocks of shape , the storage is:
The overhead ratio is . At (DeepSeek weight policy) this is — negligible. At (DeepSeek activation policy) this is — small. At per- element it would be 400% — worse than just keeping FP32. The block size is where the trade-off lives.
Why per-token and per-channel are NOT enough
Some early FP8 systems used per-row scaling for activations: one scale per token. That works when the outlier pattern is per-row. But outlier channels — fixed column indices that misbehave in every row — defeat per-row scaling: every row has that one big column. A different early strategy used per-channel scaling: one scale per column. That defeats outlier tokens, but creates a problem for the matmul kernel — the contraction axis for an activation × weight GEMM is the channel axis, and you cannot accumulate at FP8 if the scale varies along the contraction axis without breaking the GEMM's inner-loop scaling invariant. DeepSeek-V3's 1×128 tile splits the difference: one scale per (token, 128-channel-group). Outlier channels are absorbed into their own 128-group; outlier tokens are absorbed into their own row; the contraction-axis tiling cleanly matches the GEMM's K-block, which lets the FP8 kernel broadcast the scale across the K-loop in registers.
Manual Numerical Walkthrough
Eight numbers. Two strategies. One outlier. All arithmetic by hand so you can verify every step.
Click to expand: Per-tensor vs 1×4 tile, one outlier
Setup. The row of activations is . Seven normal values around magnitude 1, one outlier at 224. We model FP8 as integer rounding clamped to ±448 (this is the cleanest grid that shows the effect; real E4M3 has 8 mantissa levels per binade, same headline behaviour).
Strategy A — per-tensor (one block of 8). , so . Quantise:
- (exact)
- (err = 0.2)
- (err = 0.2)
- (err = 0.1)
- (exact — sits on the ceiling)
- (err = 0.1)
- (err = 0.1)
- (err = 0.2)
Sum of squared errors: . RMSE . Notice that the 0.1 value rounded to zero — a 100% relative error — and the 0.2 value also rounded to zero. Two of eight values destroyed.
Strategy B — 1×4 tile (two blocks of 4). Block 1: , amax = 0.9, . Block 2: , amax = 224, .
Quantise block 1 with :
Errors in block 1 are all ≤ 0.001 — essentially perfect reconstruction. Block 2's errors are unchanged from strategy A on those same four entries (s = 0.5).
New sum of squared errors: . RMSE . Lower than strategy A — and crucially, the four well-behaved values regained their full precision. The outlier's damage is contained to its block.
Reading the numbers. Strategy A wastes FP8 precision on values it could have represented perfectly, because they had to share a scale with the outlier. Strategy B does not. On a real (16 384 × 4096) activation tensor with real outlier channels, this ratio compounds: per-tensor achieves ~12 dB SNR; 1×128 achieves ~33 dB; the difference between a 21-dB gap and a flat loss curve.
Visualizing Where the Damage Goes
The widget below shows a 16×32 activation tensor with three outlier channels and one catastrophic single-cell spike. Switch between the four scaling strategies and watch the dashed block boundaries shrink. The error heatmap reveals where the damage accumulates: with per-tensor scaling almost every cell is contaminated; with 1×128 tile scaling only the tile containing the catastrophic cell is hurt; the rest of the tensor regains near-FP8 precision.
Plain Python: Per-Tensor vs Block-Wise
Pure Python, no NumPy, no PyTorch. The whole machinery of FP8 block-wise quantisation in under 40 lines. Read this once and the DeepSeek paper's quantisation diagrams stop looking like magic — they are this loop, vectorised on a GPU.
PyTorch: DeepSeek-Style Tile and Block Scaling
Same logic, vectorised across a real-sized activation tensor with the exact (1, 128) and (128, 128) granularities DeepSeek-V3 uses. The reshape-and-permute trick collapses the per-block loop into two fused PyTorch ops — no Python iteration, no per-tile kernel launch. On an H100 this entire round-trip on a 1024×4096 tensor takes well under a millisecond.
At Massive Scale: DeepSeek-V3's FP8 GEMM
DeepSeek-V3 trains a 671 B-parameter Mixture-of-Experts model on 14.8 T tokens, with most GEMMs executed in FP8. That works only because the team co-designed the FP8 format, the block layout, and a custom CUDA kernel that handles the scales correctly during the inner-product accumulation. The key facts:
| Tensor | Format | Block shape | Scale layout | Rationale |
|---|---|---|---|---|
| Activations (forward) | E4M3 | 1 × 128 | (R, C/128) FP32 | Outlier channels are absorbed per-128-group |
| Activation gradients (backward) | E5M2 | 1 × 128 | (R, C/128) FP32 | Wider exponent for small grads |
| Weights | E4M3 | 128 × 128 | (R/128, C/128) FP32 | Patches match GEMM block tile |
| Weight gradients | FP32 | — | — | Master copy stays in FP32 (§10.5) |
| Optimizer state (m, v) | FP32 / BF16 | — | — | Adam moments need precision |
Two design points are worth highlighting. First, the tile shape for activations is chosen so the 128-element K-dimension of the GEMM's inner accumulate loop aligns with one tile-scale: the kernel loads 128 FP8 values, multiplies by one FP32 scale broadcast in a register, and accumulates in FP32. No re-scaling mid-loop, no scalar dependency chain inside the hot path. Second, the weight block matches the canonical CUTLASS tile size on H100 for FP8 matmul, so each warp consumes exactly one weight scale per output tile.
What changes from the toy demo to a 671 B model
- Per-tile scales become a tensor of their own. On a (16384, 4096) activation, the scale tensor is (16384, 32) FP32 — 2 MB. Across all activations in a transformer block, scale tensors sum to gigabytes. They are not free, and they have to be moved between memory hierarchies just like the FP8 payload.
- Scale computation is on the critical path. amax must be computed before the FP8 write-back. On H100, fused-amax-and-quantise kernels (the "TE" primitive in NVIDIA Transformer Engine) write the FP8 tensor and the scale in one pass over the activation. A naive two-pass implementation doubles activation memory bandwidth and is unusable.
- Backward needs E5M2, not E4M3. Activation gradients have a wider dynamic range than activations themselves — E5M2 trades mantissa for exponent and is the format DeepSeek uses on the backward pass. Each tile still gets its own scale; the only thing that changes is the underlying FP8 format. This split is invisible to the forward-pass code but critical for stability.
- The amax history matters. Some FP8 systems (NVIDIA Transformer Engine) use a delayed scaling policy: compute amax during forward, store it, then use the previous step's amax for the current step's quantisation. This trades a small amount of accuracy (the amax can change by ±20% across steps) for the ability to fuse the FP8 write with the GEMM. DeepSeek-V3 uses just-in-time scaling — compute amax this step, use it this step — paying the bandwidth cost to avoid the accuracy hit.
Engineering Reality: Why 128 and Not 16 or 1024
The block size is a hyperparameter. Smaller blocks give better quantisation; larger blocks give better hardware throughput. Why does DeepSeek-V3 settle on exactly 128?
| Block size | Outlier containment | Scale memory | GEMM alignment | Verdict |
|---|---|---|---|---|
| 1 (per-element) | Perfect | +400% — disastrous | No useful kernel | Pointless |
| 8 | Excellent | +50% | Misaligned with K-loop | Too small |
| 32 | Very good | +12% | Half a wavefront | OK but suboptimal |
| 128 | Good | +3.1% (1×128) / 0.024% (128×128) | Matches H100 GEMM tile | DeepSeek-V3 choice |
| 512 | Mediocre | +0.78% | Larger than one GEMM tile | Outliers leak |
| 4096 (per-row) | Per-token only | +0.1% | Whole-K scaling | Loses channel granularity |
The choice is not arbitrary. 128 is the canonical K-dim of the H100's FP8 matrix-multiply-accumulate (MMA) instruction — each tensor-core instruction consumes a 16×16×16 fragment of FP8 operands, accumulated over the K dimension in steps of 16, and the kernel-level "K-block" that the warp loops over is 128. Aligning the quantisation block with the K-block means the scale broadcast is constant across the entire inner loop. Any smaller block forces a mid-loop scale change, which costs registers and breaks the dataflow. Any larger block lets one outlier's damage spread further than necessary.
Four engineering gotchas that the clean theory hides:
- Tail padding. If is not a multiple of 128, the last tile is padded — and the padding zeros contribute to amax (they don't, since they are zero, but they do take up the scale's memory). Most training stacks ensure all dimensions are multiples of 128 by construction, which is why "hidden size 4096 / 8192 / 16384" is everywhere in modern LLM configs.
- Mixed block shapes within one tensor. Some experimental schemes (e.g., NVIDIA's "current scaling") use 1×128 for activations and 1×K for the same tensor when it appears in a different GEMM. Each consumer of the tensor requires re-quantisation with its own preferred block shape. DeepSeek-V3 avoided this by standardising on (1, 128) for all activation appearances.
- The activation-gradient bug. If you quantise activation gradients with a block shape that doesn't align with the gradient's upstream tensor's block shape, the backward GEMM produces subtly wrong values that only show up as slow training divergence. The fix is to keep block shapes consistent across forward and backward — the DeepSeek paper explicitly documents this constraint.
- Communication-bound quantisation. For tensor-parallel and expert-parallel comms, the FP8 payload plus its scales travels the network. Sending the scales as BF16 instead of FP32 halves the metadata bandwidth; some systems do this for AllReduce while keeping FP32 scales for GEMM. Whether the savings are worth the extra accuracy loss depends on the network — H100 NVLink can ship FP32 scales for free, but slower interconnects (PCIe, Ethernet) benefit from BF16 metadata.
The deepest lesson. FP8 didn't fail in 2022 because the format was wrong — E4M3 is fine, the silicon is fast, the bandwidth savings are real. It failed because one global scale is the wrong abstraction for tensors whose magnitudes vary by orders of magnitude across positions. The DeepSeek-V3 contribution is not a new format or a new kernel; it is the observation that quantisation granularity is a design knob, and that the right setting — 1×128 for activations, 128×128 for weights — happens to align with the GEMM's natural tile structure on H100. The lesson generalises: every time a numerical method fails at scale, ask whether the failure is in the method or in the granularity at which the method is applied.