Boo-AI — Master Artificial Intelligence by Building from Scratch

The DeepSeek-V3 paper reports that pre-training a 671 B-parameter model on 14.8 T tokens cost about 2.788 million H800 GPU-hours. Without FP8, the same run would have cost roughly twice that — and more importantly, it would not have fit on the 2048 H800s the team had access to. FP8 is not a micro-optimisation. It is the difference between training a frontier model and watching one train somewhere else. This section makes the case for FP8 before the next two sections explain why naïve FP8 fails and how DeepSeek made it work.

The Real Problem: 671 B Parameters Don't Fit

A frontier-scale dense or MoE model carries an enormous amount of state through every training step. For each parameter you store the weight itself, the gradient, two Adam optimiser moments, and a slice of the forward-pass activations needed for backprop. With $N = 671 \cdot 10^{9}$ parameters in pure FP32 you are looking at roughly $16 N = 10.7 \, \text{TB}$ of weight-plus- optimiser state alone, before activations and the KV cache that long-context training demands.

An H800 GPU has 80 GB of HBM. To hold the FP32 state of a 671 B model you would need $10.7 \, \text{TB} / 80 \, \text{GB} \approx 134$ GPUs just to park the weights, before you spend a single FLOP on training. Add activations, gradients in transit, and all-to-all buffers for expert parallelism and the floor climbs to many hundreds of GPUs — at which point the network becomes the bottleneck and tensor cores starve.

The old fix was BF16 weights with FP32 master copies (Micikevicius et al. 2018). That cuts forward-pass memory in half. The H100's FP8 tensor cores cut it in half again — and double matmul throughput at the same time. For DeepSeek-V3's 2048-GPU cluster, FP8 is the only path that closes both the memory budget and the compute budget at once.

Format	Bits	Range (≈)	Mantissa	H100 GEMM TFLOPS
FP32	32	10⁻³⁸ to 10³⁸	23 bits	67
TF32	19	10⁻³⁸ to 10³⁸	10 bits	495
BF16	16	10⁻³⁸ to 10³⁸	7 bits	989
FP16	16	6·10⁻⁵ to 6·10⁴	10 bits	989
FP8 E4M3	8	2·10⁻³ to 448	3 bits	1979
FP8 E5M2	8	1.5·10⁻⁵ to 57 344	2 bits	1979

The two FP8 throughput rows tell the whole story. An H100 SXM does ~989 TFLOPS in BF16 and ~1979 TFLOPS in FP8 — a clean 2× on the GEMM rate, and 4× on the weight-and-activation memory footprint. Multiply both factors over a 4-month, 2048-GPU training run and FP8 saves single-digit millions of dollars on a single model. That is why every frontier lab in 2024–2025 either ships FP8 training or is trying to.

Intuition: Pay for Bits That Earn Their Keep

A floating-point number is a budget. Each bit you spend either buys you range (the orders of magnitude you can represent — the exponent) or precision (how finely you can split each order of magnitude — the mantissa). FP32 buys both lavishly. FP8 forces a brutal trade-off: with only eight bits total, you cannot afford both wide range and fine resolution.

Think of the bits as ruler markings. FP32 is a long ruler with $2^{23} \approx 8 \cdot 10^{6}$ tick marks between consecutive powers of two — so fine you cannot see the ticks. BF16 keeps the same length but uses only $2^7 = 128$ ticks per octave. FP8 E4M3 has only $2^3 = 8$ ticks per octave and a much shorter ruler — about six decades wide. Any value outside that ruler becomes zero or saturates; any value inside the ruler gets snapped to the nearest of the eight ticks in its octave.

The bet of FP8 training is that, with the right scaling, almost all the numbers a transformer block needs to multiply land inside FP8's range, and that the few-percent rounding error per multiply is recoverable by a higher-precision accumulator. The rest of this chapter is the engineering required to make that bet true.

Why two FP8 formats? E4M3 (4 exp / 3 mantissa) is like a precise short ruler — good for activations and weights which sit in a narrow band after layer-norm. E5M2 (5 exp / 2 mantissa) is like a coarse long ruler — good for gradients which can span 10 orders of magnitude during a training step. Modern FP8 recipes route weights and activations through E4M3 and gradients through E5M2 — pay the precision bit where it counts, pay the range bit where it counts.

The Anatomy of a Float: FP32, BF16, FP16, FP8

Every IEEE-style float encodes a value as:

$x = (-1)^{s} \cdot 2^{e - \text{bias}} \cdot \left( 1 + \frac{m}{2^{M}} \right)$

with $s \in \{0, 1\}$ the sign, $e$ a non-negative integer stored in $E$ exponent bits, and $m$ the mantissa stored in $M$ bits. The bias $\text{bias} = 2^{E-1} - 1$ centres the exponent range around 1 so that both very small and very large numbers are representable.

The dynamic range of the format — the ratio of max to min representable normal — is roughly:

$\text{range} \approx 2^{2^{E} - 1}$

and the relative resolution within a single octave — the smallest fractional change you can resolve — is:

$\epsilon = 2^{-M}$

Substitute the four formats and watch the trade-off snap into focus:

Format	E	M	Range 2^(2^E−1)	ε = 2^−M
FP32	8	23	≈ 2¹²⁷ (10³⁸)	≈ 1.2 · 10⁻⁷
BF16	8	7	≈ 2¹²⁷ (10³⁸)	≈ 7.8 · 10⁻³
FP16	5	10	≈ 2¹⁵ (3·10⁴)	≈ 9.8 · 10⁻⁴
FP8 E4M3	4	3	≈ 2⁸ (448)	≈ 1.25 · 10⁻¹
FP8 E5M2	5	2	≈ 2¹⁵ (57 344)	≈ 2.5 · 10⁻¹

Read the rows. BF16 has FP32's range with 4× worse precision — which is fine for forward pass because the rounding error per multiply is absorbed by an FP32 accumulator. FP16 has FP32's precision-ish but a much narrower range — which is why FP16 training requires loss scaling to keep gradients from underflowing.

FP8 E4M3 has even narrower range AND much coarser precision: it gives up about $10^{4}$ × of dynamic range vs BF16 and $16$ × of precision. FP8 E5M2 trades a precision bit back for a range bit — recovering BF16's range but with only 4 distinct values per octave. Neither format can survive on its own without scaling; the engineering question is what kind of scaling.

The two scaling regimes

The simplest fix is per-tensor scaling: multiply the whole tensor by a constant before casting to FP8 so that its amax lands at FP8's max value. One scale per tensor, one extra FP32 scalar to store. Works fine for weights, where the distribution is stable across steps.

It fails for activations. After a non-linearity, a single outlier can be 20× the bulk of the distribution — pin the amax to 448 and the bulk gets squashed into a handful of FP8 buckets, with relative error north of 50%. The DeepSeek-V3 paper's answer is per-block scaling (a separate scale per 128-element block of the tensor) plus higher-precision accumulation — the topic of section 10.3. For now, the case for FP8 stops at: the hardware is there, the memory savings are there, the throughput is there, and the scaling problem is solvable.

Manual Numerical Walkthrough

Two examples, calculated by hand: casting a handful of values to FP8 E4M3, and sizing the memory savings on a 671 B-parameter MoE like DeepSeek-V3.

Click to expand: Casting to FP8 E4M3 by hand

Setup. FP8 E4M3 has 1 sign bit, 4 exponent bits, 3 mantissa bits. Bias is $2^{4-1} - 1 = 7$ . Exponent codes 0 and 15 are reserved (zero/subnormal and NaN respectively), so the unbiased exponent $e - 7$ runs from $-6$ to $+8$ .

Example 1 — cast x = 1.3. First find the octave: $\lfloor \log_2 1.3 \rfloor = 0$ , so the unbiased exponent is $e - 7 = 0$ , i.e. stored $e = 7$ . The fractional part is $1.3 / 2^{0} = 1.3$ ; subtract the implicit leading 1 to get the stored mantissa fraction $0.3$ . Scale to 3-bit mantissa: $0.3 \cdot 2^{3} = 2.4$ ; round to $2$ . Reconstructed value: $2^{0} \cdot (1 + 2/8) = 1.25$ . Relative error: $(1.3 - 1.25)/1.3 \approx 3.8\%$ .

Example 2 — cast x = 500. Above MAX_NORMAL ≈ 448. DeepSeek's saturated cast clips to 448. Relative error: $(500 - 448)/500 = 10.4\%$ . This is why outlier handling matters — a few percent of activations past the saturation threshold can move a whole layer's loss by a measurable amount.

Example 3 — cast x = 1.5 · 10⁻⁵. Below MIN_SUBNORM / 2 ≈ 9.7 · 10⁻⁴. Flushes to zero. Relative error 100%. This is the underflow failure mode FP16 famously suffers from during gradient computation; FP8 sees it at activation magnitudes too, which is why E5M2 exists for gradients (range ≈ 1.5 · 10⁻⁵, just enough to keep this particular value alive).

Example 4 — cast x = 0.0. Stays zero. Useful because masked positions in attention, dropped tokens in MoE routing, and padding tokens all produce exact zeros — and FP8 represents zero exactly with no rounding error.

Memory sizing for DeepSeek-V3. Total parameters $N = 6.71 \cdot 10^{11}$ . In FP32: $4 N = 2.68 \, \text{TB}$ for the forward weight copy alone. In BF16: $2 N = 1.34 \, \text{TB}$ . In FP8: $1 N = 0.67 \, \text{TB}$ . Across 2048 H800 GPUs, FP8 frees $(2 - 1) N / 2048 \approx 327 \, \text{MB}$ per GPU compared with BF16. That headroom directly funds longer sequences, larger micro-batches, and more in-flight expert routing buffers.

Throughput sizing. An H100/H800 does $\approx 989$ BF16 TFLOPS and $\approx 1979$ FP8 TFLOPS. DeepSeek-V3 spent $6 N D \approx 6 \cdot 6.71 \cdot 10^{11} \cdot 1.48 \cdot 10^{13} \approx 5.96 \cdot 10^{25}$ training FLOPs. In BF16 at 50% MFU on 2048 H800s this would take $5.96 \cdot 10^{25} / (2048 \cdot 989 \cdot 10^{12} \cdot 0.5) \approx 5.9 \cdot 10^{7}$ seconds ≈ 16 400 GPU-hours per GPU ≈ 33.5 million GPU-hours total. In FP8 at the same MFU, $\approx 16.7$ million GPU-hours. The paper reports 2.788 million — the gap to our crude estimate is real but the 2× direction is exactly what shows up in the budget.

Visualizing Range, Precision, and Memory

Three widgets, each isolating one of FP8's trade-offs. The first compares the representable range of FP32, BF16, FP16, and FP8 against the magnitudes that actually appear in a transformer block. The second is a bit-level explorer — flip bits, watch the value change, see how the sign-exponent-mantissa split builds the number. The third sizes the whole-model memory bill for every realistic precision recipe.

Loading precision range visualizer…

Read the chart this way. The grey bars are the formats; the coloured bands are operational zones (activations, gradients, pre-softmax scores). Anywhere a band sticks out past the right edge of an FP8 bar, the value saturates to the FP8 max. Anywhere a band falls off the left edge, the value flushes to zero. Most of activations sit inside FP8 E4M3; most of gradients spill out of it, which is exactly why E5M2 (wider range) exists.

Loading float bit visualizer…

Flip a single mantissa bit on FP8 E4M3 and watch the value jump by ~12.5% — that is $\epsilon = 2^{-3}$ in action. Flip the same bit on FP32 and the value moves by $2^{-23} \approx 10^{-7}$ . This is the precision cost of the format, in one click.

Loading memory footprint comparator…

Two settings to try. (1) Set the model to 671 B parameters (DeepSeek-V3 scale) and compare "Pure FP32" vs "FP8 mixed" — the per-GPU memory bar shrinks by a factor of roughly 3.5×, freeing the headroom that lets a 671 B MoE run on 2048 H800s. (2) Sweep batch size; FP8's activation savings scale linearly with batch × sequence length, so the percentage win grows for long-context training.

Plain Python: Simulating FP8 Quantisation

Before we call into a Hopper tensor core, let us build the cast by hand. The function below is a faithful emulation of what the hardware does to every number: round-to-nearest-even, saturate at ±448, flush sub-subnormals to zero. Then we use it to simulate an FP8 matmul against an FP32 reference and measure the rounding error a transformer block has to live with.

fp8_e4m3_quantizer.py

🐍python

Explanation(8)

Code(56)

5FP8 E4M3 — the format we are about to build by hand

E4M3 means 1 sign bit, 4 exponent bits, 3 mantissa bits — total 8 bits per number. That is 4× less memory than FP32 and 2× less than BF16/FP16. The bias of 7 (i.e. 2^(E_BITS-1) − 1) centres the representable exponent around 1, giving an unbiased range of roughly 2^−6 to 2^+8. The whole format has only 256 possible values (including ±0 and NaNs) — quantisation is no longer optional, every cast snaps to one of those buckets.

EXECUTION STATE

E_BITS = 4 (exponent width)

M_BITS = 3 (mantissa width)

BIAS = 7

7The boundary constants of FP8 E4M3

MAX_NORMAL is the largest finite FP8 value: roughly ±448. MIN_NORMAL is the smallest normal positive: about 1.56 × 10⁻². MIN_SUBNORM is the smallest non-zero we can represent at all: about 1.95 × 10⁻³. Compare with FP32 which spans ±10³⁸ — FP8 covers maybe six orders of magnitude. Any training value outside that window must be scaled into it or it becomes zero or saturated.

EXECUTION STATE

MAX_NORMAL = ≈ 448

MIN_NORMAL = ≈ 0.0156

MIN_SUBNORM = ≈ 0.00195

11Round-to-nearest-even cast

Every FP32 input is snapped to the nearest FP8 representable value. The function reproduces what an H100 tensor core does in hardware. Three special cases: zero stays zero; values larger than MAX_NORMAL saturate (DeepSeek's choice — clipping to ±448 prevents inf/NaN poisoning the matmul); values smaller than half a subnormal flush to zero.

EXECUTION STATE

sign = +1.0 or −1.0

ax = |x|

23Decompose into exponent and significand

Every float x = sign · 2^exp · frac where frac ∈ [1, 2). We extract exp via log2, clamp it to the subnormal floor (1 − BIAS), then scale the fraction into a 3-bit integer by multiplying by 2^M_BITS. The rounded integer is the mantissa pattern that gets stored on chip. Carry-on-round (line 28) handles the case where rounding pushes the mantissa back into the next exponent bucket — same trick the IEEE 754 hardware uses.

EXECUTION STATE

exp = floor(log2(ax))

frac = ax / 2^exp ∈ [1, 2)

scaled = frac · 2^3 ∈ [8, 16)

rounded = nearest integer in {8..15}

32Reconstruct the rounded FP8 value

Multiplying the rounded mantissa by 2^(exp − M_BITS) and re-applying the sign gives us the actual FP8 value as a Python float. In real hardware this byte is what gets stored to HBM and what comes back into the tensor core; here we return the float so we can inspect the rounding error.

34Simulating an H100 FP8 GEMM

This is the operation that makes FP8 worth the effort. Inputs A and B are quantised to FP8 before the multiplication, but the accumulator stays in FP32 — that is the Hopper tensor-core contract. The 4× memory reduction lives on the inputs; the precision lives in the FP32 accumulator. Section 10.2 will explain why even this is not enough on its own and section 10.3 will introduce per-block scaling — but the skeleton here is the H100 operation byte-for-byte.

EXECUTION STATE

qA =

A quantised to FP8 E4M3

qB =

B quantised to FP8 E4M3

qA @ qB =

matmul, accumulator in FP32

39Casting canonical values

Watch how the relative error grows as |x| moves between FP8 buckets. Values like 1.0 land exactly on a representable point (error 0). Values like 0.1 sit between buckets and pick up a few percent of relative error. Tiny values like 1e-5 are below MIN_SUBNORM/2 and flush to zero — that is the underflow failure mode section 10.2 attacks with per-block scaling.

LOOP TRACE · 4 iterations

x = 1.0

q(x) = 1.0 (exact)

x = 0.1

q(x), rel.err = ≈ 0.1 ± 3%

x = 1e-5

q(x) = 0 (underflow)

x = 500

q(x) = 448 (saturation)

47Matmul fidelity vs FP32 reference

Random Gaussian 128×128 matmul, FP32 reference vs FP8 quantised. With a well-scaled distribution (variance ≈ 1) the mean relative error sits around 2–5% — the same ballpark DeepSeek observes inside transformer blocks at activation scale. That headline number is what justifies the whole programme: 4× faster, 2× less HBM traffic, with a precision penalty small enough to be absorbed by a higher-precision accumulator and per-block scaling.

EXECUTION STATE

max abs error = ≈ few units

mean rel error = ≈ 2–5%

48 lines without explanation

1import math
2import numpy as np
3
4# FP8 E4M3: 1 sign bit, 4 exponent bits, 3 mantissa bits.
5# Bias = 7 (so unbiased exponent runs roughly -6 .. +8).
6E_BITS, M_BITS, BIAS = 4, 3, 7
7MAX_NORMAL  = (2 - 2**-M_BITS) * 2**(2**E_BITS - 1 - BIAS - 1)   # ≈ 448
8MIN_NORMAL  = 2**(1 - BIAS)                                       # ≈ 0.0156
9MIN_SUBNORM = 2**(1 - BIAS - M_BITS)                              # ≈ 0.00195
10
11def quantize_e4m3(x):
12    """Round-to-nearest-even cast from FP32 to FP8 E4M3."""
13    if x == 0:
14        return 0.0
15    sign = math.copysign(1.0, x)
16    ax = abs(x)
17
18    # Above max: saturate (DeepSeek uses saturated cast, not inf).
19    if ax >= MAX_NORMAL:
20        return sign * MAX_NORMAL
21
22    # Below smallest subnormal: flush to zero.
23    if ax < MIN_SUBNORM / 2:
24        return 0.0
25
26    # Decompose into exponent and significand.
27    exp = math.floor(math.log2(ax))
28    exp = max(exp, 1 - BIAS)                  # subnormal floor
29    frac = ax / 2**exp                        # in [1, 2) for normals
30    scaled = frac * 2**M_BITS                 # in [8, 16) for normals
31    rounded = round(scaled)                   # round-to-nearest-even
32    if rounded == 2**(M_BITS + 1):            # carry into next exponent
33        rounded //= 2
34        exp += 1
35    return sign * rounded * 2**(exp - M_BITS)
36
37def fp8_matmul(A, B):
38    """Simulate an H100 FP8 GEMM: quantise inputs, multiply in FP32, return FP32."""
39    qA = np.vectorize(quantize_e4m3)(A)
40    qB = np.vectorize(quantize_e4m3)(B)
41    return qA @ qB                            # accumulator stays FP32
42
43# Quick sanity check: cast a few canonical values.
44for x in [1.0, 0.5, 0.1, 1e-3, 1e-5, 100.0, 500.0, -3.14]:
45    q = quantize_e4m3(x)
46    rel_err = abs(q - x) / max(abs(x), 1e-30)
47    print(f"x = {x:>10.5g}   q(x) = {q:>10.5g}   rel.err = {rel_err:6.2%}")
48
49# Matmul fidelity: random Gaussian inputs, compare to FP32 reference.
50np.random.seed(0)
51A = np.random.randn(128, 128).astype(np.float32)
52B = np.random.randn(128, 128).astype(np.float32)
53ref     = A @ B
54fp8_out = fp8_matmul(A, B)
55print(f"\nMatmul max abs error : {np.abs(ref - fp8_out).max():.3f}")
56print(f"Matmul mean rel error: {(np.abs(ref - fp8_out) / np.abs(ref).clip(1e-6)).mean():.2%}")

PyTorch: Real FP8 Matmul on Hopper

Now the same idea on real silicon. PyTorch 2.1+ exposes Hopper's FP8 tensor cores through $\texttt{torch.\_scaled\_mm}$ — give it FP8 inputs, an FP32 accumulator, and a pair of scales, and you get back the matmul at roughly twice BF16's throughput. This is the kernel that sits inside DeepSeek-V3's forward and backward passes; later sections will wrap it in per-block scaling, but the raw call is here.

hopper_fp8_gemm.py

🐍python

Explanation(7)

Code(44)

1Why two FP8 dtypes?

Hopper exposes E4M3 and E5M2. E4M3 has a wider mantissa (better precision) but narrower range — perfect for activations and weights, which are roughly Gaussian with a known scale. E5M2 has a wider range but coarser precision — perfect for gradients, which can span many orders of magnitude during a training step. DeepSeek-V3 uses E4M3 for forward GEMMs and E5M2 only where the gradient distribution forces it. This 'two-format' design is one of the most under-appreciated ideas in FP8 training.

EXECUTION STATE

torch.float8_e4m3fn = FP8 E4M3, range ≈ ±448

torch.float8_e5m2 = FP8 E5M2, range ≈ ±57344

11Realistic transformer-block scale

We deliberately use a 4096×4096×4096 matmul because that is the canonical shape of a frontier-model MLP layer (hidden dim ≈ 4–14k, batch×seq tokens ≈ 4k+). Weights are initialised with std 0.02 (the Megatron/GPT-NeoX default) and activations with std 1.0 (typical post-LayerNorm). FP8 must accommodate both — that is exactly why per-tensor scaling exists.

EXECUTION STATE

A_fp32 =

weights, shape (M, K), std ≈ 0.02

B_fp32 =

activations, shape (K, N), std ≈ 1.0

15Per-tensor amax scaling

Find the largest absolute value in the tensor. Choose a scale that maps that amax to FP8's max representable value (448 for E4M3). Now every element of the tensor lies inside the FP8 range, with the largest one pinned to the format ceiling — using the full dynamic range of FP8 rather than wasting bits on headroom. This is the standard 'per-tensor scaling' recipe popularised by NVIDIA TransformerEngine.

EXECUTION STATE

amax_A = max|A|, scalar

scale_A = 448 / amax_A, scalar

21The quantise-and-cast

Multiplying by the scale brings the tensor into FP8's range, then .to(e4m3) is the hardware cast — round-to-nearest-even, saturating, the same operation our Python quantize_e4m3 emulated. After this line A_fp8 is a real FP8 tensor in HBM, 4× smaller than A_fp32.

EXECUTION STATE

A_fp8 =

FP8 E4M3, shape (M, K)

B_fp8 =

FP8 E4M3, shape (K, N)

25torch._scaled_mm — the Hopper FP8 GEMM

This is the kernel that runs on H100/H200 tensor cores at roughly 2× the throughput of BF16. It takes FP8 inputs, multiplies them with an FP32 accumulator (precision lives here), un-scales the output by the inverse scales, and casts down to BF16 (or FP32, or FP16 — your choice) for the next layer. The accumulator dtype is non-negotiable: FP32 — without it, accumulation rounding error swamps the matmul. Section 10.3 will explain why even FP32 accumulation is not enough at frontier scale.

EXECUTION STATE

scale_a = 1 / scale_A (un-scale output)

scale_b = 1 / scale_B (un-scale output)

out_fp32 =

result, shape (M, N), BF16 dtype

34Reference and diff

Compute the same matmul in BF16 as the reference. The max abs error on a 4096×4096 GEMM with this scaling typically lands a few units away from the BF16 reference; the mean relative error sits in the low single-digit percent. For a single layer this looks small; the question that drives the rest of the chapter is whether the error compounds catastrophically across 60+ transformer layers and millions of training steps — and what to do about it.

41Memory footprint, the headline win

An FP32 4096×4096 weight matrix is 64 MB. The same matrix in FP8 is 16 MB — 4× smaller. Multiply across all the GEMMs of a 671 B-parameter model and the savings unlock training runs that were previously memory-bound: more parameters per GPU, larger micro-batches, longer sequences. This number is the entire reason the FP8 chapter exists.

LOOP TRACE · 2 iterations

FP32 weights

size = ≈ 64 MB

FP8 weights

size = ≈ 16 MB (4× smaller)

37 lines without explanation

1import torch
2
3# Hopper (H100/H200) and Blackwell GPUs expose two FP8 dtypes natively.
4# E4M3: wider mantissa, lower range — used for activations & weights.
5# E5M2: wider range, lower mantissa — used for gradients (need range).
6e4m3 = torch.float8_e4m3fn
7e5m2 = torch.float8_e5m2
8
9device = "cuda"
10M, K, N = 4096, 4096, 4096
11
12# 1) Allocate FP32 tensors with realistic transformer-block scale.
13A_fp32 = torch.randn(M, K, device=device) * 0.02      # weights ~ N(0, 0.02)
14B_fp32 = torch.randn(K, N, device=device) * 1.0       # activations ~ N(0, 1)
15
16# 2) Per-tensor amax scale: map the largest abs value to FP8 max (448 for E4M3).
17amax_A = A_fp32.abs().amax()
18amax_B = B_fp32.abs().amax()
19scale_A = 448.0 / amax_A.clamp(min=1e-12)
20scale_B = 448.0 / amax_B.clamp(min=1e-12)
21
22# 3) Quantise: multiply by scale, cast to FP8.
23A_fp8 = (A_fp32 * scale_A).to(e4m3)
24B_fp8 = (B_fp32 * scale_B).to(e4m3)
25
26# 4) FP8 matmul on Hopper tensor cores. Accumulator stays FP32.
27out_fp32 = torch._scaled_mm(
28    A_fp8, B_fp8,
29    scale_a=1.0 / scale_A,                # inverse scales applied to output
30    scale_b=1.0 / scale_B,
31    out_dtype=torch.bfloat16,
32)
33
34# 5) Reference and diff.
35ref = (A_fp32 @ B_fp32).to(torch.bfloat16)
36err = (out_fp32 - ref).abs()
37print(f"FP8 GEMM  : {M}x{K} @ {K}x{N}")
38print(f"max abs err: {err.max().item():.4f}")
39print(f"mean rel err: {(err / ref.abs().clamp(min=1e-3)).mean().item():.2%}")
40
41# 6) Memory footprint comparison.
42def mb(t): return t.element_size() * t.numel() / 1024**2
43print(f"\nFP32 weights : {mb(A_fp32):8.2f} MB")
44print(f"FP8  weights : {mb(A_fp8):8.2f} MB  ({mb(A_fp32) / mb(A_fp8):.1f}x smaller)")

At Massive Scale: Why DeepSeek-V3 Bet the Run on FP8

DeepSeek-V3 is the largest open-weight model trained primarily in FP8 to date — 671 B total parameters, 37 B activated per token, 14.8 T training tokens. The paper reports the run on 2048 H800 GPUs for about two months, at a quoted cost of 5.576 million USD in compute. Without FP8, neither the memory budget nor the time budget closes. Three constraints pushed the team to FP8.

Memory ceiling. 2048 H800s give 160 TB of total HBM. Holding 671 B parameters in BF16 weights + BF16 gradients + FP32 master weights + FP32 Adam moments costs $(2 + 2 + 4 + 8) N = 16 N \approx 10.7 \, \text{TB}$ — 6.7% of total HBM, before activations, KV cache, or expert routing buffers. With FP8 weights and FP8 gradients that drops to $14 N \approx 9.4 \, \text{TB}$ , and DeepSeek further shards optimiser state across DP ranks so the per-GPU footprint fits with comfortable margin.
Throughput ceiling. H800s deliver 989 BF16 TFLOPS and 1979 FP8 TFLOPS per GPU. The training run requires $6 N D \approx 6 \cdot 10^{25}$ FLOPs. At the realised MFU (~40% on H800 with FP8) the run takes $\approx 50$ days. In BF16 at the same MFU you double that — well past the team's available compute window.
Communication ceiling. Expert parallelism requires all-to-all every layer. The volume of those collectives is proportional to the activation tensor size, which is proportional to the precision width. FP8 halves the wire bytes relative to BF16, taking the network from co-bottleneck to comfortably non-binding — a key reason DeepSeek can fit fine-grained MoE inside their fabric budget.
Iteration speed. Most of the run cost is not the headline pre-training pass — it is the dozens of ablations, recipe sweeps, and recovery restarts that precede and surround it. A 2× compute speed-up at the same loss is a 2× research velocity gain across the entire programme. Frontier labs notice.

The shape of the FP8 forward pass at scale

Inside a single DeepSeek-V3 transformer block, here is what is actually FP8 and what is not:

Operation	Input dtype	Output dtype	Why
MLA QKV projection	FP8 E4M3	BF16	GEMM dominates; quality preserved by per-block scale
Attention softmax + scores	BF16	BF16	Softmax exp/normalise needs range, not throughput
MoE expert MLP up-proj	FP8 E4M3	BF16	Biggest GEMM in the model; biggest FP8 win
MoE expert MLP down-proj	FP8 E4M3	BF16	Same
Backward weight gradient	FP8 E5M2	FP32	Gradients need range; E5M2 carries it
Optimiser update (Adam)	FP32	FP32	Loss of precision here corrupts the master state
LayerNorm / RMSNorm	BF16	BF16	Reduction-heavy; FP8 saves nothing meaningful

Roughly 60–70% of the FLOPs in a transformer block are GEMMs that can run in FP8. Softmax, normalisation, and optimiser updates stay in higher precision — a hybrid recipe that captures most of the speed-up while leaving the numerically sensitive stages alone.

Engineering Reality: What FP8 Buys, What It Costs

FP8 is not free. The four engineering costs you sign up for the moment you turn it on:

Hardware lock-in. FP8 is only fast on Hopper (H100/H200) and Blackwell (B100/B200/GB200). Ampere (A100) and earlier generations have no FP8 tensor cores — an FP8 cast on A100 saves memory but not compute. Cross-cluster training (mixed hardware) becomes painful.
Numerical fragility. A single un-scaled activation outlier can saturate a whole block's GEMM, and gradient distributions shift across training — what was a safe scale at step 1k can be wrong at step 100k. Sections 10.2 and 10.3 are the cost of this fragility: per-block scaling, dynamic scale tracking, and tile-level outlier handling.
Debugging is worse. "The loss spiked at step 47k" in BF16 usually means "a data shard was bad". In FP8 it can also mean "the scale for layer 38 drifted" or "an outlier in expert 47's input overflowed". New failure mode taxonomy, new monitoring infra.
Framework support is uneven. PyTorch supports $\texttt{torch.\_scaled\_mm}$ but not (yet) FP8 autograd everywhere. Most production FP8 training stacks (TransformerEngine, DeepSeek's in-house kernels, Megatron-LM-FP8) reach into custom CUDA. Off-the-shelf is not yet enough.

Against those four costs, the benefits compound:

Benefit	Magnitude	Where it shows up
GEMM throughput	≈ 2× vs BF16	Wall-clock time of pre-training
Activation memory	≈ 0.5× vs BF16	Max sequence length, max batch
Weight memory	≈ 0.5× vs BF16	Number of parameters per GPU
Network bytes (collective ops)	≈ 0.5× vs BF16	All-to-all latency for MoE
Power per training token	≈ 0.5× vs BF16	Total energy and dollar cost

The deepest lesson. Every order of magnitude of scaling has historically required a precision step-down. Single GPUs used FP32. The first multi-GPU runs (Megatron, GPT-3) moved to FP16/BF16 with master weights. Frontier scale (DeepSeek-V3, Grok-2, internal labs at OpenAI and Anthropic) is moving to FP8. The next scale-up almost certainly involves FP4 or block-floating- point formats — the same engineering programme, one bit narrower. The lesson of this chapter is not "use FP8", it is learn to engineer training around precision, because the next format is already on the roadmap.