The DeepSeek-V3 paper reports that pre-training a 671 B-parameter model on 14.8 T tokens cost about 2.788 million H800 GPU-hours. Without FP8, the same run would have cost roughly twice that — and more importantly, it would not have fit on the 2048 H800s the team had access to. FP8 is not a micro-optimisation. It is the difference between training a frontier model and watching one train somewhere else. This section makes the case for FP8 before the next two sections explain why naïve FP8 fails and how DeepSeek made it work.
The Real Problem: 671 B Parameters Don't Fit
A frontier-scale dense or MoE model carries an enormous amount of state through every training step. For each parameter you store the weight itself, the gradient, two Adam optimiser moments, and a slice of the forward-pass activations needed for backprop. With parameters in pure FP32 you are looking at roughly of weight-plus- optimiser state alone, before activations and the KV cache that long-context training demands.
An H800 GPU has 80 GB of HBM. To hold the FP32 state of a 671 B model you would need GPUs just to park the weights, before you spend a single FLOP on training. Add activations, gradients in transit, and all-to-all buffers for expert parallelism and the floor climbs to many hundreds of GPUs — at which point the network becomes the bottleneck and tensor cores starve.
The old fix was BF16 weights with FP32 master copies (Micikevicius et al. 2018). That cuts forward-pass memory in half. The H100's FP8 tensor cores cut it in half again — and double matmul throughput at the same time. For DeepSeek-V3's 2048-GPU cluster, FP8 is the only path that closes both the memory budget and the compute budget at once.
| Format | Bits | Range (≈) | Mantissa | H100 GEMM TFLOPS |
|---|---|---|---|---|
| FP32 | 32 | 10⁻³⁸ to 10³⁸ | 23 bits | 67 |
| TF32 | 19 | 10⁻³⁸ to 10³⁸ | 10 bits | 495 |
| BF16 | 16 | 10⁻³⁸ to 10³⁸ | 7 bits | 989 |
| FP16 | 16 | 6·10⁻⁵ to 6·10⁴ | 10 bits | 989 |
| FP8 E4M3 | 8 | 2·10⁻³ to 448 | 3 bits | 1979 |
| FP8 E5M2 | 8 | 1.5·10⁻⁵ to 57 344 | 2 bits | 1979 |
Intuition: Pay for Bits That Earn Their Keep
A floating-point number is a budget. Each bit you spend either buys you range (the orders of magnitude you can represent — the exponent) or precision (how finely you can split each order of magnitude — the mantissa). FP32 buys both lavishly. FP8 forces a brutal trade-off: with only eight bits total, you cannot afford both wide range and fine resolution.
Think of the bits as ruler markings. FP32 is a long ruler with tick marks between consecutive powers of two — so fine you cannot see the ticks. BF16 keeps the same length but uses only ticks per octave. FP8 E4M3 has only ticks per octave and a much shorter ruler — about six decades wide. Any value outside that ruler becomes zero or saturates; any value inside the ruler gets snapped to the nearest of the eight ticks in its octave.
The bet of FP8 training is that, with the right scaling, almost all the numbers a transformer block needs to multiply land inside FP8's range, and that the few-percent rounding error per multiply is recoverable by a higher-precision accumulator. The rest of this chapter is the engineering required to make that bet true.
The Anatomy of a Float: FP32, BF16, FP16, FP8
Every IEEE-style float encodes a value as:
with the sign, a non-negative integer stored in exponent bits, and the mantissa stored in bits. The bias centres the exponent range around 1 so that both very small and very large numbers are representable.
The dynamic range of the format — the ratio of max to min representable normal — is roughly:
and the relative resolution within a single octave — the smallest fractional change you can resolve — is:
Substitute the four formats and watch the trade-off snap into focus:
| Format | E | M | Range 2^(2^E−1) | ε = 2^−M |
|---|---|---|---|---|
| FP32 | 8 | 23 | ≈ 2¹²⁷ (10³⁸) | ≈ 1.2 · 10⁻⁷ |
| BF16 | 8 | 7 | ≈ 2¹²⁷ (10³⁸) | ≈ 7.8 · 10⁻³ |
| FP16 | 5 | 10 | ≈ 2¹⁵ (3·10⁴) | ≈ 9.8 · 10⁻⁴ |
| FP8 E4M3 | 4 | 3 | ≈ 2⁸ (448) | ≈ 1.25 · 10⁻¹ |
| FP8 E5M2 | 5 | 2 | ≈ 2¹⁵ (57 344) | ≈ 2.5 · 10⁻¹ |
Read the rows. BF16 has FP32's range with 4× worse precision — which is fine for forward pass because the rounding error per multiply is absorbed by an FP32 accumulator. FP16 has FP32's precision-ish but a much narrower range — which is why FP16 training requires loss scaling to keep gradients from underflowing.
FP8 E4M3 has even narrower range AND much coarser precision: it gives up about × of dynamic range vs BF16 and × of precision. FP8 E5M2 trades a precision bit back for a range bit — recovering BF16's range but with only 4 distinct values per octave. Neither format can survive on its own without scaling; the engineering question is what kind of scaling.
The two scaling regimes
The simplest fix is per-tensor scaling: multiply the whole tensor by a constant before casting to FP8 so that its amax lands at FP8's max value. One scale per tensor, one extra FP32 scalar to store. Works fine for weights, where the distribution is stable across steps.
It fails for activations. After a non-linearity, a single outlier can be 20× the bulk of the distribution — pin the amax to 448 and the bulk gets squashed into a handful of FP8 buckets, with relative error north of 50%. The DeepSeek-V3 paper's answer is per-block scaling (a separate scale per 128-element block of the tensor) plus higher-precision accumulation — the topic of section 10.3. For now, the case for FP8 stops at: the hardware is there, the memory savings are there, the throughput is there, and the scaling problem is solvable.
Manual Numerical Walkthrough
Two examples, calculated by hand: casting a handful of values to FP8 E4M3, and sizing the memory savings on a 671 B-parameter MoE like DeepSeek-V3.
Click to expand: Casting to FP8 E4M3 by hand
Setup. FP8 E4M3 has 1 sign bit, 4 exponent bits, 3 mantissa bits. Bias is . Exponent codes 0 and 15 are reserved (zero/subnormal and NaN respectively), so the unbiased exponent runs from to .
Example 1 — cast x = 1.3. First find the octave: , so the unbiased exponent is , i.e. stored . The fractional part is ; subtract the implicit leading 1 to get the stored mantissa fraction . Scale to 3-bit mantissa: ; round to . Reconstructed value: . Relative error: .
Example 2 — cast x = 500. Above MAX_NORMAL ≈ 448. DeepSeek's saturated cast clips to 448. Relative error: . This is why outlier handling matters — a few percent of activations past the saturation threshold can move a whole layer's loss by a measurable amount.
Example 3 — cast x = 1.5 · 10⁻⁵. Below MIN_SUBNORM / 2 ≈ 9.7 · 10⁻⁴. Flushes to zero. Relative error 100%. This is the underflow failure mode FP16 famously suffers from during gradient computation; FP8 sees it at activation magnitudes too, which is why E5M2 exists for gradients (range ≈ 1.5 · 10⁻⁵, just enough to keep this particular value alive).
Example 4 — cast x = 0.0. Stays zero. Useful because masked positions in attention, dropped tokens in MoE routing, and padding tokens all produce exact zeros — and FP8 represents zero exactly with no rounding error.
Memory sizing for DeepSeek-V3. Total parameters . In FP32: for the forward weight copy alone. In BF16: . In FP8: . Across 2048 H800 GPUs, FP8 frees per GPU compared with BF16. That headroom directly funds longer sequences, larger micro-batches, and more in-flight expert routing buffers.
Throughput sizing. An H100/H800 does BF16 TFLOPS and FP8 TFLOPS. DeepSeek-V3 spent training FLOPs. In BF16 at 50% MFU on 2048 H800s this would take seconds ≈ 16 400 GPU-hours per GPU ≈ 33.5 million GPU-hours total. In FP8 at the same MFU, million GPU-hours. The paper reports 2.788 million — the gap to our crude estimate is real but the 2× direction is exactly what shows up in the budget.
Visualizing Range, Precision, and Memory
Three widgets, each isolating one of FP8's trade-offs. The first compares the representable range of FP32, BF16, FP16, and FP8 against the magnitudes that actually appear in a transformer block. The second is a bit-level explorer — flip bits, watch the value change, see how the sign-exponent-mantissa split builds the number. The third sizes the whole-model memory bill for every realistic precision recipe.
Flip a single mantissa bit on FP8 E4M3 and watch the value jump by ~12.5% — that is in action. Flip the same bit on FP32 and the value moves by . This is the precision cost of the format, in one click.
Plain Python: Simulating FP8 Quantisation
Before we call into a Hopper tensor core, let us build the cast by hand. The function below is a faithful emulation of what the hardware does to every number: round-to-nearest-even, saturate at ±448, flush sub-subnormals to zero. Then we use it to simulate an FP8 matmul against an FP32 reference and measure the rounding error a transformer block has to live with.
PyTorch: Real FP8 Matmul on Hopper
Now the same idea on real silicon. PyTorch 2.1+ exposes Hopper's FP8 tensor cores through — give it FP8 inputs, an FP32 accumulator, and a pair of scales, and you get back the matmul at roughly twice BF16's throughput. This is the kernel that sits inside DeepSeek-V3's forward and backward passes; later sections will wrap it in per-block scaling, but the raw call is here.
At Massive Scale: Why DeepSeek-V3 Bet the Run on FP8
DeepSeek-V3 is the largest open-weight model trained primarily in FP8 to date — 671 B total parameters, 37 B activated per token, 14.8 T training tokens. The paper reports the run on 2048 H800 GPUs for about two months, at a quoted cost of 5.576 million USD in compute. Without FP8, neither the memory budget nor the time budget closes. Three constraints pushed the team to FP8.
- Memory ceiling. 2048 H800s give 160 TB of total HBM. Holding 671 B parameters in BF16 weights + BF16 gradients + FP32 master weights + FP32 Adam moments costs — 6.7% of total HBM, before activations, KV cache, or expert routing buffers. With FP8 weights and FP8 gradients that drops to , and DeepSeek further shards optimiser state across DP ranks so the per-GPU footprint fits with comfortable margin.
- Throughput ceiling. H800s deliver 989 BF16 TFLOPS and 1979 FP8 TFLOPS per GPU. The training run requires FLOPs. At the realised MFU (~40% on H800 with FP8) the run takes days. In BF16 at the same MFU you double that — well past the team's available compute window.
- Communication ceiling. Expert parallelism requires all-to-all every layer. The volume of those collectives is proportional to the activation tensor size, which is proportional to the precision width. FP8 halves the wire bytes relative to BF16, taking the network from co-bottleneck to comfortably non-binding — a key reason DeepSeek can fit fine-grained MoE inside their fabric budget.
- Iteration speed. Most of the run cost is not the headline pre-training pass — it is the dozens of ablations, recipe sweeps, and recovery restarts that precede and surround it. A 2× compute speed-up at the same loss is a 2× research velocity gain across the entire programme. Frontier labs notice.
The shape of the FP8 forward pass at scale
Inside a single DeepSeek-V3 transformer block, here is what is actually FP8 and what is not:
| Operation | Input dtype | Output dtype | Why |
|---|---|---|---|
| MLA QKV projection | FP8 E4M3 | BF16 | GEMM dominates; quality preserved by per-block scale |
| Attention softmax + scores | BF16 | BF16 | Softmax exp/normalise needs range, not throughput |
| MoE expert MLP up-proj | FP8 E4M3 | BF16 | Biggest GEMM in the model; biggest FP8 win |
| MoE expert MLP down-proj | FP8 E4M3 | BF16 | Same |
| Backward weight gradient | FP8 E5M2 | FP32 | Gradients need range; E5M2 carries it |
| Optimiser update (Adam) | FP32 | FP32 | Loss of precision here corrupts the master state |
| LayerNorm / RMSNorm | BF16 | BF16 | Reduction-heavy; FP8 saves nothing meaningful |
Roughly 60–70% of the FLOPs in a transformer block are GEMMs that can run in FP8. Softmax, normalisation, and optimiser updates stay in higher precision — a hybrid recipe that captures most of the speed-up while leaving the numerically sensitive stages alone.
Engineering Reality: What FP8 Buys, What It Costs
FP8 is not free. The four engineering costs you sign up for the moment you turn it on:
- Hardware lock-in. FP8 is only fast on Hopper (H100/H200) and Blackwell (B100/B200/GB200). Ampere (A100) and earlier generations have no FP8 tensor cores — an FP8 cast on A100 saves memory but not compute. Cross-cluster training (mixed hardware) becomes painful.
- Numerical fragility. A single un-scaled activation outlier can saturate a whole block's GEMM, and gradient distributions shift across training — what was a safe scale at step 1k can be wrong at step 100k. Sections 10.2 and 10.3 are the cost of this fragility: per-block scaling, dynamic scale tracking, and tile-level outlier handling.
- Debugging is worse. "The loss spiked at step 47k" in BF16 usually means "a data shard was bad". In FP8 it can also mean "the scale for layer 38 drifted" or "an outlier in expert 47's input overflowed". New failure mode taxonomy, new monitoring infra.
- Framework support is uneven. PyTorch supports but not (yet) FP8 autograd everywhere. Most production FP8 training stacks (TransformerEngine, DeepSeek's in-house kernels, Megatron-LM-FP8) reach into custom CUDA. Off-the-shelf is not yet enough.
Against those four costs, the benefits compound:
| Benefit | Magnitude | Where it shows up |
|---|---|---|
| GEMM throughput | ≈ 2× vs BF16 | Wall-clock time of pre-training |
| Activation memory | ≈ 0.5× vs BF16 | Max sequence length, max batch |
| Weight memory | ≈ 0.5× vs BF16 | Number of parameters per GPU |
| Network bytes (collective ops) | ≈ 0.5× vs BF16 | All-to-all latency for MoE |
| Power per training token | ≈ 0.5× vs BF16 | Total energy and dollar cost |
The deepest lesson. Every order of magnitude of scaling has historically required a precision step-down. Single GPUs used FP32. The first multi-GPU runs (Megatron, GPT-3) moved to FP16/BF16 with master weights. Frontier scale (DeepSeek-V3, Grok-2, internal labs at OpenAI and Anthropic) is moving to FP8. The next scale-up almost certainly involves FP4 or block-floating- point formats — the same engineering programme, one bit narrower. The lesson of this chapter is not "use FP8", it is learn to engineer training around precision, because the next format is already on the roadmap.