The Real Problem: FP8 Cannot Be Everywhere
Sections 10.3 and 10.4 sold you on FP8: fine-grained block scaling recovers most of the dynamic range, high-precision accumulators stop the running sum from collapsing, and an H100 tensor core in FP8 runs nearly twice as fast as in BF16. So why not push the whole training stack to FP8? Why does DeepSeek-V3's actual precision map still keep entire layers in BF16 and an irreducible core in FP32?
The honest answer: FP8 has bits of mantissa and a hard ceiling at . That is enough for the big matrix multiplies, where per-block scaling stretches the range and where the result is immediately dequantized into a wider format. It is not enough for three families of operations that appear in every training step:
- Long-horizon accumulators. Master weights drift by per step against a weight of order . Over millions of steps these updates have to actually land. In FP8 the grain near 1.0 is roughly — every update rounds to zero and training stalls.
- Wide-range reductions. RMSNorm computes a mean of squares; softmax exponentiates logits that span tens of units. Both routinely produce intermediates that overflow or collapse small probabilities to zero.
- Second-moment statistics. Adam's tracks . For a typical late-training gradient , the second moment — three orders of magnitude below FP8's smallest distinguishable nonzero value.
Putting any of these in FP8 silently corrupts training: no NaN, no explosion, just a loss curve that mysteriously plateaus 0.05 nats above where it should. The job of this section is to make every one of those failures visible, then map out exactly which tensors in a transformer block stay in BF16 versus FP32 versus FP8 — and why.
Intuition: Cheap Ruler, Calipers, Master Blueprint
Imagine a workshop that builds a precise instrument. Most of the work — sawing boards, drilling rough holes, sanding panels — uses a millimetre ruler. It is fast and cheap and the precision is good enough because the next step always re-references the parts against the blueprint. That is FP8.
A few moments demand calipers: matching the bearing diameter to the shaft, reading a balance after many cycles. Errors here cascade — the bearing either fits or seizes the machine. That is BF16: keep the wide range, take the tighter precision, accept the 2× cost for the operations that compound across thousands of layers.
And on the shelf, in a locked drawer, sits the master blueprint — the single canonical drawing every part is referenced against. You do not redraw the blueprint with the millimetre ruler. You make all your sawing and drilling with cheap tools, then you walk back to the drawer, update the blueprint with a fine pen, and the next batch of cuts is taken from that updated reference. That is FP32: master weights, optimizer moments, gamma scales — the source of truth, never the working copy.
The Mathematics of Why Some Tensors Refuse FP8
Quantization grain near a value
FP8 E4M3 has 4 exponent bits and 3 mantissa bits. Around a value of magnitude the spacing between representable numbers is approximately . With per-block scaling, that grain becomes where is the number of binades resolved. Concretely: near a weight of the FP8 grain is roughly ; near it is roughly .
Compare against BF16, which has 8 mantissa bits and a grain of , and FP32 with . These three numbers are the entire story.
Update-survival inequality
An Adam update of size against a master weight survives the round-trip through a quantized format with grain only if:
For and , this fails by four orders of magnitude in FP8, by two orders of magnitude in BF16, and is satisfied with five orders of headroom in FP32. This is the formal reason master weights live in FP32.
Softmax saturation bound
Softmax over logits computes . The shifted maximum is ; the smallest tail probability is roughly . For attention logits with realistic gap , the tail probability is — well below any FP8 grain, so the tail collapses to zero under FP8 storage. Even BF16 distorts it; FP32 carries it cleanly. That is why every modern stack does the softmax reduction in FP32 even when both inputs and outputs are in BF16.
Second-moment underflow
Adam's . For late-training gradients , . The smallest positive FP8 value near zero is on the order of per block, so rounds to . The Adam step then divides by and blows up. FP32 storage of and is non-negotiable.
Manual Numerical Walkthrough
Let us trace one Adam step end-to-end in three precisions and see each failure mode appear.
Click to expand: one Adam step in FP32, BF16, and FP8 — side by side
Step 1 — the inputs. A single parameter, mid-training.
w = 0.731 421 (current master weight) g = 3.2e-4 (gradient from this minibatch) m_prev = 1.1e-4 (Adam first moment) v_prev = 8.5e-8 (Adam second moment) beta1 = 0.9 beta2 = 0.95 lr = 3.0e-4 eps = 1.0e-8
Step 2 — Adam update math (FP32 reference).
m = 0.9 * 1.1e-4 + 0.1 * 3.2e-4 = 1.31e-4
v = 0.95 * 8.5e-8 + 0.05 * (3.2e-4)^2
= 8.075e-8 + 5.12e-9 = 8.587e-8
step = lr * m / (sqrt(v) + eps)
= 3e-4 * 1.31e-4 / (sqrt(8.587e-8) + 1e-8)
= 3e-4 * 1.31e-4 / (2.93e-4 + 1e-8)
= 1.341e-4
w_new (FP32) = 0.731 421 - 1.341e-4 = 0.731 287The actual weight motion is . That is the quantity each precision must preserve.
Step 3 — same computation if v stored in FP8. The block grain for v near 1e-7 is roughly 2e-3 (FP8 cannot distinguish below this). v rounds to 0.
v_fp8 = 0 (underflow)
step_fp8 = 3e-4 * 1.31e-4 / (sqrt(0) + 1e-8)
= 3e-4 * 1.31e-4 / 1e-8
= 3.93 <- five orders of magnitude too largeOne bad step. The parameter would jump from to . The next forward pass would either NaN, or produce a loss spike of dozens of nats, or — worst case — silently poison every downstream activation. This is why v must be FP32.
Step 4 — same computation if w stored in FP8. FP8 grain near is roughly with global scaling, or roughly with per-block scaling and amax ≈ 1.
w_quantized = round(0.731421 / 5e-3) * 5e-3 = 0.730 step (correct) = 1.341e-4 w - step = 0.730 - 1.341e-4 = 0.729 866 re-quantize = round(0.729 866 / 5e-3) * 5e-3 = 0.730 <- no move
Every step rounds back to the same quantized weight. The weight cannot move. Multiply this over 10,000 steps and the FP8 master copy is frozen in place while the FP32 truth has drifted by . That is the second non-negotiable: master weights must be FP32.
Step 5 — what BF16 does. BF16 grain near 0.73 is — still too coarse to resolve a update reliably, though one in 20 updates lands a representable bin. BF16 master weights almost work, which is why a few research papers tried them — but accuracy degrades by 0.05–0.15 nats over a full pretraining run. Nobody at frontier scale ships BF16 master weights anymore.
Step 6 — what BF16 stores happily. Activations of magnitude , gradients of magnitude , attention scores of magnitude : all fit cleanly in BF16's range and its mantissa resolution is fine for tensors that are read once and replaced. This is why BF16 is the universal default for activations, gradients, and the residual stream.
Step 7 — what the walkthrough teaches. The precision boundary is not arbitrary. It is set by three hard mechanical bounds: (a) the update-survival inequality for master weights, (b) the second-moment underflow bound for Adam's v, and (c) the softmax-tail bound for attention probabilities. Everything else can — and should — be in BF16 for the highway and FP8 in the GEMM kernels.
Visualizing the Precision Map of a Block
The diagram below is the precision map of one transformer block as DeepSeek-V3 actually ships it. Click any tensor to see why it lives where it lives. Use the precision chips to fade everything except FP8, BF16, or FP32 and see at a glance which paths each precision owns.
Three things to read out of the map. First, FP8 is a kernel decision, not a tensor-type decision: only the big GEMMs (QKV projection, output projection) execute in FP8, and the outputs are dequantized to BF16 immediately. Every tensor your Python code sees lives in BF16 or FP32; FP8 is a transient inside a fused kernel. Second, the backward and optimizer side is uniformly FP32 — there is no point arguing about it, nobody at scale runs FP8 master weights or FP8 Adam state. Third, softmax is the only operation in the forward path that escapes to FP32; even with BF16 inputs and BF16 outputs, the exp+sum reduction is FP32 in the middle. That single cast is the difference between a stable long-context model and one whose attention silently truncates.
Plain Python: Simulating FP8 to See It Fail
Below is the diagnostic script every mixed-precision engineer should be able to write in their sleep. It quantizes a few canonical tensors to FP8 with per-block scaling, then measures the damage against an FP32 reference. Three tests, three failure modes, three reasons that FP32 paths exist.
The output of running this script on any modern laptop:
FP32 master after 10k steps: 0.000000 FP8 master after 10k steps: 1.000000 FP8 drift vs FP32 truth: 1.000000 FP32 softmax: [0.00010 0.07585 0.92388 0.00000 0.00018] FP8 softmax: [0.00000 0.07585 0.92415 0.00000 0.00000] L1 error: 0.00056 v (FP32): 9.000e-08 v (FP8) : 0.000e+00 <- collapses to 0; Adam step explodes via 1/sqrt(v+eps)
Read each line as a verdict: the FP8 master never moved, the FP8 softmax dropped the lowest-probability token entirely, and the FP8 second moment vaporized. Three small failures that compound into an unusable training run.
Sanity-check yourself. Re-run the master-weight loop with delta = 1e-2 (a hundred times larger). The FP8 master finally tracks the FP32 reference. The threshold for FP8 master weights to work is roughly per step — three orders of magnitude larger than any realistic Adam update at scale. Confirm the breakdown bound for yourself; it is the cleanest way to internalize why FP32 master weights are mandatory.
PyTorch: The Mixed-Precision Recipe DeepSeek Ships
Below is the production transformer block with the precision map baked in. Three patterns to study: (a) RMSNorm with a forced FP32 reduction, (b) FP8-eligible matmuls scoped by autocast, (c) softmax with an explicit FP32 cast in the middle. The optimizer sees only the FP32 master copy and updates it in FP32.
Two structural details worth a second look. First, autocast is not a magic wand: it tags a region as eligible for low precision, but the framework still picks the actual kernel based on what is available and safe. On H100 with FP8 support enabled, the qkv and proj GEMMs run in FP8; on A100 the same code runs in BF16. The precision map is portable. Second, the cost of the FP32 cast inside softmax is negligible — softmax is memory-bound, not compute-bound, and a single extra cast costs microseconds against the milliseconds of the surrounding matmuls. There is no engineering reason to economize on it.
assert not torch.isnan(loss).any() immediately after the loss computation and pin the failing batch to a file. NaN at training time almost always traces back to one of three things in this section's precision map: a missing FP32 cast in softmax, an FP8 input that breached , or an Adam step against a quantized v. The assertion fires before the gradient propagates and the postmortem is trivial. Frontier labs run this assertion on every step.At Massive Scale: Why FP32 Master Weights Cost Terabytes (and Are Worth It)
Snap the precision map onto DeepSeek-V3's 671B parameter budget and the costs become concrete:
| Tensor class | Precision | Bytes / param | Total at 671B params |
|---|---|---|---|
| Forward weights (FP8 GEMM input) | FP8 | 1 | 671 GB |
| BF16 shadow weights (for autocast) | BF16 | 2 | 1.34 TB |
| FP32 master weights | FP32 | 4 | 2.68 TB |
| Adam m + v (FP32 each) | FP32 | 8 | 5.37 TB |
| Gradients (BF16) | BF16 | 2 | 1.34 TB |
| Total optimizer-state cost | — | 17 | 11.4 TB |
That 11.4 TB of optimizer state is per training step and per replica if you do not shard it. The only reason this is tractable is ZeRO-1 / FSDP, which splits the FP32 master and Adam state across data-parallel ranks (Chapter 11.7). With H100s in the cluster, the per-GPU share of FP32 optimizer state is around — fits comfortably in the 80 GB HBM with room left for activations and the FP8 forward weights.
Two strategic consequences fall out of this arithmetic. First, FP8 is the leverage that makes everything else possible: dropping the forward weight bytes from 2 to 1 frees the budget for the FP32 master and optimizer state we are not willing to give up. Without FP8 GEMMs, DeepSeek-V3 could not fit on the cluster. Second, the FP32 reservations are not waste — they are the survival kit. The two failure modes from Section 1 (master-weight underflow and Adam v underflow) are exactly what FP32 prevents. Cutting them to BF16 to save bytes is the most common rookie optimization, and the most expensive one when the run silently underperforms.
Where the bytes go in practice
- FP8 forward weights live on every GPU that runs a copy of the tensor-parallel shard. They are read once per forward step and discarded.
- BF16 shadow weights are the working copy that PyTorch's autocast actually reads. They are refreshed from the FP32 master after each optimizer step.
- FP32 master weights and Adam m, v are sharded across data-parallel ranks (ZeRO-1). On a step boundary, all-gather brings the master weight tile back to every rank that needs it, the optimizer step runs, and the shard is written back. This is the most communication-heavy step in the training loop and the one DeepSeek-V3 spent the most engineering on (see Chapter 11.5 DualPipe).
Engineering Reality: The DeepSeek-V3 Precision Map
Here is the precision map as DeepSeek-V3 actually ships it, layer by layer. Every entry is a deliberate engineering choice backed by a failure mode like the ones we traced above.
| Component | Precision | Why |
|---|---|---|
| Token embedding & output head | BF16 | Embeddings are sparse and gradient-sensitive (a token seen once must still produce a meaningful gradient). FP8 quantization on the embedding matrix loses tail tokens. The output head ties into cross-entropy which itself runs in FP32. |
| RMSNorm reduction | FP32 (in), BF16 (out) | Mean-of-squares overflows or underflows in FP8 and is borderline in BF16. The reduction is forced to FP32 and the rescaled activation is cast back. |
| RMSNorm gamma (learned scale) | FP32 master, BF16 shadow | Same logic as every other learned parameter: master in FP32 so the optimizer step actually lands, shadow in BF16 for the forward pass. |
| QKV / output / FFN GEMMs | FP8 input, FP8 weight, BF16 output | The whole point of FP8. Per-block (1×128 act, 128×128 weight) scaling preserves dynamic range; the FP32 accumulator (Section 10.4) prevents catastrophic accumulation error. |
| Attention scores (QKᵀ) | BF16 | Logits routinely span ±30 in long-context models; the softmax that follows needs the full range. FP8 saturates and the post-softmax tail collapses. |
| Softmax | FP32 reduction, BF16 output | exp(z) where z spans 15-30 produces a probability tail at 1e-7 to 1e-13. Only FP32 preserves it; BF16 distorts it; FP8 destroys it. |
| Residual stream | BF16 | Compounded over 60+ layers, an FP8 residual would accumulate quantization noise on every block. BF16 grain (≈ 4e-3 × magnitude) is small enough relative to typical residual magnitudes (≈ 1) that depth-wise drift is negligible. |
| Gradients | BF16 (per-rank), FP8 (cross-node compression) | Per-rank gradient is BF16 — the standard. For cross-node all-reduce, DeepSeek-V3 compresses to FP8 with per-block scaling, recovers the BF16 average on the receive side. This is an aggressive choice and Section 11.6 covers the analysis. |
| Master weights | FP32 | Update-survival inequality — every line of math points here. Sharded across DP ranks via ZeRO-1. |
| Adam m, v | FP32 | Second-moment underflow bound — every line of math points here too. Sharded across DP ranks via ZeRO-1. |
| Cross-entropy loss | FP32 | Loss values are summed across the batch and then divided; numerical drift in the reduction shows up directly in the gradient norm. The whole loss path is FP32 — it is one scalar per step, the cost is zero. |
| Gradient clipping | FP32 | Computes the global gradient L2 norm by summing squared values across every parameter shard. Mixed-precision implementations of this step are a known source of subtle bugs; FP32 is the safe default and the cost is negligible. |
DeepSeek-V3's precision map looks intricate because every cell records a separate engineering battle: did this tensor survive FP8 in the ablations, or did the run diverge by 200B tokens in? The answer is rarely "FP8 was fine" and rarely "FP32 was needed." The answer almost always is this exact tensor at this exact place needed BF16. The whole chapter builds toward this one observation: mixed precision is not a format choice. It is a precision-by-precision audit of every tensor flow in the model. Section 10.6 puts that audit into runnable code — the full training loop with the precision map wired in.