Every transformer trained today does an enormous amount of work and is rewarded with a tiny amount of feedback. A 671-billion-parameter model spends roughly four trillion FLOPs to process a single token through its full forward and backward pass, and at the end of that journey it is graded on one question: what is the next token? One categorical target. A handful of bits of information per step. Most of the network's machinery is, in a precise information-theoretic sense, going to waste. This chapter is about why the wastage exists, what it costs, and how DeepSeek-V3 — along with a generation of speculative-decoding work — claws some of the lost signal back. This first section is just the diagnosis.
The bottleneck in one sentence. Next-token prediction makes the forward pass do the work of N layers and extracts the signal of one categorical guess. The compute scales with the model; the supervisory signal scales with the vocabulary. The gap between the two is the largest underused resource in large-model training.
The Asymmetry: Huge Forward Pass, Tiny Signal
Walk through what one training step actually does. A batch of sequences enters the model. For each of the positions in each sequence, every one of the transformer layers runs its full forward computation: self-attention, MLP, normalization, residual adds. Then the gradients flow back through the same N layers. The compute is proportional to , and it grows with the model.
At the very top of that stack, the final hidden state at position is projected to a vocabulary-sized logit vector and compared, by cross-entropy, against exactly one ground-truth token: the token at position . The supervisory signal that this step delivers is bounded by bits — and in practice it is much smaller, because natural text is nowhere near maximum entropy. The signal scales with the data, not the model.
| Quantity | Scales with | 7B example | 671B example |
|---|---|---|---|
| FLOPs per step | model size × tokens | ~1.7 × 10¹⁴ | ~1.6 × 10¹⁶ |
| Supervisory bits (ceiling) | tokens × log(vocab) | ~63,000 bits | ~70,000 bits |
| Effective bits (real text) | tokens × ~3 bits | ~12,000 bits | ~13,000 bits |
| Bits / GFLOP (effective) | DECREASES with model size | ~7 × 10⁻⁵ | ~8 × 10⁻⁷ |
Intuition: The 100-Step Exam Graded on One Number
Imagine a student sitting an exam that takes 100 steps of arithmetic to complete. The student does every step carefully. The teacher grades the work by reading only the final number — they never look at any intermediate result — and writes either a check or a cross at the bottom of the page. Over the course of a semester, the student improves on the final answer, but slowly, because each round of feedback is one bit of signal compressed from a hundred steps of effort. If the teacher had instead written a check or cross at every intermediate step, the student would have learned in ten exams what previously took a hundred.
A transformer is exactly this student. The forward pass computes a long chain of intermediate representations — early-layer hidden states contain low-level features, mid-layer hidden states contain syntactic structure, late-layer hidden states contain semantic content. By the time the final layer produces its logits, the model has done something analogous to a hundred steps of reasoning. The standard objective reads only the final logit vector and grades it on one cross-entropy target. Everything else is signal-free.
Why has the field tolerated this for so long? Two reasons. First, the single-target objective is mathematically clean — cross-entropy on the last hidden state has a tidy gradient and works out of the box. Second, for a long time the data side was the bottleneck: there was always more text to train on, so the question of "more signal per FLOP" was less urgent than the question of "more data, please." That changed once frontier labs started running out of high-quality tokens and once total training FLOPs began to dwarf data scaling. With both data and capital expensive, the bits-per-FLOP ratio became the new lever.
The Math: Supervisory Density per FLOP
Fix the model at total parameters and one training step that processes tokens. The standard estimate for FLOPs per step is:
The factor of 6 comes from FLOPs per parameter in the forward pass (one multiply plus one add, counted as two operations) and roughly FLOPs per parameter in the backward pass (one for the activation gradient, one for the weight gradient, each a multiply-add). Some analyses use a slightly different constant — 8 if you include the optimizer update — but 6 is the standard in scaling-law literature.
Now the supervisory side. Next-token prediction emits one cross-entropy target per position. The maximum information content of a categorical distribution over classes is bits, achieved when the true distribution is uniform. Real text falls well below this ceiling, but as an upper bound:
Define the supervisory density as bits per FLOP:
The 's cancel — a longer context does not change the density. Vocabulary buys only a logarithmic improvement. The only term that survives is in the denominator, which is exactly the observation we want: supervisory density falls linearly with model size. Big models are the worst possible thing for bits per FLOP.
The MTP lever, in one line
Suppose we attach output heads instead of one, so position predicts the tokens at . The extra heads cost FLOPs per step — small compared with the trunk's — so to a first approximation the FLOPs are unchanged. The bits scale linearly:
We multiplied the density by for free. Whether the model can use the extra k targets without confusing itself is the engineering question that the remaining sections of this chapter answer. Here we are only establishing the prize.
Manual Numerical Walkthrough
Numbers make the asymmetry vivid. Let us compute the supervisory density for three concrete configurations, by hand.
Click to expand: density calculation for 7B, 70B, and 671B models
Common inputs. Context length , vocabulary , batch size . We compute everything per processed sequence so the numbers stay on a human scale.
Step 1 — Llama-7B class, k = 1. parameters.
- FLOPs/step = 6 · 7e9 · 4096 ≈ 1.72 × 10¹⁴ (172 TFLOPs)
- log₂(50000) ≈ 15.61 bits per target
- Bits/step (ceiling) = 4096 · 15.61 ≈ 63,929 bits
- ρ = 63929 / 1.72e14 ≈ 3.71 × 10⁻¹⁰ bits/FLOP
- ρ ≈ 3.71 × 10⁻⁴ bits per GFLOP
Read: every gigaflop of training compute is rewarded with less than half a millibit of supervisory information. That is the optimistic ceiling — real text only fills 2–4 bits of the 15.6 available, so the actual figure is closer to bits/GFLOP.
Step 2 — A 70B model, k = 1. .
- FLOPs/step = 6 · 7e10 · 4096 ≈ 1.72 × 10¹⁵
- Bits/step (ceiling) unchanged ≈ 63,929 bits
- ρ ≈ 63929 / 1.72e15 ≈ 3.71 × 10⁻⁵ bits/GFLOP
We multiplied parameters by 10× and density dropped by 10×. The formula predicted exactly this — and any actual model would land at this ratio without intervention.
Step 3 — DeepSeek-V3, k = 1. total parameters (with MoE; ~37B activated per token, but the gradient touches all parameters via expert routing across the dataset).
- FLOPs/step (effective, activated) ≈ 6 · 3.7e10 · 4096 ≈ 9.1 × 10¹⁴
- FLOPs/step (full, naïve) ≈ 6 · 6.71e11 · 4096 ≈ 1.65 × 10¹⁶
- Bits/step ceiling ≈ 4096 · log₂(128000) ≈ 4096 · 17.0 ≈ 69,632 bits
- ρ (effective) ≈ 7.6 × 10⁻⁵ bits/GFLOP
- ρ (full) ≈ 4.2 × 10⁻⁶ bits/GFLOP
Either way you count it — by activated FLOPs or by the full sparse footprint — the density is a tiny fraction of even a millibit per gigaflop. The compute spend per unit of supervisory information has collapsed.
Step 4 — Same DeepSeek-V3, but k = 2. The trunk cost is essentially unchanged. The extra LM head costs roughly parameters per head — a fraction of a percent of the 671B trunk. So FLOPs/step grow by perhaps 0.2%, and bits/step double.
- Bits/step ≈ 2 · 69,632 = 139,264 bits
- ρ (effective) ≈ 1.5 × 10⁻⁴ bits/GFLOP — a 2× improvement
- FLOP overhead vs k = 1: ~0.2%
The headline. Moving from k = 1 to k = 2 gives a 2× boost in supervisory density for a 0.2% increase in compute. This is the economic case for MTP at frontier scale — and the next four sections describe how to actually realize that gain without breaking training stability.
Sanity check. If you set N = 100M and run the formula, bits/GFLOP — two orders of magnitude better than the 7B case. That is why early small transformers never felt this pressure: the bottleneck only becomes crippling once N gets large.
Visualizing the Bottleneck
The diagram below makes the compute-vs-signal asymmetry visceral. Each token column is a full forward + backward pass through layers — every amber cell lit, every step. The thin green lines at the bottom are the only supervisory signal that pass extracts. Hover any column to see what one position is asked to learn from one round of compute.
Three behaviors to try out. First, push the model-size slider up to 671B and watch the "bits per GFLOP" meter shrink — even though you have not changed the supervisory side at all, the compute side ballooned and the ratio collapsed. Second, push the predictions-per-position slider from 1 to 4. Compute barely budges; the supervisory bar grows in lockstep with k. The density meter climbs. Third, try the preset buttons: "DeepSeek-V3 / NTP" gives the tiny ratio that motivated MTP, and "DeepSeek-V3 / MTP-2" shows the resulting density without redrawing a single layer of the trunk.
Plain Python: Counting the Signal
Before we touch a real loss function, it is worth writing the bookkeeping directly. The code below does no training — it simply counts, in NumPy, what the standard NTP step costs and what it produces. The same formulas scale, unchanged, from the toy case to DeepSeek-V3.
Two structural points are worth marking. First, the loop over k is the cheapest possible preview of MTP: nothing about the network changes; we simply multiply the bits term by k and re-divide. The fact that this trivial edit produces a multiplicatively-better ratio is the entire reason MTP is a serious research direction. Second, the FLOPs estimate uses the canonical rule because that is what scaling-law papers use; if you have read Hoffmann et al.'s Chinchilla paper, this is exactly their accounting.
Sanity check. If you set N_PARAMS = 1 and V = 2, the bits/step becomes and the FLOPs/step becomes . Density = 1/6 bits/FLOP — one bit per six operations. That is the theoretical best a transformer could ever get at this objective. Any real model lands many orders of magnitude below it, and the gap is the cost of using the transformer as a function approximator.
PyTorch: The Standard NTP Loss in One Line
Now the same idea, instantiated in PyTorch. This is the loss every modern LLM is trained with, give or take a label-smoothing parameter. Notice how short it is — that brevity is exactly what we are about to enrich.
Two subtleties worth marking, both about how this loss interacts with the rest of the training step:
- The reduction is a mean, not a sum. Each of the positions contributes of the final scalar. This means the loss magnitude does not grow with context length, which is convenient — but it also means the gradient delivered per token is a tiny fraction of the total loss. The asymmetry we counted at the FLOP level is faithfully preserved at the gradient level: more positions means more shared weight on each backward signal.
- The shift is non-negotiable. Logits at position t predict the token at t+1. If you forget to shift, you train the model to predict itself — trivial accuracy, zero useful gradient. This is a common bug in hand-rolled training loops and the reason most frameworks ship a wrapped loss with the shift baked in.
What Changes at Massive Scale
At toy scale the bottleneck is a curiosity. At frontier scale it is a defining constraint of the training economy. Three things change as N grows from millions to hundreds of billions of parameters.
Compute becomes the dominant cost
At 100M parameters, a training run might cost a few thousand GPU-hours and the data pipeline is the bottleneck — you tune tokenization, dedup, and domain mixing because that is where the gains are. At 671B parameters, a single training run costs tens of millions of GPU-hours, and the data pipeline is solved enough that every additional bit of supervisory signal per FLOP translates directly into millions of dollars of compute equivalent. The density ratio stops being academic.
The model can absorb more signal than it gets
Empirically, 100B+ models trained on NTP show clear evidence of under-supervision: probes of intermediate layers reveal predictive information about tokens many positions ahead, but the model is never asked to use it. Multi-token prediction does not teach the model new tricks — it unlocks tricks the model has already learned in service of a one-step objective. The capacity is already there; only the supervisory signal is missing.
Inference cost lines up with training-time choices
A model trained to predict tokens ahead at every position is, almost by accident, also a model whose internal representations are well-suited for speculative decoding at inference — draft k tokens in parallel and accept the prefix that matches a verifier. Section 7.5 makes this connection explicit. What looks like a training-time density argument in this section turns into a 2–4× inference speedup later, with no extra parameters and no separate draft model required.
| Era | Typical N | Bottleneck | Why NTP was fine |
|---|---|---|---|
| GPT-2 (2019) | 1.5B | Data quality | Density was ~5e-3 bits/GFLOP — plenty |
| GPT-3 (2020) | 175B | Data scale | Density dropped, but data outpaced compute |
| Chinchilla / Llama-2 (2022-23) | 7B–70B | Compute–data balance | Density tight but tolerable |
| DeepSeek-V3 / GPT-4 class (2024-25) | 300B–1T+ | Bits per FLOP | Density collapsed; MTP becomes necessary |
Engineering Reality and the Door MTP Opens
Three practical observations sit on top of the math, and each one will recur through the rest of this chapter.
- The bottleneck is not a bug; it is a default. F.cross_entropy on shifted logits is so easy and so robust that it became the universal recipe. There is nothing "wrong" with it — every model in production uses it. The point is that the default leaves information on the table, and the cost of leaving that information on the table grows with model scale. MTP is not fixing a bug; it is collecting on a debt.
- Naïve parallel k-target prediction does not work. The obvious move is to attach k independent output heads, each predicting from the same hidden state. This is what Section 7.2 examines, and it fails: heads compete for the same representation, gradients clash, and the model often does worse than NTP. DeepSeek-V3's sequential causal MTP from Section 7.3 is the construction that actually works, and the rest of the chapter is about why.
- The right loss coefficient is small. Even when the architecture is right, the MTP heads are auxiliary — the main objective is still next-token prediction. DeepSeek-V3 weights MTP losses at roughly early in training and decays to by the end. Section 7.4 derives why this schedule matters, and how the wrong choice silently destroys the gains from the architecture in Section 7.3.
One sentence to carry forward into the rest of the chapter: the single-token objective is a compute-efficient learner of a signal-inefficient problem; multi-token prediction is the cheapest known way to fix the supervisory side without touching the trunk. Everything that follows in Chapter 7 is the careful engineering of that fix.