Sections 7.1 through 7.4 made the case for Multi-Token Prediction as a training objective: it pulls more supervision out of every forward pass, sharpens the model's long-range planning, and costs almost nothing in extra parameters. But there is a second life for the same machinery that DeepSeek-V3 quietly cashes in at deployment time. The K small MTP heads that were trained to predict tokens ahead already know how to draft K speculative tokens for any prefix. Hand those drafts to the main model for one verification forward pass and you get a wall-clock speedup that costs nothing to deploy. That is MTP-driven speculative decoding — and it is why a 671B-parameter mixture-of-experts model can serve tokens at a price that makes commercial sense.
The promise of MTP at inference. Reuse the training-time draft heads as a free, in-distribution draft model. One main-model forward pass per round, K draft tokens proposed in parallel, between 1 and tokens committed. The arithmetic is striking: with K=3 and a 75% acceptance rate, the same hardware emits roughly more tokens per second than the standard autoregressive loop — at identical sampling distribution.
The Inference Bottleneck
Standard autoregressive decoding does one forward pass per generated token. For a 67B-parameter dense model that is hundreds of gigabytes of parameter weights and KV-cache state pulled through GPU memory per token. The bottleneck on a modern H100 is not compute — the tensor cores are idling. The bottleneck is the memory bandwidth needed to stream the model weights and the cache into the SMs for each forward.
Concretely, a single H100 has roughly 3 TB/s of HBM bandwidth, and a 67B FP16 model is about 134 GB. If every token required a full model read, the ceiling is around 22 tokens per second per GPU — and that ceiling does not budge no matter how fast your tensor cores are. Compute utilisation at inference often sits below 5%. The expensive thing on the chip is doing nothing for 95% of every second.
| Resource | Used in training | Used in autoregressive inference |
|---|---|---|
| Tensor-core FLOPs | Near-saturated (large batch, long seq) | ~3–5% utilised |
| HBM bandwidth | Mostly hidden behind compute | The hard ceiling |
| KV-cache reads | Once per layer per step | Once per layer per token |
| Effective throughput | Hundreds of TFLOPS | Tens of tokens / sec / GPU |
| What is wasted | Almost nothing | The entire compute budget |
Intuition: Two Models, One Forward Pass
The mental image is a typist and a proof-reader. The typist (the MTP heads) is quick and not always right — they bash out the next few words at speed. The proof-reader (the main model) is slow and authoritative — they read the whole proposal and stop at the first word that they themselves would not have written. Everything before that line is kept. The proof-reader then writes the corrected word and hands the draft back.
Crucially, the proof-reader does not re-read the page once for each word. They read the entire proposed continuation in a single sweep — and because the proof-reader is a causal transformer, that sweep already gives them their own prediction at every position simultaneously. One forward pass delivers K+1 verdicts. If the typist guessed even half-right, the throughput of the system as a whole rises by exactly that factor.
Two properties make this work without compromising output quality. First, the accept-or-reject rule is greedy and exact: a draft token is accepted if and only if it equals what the main model would have produced at that position. Second, the rejected position is repaired using the main model's own argmax at that position — which is sitting right there in the verification forward pass. So the committed sequence is byte-for-byte identical to what pure autoregressive decoding would have produced. Speculative decoding is a speedup, not an approximation.
The Math of Acceptance and Speedup
Let be the per-token probability that a draft from the MTP heads agrees with what the main model would have emitted at the same position, and let be the number of draft tokens proposed each round. Acceptance is sequential — the moment one draft is rejected, all subsequent drafts are discarded because they were conditioned on a prefix the main model would not have written. So the number of accepted drafts in a round follows a truncated geometric distribution:
Every round commits tokens (the being the main model's free prediction at the rejection point — even when we get one extra position out of the forward pass). The expected number of tokens committed per round is therefore:
That ratio is, to a first approximation, the wall-clock speedup over autoregressive decoding — because every round costs essentially one main forward pass and the standard loop would need such passes to produce the same number of tokens. Three regimes are worth burning into intuition:
| K | p | E[J+1] | Wall-clock speedup |
|---|---|---|---|
| 1 | 0.80 | 1.80 | 1.80× |
| 2 | 0.80 | 2.44 | 2.44× |
| 3 | 0.80 | 2.95 | 2.95× |
| 4 | 0.80 | 3.36 | 3.36× |
| 3 | 0.50 | 1.88 | 1.88× |
| 3 | 0.90 | 3.44 | 3.44× |
Why the speedup saturates with K
The sum is a geometric series: its limit as is . With that ceiling is exactly . Doubling K past 3 or 4 barely moves the needle because the probability that draft 5 is right given draft 4 was already right is multiplied by yet another factor of . This is why DeepSeek-V3 trains only one MTP head and still gets most of the available speedup at inference — the marginal benefit of head 4 is much smaller than the marginal training cost of training it well.
Why the cost of a draft is not zero — but close
The above formula treated each round as one main-model forward pass. The MTP heads themselves do cost something — but each head is a few linear layers riding on the same trunk output, so the K-head draft costs roughly where . Empirically the overhead is on the order of 1–3% of one main-model forward. So a speedup of in the table above lands as about after honest accounting — still decisive.
Manual Numerical Walkthrough
Let us run three concrete rounds with , , writing the toy sentence The small boat drifted into the harbour. Every number is worked out by hand so the mechanism is fully exposed.
Click to expand: three rounds, K=3, p=0.8, by hand
Setup. The main model would (if asked one token at a time) emit the sentence The small boat drifted into the harbour. We will simulate the MTP heads drawing K=3 drafts per round with each draft agreeing with the main model with probability .
Round 1 — prefix is empty. The MTP heads draft 3 tokens. Suppose they emit ["The", "small", "ship"]. The main model is then run on the concatenation ["The", "small", "ship"] and emits its own predictions at every position: ["The", "small", "boat", "drifted"] — that is K+1 = 4 next-token predictions for free.
Compare draft vs. main, left-to-right: position 0 agrees (The), position 1 agrees (small), position 2 disagrees (ship vs. boat). We accept drafts and commit the main model's third prediction. Tokens committed this round: ["The", "small", "boat"] — three tokens for one main-model forward pass.
Round 2 — prefix is "The small boat". The MTP heads draft 3 more: ["drifted", "into", "the"]. All three agree with the main model's argmax at positions 3, 4, 5. . We accept all three drafts and commit the main model's free prediction at position 6: "harbour". Tokens committed: ["drifted", "into", "the", "harbour"] — four tokens for one forward pass. This is the K+1 best-case round.
Round 3 — sentence complete. The sentence ends here, so the loop terminates after 2 rounds. Standard autoregressive decoding would have needed 7 forward passes to emit the same 7 tokens. Wall-clock speedup on this micro-example: .
Check against the formula. With , : . Our two rounds averaged tokens/round — a bit lucky compared to the long-run mean of 2.952, but within one round's noise.
The MTP-head cost. Each draft round we ran K=3 small heads on a single trunk output. If one head costs 1% of one main forward, the three drafts add 3% per round. So our effective speedup is — not the naive 2.952, but still a transformative win on the inference cost sheet.
Visualizing One Verification Round
The visualizer below builds the sentence one round at a time. Each round shows the K draft tokens (green = main model would have written this, red = rejected) and the corrected main-model token tacked on at the rejection point. Toggle between Autoregressive (forced K=1) and Speculative to feel the round-by-round difference. Slide K between 1 and 4, and the acceptance probability between 0.30 and 0.95, to explore the regime around where DeepSeek-V3 actually operates.
Three things to lock in. First, the "tokens / round (observed)" stat is the empirical version of . With p=0.8 and K=3 it hovers around 3, exactly as the geometric series predicts. With K=1 it can never exceed 2 — that is the speedup ceiling for any 1-token draft model. Second, every round commits at least one token, even when the very first draft was wrong: the main model's "corrected" token always lands. There is no scenario where speculative decoding produces fewer tokens than autoregressive would have. Third, the speedup is bounded above by : push K beyond 4 with p=0.8 and the gain past 5× becomes vanishingly thin.
Plain Python: A Speculative Decoding Loop
Below is the entire mechanism in plain Python — no PyTorch, no model. We replace the main model with a deterministic lookup table (it "wants to write" a fixed sentence) and the MTP heads with a coin flip per token. The loop counts rounds versus tokens so the speedup is visible.
Two structural details deserve a second pass. First, the verification call in main_model_verify intentionally returns K+1 targets, not K. That extra slot — the main model's prediction at position L+K, conditioned on all the drafts — is the "free" token that gets committed even when every single draft was correct. Forget the +1 and your best-case round only commits K tokens instead of K+1.
Second, the accept rule is greedy-equality: drafts are accepted iff they exactly match the main model's argmax. This works for greedy decoding (temperature 0). For temperature sampling or top-p sampling, the rule is replaced by the speculative-sampling probability-ratio test from Chen et al., 2023: accept draft token with probability where are the per-position output distributions. That rule preserves the main model's sampling distribution exactly — sampled outputs are still drawn from the main model, just much faster.
Sanity check the formula. Run the script ten times withrandom.seed(0..9)and compute the empirical average tokens per round. With you should see tokens per round — the geometric series talking back to you in code.
PyTorch: Verifying K Drafts in One Forward Pass
The production version replaces the deterministic table with real tensors: argmax over the vocabulary at K+1 positions of a single forward pass. The accept-or-reject decision lives entirely on the GPU, with no Python loop inside the hot path.
Three subtleties worth marking, all about how this loop interacts with the rest of an LLM serving stack:
- The KV cache is reused across rounds. The verification forward pass populates the cache for positions . Only the cache entries past the rejection point need to be invalidated — the entries for the accepted positions are still valid because they were computed conditioned on the actual committed prefix. A well-implemented serving stack saves the entire trunk recomputation cost for those tokens.
- Sampling preserves the main-model distribution. Speculative decoding is not an approximation. With greedy decoding the committed sequence is byte-for-byte identical to standard decoding; with temperature sampling the accept rule is the probability-ratio test and the marginal distribution of every committed token is still exactly the main model's. You cannot tell from the output whether a server is using speculative decoding under the hood.
- Batching changes the arithmetic. If the server batches N requests, the main-model forward pass already amortises bandwidth across N sequences — the memory-bandwidth bottleneck loosens and the tensor cores warm up. Speculative decoding still helps, but the speedup shrinks because the autoregressive baseline was already faster per token. The win is largest in single-stream / low-batch interactive serving, which is also where latency matters most.
At Massive Scale: Memory-Bandwidth Wins
Speculative decoding looks like a clever programming trick until you do the arithmetic for a real 671B-parameter MoE model. The DeepSeek-V3 numbers turn it into a deployment necessity.
| Model | Active params / token | AR tokens/s (1×H100) | Spec tokens/s (K=2) | Speedup |
|---|---|---|---|---|
| Llama-3 70B (dense) | 70B | ≈18 | ≈32 | 1.8× |
| DeepSeek-V3 (MoE) | 37B | ≈28 | ≈54 | 1.9× |
| DeepSeek-V3 + MTP K=2 (paper) | 37B | ≈28 | ≈58 | ≈2.1× |
| Mistral-Large (dense) | 123B | ≈11 | ≈22 | 2.0× |
Two patterns to read. First, every serving stack listed here uses speculative decoding in production — the question is where the drafts come from. Llama-3 70B uses a separately trained small Llama-3 8B draft model. DeepSeek-V3 uses the MTP heads that already exist for free because they were part of the training objective. The MTP route saves gigabytes of draft-model memory and avoids the distribution-shift problem that hurts external draft models (a separately trained 8B does not always agree with what the 70B would have written).
Second, the speedup is meaningfully larger for MoE models than the bare formula predicts. The reason is that an MoE forward pass only activates a sliver of the parameters (37B out of 671B for DeepSeek-V3) but still has to load all 671B of expert weights into HBM cache lines in the worst case. Speculative decoding amortises that loading cost over K+1 tokens instead of one. The HBM-bandwidth ceiling moves up by the same factor as your accept rate.
The interaction with expert parallelism
When experts are spread across GPUs (Chapter 5), each MoE forward pass triggers an all-to-all shuffle of tokens to the right experts and back. Speculative decoding multiplies the "tokens" in that shuffle by K+1 — which sounds bad until you remember that the all-to-all cost is almost entirely fixed-latency setup, not per-token data. Sending 1 token and sending 4 tokens through the same all-to-all costs almost the same wall-clock time. So speculative decoding turns the all-to-all latency into a smaller fraction of the per-token cost. The serving cluster gets cheaper, not more expensive.
The interaction with FP8 inference
DeepSeek-V3's FP8 training (Chapter 10) carries over to inference, cutting the model weight size roughly in half versus FP16. That shifts the memory-bandwidth ceiling upward by about on its own. Speculative decoding then stacks on top, so a serving node that produced 18 tokens/s/H100 in FP16 autoregressive mode can produce roughly tokens/s/H100 in FP8 + spec-decode mode. That 4× compound speedup is what actually makes a 671B-parameter MoE model affordable to serve.
Engineering Reality and Gotchas
MTP-driven speculative decoding looks like a clean win on paper. Three production failure modes earn their flags:
- Draft / main distribution drift. The MTP heads were trained alongside the main model on the same data — but the main model continued to evolve through any fine-tuning, RLHF, or DPO stage that followed pre-training. If the alignment phase did not include the MTP heads in its optimisation, the heads' acceptance rate can degrade sharply on the new distribution (chat, instruction-following, refusal behaviour). The fix is to either re-tune the heads after alignment or back-propagate the alignment loss through them. DeepSeek-V3 reports a ≈4-point acceptance drop on chat traffic that they recover via head re-tuning.
- K too large hurts more than it helps. Pushing K past 4 rarely improves throughput on real workloads, and it can hurt: every extra draft adds head-compute, and the geometric saturation means the last drafts are mostly wasted. Production schedulers monitor the running acceptance rate per request and cap K dynamically — typically K=2 for chat, K=3 for code, K=4 for highly templated outputs.
- The verification forward pass needs careful masking. The K draft tokens added to the input must still be attended to causally — draft i sees drafts 0..i-1 but not i+1..K-1. If the attention mask accidentally lets later drafts leak into earlier positions, the verification logits become contaminated and the accept rate craters silently. This is the single most common bug in custom speculative-decoding kernels. Always unit-test by comparing a spec-decode forward pass against the autoregressive baseline at greedy temperature: they must produce byte-identical outputs.
The one sentence to carry forward: the MTP heads paid for themselves twice — once during training by improving sample efficiency, once at inference by turning every otherwise-idle tensor core into committed tokens — and that double-dip is the engineering reason a 671B-parameter MoE model can match dense-70B serving cost.