Section 7.3 gave us the architecture of sequential causal MTP — a stack of small modules sitting on top of the trunk, each one predicting one token further into the future, each one passing its hidden state to the next. That is the machinery. What we have not yet answered is the question every trainer asks next: what loss do we minimise? A sequential MTP stack is just an elaborate way to produce logits. Without an objective those logits are noise. The choice of loss decides what those extra heads learn, how aggressively the trunk reshapes itself to please them, and whether the whole construction earns its keep at training time and at inference time.
The promise of the MTP objective. A clean, scale-free addition to the standard next-token cross-entropy: one extra cross-entropy per depth, normalised by depth, scaled by a tiny coefficient that anneals across training. Less than 1 % extra FLOPs at ; measurable downstream gains; and the very same heads turn into a free speculative-decoding draft model at inference. The ablation table in the DeepSeek-V3 report is the receipt.
What Objective Trains a Sequential MTP Stack?
The trunk already has a perfectly good loss: next-token cross-entropy. Every position predicts the token at position , the prediction is scored by a softmax over the vocabulary, and the gradient flows backward into the embeddings and attention layers. This is the loss the trunk has been trained on since the first transformer paper. We are not allowed to harm it — it is what makes the model a useful language model. Whatever we add for MTP must compose with the main loss without crowding it out.
Several questions immediately appear once you commit to additional supervision on the depth- heads:
- What does each MTP head predict? The natural answer is the token steps further into the future than the main head. So depth 1 predicts , depth 2 predicts , etc. The target is just the existing token sequence shifted by an extra .
- How is each head scored? Same as the main head: cross-entropy against the true next-token-at-distance-. One softmax per head, one scalar per head.
- How much weight do these heads get? Too much and the trunk warps itself to please the MTP heads at the expense of plain next-token loss; too little and the heads do not learn anything useful. We need a coefficient.
- How do the heads interact across depths? Should depth 4 count four times as much as depth 1 (because there are more of them) or the same (because we care about the average)? This is what the normalisation answers.
- Does the weight change over training? Early on, future-token prediction may push the trunk toward better long-range representations. Late in training, the trunk has converged and any extra signal can only steal capacity. A schedule on handles this.
The DeepSeek-V3 paper picks the simplest answer to all five questions simultaneously, and then runs ablations to prove the choices were the right ones. The rest of this section is that loss, that schedule, and that ablation table — and what they imply for any team training a frontier model from scratch.
Intuition: A Stack of Teachers, One Optimizer Budget
Think of the trunk as a single student. The main loss is the headline exam: "given this sentence so far, what comes next?" The MTP heads are extra tutors standing behind the student, each asking a harder version of the same question. Tutor 1 asks "what comes two tokens from now?" Tutor 2 asks "what comes three tokens from now?" The tutors do not give the student new material — they only re-grade the same content from further away.
Two design decisions follow from this picture. First, the tutors share the student's budget. If you bring in twelve tutors, each one's feedback should count less individually, or the student spends all their effort pleasing tutors and forgets the headline exam. That is exactly what the term does. Second, you want the tutors loud at the start of training (when the student is still building intuition for the subject) and quiet at the end (when the student is polishing the headline exam answers). That is exactly what the schedule does.
The Math: Per-Depth Cross-Entropy with a Shared Budget
Fix a sequence of length with token ids and vocabulary size . The trunk produces hidden states ; the main head turns each into a softmax distribution over the vocabulary. The main loss is the textbook next-token cross-entropy averaged over the sequence:
For depth the sequential MTP architecture produces another set of distributions — see section 7.3 for how is computed. The depth-k cross-entropy is the same shape but its target is shifted tokens further:
Notice the upper limit : depth-k predictions for positions in the last tokens have no ground truth to compare against, so we drop them. At and this costs us four positions out of four thousand — negligible.
The full training objective is the main loss plus a single global term that averages the MTP losses across depths and scales the whole thing by :
Three properties of this formula are worth slowing down on.
- Depth-normalised. Whether or , the total auxiliary gradient magnitude is bounded by times the per-head loss. You can compare ablations across different depths without re-tuning .
- Decoupled from the architecture. The loss does not care how is computed — parallel heads, sequential heads, shared trunk, separate trunk. It is purely a statement about which token each head is graded against. This is why the same loss formula is reused in the DeepSeek-V3 paper and in earlier MTP work like Gloeckle et al. 2024.
- Recovers the baseline. Setting reduces the formula to plain next-token training. So "MTP off" is literally one knob away — no architectural surgery needed for an ablation.
The schedule for in DeepSeek-V3 is a single step function:
That is, for the first ~67 % of pretraining tokens the MTP heads carry their full weight; after that the coefficient drops by 3×. Earlier MTP work (Gloeckle et al.) reported similar findings with simpler constant — DeepSeek's schedule is a small refinement that buys a little extra final quality on the main loss without sacrificing the speculative-decoding gains we will see in section 7.5.
Manual Numerical Walkthrough
Step-by-step: computing for a 4-token toy with and
Take a vocabulary of size , a sequence , indices . We will compute one round of MTP loss with two extra heads and .
Step 1 — main head logits and loss. Suppose the trunk assigns these probabilities at positions 1, 2, 3:
| position i | predict t_{i+1} | p(A) | p(B) | p(C) | -log p_correct |
|---|---|---|---|---|---|
| 1 | B (id=1) | 0.20 | 0.70 | 0.10 | 0.357 |
| 2 | C (id=2) | 0.30 | 0.20 | 0.50 | 0.693 |
| 3 | B (id=1) | 0.10 | 0.60 | 0.30 | 0.511 |
nats.
Step 2 — depth-1 MTP head loss. The depth-1 head predicts . Only positions have ground truth (position 3 would predict , which does not exist).
| position i | predict t_{i+2} | p(A) | p(B) | p(C) | -log p_correct |
|---|---|---|---|---|---|
| 1 | C (id=2) | 0.30 | 0.30 | 0.40 | 0.916 |
| 2 | B (id=1) | 0.20 | 0.55 | 0.25 | 0.598 |
nats. Higher than the main loss, as expected: predicting two tokens ahead is genuinely harder.
Step 3 — depth-2 MTP head loss. The depth-2 head predicts . Only position 1 has a ground truth.
| position i | predict t_{i+3} | p(A) | p(B) | p(C) | -log p_correct |
|---|---|---|---|---|---|
| 1 | B (id=1) | 0.30 | 0.40 | 0.30 | 0.916 |
nats.
Step 4 — combine. With and :
nats.
The MTP contribution is of the total — a sizeable share early in training, when the trunk needs the future-token signal most. After the schedule cutover , the same raw losses give nats, about of the total — visibly quieter.
Try this with : the depth-3 and depth-4 raw losses would be higher still (say and ), but not 0.15, so the auxiliary contribution would actually shrink per-head. The depth normalisation is doing its job: adding heads does not inflate the gradient budget.
Visualizing the Loss Budget and λ-Schedule
Slide through the controls below. Increase from 1 to 4 and watch the per-head raw losses climb (deeper heads see harder problems) while the per-head weighted contributions stay bounded — that is the term at work. Then toggle between the constant-0.3, constant-0.1, and DeepSeek-V3 anneal schedules and slide the training-progress bar; the orange curve on the right shows over training, and the blue marker locks the loss budget on the left to the current .
Two patterns to confirm with the slider. First: switching the schedule from constant 0.3 to 0.3 → 0.1 anneal changes nothing for the first 67 % of training — the loss budget is identical. The anneal's effect lives entirely in the last third. Second: setting λ = 0 (MTP off) with any recovers a single sky-blue bar — the standard next-token loss. The architecture stays in the forward pass, but the gradient lanes to the MTP heads are silent. That is exactly the configuration of an "MTP-off" ablation: same compute graph, zero auxiliary supervision.
Plain Python: One Sequence by Hand
Before the PyTorch version, here is the same loss written as a tight NumPy loop — no autograd, no batching, just the math on a single toy sequence. The point of showing this first is to see that the "MTP loss" is nothing exotic. It is the standard cross-entropy applied to shifted targets, summed across heads, averaged with , and weighted by . Every loop iteration corresponds directly to one term of the formula above.
Output (with the default rng seed of 0): , , , and . Random logits give very high cross-entropy values, but the relative magnitudes are exactly what the schedule expects: each successive depth is slightly harder, and the auxiliary term contributes about times its mean.
PyTorch: The Production MTP Loss
The production version is the same formula wrapped in an so it composes cleanly with the trainer, batches across sequences, masks padding via , and owns the schedule. The line count is short — what makes it production-worthy is the careful target shifting per depth and the per-depth loss returned alongside the total, both of which the ablation table absolutely requires.
Ablations: What DeepSeek Actually Measured
Three families of ablation appear in the DeepSeek-V3 technical report and the related MTP literature. Each one isolates a single design choice we just argued for and shows what happens when it is removed or perturbed. The numbers below track the published values; absolute magnitudes vary by model size and dataset, but the direction of every comparison is consistent across reported runs.
Ablation 1 — MTP on vs MTP off
The headline comparison: train two identical models for the same token budget, same data, same schedule. One uses (vanilla next-token); the other uses the DeepSeek-V3 schedule with .
| Metric | MTP off (λ = 0) | MTP on (D = 1, anneal) | Effect |
|---|---|---|---|
| Main next-token loss (final) | baseline | slightly lower | + small but consistent |
| HumanEval pass@1 | baseline | +1.4–2 pts | + |
| MMLU 5-shot | baseline | +0.4–0.7 pts | + small |
| GSM8K 8-shot CoT | baseline | +1.5–2 pts | + |
| Training compute overhead | 0% | ≈ 0.9% extra FLOPs | negligible |
| Inference acceptance rate | n/a (single head) | 85–90% second-token accept | → 1.8× decode speedup |
The cost is under one percent extra compute; the downstream-quality gains are small in absolute terms but consistent across benchmarks. The real payoff is the last row: those same MTP heads, trained free as a side effect, become a built-in draft model for speculative decoding at inference. That part of the story is section 7.5.
Ablation 2 — Depth
How many MTP heads is the right number? More heads predict further into the future, which is intrinsically harder, so per-head losses rise. But each head adds compute and parameters in the forward pass and complicates the inference draft chain. The DeepSeek-V3 report explored :
| Depth | Extra params | Extra FLOPs (fwd+bwd) | Δ downstream | Inference draft speedup |
|---|---|---|---|---|
| D = 1 | 1 small module | ≈ 0.9% | best per-param | 1.8× (one extra token) |
| D = 2 | 2 modules | ≈ 1.8% | ~equal to D=1, slightly noisier | 2.0–2.2× (two extra) |
| D = 4 | 4 modules | ≈ 3.6% | marginal further gain on long-form gen | 2.4–2.6× peak |
DeepSeek-V3 ships with . The reasoning is the per-FLOP curve: the second MTP head buys some extra speculative-decoding speed but offers diminishing downstream gains, and each additional head makes the inference verifier chain longer (which we will see in section 7.5 hurts acceptance in the worst case). Earlier MTP work (Gloeckle et al. 2024) reports similar diminishing-returns behaviour with their parallel-heads variant.
Ablation 3 — λ schedule
Holding fixed, what does the auxiliary weight actually do?
| λ schedule | Final L_main | Downstream avg | Notes |
|---|---|---|---|
| λ = 0 (off) | baseline | baseline | no speculative decoding |
| λ = 0.1 constant | ≈ baseline | + small | safe; tutors are quiet throughout |
| λ = 0.3 constant | slightly higher than baseline | + small | tutors stay loud; main exam suffers a bit |
| λ = 0.3 → 0.1 (DeepSeek-V3) | lowest of all four | best of all four | shipped configuration |
| λ = 1.0 constant | noticeably higher | +/- | tutors drown out the headline exam |
Two clean signals come out of this table. First, "a little λ is much better than none" — even constant beats the baseline. Second, "turn it down later" — the anneal squeezes out an extra hair of main-loss quality by letting the trunk focus on the headline exam once it has matured. The third signal, less clean but very real: is harmful. The tutors should never be louder than the headline exam.
One-line takeaway from the ablations. The right configuration isboringly small — , below 0.3, schedule annealed to 0.1. Larger choices give diminishing returns and eventually hurt. MTP is a quiet, well-behaved auxiliary — not a second loss function fighting the first.
What Changes at Massive Scale
At toy size — the 6-token NumPy example above — the MTP objective is a couple of extra cross-entropies. At 671 B parameters, 14.8 T tokens, and thousands of GPUs, a few things have to be handled with care.
Compute and memory cost is exactly D times one head's cost
Each MTP module is a transformer block plus a small linear and an embedding projection — by far dominated by the attention/FFN compute. At it adds one extra block-worth of FLOPs per forward pass: roughly of the trunk cost where is the number of trunk layers. For that is the ~0.9 % overhead quoted in the ablations table. The backward pass adds proportionally. Activation memory for the extra block is reclaimed by the same checkpointing pass used for trunk layers; no new accounting is required.
The shared embedding and output head
DeepSeek-V3's MTP modules share the trunk's token embedding and output projection — only the transformer block and the small projection layer between them are unique per depth. This decision shapes the loss in a subtle way: the embedding table and output head receive both the main gradient and the MTP gradients (scaled by ). In FP32, this is a simple add; in mixed-precision training (next chapter) it requires that the accumulator for these shared parameters stays in FP32 even though the heads themselves can be FP8. We will return to this in chapter 10.
Gradient noise and convergence
Adding the MTP loss makes the overall gradient slightly higher-variance, because each MTP head is supervising fewer positions (the depth-k head supervises , not ). At this is a 0.025 % effect at and entirely lost in the noise. The optimizer hyperparameters (Adam β, weight decay, LR schedule) do not need to be re-tuned when MTP is added — DeepSeek-V3 reports using exactly the same trainer config as an internal MTP-off baseline.
Pipeline parallelism placement
Where in the pipeline do the MTP modules live? They are downstream of the final trunk layer, which means they belong on the same pipeline rank as the final output head. In chapter 11 we will see that DualPipe puts the output head and MTP modules on the same micro-batch boundary so that the auxiliary backward pass overlaps with the main backward pass and the all-reduce of the embedding gradient — no new bubble is introduced. This is one of the quiet reasons the 0.9 % FLOPs overhead does not translate into a 0.9 % wall-clock overhead: most of it hides inside a bubble that was already there.
Engineering Reality and Gotchas
- Target shifting is the most common bug. Off-by-one errors in the depth-k target shift produce a head that learns the identity (copying the input) and reports a suspiciously low loss. Always assert and log per-depth losses from step 0; if depth-1 loss is lower than depth-0 loss, you have shifted the wrong direction.
- Padding interacts with the shift. works on the flattened target, but if your sequence packing puts a padding token at position , the depth-k head can still see it. The safest pattern is to apply the shift to both targets and an , then any position the mask marks as padded.
- Loss accumulation precision. Even when the forward pass is BF16 or FP8, the cross-entropy reduction (the mean over T·B positions) should accumulate in FP32. PyTorch's does this internally for non-FP8 dtypes; with explicit FP8 you must cast logits to BF16 before the softmax. The few extra bits matter when is already shrinking the result by .
- The λ schedule and the LR schedule are independent. Do not tie the cutover to the LR decay. The optimizer schedule reacts to validation curves and warmup behaviour; the cutover is a curriculum decision. Coupling them creates an opaque hyperparameter that no ablation can isolate.
- Disable MTP for eval. When you compute validation loss, you want to report the main next-token loss only — that is the comparable metric across runs and against the literature. Passing a separate to the loss module (or just reading out of the return dict) keeps comparisons honest.
- The MTP heads are kept for inference. A naive read of "auxiliary loss" suggests the heads are thrown away after training. They are not — section 7.5 turns them into a speculative-decoding draft model and recovers a 1.8–2.0× decode speedup. The training objective above is what makes them good draft models. The fact that the same module serves both the training objective and the inference acceleration is the deepest reason MTP is worth its modest compute overhead.
End of the MTP training story. Architecture (7.3) gave us a sequential causal stack. This section gave us the loss that trains it: a depth-normalised, anneal-weighted sum of per-depth cross-entropies. Ablations confirmed the choices were not arbitrary — they sit on the corner of a diminishing-returns frontier. The next section, 7.5, cashes in the second dividend: those same heads, at inference time, become a tightly-coupled draft model and unlock speculative decoding without a second model to maintain.