Section 8.4 picked the proportions of the training mix and section 8.5 added synthetic data to fix the corpora that were too small or too sparse. Both treated the 14.8T-token budget as a single pool — every token had the same probability of being seen at every moment in the run. That implicit assumption — that ordering does not matter — is wrong. The order in which the model sees its tokens changes the loss curve, the final benchmark scores, and the capabilities the model acquires, by margins as large as a generation of architectural improvement. A training curriculum is the explicit schedule that tells you: which mixture is active when, which sequence length is in force when, and which corpora get to dominate the final, most-influential gradient updates.
The thesis of this section. Pretraining is not a single uniform process — it is a four-act play. Act 1 (warmup) settles the optimizer on broad web text. Act 2 (main) burns the bulk of the budget on the canonical mixture. Act 3 (reasoning ramp) doubles context length and upweights math and code. Act 4 (annealing) collapses the learning rate, jumps to 32K context, and feeds the model only the highest-quality corpora. The same 14.8T tokens, served in this order, produce a measurably better model than the same tokens shuffled IID. That delta is the curriculum.
The Real Problem: Order Is a Hyperparameter
The default assumption in machine learning is that training data should be sampled IID — independent and identically distributed across the whole run. SGD's convergence proofs assume it. PyTorch's default DataLoader implements it. And for a 100M-parameter model trained on 10B tokens, this assumption is fine — the model is far from saturation, every gradient is roughly equally informative, and a global shuffle gets you 99% of the way to the best achievable loss.
Three things break this assumption once models reach billions of parameters and trillions of training tokens:
| Problem | What goes wrong with a flat, shuffled, single-mixture schedule |
|---|---|
| Optimizer fragility at start | An untrained transformer with random weights is a chaotic function. Feeding it dense math/code from step 0 produces enormous initial losses, gradient spikes, and a high probability of divergence in the first few thousand steps. |
| Capability emergence requires concentration | Math, code, and reasoning skills are step-function-acquired. They need to be present at high density during the late, low-LR portion of training where the model is consolidating, not diluted across the whole run. |
| Context length is expensive everywhere | Attention is O(seq_len²). Training the whole run at 32K context costs 64× more compute per token than 4K context. Most of the run does not need the long context — only the final phase does. |
| Annealing locks in behaviour | The last 10% of training, where the LR collapses to near zero, is where the model's final 'voice' settles. Tokens fed during this window have outsized influence — they MUST be the highest-quality slice you have. |
The flat-schedule approach respects none of these. A curriculum respects all four, and the gain shows up as a 0.05–0.15 nat improvement in pretraining loss and a 5–15-point lift on reasoning benchmarks — at the cost of a few hundred lines of sampler code and a more careful LR schedule.
Intuition: A Curriculum, Not a Shuffle
Think of a student learning calculus. You do not hand them a randomly shuffled deck containing limits, derivatives, integrals, vector analysis, and tensor calculus on day one. You start with the easiest idea — slopes — let the brain settle, then introduce the next idea on top of the established foundation. The order matters because each concept assumes the previous one. Knowledge stacks, and the bottom of the stack must be solid before the top is placed.
Transformer pretraining is structurally similar. The bottom of the stack is "how language works" — tokenization patterns, basic syntax, common word co-occurrences. The top of the stack is "multi-step formal reasoning across a 30-page proof." If you try to teach the top while the bottom is still wobbly, the optimizer flounders and the high-value tokens are wasted. The curriculum builds the bottom in the warmup + main phases, then puts the top on once the foundation can hold it.
A second mental model: recency bias of the optimizer. AdamW with a cosine LR decay weights its updates more heavily at the start (high LR, big steps) and barely at all near the end (LR ≈ 0, near-zero steps). But there is a dual effect — the information a low-LR update encodes is the one most likely to survive into the final weights, because nothing later will overwrite it. The early phase sets the model's prior; the annealing phase sets the model's "voice." The curriculum chooses what voice you want by choosing what data is present when the LR is near zero.
The Mathematics of a Schedule
Let be normalised training progress (fraction of total tokens consumed). A curriculum is a pair of functions:
and a sequence-length schedule:
The simplest non-trivial form is a step function: divide into phases, each with a constant mixture and a constant length . This is what DeepSeek-V3, Llama-3, and MiniCPM all use in practice — four phases is the sweet spot of expressive-enough but simple-enough.
Expected tokens per domain under a curriculum
Under a static mixture, the expected number of tokens drawn from domain is . Under a curriculum, the integral generalises:
where is the share of the budget allocated to phase and is the weight of domain in that phase. The integral collapses to a weighted sum because the step function is piecewise constant.
The annealing phase loss decomposition
Empirically, the loss the model carries out of training is not the time-averaged loss — it is dominated by the last phase. A clean decomposition (consistent with the MiniCPM ablation):
with empirically — even though the anneal phase is only ~10% of the schedule, it accounts for roughly 70% of the final loss reduction. This is the formal statement of "the last 10% matters most" and is the reason the anneal mixture is so heavily skewed toward quality.
Length warmup and the attention cost
Self-attention is in sequence length. A run that trains end-to-end at costs the per-token attention compute of a run at . A length-warmup curriculum exploits this:
For the canonical 4K / 4K / 8K / 32K schedule, this works out to a roughly 4× total attention-compute saving compared to running everything at 32K — for the same final model quality, because the long-context capability is acquired during the 10% anneal phase when it actually matters.
Manual Numerical Walkthrough
Let us pour the 14.8T-token budget through the four-phase schedule by hand. Same five domains as section 8.4 (web 12 000 B, code 600 B, math 150 B, books 300 B, wiki 50 B), but the mixture is now time-varying.
Click to expand: four phases, 14.8T tokens, by hand
Setup. Total budget B. Four phases with shares . So the per-phase budget is:
- Warmup: B tokens
- Main: B tokens
- Reasoning ramp: B tokens
- Anneal: B tokens
Step 2 — accumulate per-domain tokens with the integral. For math ():
- Warmup contribution: B
- Main contribution: B
- Reasoning contribution: B
- Anneal contribution: B
- Total math tokens: B
Step 3 — convert to effective epochs. Math raw corpus is 150 B, so epochs. Compare to the static deepseek mixture in section 8.4 which gave 7.89×. The curriculum's math coverage is 26% higher — but it is delivered concentrated in the reasoning ramp and anneal phases, exactly where it propagates into the final weights.
Step 4 — repeat for web. B. Effective epochs: . Less than one pass — the model sees only 67% of the available web tokens, which is the "safe" zone where memorisation is structurally impossible.
Step 5 — sanity check. . Working out the rest: code = 2 800.4 B, books = 1 700 B, wiki = 768.4 B. B — equals the total budget. The curriculum balances exactly because every per-phase weight vector sums to 1.
Step 6 — the anneal-phase dominance check. The anneal phase serves 1 480 B tokens. Under MiniCPM's rule, those 1 480 B tokens contribute roughly 70% of the final loss reduction — even though they are 10% of the run. The anneal mixture is uniform-ish (entropy ≈ 2.26 bits out of a max of 2.32), which is by design: late-stage training balances all capabilities rather than doubling down on any single one.
Visualizing the Curriculum
The scrubber below sweeps a cursor across the 14.8T-token schedule. Drag the slider, or click any of the four phase buttons to jump directly to the middle of that phase. Watch three things change in sync: the active mixture bar inside the phase card (the mixture is different in every phase), the sequence length readout in the header (4K → 4K → 8K → 32K across phases), and the cumulative tokens-per-domain bars at the bottom (these fill at different rates depending on which phase is currently active).
Three observations the scrubber forces on you that flat-mixture thinking hides. First, math fills slowly through warmup and main (where its weight is 2–6%), then accelerates sharply in the reasoning and anneal phases (where its weight is 18–25%). The final math coverage is 1.5T tokens, but two thirds of that arrives in the last 30% of training. Second, the sequence-length escalation is concentrated in the back end — most of the run is at 4K. That is the 4× attention-compute saving from length warmup, visible directly in the schedule. Third, the wiki bar — barely visible in the early phases — closes most of its distance during anneal. The smallest corpora get their disproportionate boost exactly when the model can absorb it without diverging.
Plain Python: A Phase-Aware Sampler
Below is the complete curriculum mechanism in plain Python. The sampler is the same inverse-CDF walk you saw in section 8.4, but the weights vector is now a function of training step. The full schedule is four phases declared as a list of dicts; the phase lookup is a four-comparison walk; the per-step sampler costs one O(K) call.
Two ideas worth pinning. First, the schedule isdata, not code. The four-phase table is a single 4×3 grid of primitives — share, weights, seq_len. Changing the curriculum is a YAML edit, not a code change; this is what lets frontier labs run dozens of curriculum-search experiments on a fixed model and a fixed data pool. Second, the phase boundary lookup is deterministic in step only — no shared state, no race conditions across workers. If worker 3 crashes at step 412 781 and resumes, it reconstructs "I am in phase main" from the step number alone and serves the correct mixture on its very first sample.
The accounting check. After simulating the full run, sum the per-domain tokens. The total must equal where is the share-weighted mean sequence length. For our schedule: tokens per step. If your counter says otherwise, the bug is almost always a missed seq_len update at a phase boundary.
PyTorch: Time-Varying MixtureSampler
The production version inherits the entire interface of section 8.4's MixtureSampler — same IterableDataset, same DataLoader, same per-worker independence — and adds one new responsibility: re-select the mixture from the current phase before each multinomial draw. The shards on disk are packed at the longest seq_len (32K) so a single pre-tokenisation serves all four phases.
Three places this code touches the rest of the training stack:
- The LR schedule must agree with the phase schedule. A cosine LR that peaks at step 0 and decays smoothly to zero will undershoot the anneal phase — the LR is already small by then, and the model is not learning much from the highest-quality data. The standard fix is a two-segment LR: a long, gentle cosine across phases 1–3, then a sharp linear decay through phase 4. This way the anneal phase still gets non-trivial updates.
- The model wrapper must accept variable seq_len. The batch shape changes from (8, 4096) in main to (8, 32768) in anneal. For RoPE this is automatic; for ALiBi or learned positional embeddings you need a length-extrapolation strategy (YaRN, PI, NTK-aware scaling). DeepSeek-V3 uses YaRN at the phase-3→4 boundary.
- Checkpoints at phase boundaries are mandatory. A clean checkpoint right before each phase transition lets you (a) restart the run if the phase change causes a loss spike, (b) reuse phases 1–3 as the starting point for multiple phase-4 variants (anneal mixture search), (c) audit the per-phase contribution to capabilities afterwards. Skipping these checkpoints is the single most expensive mistake we have seen labs make.
At Massive Scale: Length Warmup and Annealing
At 671B parameters and 14.8T tokens, the curriculum is not an optimisation luxury — it is a structural prerequisite for the run to finish at all. Three constraints force its hand:
| Constraint | How the curriculum addresses it |
|---|---|
| GPU memory at long context | Attention activations grow as L²·d. Training the whole run at 32K context would need ~64× the activation memory of 4K, far exceeding what FSDP+activation checkpointing can hide. The curriculum keeps L at 4K for 70% of the run, only paying the long-context cost in the final 10%. |
| Distributed I/O bandwidth | Different phases serve different effective sequence sizes, so per-GPU token throughput changes by 8×. The data pipeline must size shard reads, prefetch queues, and inter-node bandwidth for the WORST case (32K anneal) — overprovisioned during the main phase, fully utilised during anneal. |
| Long-context capability acquisition | DeepSeek uses YaRN positional scaling at the seq_len jumps. A naive jump from 4K to 32K without RoPE rescaling causes the loss to spike by 1–2 nats and never recover. The curriculum schedules the rescaling event explicitly between phase 3 and phase 4. |
| Eval-set contamination amplification | The anneal phase serves each high-quality doc 2–4× over its short duration. Any contamination of the anneal corpus is amplified exactly when the model is most receptive. Contamination scans (section 8.7) MUST re-run before anneal launches; the run-wide scan from week-1 is not sufficient. |
Real frontier runs add a fifth ingredient: data ordering within a phase. Even inside phase 2 (main), the schedule can order documents by a difficulty proxy — average per-token perplexity from a small reference model — and serve easier documents first. This is the classical "Bengio curriculum" idea, but applied at document granularity rather than corpus granularity. Reported gains are modest (1–2% benchmark lift) but free if you already have the proxy scores from your quality-filtering stage.
The annealing playbook
Among practitioners, the annealing phase has a fairly tight recipe as of 2025. The shape repeats across DeepSeek-V3, MiniCPM, Qwen-2.5, and the public Llama-3 technical report:
- Length jump first, then mixture change. Anneal starts with one or two checkpoints that just extend the context length while keeping the main-phase mixture. This isolates the length-extrapolation event so any loss spike is attributable.
- Linear LR decay to zero across the anneal duration. Not cosine — linear. Cosine's long tail near zero wastes anneal tokens; linear lets the model take real steps right up to the last batch.
- Mixture flattens toward uniform across domains. Anneal weights look like [0.20, 0.20, 0.25, 0.20, 0.15] — entropy near max. The model has already learned web; the marginal anneal token is most valuable when it is from a previously-underrepresented slice.
- Quality filter tightens by ~2 standard deviations. The anneal corpus is built by re-applying the section-8.3 quality score with a stricter cutoff. Only the cleanest 5–10% of each domain's tokens survives into anneal.
- Per-domain validation loss is checked every 50B tokens. If any domain's loss rises during anneal, halt and inspect — it means either the LR is too low (model is forgetting) or the mixture is poisoned (a contamination scan slipped through).
Engineering Reality and Gotchas
Three real failure modes that have shipped to production runs and cost real money:
- Phase-boundary loss spikes. Every time the mixture changes, the loss jumps — sometimes 0.05 nats (recoverable), sometimes 0.5 nats (run-ending). The standard mitigation is a linear mixture interpolation over 1% of the schedule around each boundary: instead of switching abruptly from to , blend with ramping linearly 0 → 1. This adds ~10 lines to the sampler and prevents 90% of phase-boundary spikes.
- Length-jump RoPE divergence. A 4K→32K jump without YaRN scaling typically causes loss to rise from 1.8 to 3.5 nats inside 200 steps and stay there. The fix is to apply the YaRN factor at the EXACT batch where the seq_len changes; the model weights are otherwise untouched. Missing this is the most common way to lose a 70B run.
- Optimizer state interactions with seq_len changes. Adam's (second moment) is per-parameter, but the scale of typical gradients changes with seq_len because attention activations grow. The result: after a seq_len jump, the effective step size on attention weights changes by . DeepSeek mitigates by adding a one-time gradient renormalisation at each length boundary; some labs reset only the second-moment estimates for attention parameters. Either works; doing nothing does not.
The one sentence to carry forward: a curriculum is the schedule that turns a fixed pool of training tokens into the maximum capability per dollar — and the final 10% of that schedule (the anneal phase) is where most of the capability actually lands — which is why every frontier lab now spends as much engineering effort tuning the four-phase schedule as it does on the underlying mixture itself.