Chapter 8
20 min read
Section 46 of 117

Training Data Curriculum

Data: The Invisible Foundation

Section 8.4 picked the proportions of the training mix and section 8.5 added synthetic data to fix the corpora that were too small or too sparse. Both treated the 14.8T-token budget as a single pool — every token had the same probability of being seen at every moment in the run. That implicit assumption — that ordering does not matter — is wrong. The order in which the model sees its tokens changes the loss curve, the final benchmark scores, and the capabilities the model acquires, by margins as large as a generation of architectural improvement. A training curriculum is the explicit schedule that tells you: which mixture is active when, which sequence length is in force when, and which corpora get to dominate the final, most-influential gradient updates.

The thesis of this section. Pretraining is not a single uniform process — it is a four-act play. Act 1 (warmup) settles the optimizer on broad web text. Act 2 (main) burns the bulk of the budget on the canonical mixture. Act 3 (reasoning ramp) doubles context length and upweights math and code. Act 4 (annealing) collapses the learning rate, jumps to 32K context, and feeds the model only the highest-quality corpora. The same 14.8T tokens, served in this order, produce a measurably better model than the same tokens shuffled IID. That delta is the curriculum.

The Real Problem: Order Is a Hyperparameter

The default assumption in machine learning is that training data should be sampled IID — independent and identically distributed across the whole run. SGD's convergence proofs assume it. PyTorch's default DataLoader implements it. And for a 100M-parameter model trained on 10B tokens, this assumption is fine — the model is far from saturation, every gradient is roughly equally informative, and a global shuffle gets you 99% of the way to the best achievable loss.

Three things break this assumption once models reach billions of parameters and trillions of training tokens:

ProblemWhat goes wrong with a flat, shuffled, single-mixture schedule
Optimizer fragility at startAn untrained transformer with random weights is a chaotic function. Feeding it dense math/code from step 0 produces enormous initial losses, gradient spikes, and a high probability of divergence in the first few thousand steps.
Capability emergence requires concentrationMath, code, and reasoning skills are step-function-acquired. They need to be present at high density during the late, low-LR portion of training where the model is consolidating, not diluted across the whole run.
Context length is expensive everywhereAttention is O(seq_len²). Training the whole run at 32K context costs 64× more compute per token than 4K context. Most of the run does not need the long context — only the final phase does.
Annealing locks in behaviourThe last 10% of training, where the LR collapses to near zero, is where the model's final 'voice' settles. Tokens fed during this window have outsized influence — they MUST be the highest-quality slice you have.

The flat-schedule approach respects none of these. A curriculum respects all four, and the gain shows up as a 0.05–0.15 nat improvement in pretraining loss and a 5–15-point lift on reasoning benchmarks — at the cost of a few hundred lines of sampler code and a more careful LR schedule.

What the literature shows. Curriculum learning has a 20-year arc (Elman 1993, Bengio et al. 2009) but only became unambiguously useful at frontier scale around 2022. The MiniCPM technical report (Hu et al., 2024) ablated the annealing phase and showed that the final 10% of training, when fed high-quality data, contributes more capability per token than the preceding 50%. DeepSeek-V3 adopted the same four-phase shape; Llama-3 and Qwen-2.5 ship comparable schedules. The pattern is now considered table-stakes for any sub-10B-loss training run.

Intuition: A Curriculum, Not a Shuffle

Think of a student learning calculus. You do not hand them a randomly shuffled deck containing limits, derivatives, integrals, vector analysis, and tensor calculus on day one. You start with the easiest idea — slopes — let the brain settle, then introduce the next idea on top of the established foundation. The order matters because each concept assumes the previous one. Knowledge stacks, and the bottom of the stack must be solid before the top is placed.

Transformer pretraining is structurally similar. The bottom of the stack is "how language works" — tokenization patterns, basic syntax, common word co-occurrences. The top of the stack is "multi-step formal reasoning across a 30-page proof." If you try to teach the top while the bottom is still wobbly, the optimizer flounders and the high-value tokens are wasted. The curriculum builds the bottom in the warmup + main phases, then puts the top on once the foundation can hold it.

A second mental model: recency bias of the optimizer. AdamW with a cosine LR decay weights its updates more heavily at the start (high LR, big steps) and barely at all near the end (LR ≈ 0, near-zero steps). But there is a dual effect — the information a low-LR update encodes is the one most likely to survive into the final weights, because nothing later will overwrite it. The early phase sets the model's prior; the annealing phase sets the model's "voice." The curriculum chooses what voice you want by choosing what data is present when the LR is near zero.

The one-sentence mental model. A static mixture asks what fraction of the budget should each domain receive? A curriculum asks the strictly more powerful question what fraction of the budget should each domain receive, at each moment of the run? The static answer is a 5-vector. The curriculum answer is a function pi(t)p_i(t).

The Mathematics of a Schedule

Let t[0,1]t \in [0, 1] be normalised training progress (fraction of total tokens consumed). A curriculum is a pair of functions:

pi:[0,1][0,1]withi=1Kpi(t)=1    tp_i: [0,1] \to [0,1] \quad \text{with} \quad \sum_{i=1}^{K} p_i(t) = 1 \;\; \forall t

and a sequence-length schedule:

L:[0,1]Z+,L(t) is the context length at progress t.L: [0,1] \to \mathbb{Z}_+, \quad L(t) \text{ is the context length at progress } t.

The simplest non-trivial form is a step function: divide [0,1][0,1] into PP phases, each with a constant mixture p(j)p^{(j)} and a constant length L(j)L^{(j)}. This is what DeepSeek-V3, Llama-3, and MiniCPM all use in practice — four phases is the sweet spot of expressive-enough but simple-enough.

Expected tokens per domain under a curriculum

Under a static mixture, the expected number of tokens drawn from domain ii is Ti=piTT_i = p_i \cdot T. Under a curriculum, the integral generalises:

Ti=T01pi(t)dt=Tj=1Ps(j)pi(j)T_i = T \cdot \int_{0}^{1} p_i(t) \, dt = T \cdot \sum_{j=1}^{P} s^{(j)} \cdot p_i^{(j)}

where s(j)s^{(j)} is the share of the budget allocated to phase jj and pi(j)p_i^{(j)} is the weight of domain ii in that phase. The integral collapses to a weighted sum because the step function is piecewise constant.

The annealing phase loss decomposition

Empirically, the loss the model carries out of training is not the time-averaged loss — it is dominated by the last phase. A clean decomposition (consistent with the MiniCPM ablation):

Lfinal(θ)αLanneal(θ)+(1α)Lpre-anneal(θ)\mathcal{L}_{\text{final}}(\theta) \approx \alpha \cdot \mathcal{L}_{\text{anneal}}(\theta) + (1 - \alpha) \cdot \mathcal{L}_{\text{pre-anneal}}(\theta)

with α0.7\alpha \approx 0.7 empirically — even though the anneal phase is only ~10% of the schedule, it accounts for roughly 70% of the final loss reduction. This is the formal statement of "the last 10% matters most" and is the reason the anneal mixture is so heavily skewed toward quality.

Length warmup and the attention cost

Self-attention is O(L2d)O(L^2 \cdot d) in sequence length. A run that trains end-to-end at L=32,768L = 32{,}768 costs (32768/4096)2=64×(32768/4096)^2 = 64\times the per-token attention compute of a run at L=4096L = 4096. A length-warmup curriculum exploits this:

Compute savedjs(j)(Lmax2(L(j))2)d\text{Compute saved} \approx \sum_{j} s^{(j)} \cdot \big( L_{\max}^2 - (L^{(j)})^2 \big) \cdot d

For the canonical 4K / 4K / 8K / 32K schedule, this works out to a roughly 4× total attention-compute saving compared to running everything at 32K — for the same final model quality, because the long-context capability is acquired during the 10% anneal phase when it actually matters.

Manual Numerical Walkthrough

Let us pour the 14.8T-token budget through the four-phase schedule by hand. Same five domains as section 8.4 (web 12 000 B, code 600 B, math 150 B, books 300 B, wiki 50 B), but the mixture is now time-varying.

Click to expand: four phases, 14.8T tokens, by hand

Setup. Total budget T=14800T = 14\,800 B. Four phases with shares s=[0.05,0.65,0.20,0.10]s = [0.05, 0.65, 0.20, 0.10]. So the per-phase budget is:

  • Warmup: 0.0514800=7400.05 \cdot 14\,800 = 740 B tokens
  • Main: 0.6514800=96200.65 \cdot 14\,800 = 9\,620 B tokens
  • Reasoning ramp: 0.2014800=29600.20 \cdot 14\,800 = 2\,960 B tokens
  • Anneal: 0.1014800=14800.10 \cdot 14\,800 = 1\,480 B tokens

Step 2 — accumulate per-domain tokens with the integral. For math (i=2i = 2):

  • Warmup contribution: 0.02740=14.80.02 \cdot 740 = 14.8 B
  • Main contribution: 0.069620=577.20.06 \cdot 9\,620 = 577.2 B
  • Reasoning contribution: 0.182960=532.80.18 \cdot 2\,960 = 532.8 B
  • Anneal contribution: 0.251480=370.00.25 \cdot 1\,480 = 370.0 B
  • Total math tokens: 14.8+577.2+532.8+370.0=1494.814.8 + 577.2 + 532.8 + 370.0 = 1\,494.8 B

Step 3 — convert to effective epochs. Math raw corpus is 150 B, so Emath=1494.8/1509.97×E_{\text{math}} = 1\,494.8 / 150 \approx 9.97\times epochs. Compare to the static deepseek mixture in section 8.4 which gave 7.89×. The curriculum's math coverage is 26% higher — but it is delivered concentrated in the reasoning ramp and anneal phases, exactly where it propagates into the final weights.

Step 4 — repeat for web. 0.80740+0.629620+0.402960+0.2014800.80 \cdot 740 + 0.62 \cdot 9620 + 0.40 \cdot 2960 + 0.20 \cdot 1480 =592+5964.4+1184+296=8036.4= 592 + 5964.4 + 1184 + 296 = 8\,036.4 B. Effective epochs: 8036.4/120000.67×8\,036.4 / 12\,000 \approx 0.67\times. Less than one pass — the model sees only 67% of the available web tokens, which is the "safe" zone where memorisation is structurally impossible.

Step 5 — sanity check. iTi=8036.4+Tcode+1494.8+Tbooks+Twiki\sum_i T_i = 8036.4 + T_{\text{code}} + 1494.8 + T_{\text{books}} + T_{\text{wiki}}. Working out the rest: code = 2 800.4 B, books = 1 700 B, wiki = 768.4 B. 8036.4+2800.4+1494.8+1700+768.4=148008036.4 + 2800.4 + 1494.8 + 1700 + 768.4 = 14\,800 B — equals the total budget. The curriculum balances exactly because every per-phase weight vector sums to 1.

Step 6 — the anneal-phase dominance check. The anneal phase serves 1 480 B tokens. Under MiniCPM's α0.7\alpha \approx 0.7 rule, those 1 480 B tokens contribute roughly 70% of the final loss reduction — even though they are 10% of the run. The anneal mixture is uniform-ish (entropy ≈ 2.26 bits out of a max of 2.32), which is by design: late-stage training balances all capabilities rather than doubling down on any single one.

Visualizing the Curriculum

The scrubber below sweeps a cursor across the 14.8T-token schedule. Drag the slider, or click any of the four phase buttons to jump directly to the middle of that phase. Watch three things change in sync: the active mixture bar inside the phase card (the mixture is different in every phase), the sequence length readout in the header (4K → 4K → 8K → 32K across phases), and the cumulative tokens-per-domain bars at the bottom (these fill at different rates depending on which phase is currently active).

Loading curriculum scrubber…

Three observations the scrubber forces on you that flat-mixture thinking hides. First, math fills slowly through warmup and main (where its weight is 2–6%), then accelerates sharply in the reasoning and anneal phases (where its weight is 18–25%). The final math coverage is 1.5T tokens, but two thirds of that arrives in the last 30% of training. Second, the sequence-length escalation is concentrated in the back end — most of the run is at 4K. That is the 4× attention-compute saving from length warmup, visible directly in the schedule. Third, the wiki bar — barely visible in the early phases — closes most of its distance during anneal. The smallest corpora get their disproportionate boost exactly when the model can absorb it without diverging.

Plain Python: A Phase-Aware Sampler

Below is the complete curriculum mechanism in plain Python. The sampler is the same inverse-CDF walk you saw in section 8.4, but the weights vector is now a function of training step. The full schedule is four phases declared as a list of dicts; the phase lookup is a four-comparison walk; the per-step sampler costs one O(K) call.

🐍curriculum_sampler.py
6Same five domains, but the run is now segmented in time

RAW_B is identical to section 8.4 — Web is two orders of magnitude bigger than Math. What changes here is that the mixture is no longer a single 5-vector for the whole run; it is FOUR 5-vectors, one per phase, and the schedule decides which one is active at any step.

EXECUTION STATE
DOMAINS = ['web', 'code', 'math', 'books', 'wiki']
RAW_B[0] (web) = 12000 B
RAW_B[2] (math) = 150 B
11Four phases declare share, weights, and sequence length

Each phase is an atomic recipe: how much of the budget it gets, the mixture it serves, the context length it trains on. The transition between phases is the curriculum. 'warmup' is mostly web at short context to get the optimizer settled; 'anneal' is dense, balanced, long-context to converge the final model. Note seq_len changes across phases — that is the length-warmup half of the curriculum.

EXECUTION STATE
PHASES[0].share = 0.05
PHASES[3].weights[2] (math anneal) = 0.25
PHASES[3].seq_len = 32768
25Build phase boundaries — the schedule's clock

PHASE_BOUNDARIES = [0.05, 0.70, 0.90, 1.00]. These are the right edges of each phase along normalised training progress. Any step t whose fraction t/TOTAL_STEPS falls into bucket i activates PHASES[i]. This is the entire 'schedule' data structure — four floats.

EXECUTION STATE
PHASE_BOUNDARIES = [0.05, 0.70, 0.90, 1.00]
30Phase lookup is O(K) — fast and stateless

current_phase(step) computes frac = step / TOTAL_STEPS and walks the boundary list. With K=4 phases this is essentially free and runs once per step. Statelessness matters: if a worker crashes and resumes from step t, the phase is reconstructed deterministically from t alone — no shared mutable state to corrupt.

38Phase-aware categorical sampler

Same inverse-CDF walk as section 8.4, but the weights come from current_phase(step) rather than a global constant. The sampler is now time-varying — at step 500_000 (halfway, inside 'main') it serves web 62% of the time; at step 950_000 (inside 'anneal') it serves web only 20% and math 25%.

EXECUTION STATE
ph['weights'] at step 500000 = [0.62, 0.17, 0.06, 0.10, 0.05]
49Tally tokens-per-domain, converted to billions

ph['seq_len'] is the number of tokens in ONE training sequence. Adding it to seen_B[d] each step is a rough accounting (real runs include batch_size and gradient_accumulation, but the proportions are identical). Crucially: phases with seq_len=32768 add 8× more tokens per step than seq_len=4096 phases — annealing is short in step-count but heavy in token-count.

EXECUTION STATE
seen_B[2] (math, final) = ≈ 700-900 B
53Final epochs reflect the curriculum, not a single mixture

The printed 'epochs' row is the curriculum's contract: how many times each raw corpus was effectively replayed by end-of-run. Compare to section 8.4 — the static 'deepseek' mix gave math ≈ 7.89× epochs. The curriculum can hit similar or higher math coverage while spending MORE budget on long-context anneal, because the math weight is concentrated where it matters (reasoning ramp + anneal phases) rather than diluted across the whole run.

50 lines without explanation
1import random
2random.seed(0)
3
4# The same five domains from section 8.4, with raw token counts in BILLIONS.
5DOMAINS = ["web", "code", "math", "books", "wiki"]
6RAW_B   = [12_000,    600,    150,     300,     50]
7
8# A four-phase curriculum. Each phase declares:
9#   share   — fraction of the 14.8T-token budget spent here
10#   weights — mixture inside the phase (must sum to 1)
11#   seq_len — context length the model trains on during this phase
12PHASES = [
13    {"name": "warmup",     "share": 0.05, "weights": [0.80, 0.05, 0.02, 0.10, 0.03], "seq_len":  4_096},
14    {"name": "main",       "share": 0.65, "weights": [0.62, 0.17, 0.06, 0.10, 0.05], "seq_len":  4_096},
15    {"name": "reasoning",  "share": 0.20, "weights": [0.40, 0.22, 0.18, 0.12, 0.08], "seq_len":  8_192},
16    {"name": "anneal",     "share": 0.10, "weights": [0.20, 0.20, 0.25, 0.20, 0.15], "seq_len": 32_768},
17]
18
19TOTAL_B = 14_800  # total training tokens, in billions
20TOTAL_STEPS = 1_000_000  # symbolic — real run has ~1.8M steps
21
22# Build a phase-lookup table: for any step t, which phase are we in?
23_cum = 0.0
24PHASE_BOUNDARIES = []
25for ph in PHASES:
26    _cum += ph["share"]
27    PHASE_BOUNDARIES.append(_cum)  # right edge of each phase, in [0, 1]
28
29def current_phase(step: int) -> dict:
30    """Look up the active phase for a given training step."""
31    frac = step / TOTAL_STEPS
32    for i, edge in enumerate(PHASE_BOUNDARIES):
33        if frac <= edge:
34            return PHASES[i]
35    return PHASES[-1]
36
37def sample_domain(step: int) -> int:
38    """One step of the curriculum-aware sampler."""
39    ph = current_phase(step)
40    r, acc = random.random(), 0.0
41    for i, w in enumerate(ph["weights"]):
42        acc += w
43        if r < acc:
44            return i
45    return len(ph["weights"]) - 1
46
47# Simulate the FULL run and tally per-domain tokens consumed.
48seen_B = [0.0] * len(DOMAINS)
49for step in range(TOTAL_STEPS):
50    ph = current_phase(step)
51    d = sample_domain(step)
52    # tokens per step in this phase ≈ seq_len (we collapse batch_size for clarity)
53    seen_B[d] += ph["seq_len"] / 1e9  # convert tokens → billions
54
55for i, name in enumerate(DOMAINS):
56    epochs = seen_B[i] / RAW_B[i]
57    print(f"{name:6s}  seen={seen_B[i]:7.1f}B  epochs={epochs:5.2f}x")

Two ideas worth pinning. First, the schedule isdata, not code. The four-phase table is a single 4×3 grid of primitives — share, weights, seq_len. Changing the curriculum is a YAML edit, not a code change; this is what lets frontier labs run dozens of curriculum-search experiments on a fixed model and a fixed data pool. Second, the phase boundary lookup is deterministic in step only — no shared state, no race conditions across workers. If worker 3 crashes at step 412 781 and resumes, it reconstructs "I am in phase main" from the step number alone and serves the correct mixture on its very first sample.

The accounting check. After simulating the full run, sum the per-domain tokens. The total must equal TLˉT \cdot \bar{L} where Lˉ\bar{L} is the share-weighted mean sequence length. For our schedule: Lˉ=0.054096+0.654096+0.208192+0.10327687782\bar{L} = 0.05 \cdot 4096 + 0.65 \cdot 4096 + 0.20 \cdot 8192 + 0.10 \cdot 32768 \approx 7782 tokens per step. If your counter says otherwise, the bug is almost always a missed seq_len update at a phase boundary.

PyTorch: Time-Varying MixtureSampler

The production version inherits the entire interface of section 8.4's MixtureSampler — same IterableDataset, same DataLoader, same per-worker independence — and adds one new responsibility: re-select the mixture from the current phase before each multinomial draw. The shards on disk are packed at the longest seq_len (32K) so a single pre-tokenisation serves all four phases.

🐍curriculum_pytorch.py
7Phase as a dataclass — the schedule lives in code, not config

Four fields per phase. Keeping the schedule as a typed Python object (rather than a JSON blob) means the type checker enforces shape: weights is a List[float], seq_len is an int, share is a float. Frontier teams version-control this file and treat schedule changes as first-class code reviews.

15The full four-phase curriculum, declared once

Note the seq_len escalation: 4K → 4K → 8K → 32K. The optimizer never sees a sudden 8× context jump — there is an intermediate 8K phase to let attention patterns adapt. This is 'length warmup' and is the most under-appreciated part of the curriculum. Skip it and the loss spikes 0.3 nats on the day you switch to 32K.

EXECUTION STATE
PHASES[3].seq_len / PHASES[1].seq_len = 8.0
22CurriculumSampler subclasses the same IterableDataset interface

The DataLoader downstream cannot tell that the mixture changed — it just keeps pulling samples. That is the key engineering win: the curriculum is implemented entirely INSIDE the sampler. No changes to the trainer, no changes to the model, no changes to the optimizer (other than its existing LR schedule, which is coordinated with phases).

32Precomputed phase edges — O(K) per step lookup

self.edges = [0.05, 0.70, 0.90, 1.00]. _phase_for_step is then a four-comparison walk. Even at 100k samples/second per worker, this is sub-microsecond — phase lookup is never the bottleneck. The bottleneck is always disk I/O on the underlying shards.

EXECUTION STATE
self.edges = [0.05, 0.70, 0.90, 1.00]
47Re-build the weight tensor inside the loop on purpose

Reading phase.weights and wrapping in a fresh tensor per step is slightly wasteful but defensive: if you mutate PHASES (e.g. a live hyperparameter sweep changes a weight mid-run) the change takes effect on the very next sample. Static caching here is a foot-gun.

EXECUTION STATE
weights.shape = (5,)
51Multinomial draw with the live, phase-specific weights

Identical to section 8.4's call, but the weights argument is now time-varying. At step 100_000 (frac=0.056, still warmup) weights are [0.80, 0.05, 0.02, 0.10, 0.03] — almost all web. At step 1_750_000 (frac=0.972, deep in anneal) weights are [0.20, 0.20, 0.25, 0.20, 0.15] — math wins. Same line of code, completely different gradient signal.

EXECUTION STATE
k at step 100_000 = most often 0 (web)
k at step 1_750_000 = uniform across 0..4
61Trim to phase.seq_len at sample time

Critical reuse trick: the on-disk shards are packed at the LONGEST seq_len (32K). Each phase just slices what it needs. This means we do not re-tokenise, re-pack, or re-shuffle the corpora when seq_len escalates — a single set of shard files serves all four phases. Without this trick, length warmup would force three full data re-builds.

EXECUTION STATE
full.shape = (32768,)
yielded shape, phase 1 (warmup) = (4096,)
yielded shape, phase 4 (anneal) = (32768,)
73The trainer loop is unchanged

Notice what is NOT here: no phase check inside train_step, no special handling of the seq_len jump in the optimizer, no special checkpoint at phase boundaries. The curriculum is fully encapsulated in CurriculumSampler. This separation is what lets DeepSeek change the schedule between training runs without touching the model code at all.

EXECUTION STATE
batch.shape (phase 2, main) = (8, 4096)
batch.shape (phase 4, anneal) = (8, 32768)
70 lines without explanation
1import torch
2from torch.utils.data import IterableDataset, DataLoader
3from dataclasses import dataclass
4from typing import List
5
6@dataclass
7class Phase:
8    name: str
9    share: float            # fraction of total training tokens
10    weights: List[float]    # mixture (sums to 1) — aligned with shards
11    seq_len: int            # context length for THIS phase
12
13PHASES = [
14    Phase("warmup",    0.05, [0.80, 0.05, 0.02, 0.10, 0.03],  4_096),
15    Phase("main",      0.65, [0.62, 0.17, 0.06, 0.10, 0.05],  4_096),
16    Phase("reasoning", 0.20, [0.40, 0.22, 0.18, 0.12, 0.08],  8_192),
17    Phase("anneal",    0.10, [0.20, 0.20, 0.25, 0.20, 0.15], 32_768),
18]
19
20class CurriculumSampler(IterableDataset):
21    """Time-varying MixtureSampler. Same interface as in section 8.4,
22    but the active mixture is a function of the global training step."""
23
24    def __init__(self, shards, phases, total_steps, seed=0):
25        assert all(abs(sum(p.weights) - 1.0) < 1e-9 for p in phases)
26        self.shards = shards
27        self.phases = phases
28        self.total_steps = total_steps
29        self.gen = torch.Generator().manual_seed(seed)
30
31        # Precompute right-edge boundaries for O(1) lookup.
32        cum, self.edges = 0.0, []
33        for p in phases:
34            cum += p.share
35            self.edges.append(cum)
36
37    def _phase_for_step(self, step: int) -> Phase:
38        frac = step / self.total_steps
39        for i, edge in enumerate(self.edges):
40            if frac <= edge:
41                return self.phases[i]
42        return self.phases[-1]
43
44    def __iter__(self):
45        iters = [iter(s) for s in self.shards]
46        step = 0
47        while True:
48            phase = self._phase_for_step(step)
49
50            # 1) Slice each underlying shard to phase.seq_len at READ time.
51            #    Cheaper than re-packing the corpus per phase.
52            weights = torch.tensor(phase.weights, dtype=torch.float32)
53            k = int(torch.multinomial(weights, 1, generator=self.gen))
54
55            try:
56                full = next(iters[k])                  # (max_seq,) int64
57            except StopIteration:
58                iters[k] = iter(self.shards[k])
59                full = next(iters[k])
60
61            # 2) Trim to current phase length. Real implementations precut
62            #    the shard files at the longest seq_len and slice down here.
63            yield full[: phase.seq_len]
64
65            step += 1
66
67# shards are produced by section 8.4's DomainShard, but packed at
68# max(phase.seq_len for phase in PHASES) = 32_768 so a single shard
69# stream serves all four phases.
70loader = DataLoader(CurriculumSampler(shards, PHASES, total_steps=1_800_000),
71                    batch_size=8, num_workers=4, pin_memory=True)
72
73for step, batch in enumerate(loader):
74    if step == 1_800_000:
75        break
76    # batch.shape == (8, seq_len_for_current_phase)
77    # The model wrapper must accept variable seq_len across phases.
78    loss = train_step(batch, step=step)

Three places this code touches the rest of the training stack:

  1. The LR schedule must agree with the phase schedule. A cosine LR that peaks at step 0 and decays smoothly to zero will undershoot the anneal phase — the LR is already small by then, and the model is not learning much from the highest-quality data. The standard fix is a two-segment LR: a long, gentle cosine across phases 1–3, then a sharp linear decay through phase 4. This way the anneal phase still gets non-trivial updates.
  2. The model wrapper must accept variable seq_len. The batch shape changes from (8, 4096) in main to (8, 32768) in anneal. For RoPE this is automatic; for ALiBi or learned positional embeddings you need a length-extrapolation strategy (YaRN, PI, NTK-aware scaling). DeepSeek-V3 uses YaRN at the phase-3→4 boundary.
  3. Checkpoints at phase boundaries are mandatory. A clean checkpoint right before each phase transition lets you (a) restart the run if the phase change causes a loss spike, (b) reuse phases 1–3 as the starting point for multiple phase-4 variants (anneal mixture search), (c) audit the per-phase contribution to capabilities afterwards. Skipping these checkpoints is the single most expensive mistake we have seen labs make.
Implementation note: the anneal phase is where you do mixture search, not the main phase. Run the same checkpoint of phases 1–3 forward four times with four different anneal mixtures. The four resulting models differ by 5–10 benchmark points — and the cost is only 4 × 10% = 40% of one training run, instead of 4× the whole run. Frontier labs do this routinely; it is the single highest ROI experiment in the whole pretraining stack.

At Massive Scale: Length Warmup and Annealing

At 671B parameters and 14.8T tokens, the curriculum is not an optimisation luxury — it is a structural prerequisite for the run to finish at all. Three constraints force its hand:

ConstraintHow the curriculum addresses it
GPU memory at long contextAttention activations grow as L²·d. Training the whole run at 32K context would need ~64× the activation memory of 4K, far exceeding what FSDP+activation checkpointing can hide. The curriculum keeps L at 4K for 70% of the run, only paying the long-context cost in the final 10%.
Distributed I/O bandwidthDifferent phases serve different effective sequence sizes, so per-GPU token throughput changes by 8×. The data pipeline must size shard reads, prefetch queues, and inter-node bandwidth for the WORST case (32K anneal) — overprovisioned during the main phase, fully utilised during anneal.
Long-context capability acquisitionDeepSeek uses YaRN positional scaling at the seq_len jumps. A naive jump from 4K to 32K without RoPE rescaling causes the loss to spike by 1–2 nats and never recover. The curriculum schedules the rescaling event explicitly between phase 3 and phase 4.
Eval-set contamination amplificationThe anneal phase serves each high-quality doc 2–4× over its short duration. Any contamination of the anneal corpus is amplified exactly when the model is most receptive. Contamination scans (section 8.7) MUST re-run before anneal launches; the run-wide scan from week-1 is not sufficient.

Real frontier runs add a fifth ingredient: data ordering within a phase. Even inside phase 2 (main), the schedule can order documents by a difficulty proxy — average per-token perplexity from a small reference model — and serve easier documents first. This is the classical "Bengio curriculum" idea, but applied at document granularity rather than corpus granularity. Reported gains are modest (1–2% benchmark lift) but free if you already have the proxy scores from your quality-filtering stage.

The annealing playbook

Among practitioners, the annealing phase has a fairly tight recipe as of 2025. The shape repeats across DeepSeek-V3, MiniCPM, Qwen-2.5, and the public Llama-3 technical report:

  1. Length jump first, then mixture change. Anneal starts with one or two checkpoints that just extend the context length while keeping the main-phase mixture. This isolates the length-extrapolation event so any loss spike is attributable.
  2. Linear LR decay to zero across the anneal duration. Not cosine — linear. Cosine's long tail near zero wastes anneal tokens; linear lets the model take real steps right up to the last batch.
  3. Mixture flattens toward uniform across domains. Anneal weights look like [0.20, 0.20, 0.25, 0.20, 0.15] — entropy near max. The model has already learned web; the marginal anneal token is most valuable when it is from a previously-underrepresented slice.
  4. Quality filter tightens by ~2 standard deviations. The anneal corpus is built by re-applying the section-8.3 quality score with a stricter cutoff. Only the cleanest 5–10% of each domain's tokens survives into anneal.
  5. Per-domain validation loss is checked every 50B tokens. If any domain's loss rises during anneal, halt and inspect — it means either the LR is too low (model is forgetting) or the mixture is poisoned (a contamination scan slipped through).

Engineering Reality and Gotchas

Three real failure modes that have shipped to production runs and cost real money:

  1. Phase-boundary loss spikes. Every time the mixture changes, the loss jumps — sometimes 0.05 nats (recoverable), sometimes 0.5 nats (run-ending). The standard mitigation is a linear mixture interpolation over 1% of the schedule around each boundary: instead of switching abruptly from p(j)p^{(j)} to p(j+1)p^{(j+1)}, blend p(t)=(1λ)p(j)+λp(j+1)p(t) = (1 - \lambda) p^{(j)} + \lambda p^{(j+1)} with λ\lambda ramping linearly 0 → 1. This adds ~10 lines to the sampler and prevents 90% of phase-boundary spikes.
  2. Length-jump RoPE divergence. A 4K→32K jump without YaRN scaling typically causes loss to rise from 1.8 to 3.5 nats inside 200 steps and stay there. The fix is to apply the YaRN factor at the EXACT batch where the seq_len changes; the model weights are otherwise untouched. Missing this is the most common way to lose a 70B run.
  3. Optimizer state interactions with seq_len changes. Adam's vv (second moment) is per-parameter, but the scale of typical gradients changes with seq_len because attention activations grow. The result: after a seq_len jump, the effective step size on attention weights changes by (Lnew/Lold)2(L_{\text{new}}/L_{\text{old}})^2. DeepSeek mitigates by adding a one-time gradient renormalisation at each length boundary; some labs reset only the second-moment estimates for attention parameters. Either works; doing nothing does not.
DeepSeek-V3's actual schedule, distilled. Per the public technical report: warmup ~2000 steps at LR ramp + base mixture; main pretraining at 4K context for the bulk of 14.8T tokens; a long-context extension stage that increases context to 32K using YaRN; a final supervised fine-tuning / RL stage on top. The exact share numbers are not published, but every published frontier model since 2023 follows the same broad shape, and the gap between the published recipes is small enough that the four-phase template in this section is a reasonable open-source approximation.

The one sentence to carry forward: a curriculum is the schedule that turns a fixed pool of training tokens into the maximum capability per dollar — and the final 10% of that schedule (the anneal phase) is where most of the capability actually lands — which is why every frontier lab now spends as much engineering effort tuning the four-phase schedule as it does on the underlying mixture itself.

Where we go from here. Section 8.7 is the natural capstone of this chapter: once your data pipeline, mixture, and curriculum are all set, the last remaining failure mode is contamination — eval-set documents leaking into training, especially into the high-replay anneal corpus. We will see how n-gram, embedding, and substring detectors are layered to catch contamination before it inflates benchmark scores and silently destroys a launch.
Loading comments...