Sections 8.1 through 8.3 built the pipeline that takes raw internet text and turns it into clean tokens — deduplicated, quality-filtered, ready for the optimizer. We now hold five (or fifty) separate corpora on disk: filtered web, code, math, books, multilingual, scientific papers, dialogue. The naive next step is to dump them into one giant file and call it a day. That single decision is where most models go wrong. The proportions of the final mix matter more for capability than almost any architectural knob — and they are the knob the public almost never sees discussed.
The thesis of this section. A 14.8T-token training run is a budget. The budget gets spent in proportions , one per domain. Those proportions are picked, not discovered. Picking them is called data mixing, and modern frontier labs treat it as a search problem on par with architecture search. The wrong mix gives you a fluent web-chatter that cannot multiply two two-digit numbers; the right mix gives you the model on the leaderboard.
The Real Problem: Why Natural Proportions Fail
If you stacked every cleaned corpus on top of one another by raw token count, web crawl would account for roughly 93% of the pile, code 5%, books 2%, math 0.1%, Wikipedia 0.03%. That is the "natural" proportion — what the internet actually contains by volume. Train on it directly and the optimizer hears the web crowd shouting and the math whisper drowning. The model learns to chat, but it cannot prove an identity, reason about a Python AST, or recall a date from a Wikipedia infobox.
Three forces conspire to make natural proportions the wrong choice:
| Pressure | What it does to a model trained on natural proportions |
|---|---|
| Volume imbalance | Web crawl dominates by 100× — every other domain's gradient signal is buried in noise. |
| Quality variance | 1B tokens of math is denser-per-bit than 100B of forum posts; the model needs more weight on the dense signal to learn it. |
| Capability gating | Some abilities (code, math, multilingual) are step-function-acquired once a critical mass of in-domain tokens is seen. Below that threshold, the capability never emerges. |
| Long-tail amnesia | Rare-but-valuable corpora (legal, scientific, foreign-language) get visited once or twice and forgotten before training ends. |
The fix is to upweight the high-value rare corpora and downweight the overwhelming web crawl — at the cost of replaying the small corpora multiple times. The whole engineering question of this section is by how much, and how do we tell?
Intuition: A Training Diet, Not a Buffet
The right mental picture is a nutritionist, not a librarian. The librarian shelves books in proportion to how many copies exist — that is natural-proportion training, and it gives the model a balanced diet of empty calories. The nutritionist decides what the body actually needs and prescribes accordingly: less of the abundant, more of the dense, supplements for the deficient. That is data mixing.
A second analogy: instrumental ensembles. A 200-piece orchestra has 150 strings and 8 percussionists — but the conductor does not let the strings drown out the timpani simply because there are more of them. Gain knobs sit on every section. The mixture weight is the gain knob for each corpus. The conductor's job (the model designer's job) is to set those knobs so the whole sound — the model's capability surface — is balanced.
The Mathematics of a Mixture
Let there be domains, each with a raw clean corpus of size tokens. The mixture is a probability distribution over domains:
Here is the probability that the next training sample comes from domain . Given a total training budget of tokens, the expected number of tokens drawn from domain is:
And the number of effective epochs the optimizer runs over corpus is:
This single quantity is the central engineering variable. means the domain is sub-sampled — the model sees only a fraction of available documents. is the sweet spot for a large, high-quality corpus. means heavy replay; above empirical evidence (Muennighoff et al., 2023) shows the validation loss begins to diverge from the training loss — the model is memorising.
The expected per-token loss
Training is empirical risk minimisation. With a mixture, the loss the optimizer actually descends is the weighted sum of per-domain losses:
Here are the model parameters and is the data distribution of domain . The gradient of this loss is also a -weighted sum of per-domain gradients — which means raising literally amplifies the gradient signal coming from domain . Mixture weighting is gradient weighting.
The DoReMi objective
DoReMi (Xie et al., 2023) reframes mixture choice as a minimax problem: find weights that minimise the worst-case per-domain excess loss, measured against a small reference model. Concretely:
The intuition is "train a 280M proxy under candidate mixtures; pick the mixture under which the proxy's worst-served domain improves the most over a baseline." This proxy-driven search is how Llama-3, DeepSeek-V3, and other frontier models actually choose their mixture in 2024–2025 — empirically, not from first principles.
Manual Numerical Walkthrough
Let us run the arithmetic on a concrete five-domain mixture against DeepSeek-V3's 14.8T-token budget. Every number is computed by hand so the mechanism is fully exposed.
Click to expand: five domains, 14.8T tokens, by hand
Setup. Five clean corpora with raw token counts (in billions): web = 12 000, code = 600, math = 150, books = 300, wiki = 50. Total raw tokens: B. Total budget: B.
Step 1 — natural proportions. If we set , the weights become web ≈ 0.916, code ≈ 0.046, math ≈ 0.011, books ≈ 0.023, wiki ≈ 0.004. The math corpus then draws B tokens — roughly one epoch over 150B raw, but only of the budget. Below the math-competence threshold of ~1T tokens; the model will not learn arithmetic.
Step 2 — DeepSeek-style upweight. Now set . The per-domain draws are:
- web: B, × epochs (under one pass — the model sees ~74% of web)
- code: B, × epochs (borderline — code is dense, four passes is the upper edge)
- math: B, × epochs (heavy replay, justified because math notation is highly regular and the model can absorb repetition without memorising)
- books: B, × epochs (also borderline; books vary in style enough that the model still generalises)
- wiki: B, × epochs (very high replay — Wikipedia is so factually compressed and so non-redundant that production teams accept the memorisation risk for the factual recall benefit)
Step 3 — sanity check. . B — equals the budget. If either of these did not balance, the mixture is broken.
Step 4 — the mixture entropy. A useful one-number summary of mixture "evenness" is the Shannon entropy. For this mixture: bits. Maximum entropy (five domains uniform) is bits. Natural proportions give bits — very concentrated on web. The DeepSeek-style mix sits closer to uniform than to natural, deliberately.
Step 5 — what the gradient sees. Per training batch of 8 sequences, the expected composition is roughly 4.8 web samples, 1.4 code, 0.6 math, 0.8 books, 0.4 wiki. With 100 batches you see ~60 math sequences — small but non-zero, and over the full run that adds up to the 1.18T math tokens above.
Visualizing the Mixture
The sandbox below lets you set the five mixture weights with sliders and immediately see (a) how the 14.8T-token budget gets sliced, (b) how many effective epochs each raw corpus suffers, and (c) the Shannon entropy of the resulting distribution. The three preset buttons — natural, balanced, deepseek — load the three reference mixes discussed above. Watch the math row: natural gives 1.1× epochs but invisible budget share; deepseek gives ~8× epochs and a budget share large enough for capability to emerge.
Three things to lock in from playing with the sliders. First, the stacked dominance bar at the top is what the optimizer literally sees — its proportions are the gradient proportions. Second, the per-domain epoch chip turns amber and then red as you push weight into a small corpus; the colour is your replay-risk dashboard. Third, the entropy readout next to the bar is a one-number summary of how concentrated vs. uniform your diet is — and you can land within ~0.1 bits of the published mixes of frontier models by ear.
Plain Python: A Weighted Domain Sampler
Below is the entire mixture-sampling mechanism in plain Python — no PyTorch, no GPU. Five domains, five raw token counts, a 14.8T-token budget, and a 100k-step empirical check that the sampler reproduces the declared weights. This is the kernel that every production data loader wraps.
Two structural details worth a second look. First, the epochs row is computed before training launches — it is a static dry-run that tells you whether your mixture is sane. If any on a corpus whose documents are highly repetitive (boilerplate dialogue, scraped FAQs), you are committing to memorisation. Fix the weights before you spend $5M of GPU time finding out.
Second, the empirical-frequency check on line 38 is the single most under-rated unit test in this whole pipeline. The sampler can be broken in a dozen subtle ways — off-by-one in the CDF, accidentally calling random.random() twice per draw, treating weights as logits instead of probabilities — and every one of them shows up as a divergence between declared and empirical p. Run this check on every new sampler implementation before you trust it with a training run.
Sanity check yourself. Withweights = [0.6, 0.17, 0.08, 0.10, 0.05]and 100k draws, the empirical counts should land within of the declared weights. If yours come back at[0.50, 0.50, 0, 0, 0], your CDF walk is broken — almost always an off-by-one in the cumulative sum.
PyTorch: Interleaved DataLoaders at Scale
The production version replaces the deterministic loop with an IterableDataset per domain and a single MixtureSampler that interleaves them at sample-time. The weights remain a 5-vector in a config file; the on-disk layout never changes when you adjust them.
Three subtleties worth marking, all about how this loop interacts with the rest of a frontier-scale training stack:
- Per-worker independence is the throughput win. With
num_workers=4, four parallel processes each run their own MixtureSampler. Each draws its own categorical per step, so the four streams interleave into a global batch that itself is already a mini-mixture. There is no central coordinator — the law of large numbers does the coordination for you. - Resetting an exhausted shard is where the epoch count goes up. Lines 30–32 reset the iterator when a domain runs out. On a tiny corpus this happens hundreds of times per training run. Two implementation rules: (a) reshuffle the shard order on reset — otherwise the model sees the same document order every epoch and memorises that order; (b) log every reset — the reset count is the empirical effective-epochs counter, and you want it on a dashboard.
- The weights live in the config, not in the data. The five shards on disk are never re-mixed. Changing means editing one line of YAML and relaunching — no re-tokenisation, no re-shuffling, no re-packing. This is the single most important engineering decision in the chapter. Pre-mixed datasets force a full re-build for every mixture-search experiment; sample-time mixing makes experimentation essentially free.
At Massive Scale: DoReMi and Mixture Search
Up to this point the weights have been a human choice — informed by the epoch-and-budget arithmetic above, plus folk knowledge from prior training runs. The 2023 DoReMi paper (Xie et al.) showed that you can do meaningfully better by treating mixture choice as a small-scale search problem, then transferring the winning weights to the full run.
The recipe:
- Train a small M reference model on a default (e.g. uniform) mixture for a fixed number of steps. Record its per-domain losses .
- Initialise candidate weights (e.g. uniform). Train a second 280M proxy under those weights. After each chunk of steps, compute the per-domain excess loss .
- Update weights with a multiplicative-weights step: , then renormalise. Domains where the proxy is doing the worst (largest excess loss) get more weight; domains where the proxy is beating the reference get less.
- After the proxy stabilises, freeze its final . Transfer those weights to the real 70B / 671B training run.
DoReMi's reported gain on The Pile is a 2.6× lower average perplexity at the same compute, with no model-architecture change. That number alone is roughly what an entire generation of architectural improvement delivers — and the cost was a single 280M proxy run, a tiny fraction of one full training launch.
| Model | Web | Code | Math + Sci | Books + Long-form | Wiki + Reference | Mixture entropy (bits) |
|---|---|---|---|---|---|---|
| Natural proportions | 0.92 | 0.05 | 0.01 | 0.02 | 0.004 | ≈ 0.59 |
| GPT-3 (illustrative) | 0.60 | 0.16 | 0.03 | 0.16 | 0.05 | ≈ 1.71 |
| Llama-2 (illustrative) | 0.67 | 0.08 | 0.03 | 0.18 | 0.04 | ≈ 1.49 |
| DeepSeek-V3 (illustrative) | 0.60 | 0.17 | 0.08 | 0.10 | 0.05 | ≈ 1.87 |
| DoReMi-optimal (Pile) | 0.34 | 0.20 | 0.18 | 0.20 | 0.08 | ≈ 2.18 |
Two patterns to read off the table. First, every modern frontier mix is much closer to uniform (high entropy) than to natural proportions — the field has converged on "upweight the rare, downweight the abundant" as a non-negotiable. Second, the DoReMi-optimal row is even closer to uniform than any human-picked mix, which is the empirical signal that humans systematically underweight rare-but-valuable corpora when picking by eye.
The interaction with scaling laws
Chinchilla-style scaling laws tell you the compute-optimal model-size-to-token-count ratio, but they assume an iid training distribution. Once you mix domains with different intrinsic difficulty, the scaling exponent per domain differs — math and code follow steeper power laws than web. The practical consequence is that the compute-optimal at a 70B model is not the same as at a 7B model: larger models can extract more value from heavier weight on the dense, hard domains because they have the capacity to fit them. Mixture choice is therefore scale-dependent, and proxy-based search at 280M can systematically under-recommend hard slices for a 70B target. This is an active research area as of 2025.
Engineering Reality and Gotchas
Mixture weighting looks like a clean math problem. Three production failure modes earn their flags:
- Effective epochs that look fine in expectation can be catastrophic per-document. A domain with means each document is replayed roughly 8 times — but if your shuffle is weak, a popular subset (the most common code repositories, the most common math textbooks) gets replayed 30× while the long tail gets replayed 2×. The mean is fine; the variance is fatal. Always verify shuffle quality before launching, and reshuffle on every reset.
- Tokens-seen drift between declared and actual. Across hundreds of training steps, the empirical mixture you served to the GPUs can drift by 1–2 percentage points from the declared weights — usually due to non-uniform shard sizes interacting with
num_workers. The fix is to log a histogram of domain-of-origin tags on every batch and add an alert if any empirical drifts more than 0.5pp from the declared . We have seen a 70B run waste 200B tokens of effective compute on this bug alone. - Evaluation set contamination from upweighted corpora. The small, high-quality corpora you most want to upweight (math textbooks, code from popular repositories) are also the corpora most likely to overlap with public benchmarks (GSM8K, HumanEval, MMLU). At 8× replay, even a single contaminating document is seen eight times — easy to memorise, impossible to detect post-hoc. Contamination filtering (section 8.7) must run per-domain and must be re-run every time you bump a mixture weight upward.
The one sentence to carry forward: a frontier model's capabilities are decided more by the five numbers in its mixture vector than by any other single hyperparameter — and those five numbers are picked, not discovered — which is why every serious lab now treats mixture search as a first-class optimisation, not an afterthought.