Chapter 8
15 min read
Section 44 of 117

Data Mixing and Domain Weighting

Data: The Invisible Foundation

Sections 8.1 through 8.3 built the pipeline that takes raw internet text and turns it into clean tokens — deduplicated, quality-filtered, ready for the optimizer. We now hold five (or fifty) separate corpora on disk: filtered web, code, math, books, multilingual, scientific papers, dialogue. The naive next step is to dump them into one giant file and call it a day. That single decision is where most models go wrong. The proportions of the final mix matter more for capability than almost any architectural knob — and they are the knob the public almost never sees discussed.

The thesis of this section. A 14.8T-token training run is a budget. The budget gets spent in proportions p1,p2,,pKp_1, p_2, \dots, p_K, one per domain. Those proportions are picked, not discovered. Picking them is called data mixing, and modern frontier labs treat it as a search problem on par with architecture search. The wrong mix gives you a fluent web-chatter that cannot multiply two two-digit numbers; the right mix gives you the model on the leaderboard.

The Real Problem: Why Natural Proportions Fail

If you stacked every cleaned corpus on top of one another by raw token count, web crawl would account for roughly 93% of the pile, code 5%, books 2%, math 0.1%, Wikipedia 0.03%. That is the "natural" proportion — what the internet actually contains by volume. Train on it directly and the optimizer hears the web crowd shouting and the math whisper drowning. The model learns to chat, but it cannot prove an identity, reason about a Python AST, or recall a date from a Wikipedia infobox.

Three forces conspire to make natural proportions the wrong choice:

PressureWhat it does to a model trained on natural proportions
Volume imbalanceWeb crawl dominates by 100× — every other domain's gradient signal is buried in noise.
Quality variance1B tokens of math is denser-per-bit than 100B of forum posts; the model needs more weight on the dense signal to learn it.
Capability gatingSome abilities (code, math, multilingual) are step-function-acquired once a critical mass of in-domain tokens is seen. Below that threshold, the capability never emerges.
Long-tail amnesiaRare-but-valuable corpora (legal, scientific, foreign-language) get visited once or twice and forgotten before training ends.

The fix is to upweight the high-value rare corpora and downweight the overwhelming web crawl — at the cost of replaying the small corpora multiple times. The whole engineering question of this section is by how much, and how do we tell?

The capability emergence threshold. Chinchilla-style analyses show that a model needs roughly 101010^{10} in-domain tokens before a competence like "solve grade-school arithmetic word problems" emerges. If the math corpus is 150B tokens and you give it natural weight (~0.1%), the model sees only 14.8B math tokens across the whole run — below threshold, competence never appears. Bump the weight to 8% and you see 1.18T math tokens — well past threshold. The model goes from innumerate to competent, on the same hardware, by changing one number.

Intuition: A Training Diet, Not a Buffet

The right mental picture is a nutritionist, not a librarian. The librarian shelves books in proportion to how many copies exist — that is natural-proportion training, and it gives the model a balanced diet of empty calories. The nutritionist decides what the body actually needs and prescribes accordingly: less of the abundant, more of the dense, supplements for the deficient. That is data mixing.

A second analogy: instrumental ensembles. A 200-piece orchestra has 150 strings and 8 percussionists — but the conductor does not let the strings drown out the timpani simply because there are more of them. Gain knobs sit on every section. The mixture weight is the gain knob for each corpus. The conductor's job (the model designer's job) is to set those knobs so the whole sound — the model's capability surface — is balanced.

The right mental check. Before launching a training run, ask: "If a domain contributes 5% of my training tokens, can the model become competent at that domain's tasks?" If yes, good. If no — either bump the weight or accept the capability gap. There is no third option; you cannot trick the optimizer into caring about tokens it never sees.

The Mathematics of a Mixture

Let there be KK domains, each with a raw clean corpus of size NiN_i tokens. The mixture is a probability distribution over domains:

pi0,i=1Kpi=1p_i \geq 0, \quad \sum_{i=1}^{K} p_i = 1

Here pip_i is the probability that the next training sample comes from domain ii. Given a total training budget of TT tokens, the expected number of tokens drawn from domain ii is:

Ti=piTT_i = p_i \cdot T

And the number of effective epochs the optimizer runs over corpus ii is:

Ei=Ti/Ni=(piT)/NiE_i = T_i / N_i = (p_i \cdot T) / N_i

This single quantity EiE_i is the central engineering variable. Ei<1E_i < 1 means the domain is sub-sampled — the model sees only a fraction of available documents. Ei1E_i \approx 1 is the sweet spot for a large, high-quality corpus. Ei1E_i \gg 1 means heavy replay; aboveEi4E_i \approx 4 empirical evidence (Muennighoff et al., 2023) shows the validation loss begins to diverge from the training loss — the model is memorising.

The expected per-token loss

Training is empirical risk minimisation. With a mixture, the loss the optimizer actually descends is the weighted sum of per-domain losses:

Lmix(θ)=i=1KpiExDi[L(θ;x)]\mathcal{L}_{\text{mix}}(\theta) = \sum_{i=1}^{K} p_i \cdot \mathbb{E}_{x \sim D_i} \left[\mathcal{L}(\theta; x)\right]

Here θ\theta are the model parameters and DiD_i is the data distribution of domain ii. The gradient of this loss is also a pp-weighted sum of per-domain gradients — which means raising pip_i literally amplifies the gradient signal coming from domain ii. Mixture weighting is gradient weighting.

The DoReMi objective

DoReMi (Xie et al., 2023) reframes mixture choice as a minimax problem: find weights that minimise the worst-case per-domain excess loss, measured against a small reference model. Concretely:

p=argminp  maxi  (Li(θp)Li(θref))p^{\star} = \arg\min_{p} \; \max_{i} \; \big( \mathcal{L}_i(\theta_p) - \mathcal{L}_i(\theta_{\text{ref}}) \big)

The intuition is "train a 280M proxy under candidate mixtures; pick the mixture under which the proxy's worst-served domain improves the most over a baseline." This proxy-driven search is how Llama-3, DeepSeek-V3, and other frontier models actually choose their mixture in 2024–2025 — empirically, not from first principles.

Manual Numerical Walkthrough

Let us run the arithmetic on a concrete five-domain mixture against DeepSeek-V3's 14.8T-token budget. Every number is computed by hand so the mechanism is fully exposed.

Click to expand: five domains, 14.8T tokens, by hand

Setup. Five clean corpora with raw token counts (in billions): web = 12 000, code = 600, math = 150, books = 300, wiki = 50. Total raw tokens: Ni=13100\sum N_i = 13\,100 B. Total budget: T=14800T = 14\,800 B.

Step 1 — natural proportions. If we set pi=Ni/Njp_i = N_i / \sum N_j, the weights become web ≈ 0.916, code ≈ 0.046, math ≈ 0.011, books ≈ 0.023, wiki ≈ 0.004. The math corpus then draws 0.01114800=1630.011 \cdot 14\,800 = 163 B tokens — roughly one epoch over 150B raw, but only 163/14800=1.1%163 / 14\,800 = 1.1\% of the budget. Below the math-competence threshold of ~1T tokens; the model will not learn arithmetic.

Step 2 — DeepSeek-style upweight. Now set p=[0.60,0.17,0.08,0.10,0.05]p = [0.60, 0.17, 0.08, 0.10, 0.05]. The per-domain draws are:

  • web: 0.6014800=88800.60 \cdot 14\,800 = 8\,880 B, E=8880/12000=0.74E = 8\,880 / 12\,000 = 0.74× epochs (under one pass — the model sees ~74% of web)
  • code: 0.1714800=25160.17 \cdot 14\,800 = 2\,516 B, E=2516/600=4.19E = 2\,516 / 600 = 4.19× epochs (borderline — code is dense, four passes is the upper edge)
  • math: 0.0814800=11840.08 \cdot 14\,800 = 1\,184 B, E=1184/150=7.89E = 1\,184 / 150 = 7.89× epochs (heavy replay, justified because math notation is highly regular and the model can absorb repetition without memorising)
  • books: 0.1014800=14800.10 \cdot 14\,800 = 1\,480 B, E=1480/300=4.93E = 1\,480 / 300 = 4.93× epochs (also borderline; books vary in style enough that the model still generalises)
  • wiki: 0.0514800=7400.05 \cdot 14\,800 = 740 B, E=740/50=14.8E = 740 / 50 = 14.8× epochs (very high replay — Wikipedia is so factually compressed and so non-redundant that production teams accept the memorisation risk for the factual recall benefit)

Step 3 — sanity check. pi=0.60+0.17+0.08+0.10+0.05=1.00\sum p_i = 0.60 + 0.17 + 0.08 + 0.10 + 0.05 = 1.00. Ti=8880+2516+1184+1480+740=14800\sum T_i = 8880 + 2516 + 1184 + 1480 + 740 = 14\,800 B — equals the budget. If either of these did not balance, the mixture is broken.

Step 4 — the mixture entropy. A useful one-number summary of mixture "evenness" is the Shannon entropy H(p)=pilog2pi\ H(p) = -\sum p_i \log_2 p_i. For this mixture: H1.87H \approx 1.87 bits. Maximum entropy (five domains uniform) is log252.32\log_2 5 \approx 2.32 bits. Natural proportions give H0.59H \approx 0.59 bits — very concentrated on web. The DeepSeek-style mix sits closer to uniform than to natural, deliberately.

Step 5 — what the gradient sees. Per training batch of 8 sequences, the expected composition is roughly 4.8 web samples, 1.4 code, 0.6 math, 0.8 books, 0.4 wiki. With 100 batches you see ~60 math sequences — small but non-zero, and over the full run that adds up to the 1.18T math tokens above.

Visualizing the Mixture

The sandbox below lets you set the five mixture weights with sliders and immediately see (a) how the 14.8T-token budget gets sliced, (b) how many effective epochs each raw corpus suffers, and (c) the Shannon entropy of the resulting distribution. The three preset buttons — natural, balanced, deepseek — load the three reference mixes discussed above. Watch the math row: natural gives 1.1× epochs but invisible budget share; deepseek gives ~8× epochs and a budget share large enough for capability to emerge.

Loading data-mixing visualizer…

Three things to lock in from playing with the sliders. First, the stacked dominance bar at the top is what the optimizer literally sees — its proportions are the gradient proportions. Second, the per-domain epoch chip turns amber and then red as you push weight into a small corpus; the colour is your replay-risk dashboard. Third, the entropy readout next to the bar is a one-number summary of how concentrated vs. uniform your diet is — and you can land within ~0.1 bits of the published mixes of frontier models by ear.

Plain Python: A Weighted Domain Sampler

Below is the entire mixture-sampling mechanism in plain Python — no PyTorch, no GPU. Five domains, five raw token counts, a 14.8T-token budget, and a 100k-step empirical check that the sampler reproduces the declared weights. This is the kernel that every production data loader wraps.

🐍mixture_sampler.py
4Five domains, each with a fixed raw budget

The training pipeline produces FIVE separately-curated corpora. raw_B is the number of clean tokens AVAILABLE — it is a hard ceiling. You cannot draw more from a domain than exists without replaying documents. Web is two orders of magnitude larger than math, which is the imbalance that motivates everything in this chapter.

EXECUTION STATE
DOMAINS[0].raw_B = 12000 (B tokens)
DOMAINS[2].raw_B = 150 (B tokens)
14The mixture weights — the most important hyperparameter you have

Five numbers that sum to 1. They decide what fraction of the 14.8T-token training schedule is spent on each domain. Architecture choices give 5–15% loss-curve movement; mixture choices give 30–50%. Pick badly and your model becomes a great web-chat agent that cannot do arithmetic.

EXECUTION STATE
weights = [0.60, 0.17, 0.08, 0.10, 0.05]
18Training budget in billions of tokens

DeepSeek-V3 pretrains on 14.8T tokens. This is the total number of tokens the optimizer will consume, summed over ALL domains. The mixture decides how that budget is sliced.

EXECUTION STATE
TRAINING_TOKENS_B = 14800 (B)
21Tokens drawn from each domain

Elementwise multiplication. drawn[i] is the number of tokens the optimizer will see from domain i over the entire run. This is a CONCRETE number — 14.8T × 0.08 = 1.184T tokens of math, not 'some math'.

EXECUTION STATE
drawn[0] (web) = 8880.0 B
drawn[2] (math) = 1184.0 B
25Effective epochs per corpus — the safety check

How many times does each raw corpus get replayed? Web: 8880/12000 ≈ 0.74× (less than one pass — safe). Math: 1184/150 ≈ 7.9× (the SAME math docs are seen ~8 times — risky, the model can start memorising rare proof patterns). The epochs row is the single most important sanity check on any mixture.

EXECUTION STATE
epochs[0] (web) = ≈ 0.74
epochs[2] (math) = ≈ 7.89
29Categorical sampling from a weighted distribution

Standard inverse-CDF sampler. Draw r ~ Uniform[0,1), walk the cumulative weights, return the first bucket whose CDF exceeds r. In production this is replaced by torch.utils.data.WeightedRandomSampler over interleaved iterables — same math, batched and pinned to GPUs.

38Empirical frequencies must match weights

Running 100k draws and counting which domain came out should reproduce the weight vector to ~3 decimal places. This is the FIRST unit test of any data pipeline — if your empirical p doesn't match your declared weights, the sampler is wrong and every downstream loss-curve story is fiction.

EXECUTION STATE
counts (typical) = [60012, 17034, 7991, 9985, 4978]
40 lines without explanation
1import random
2random.seed(0)
3
4# Five training domains. "raw_B" is how many BILLIONS of tokens exist
5# in each cleaned corpus after dedup + quality filtering.
6DOMAINS = [
7    {"name": "web",   "raw_B": 12_000},   # filtered Common Crawl
8    {"name": "code",  "raw_B":    600},   # GitHub permissive
9    {"name": "math",  "raw_B":    150},   # arXiv + proof-pile
10    {"name": "books", "raw_B":    300},   # curated long-form
11    {"name": "wiki",  "raw_B":     50},   # multilingual reference
12]
13
14# Mixture weights (must sum to 1). This is the choice that
15# determines the model's "personality" more than any architecture knob.
16weights = [0.60, 0.17, 0.08, 0.10, 0.05]
17assert abs(sum(weights) - 1.0) < 1e-9
18
19# Training budget in BILLIONS of tokens.
20TRAINING_TOKENS_B = 14_800
21
22# 1) How many tokens does each domain DRAW from the schedule?
23drawn = [w * TRAINING_TOKENS_B for w in weights]
24
25# 2) How many EFFECTIVE EPOCHS does that imply over each raw corpus?
26#    epochs > 1 means the model sees the same document more than once.
27epochs = [drawn[i] / DOMAINS[i]["raw_B"] for i in range(len(DOMAINS))]
28
29# 3) Simulate the SAMPLER: at every step we draw one document
30#    from one domain with probability proportional to weights[i].
31def sample_domain_index(weights):
32    r, acc = random.random(), 0.0
33    for i, w in enumerate(weights):
34        acc += w
35        if r < acc:
36            return i
37    return len(weights) - 1
38
39# 4) Run the sampler for 100k synthetic steps; check empirical frequencies.
40counts = [0] * len(DOMAINS)
41for _ in range(100_000):
42    counts[sample_domain_index(weights)] += 1
43
44for i, d in enumerate(DOMAINS):
45    p_emp = counts[i] / 100_000
46    print(f"{d['name']:6s}  p={weights[i]:.3f}  empirical={p_emp:.3f}  "
47          f"draw={drawn[i]:7.1f}B  epochs={epochs[i]:5.2f}x")

Two structural details worth a second look. First, the epochs row is computed before training launches — it is a static dry-run that tells you whether your mixture is sane. If any Ei>4E_i > 4 on a corpus whose documents are highly repetitive (boilerplate dialogue, scraped FAQs), you are committing to memorisation. Fix the weights before you spend $5M of GPU time finding out.

Second, the empirical-frequency check on line 38 is the single most under-rated unit test in this whole pipeline. The sampler can be broken in a dozen subtle ways — off-by-one in the CDF, accidentally calling random.random() twice per draw, treating weights as logits instead of probabilities — and every one of them shows up as a divergence between declared and empirical p. Run this check on every new sampler implementation before you trust it with a training run.

Sanity check yourself. With weights = [0.6, 0.17, 0.08, 0.10, 0.05] and 100k draws, the empirical counts should land within ±0.003\pm 0.003 of the declared weights. If yours come back at [0.50, 0.50, 0, 0, 0], your CDF walk is broken — almost always an off-by-one in the cumulative sum.

PyTorch: Interleaved DataLoaders at Scale

The production version replaces the deterministic loop with an IterableDataset per domain and a single MixtureSampler that interleaves them at sample-time. The weights remain a 5-vector in a config file; the on-disk layout never changes when you adjust them.

🐍mixture_pytorch.py
4Each domain is its own IterableDataset

Critical choice: do NOT concatenate corpora on disk. Keep each domain as a separate shard with its own shuffle order, its own packing, its own dedup history. The mixture lives at sample-time so you can change weights without re-running the whole pipeline.

11Pre-tokenised, packed token streams

torch.load returns a uint16 tensor of token ids — typically billions of them per shard file. We slide a seq_len=4096 window with no gap (document packing). Packing means a single training sequence can cross document boundaries — that is intentional; it amortises the BOS/EOS overhead and keeps GPU utilisation near 1.0.

EXECUTION STATE
ids.shape = (N,) where N ≈ 10⁹
yielded sample = (4096,)
17MixtureSampler holds the weight vector

weights is a tensor of K probabilities. This is the live mixture — change it between training runs by editing a YAML, no data re-processing required. The assert guards against weights that secretly don't sum to 1 (the most common silent bug in this code).

EXECUTION STATE
self.weights = tensor([0.60, 0.17, 0.08, 0.10, 0.05])
26Categorical draw with torch.multinomial

Every sample-time step: pick which domain to serve from. multinomial is a vectorised inverse-CDF sampler. With weights = [0.6, 0.17, 0.08, 0.10, 0.05] and 1M draws you get within ±0.001 of the declared p — exactly the empirical check we ran in plain Python.

EXECUTION STATE
k = int in {0..4}
28Yield one sample from the chosen domain

iters[k] is a long-running iterator over domain k's shards. We pull one (4096,) tensor and yield it. The DataLoader downstream batches 8 of these into (8, 4096) for the transformer.

30Exhaustion = the start of a new epoch over that domain

When a domain is exhausted, we reset its iterator and keep going. Tiny domains (math, wiki) hit this branch many times during training — every hit is one effective epoch. Production code reshuffles on reset; if you skip the reshuffle, the model sees documents in the SAME order in every epoch and memorises ordering noise.

40The mixture is constructed at training time

Notice what we did NOT do: we did not pre-concatenate the corpora with a fixed mixture. The five shards stay independent on disk, the weights are a 5-vector in the training config, and the mixture exists only as live samples streaming through the DataLoader. This separation lets DeepSeek run dozens of small mixture-search experiments without ever touching disk.

42Standard DataLoader on top of the mixture

DataLoader does NOT care that the underlying dataset is a mixture — it just sees an IterableDataset that yields (4096,) tensors forever. num_workers=4 means four parallel sampler processes; each draws its own k per step, and the four streams interleave into the global batch. Result: the per-step batch of 8 sequences is itself a mini-mixture, which is exactly what you want for stable gradient estimates.

EXECUTION STATE
batch.shape = (8, 4096)
43 lines without explanation
1import torch
2from torch.utils.data import IterableDataset, DataLoader
3
4class DomainShard(IterableDataset):
5    """One domain's tokenised, packed, shuffled token stream."""
6    def __init__(self, name, files, seq_len=4096):
7        self.name = name
8        self.files = files
9        self.seq_len = seq_len
10
11    def __iter__(self):
12        for path in self.files:                       # pretend each file is preshuffled
13            ids = torch.load(path)                    # (N,)  uint16 token ids
14            for i in range(0, len(ids) - self.seq_len, self.seq_len):
15                yield ids[i : i + self.seq_len]      # (seq_len,)
16
17class MixtureSampler(IterableDataset):
18    """Interleaves K DomainShards with categorical weights."""
19    def __init__(self, shards, weights, seed=0):
20        assert abs(sum(weights) - 1.0) < 1e-9
21        self.shards = shards
22        self.weights = torch.tensor(weights, dtype=torch.float32)
23        self.gen = torch.Generator().manual_seed(seed)
24
25    def __iter__(self):
26        iters = [iter(s) for s in self.shards]
27        while True:
28            # Categorical draw: which domain serves the next sample?
29            k = int(torch.multinomial(self.weights, 1, generator=self.gen))
30            try:
31                yield next(iters[k])
32            except StopIteration:
33                # Exhausted: cycle. This is where 'effective epochs > 1' happens.
34                iters[k] = iter(self.shards[k])
35                yield next(iters[k])
36
37shards = [
38    DomainShard("web",   ["/data/web/0.pt",   "/data/web/1.pt"]),
39    DomainShard("code",  ["/data/code/0.pt"]),
40    DomainShard("math",  ["/data/math/0.pt"]),
41    DomainShard("books", ["/data/books/0.pt"]),
42    DomainShard("wiki",  ["/data/wiki/0.pt"]),
43]
44mix = MixtureSampler(shards, weights=[0.60, 0.17, 0.08, 0.10, 0.05])
45
46loader = DataLoader(mix, batch_size=8, num_workers=4, pin_memory=True)
47for step, batch in enumerate(loader):
48    if step == 1000:
49        break
50    # batch: (8, 4096) int64 — feed to the transformer
51    loss = train_step(batch)

Three subtleties worth marking, all about how this loop interacts with the rest of a frontier-scale training stack:

  1. Per-worker independence is the throughput win. With num_workers=4, four parallel processes each run their own MixtureSampler. Each draws its own categorical kk per step, so the four streams interleave into a global batch that itself is already a mini-mixture. There is no central coordinator — the law of large numbers does the coordination for you.
  2. Resetting an exhausted shard is where the epoch count goes up. Lines 30–32 reset the iterator when a domain runs out. On a tiny corpus this happens hundreds of times per training run. Two implementation rules: (a) reshuffle the shard order on reset — otherwise the model sees the same document order every epoch and memorises that order; (b) log every reset — the reset count is the empirical effective-epochs counter, and you want it on a dashboard.
  3. The weights live in the config, not in the data. The five shards on disk are never re-mixed. Changing pp means editing one line of YAML and relaunching — no re-tokenisation, no re-shuffling, no re-packing. This is the single most important engineering decision in the chapter. Pre-mixed datasets force a full re-build for every mixture-search experiment; sample-time mixing makes experimentation essentially free.
Implementation note. DeepSeek and Llama-3 both ship their mixture as two sets of weights: a pretraining mixture used for the first ~90% of the schedule, and an annealing mixture used for the final ~10%. The annealing mixture upweights the highest-quality slices (math, code, peer-reviewed scientific text) and downweights raw web. We cover the annealing schedule in detail in section 8.6 (Curriculum); for now, know that the sample-time interleaving you saw above happily handles a schedule-dependent p(t)p(t) — you only need to make weights a function of step, not a constant.

At Massive Scale: DoReMi and Mixture Search

Up to this point the weights have been a human choice — informed by the epoch-and-budget arithmetic above, plus folk knowledge from prior training runs. The 2023 DoReMi paper (Xie et al.) showed that you can do meaningfully better by treating mixture choice as a small-scale search problem, then transferring the winning weights to the full run.

The recipe:

  1. Train a small 280280M reference model on a default (e.g. uniform) mixture for a fixed number of steps. Record its per-domain losses Liref\mathcal{L}_i^{\text{ref}}.
  2. Initialise candidate weights p(0)p^{(0)} (e.g. uniform). Train a second 280M proxy under those weights. After each chunk of steps, compute the per-domain excess loss Li(θ)Liref\mathcal{L}_i(\theta) - \mathcal{L}_i^{\text{ref}}.
  3. Update weights with a multiplicative-weights step: pipiexp(ηΔi)p_i \leftarrow p_i \cdot \exp(\eta \cdot \Delta_i), then renormalise. Domains where the proxy is doing the worst (largest excess loss) get more weight; domains where the proxy is beating the reference get less.
  4. After the proxy stabilises, freeze its final pp^{\star}. Transfer those weights to the real 70B / 671B training run.

DoReMi's reported gain on The Pile is a 2.6× lower average perplexity at the same compute, with no model-architecture change. That number alone is roughly what an entire generation of architectural improvement delivers — and the cost was a single 280M proxy run, a tiny fraction of one full training launch.

ModelWebCodeMath + SciBooks + Long-formWiki + ReferenceMixture entropy (bits)
Natural proportions0.920.050.010.020.004≈ 0.59
GPT-3 (illustrative)0.600.160.030.160.05≈ 1.71
Llama-2 (illustrative)0.670.080.030.180.04≈ 1.49
DeepSeek-V3 (illustrative)0.600.170.080.100.05≈ 1.87
DoReMi-optimal (Pile)0.340.200.180.200.08≈ 2.18

Two patterns to read off the table. First, every modern frontier mix is much closer to uniform (high entropy) than to natural proportions — the field has converged on "upweight the rare, downweight the abundant" as a non-negotiable. Second, the DoReMi-optimal row is even closer to uniform than any human-picked mix, which is the empirical signal that humans systematically underweight rare-but-valuable corpora when picking by eye.

The interaction with scaling laws

Chinchilla-style scaling laws tell you the compute-optimal model-size-to-token-count ratio, but they assume an iid training distribution. Once you mix domains with different intrinsic difficulty, the scaling exponent per domain differs — math and code follow steeper power laws than web. The practical consequence is that the compute-optimal pp at a 70B model is not the same as at a 7B model: larger models can extract more value from heavier weight on the dense, hard domains because they have the capacity to fit them. Mixture choice is therefore scale-dependent, and proxy-based search at 280M can systematically under-recommend hard slices for a 70B target. This is an active research area as of 2025.

Engineering Reality and Gotchas

Mixture weighting looks like a clean math problem. Three production failure modes earn their flags:

  1. Effective epochs that look fine in expectation can be catastrophic per-document. A domain with E=8E = 8 means each document is replayed roughly 8 times — but if your shuffle is weak, a popular subset (the most common code repositories, the most common math textbooks) gets replayed 30× while the long tail gets replayed 2×. The mean is fine; the variance is fatal. Always verify shuffle quality before launching, and reshuffle on every reset.
  2. Tokens-seen drift between declared and actual. Across hundreds of training steps, the empirical mixture you served to the GPUs can drift by 1–2 percentage points from the declared weights — usually due to non-uniform shard sizes interacting with num_workers. The fix is to log a histogram of domain-of-origin tags on every batch and add an alert if any empirical p^i\hat{p}_i drifts more than 0.5pp from the declared pip_i. We have seen a 70B run waste 200B tokens of effective compute on this bug alone.
  3. Evaluation set contamination from upweighted corpora. The small, high-quality corpora you most want to upweight (math textbooks, code from popular repositories) are also the corpora most likely to overlap with public benchmarks (GSM8K, HumanEval, MMLU). At 8× replay, even a single contaminating document is seen eight times — easy to memorise, impossible to detect post-hoc. Contamination filtering (section 8.7) must run per-domain and must be re-run every time you bump a mixture weight upward.
How DeepSeek validates a mixture before launch. Three cheap pre-flight checks: (a) the empirical-frequency unit test on the sampler (counts must match weights to ±0.003\pm 0.003); (b) a 1B-token proxy run that measures per-domain validation loss every 100M tokens — flat or rising loss on any domain is a launch-blocker; (c) a contamination scan against the eval suite that grows with every weight increase. Skipping any of the three is how labs end up burning $1M of GPU time to discover a sign error in their YAML.

The one sentence to carry forward: a frontier model's capabilities are decided more by the five numbers in its mixture vector than by any other single hyperparameter — and those five numbers are picked, not discovered — which is why every serious lab now treats mixture search as a first-class optimisation, not an afterthought.

Where we go from here. Section 8.5 takes the next natural step: when the available data is not enough — or not dense enough — frontier labs synthesise additional training data from existing models. We will see when synthetic data helps, when it causes mode collapse, and how DeepSeek uses LLM-generated math problems to push the math weight from 8% to an effective 12% without running the small raw corpus past memorisation thresholds.
Loading comments...