Section 8.2 took a 15-trillion-token raw web crawl and removed the duplicates. What remained was still mostly junk. The web is unfiltered in a way that does not survive contact with a learning machine: navigation bars, machine-translated nonsense, autogenerated SEO farms, broken Unicode, copy-pasted boilerplate, link spam, and pages whose only content is a cookie banner. Train a model on the deduped crawl as-is and you will burn months of GPU time learning the statistics of garbage. Quality filtering is the engineering practice that decides which documents are worth a model's attention — and at the 14.8T-token scale of DeepSeek-V3, it is the single largest lever on final model quality short of architecture itself.
What this section delivers. The three-stage cascade every modern pretraining stack uses — heuristic rules, perplexity scoring under a reference language model, and a learned binary classifier — derived from the problem each stage solves, computed on a toy corpus by hand, implemented in plain Python and then PyTorch, and finally connected to the engineering reality of running it on a petabyte of raw crawl per week.
The Junk Problem at 15 Trillion Tokens
Common Crawl serves a snapshot of the open web every few weeks. A single snapshot is about 3.5 PB compressed, roughly 90B documents, and somewhere on the order of 50T tokens after a permissive HTML→text extraction. After the dedup pass from section 8.2 the corpus shrinks by about — but the surviving 15T tokens are still, by document-level eyeballing, only 15–25% actually useful for teaching a language model.
The other 75–85% is not random noise. It is structured noise: real strings that real humans (or scrapers) put on the web for reasons that have nothing to do with conveying knowledge. Three families dominate.
| Junk family | What it looks like | Why it kills training |
|---|---|---|
| Boilerplate | Navbars, cookie banners, footer link farms, "404 Page Not Found" | Same 200 tokens repeated billions of times — the model wastes capacity memorizing them |
| Machine-generated SEO | Auto-spun product descriptions, template-filled review pages, translated-then-back-translated articles | Plausible English with no information content — pollutes the model's factual posterior |
| Encoding / extraction failures | Mojibake (é), broken Unicode, HTML entities leaked through, JSON / CSS dumps | Teaches the tokenizer wrong byte patterns and corrupts the embedding table |
The cost of not filtering is not subtle. The original GPT-3 paper notes a roughly compute equivalence between adding curated data and growing the model — every junk token consumed was a model token of capacity wasted. Llama 2 reports that the Wikipedia-classifier filter contributed more to final downstream MMLU score than doubling the deduplication aggressiveness did. DeepSeek-V3 explicitly attributes a large share of its punching-above-its-active-parameter-count to aggressive quality filtering, not architecture.
Intuition: A Cascade of Increasingly Expensive Sieves
The temptation is to write one big “is this doc good?” function. Resist it. Quality filtering is built as a cascade of independent stages because the cost-per-doc rises by orders of magnitude as the signal-per-doc rises. Each stage's job is to make the next stage cheaper.
| Stage | Cost / doc | What it catches | What it cannot catch |
|---|---|---|---|
| 1. Heuristic rules | ~5 µs (string scans) | Empty pages, link farms, mojibake, runaway docs | Plausible but vapid text — heuristics see only surface stats |
| 2. Reference-LM perplexity | ~500 µs (5-gram lookup) | Off-distribution prose (gibberish, machine translation artifacts) | Topic / register mismatch — a Wikipedia LM thinks all news is "weird" |
| 3. Learned binary classifier | ~5 ms (fastText / tiny transformer) | Subtle register: textbook vs. blog vs. forum vs. spam | Anything in its training distribution — the classifier inherits its labels' biases |
The cascade is what makes the budget arithmetic close. Run the classifier on every raw doc and you spend million CPU-core-hours per snapshot. Run heuristics first to filter out 80% of the crawl, then perplexity to remove another 50%, and the classifier only sees docs — a budget that fits on a single CPU cluster overnight. Re-ordering the cascade is the difference between a tractable pipeline and an intractable one.
The Math of a Quality Filter
Each stage produces a real-valued score for every document. Let be a document and let be the heuristic, perplexity, and classifier scores. The pipeline applies three thresholds and keeps the document iff all three pass:
Two of the inequalities point the way you expect (more of this is better) and one (perplexity) points the other way (lower perplexity = more reference-like = better). is the indicator function — 1 if the predicate is true, 0 otherwise.
Stage 1: heuristic score
The simplest workable form is the alphanumeric ratio, . Real prose lands at ; nav-bar dumps land near . A threshold of is the standard starting point. Production pipelines stack a dozen such rules (mean word length, line repetition rate, ratio of bullet points, fraction of lines ending in terminal punctuation, etc.). Each is a single O(N) string pass and runs before anything else.
Stage 2: perplexity under a reference LM
Train a small n-gram language model on a curated reference corpus (Wikipedia + books works well). Score document with its perplexity, which is the exponentiated average negative log-probability per word:
Interpretation. Perplexity is the “effective vocabulary size” the reference LM perceives at each step — a low means the LM is consistently unsurprised by the next word, i.e. the document is similar to what the LM was trained on. A high means the LM is constantly surprised — the document either uses words the reference never saw (gibberish, foreign language, code-switched text) or strings them together in ways the reference never saw (machine translation, autogen). The threshold is set per-domain because domain-specific text legitimately scores higher under a Wikipedia LM than Wikipedia does.
Stage 3: learned binary classifier
Train a fastText-style classifier with reference text (Wikipedia + a few curated sources) as the positive class and an unfiltered random crawl as the negative class. The model is a single embedding table with mean-pooling and a 2-class linear head — about 10M parameters. Score documents by the predicted probability of the “reference” class:
where is the embedding table, is the classifier head, and is the softmax. The classifier picks up register and topic signals that unigram perplexity cannot — it distinguishes “textbook-quality explanation” from “competent blog post” from “auto-spun marketing copy.”
Why the indicator-product form is exact, not a heuristic
Treating the three stages as independent thresholds is a deliberate modelling choice. You could imagine a soft-AND like instead. Production pipelines do not, for two reasons. First, the independence assumption maps cleanly to the cascade ordering: each stage's threshold can be tuned in isolation against a held-out labelled set. Second, hard rejection at each stage is what makes the cost arithmetic work — soft-AND would force every stage to run on every document.
Manual Numerical Walkthrough
Five documents. Three filter stages. By hand. The point is to see that the numbers for the cascade are not mysterious — they are mean-ratios, log-sums, and one matrix-vector product.
Click to expand: five docs, three stages, every number computed by hand
The corpus.
"Deep learning is the study of neural networks trained on data"(length 60 chars, 10 words)"BUY NOW!!! Click here >>> http://spam.tld/win <<<"(length 49 chars, 7 tokens)"the quick brown fox jumps over the lazy dog repeatedly forever"(length 61 chars, 11 words)"asdf qwer zxcv 1234 //// :::: ____ random gibberish line"(length 55 chars, 9 tokens)"Transformers replaced recurrent models for language tasks in 2017"(length 64 chars, 10 words)
Reference unigram. Train on the 23-word reference corpus from the code below. Notable probabilities: , , , OOV smoothing floor .
Stage 1 (alpha ratio, threshold 0.6).
| Doc | Letters | Total chars | α = s₁ | α ≥ 0.60? |
|---|---|---|---|---|
| 1 | 50 | 60 | 0.83 | ✓ |
| 2 | 23 | 49 | 0.47 | ✗ DROP |
| 3 | 51 | 61 | 0.84 | ✓ |
| 4 | 23 | 55 | 0.42 | ✗ DROP |
| 5 | 56 | 64 | 0.88 | ✓ |
Two of five docs are dropped at the cheapest stage. Docs 2 and 4 never see another instruction. The downstream cost is now 3/5 of what it would have been.
Stage 2 (perplexity, threshold 60). Compute log-prob word-by-word for the three survivors.
Doc 1: deep / learning / is / the / study / of / neural / networks / trained / on / data (11 tokens). Reference contains: deep, learning, is, the, study, of, neural, networks. Out of vocab: trained, on, data. Sum of log-probs: . Average: . So . 121 > 60 — DROP. The unigram reference is too small to reward this doc; in production a 5-gram LM trained on billions of words would happily accept it.
Doc 3: All 11 words are in the reference (the, quick, brown, fox, jumps, over, the, lazy, dog, repeatedly, forever). But “repeatedly” and “forever” are not — they are OOV. So 9 in-vocab + 2 OOV. Each in-vocab word has except “the” which has . Sum: = . Avg: . PP . 54 < 60 — KEEP.
Doc 5: transformers / replaced / recurrent / models / for / language / tasks / in / 2017 — only “for”, “language”, and “tasks” are in the reference (3 of 9 in-vocab). PP . 1325 > 60 — DROP. Again, the tiny reference is unfair. In production this doc would pass.
Stage 3 (classifier, threshold 0.55). Only doc 3 survives. Pretend the trained classifier outputs probability distribution on doc 3. — KEEP.
Final retention. 1 of 5 docs kept = 20% document retention. Aggregate token retention is . That feels low — and on this toy reference it is, because the reference corpus is 23 words. The real lesson is the arithmetic of the cascade: stage 1 removed 40% of docs at cost 5µs each; stage 2 removed 67% of survivors at cost ~500µs each; stage 3 ran on only 1 of 5 docs. That cost-shape is the cascade's entire point.
Calibration check. A real 5-gram reference LM trained on 5B Wikipedia words would assign , (the fox sentence looks weird to Wikipedia), . Under that LM, docs 1 and 5 would pass perplexity and doc 3 would drop — the opposite of what our toy reference produced. Filter quality is entirely a function of how representative your reference LM is. This is the most underappreciated rule in pretraining data engineering.
Visualizing the Three-Stage Pipeline
The visualizer below runs a synthetic 160-document corpus through the same three-stage cascade. Each bubble is one document; the bubble radius is the doc's token count. Drag any of the three threshold sliders and watch the cohort of survivors (green) shrink or expand, and the cohorts dropped at each stage (red / amber / violet) flow into the trash zone. The stats panel reports the same numbers a production pipeline logs after every shard.
Three behaviours to lock in by playing with the sliders. First, drop the heuristic threshold to 0.3 — most boilerplate-looking docs leak through, and the average quality of survivors falls sharply. Second, drop the perplexity cutoff from 110 to 50 (much stricter) — token retention craters and the kept corpus becomes uniformly Wikipedia-like, which is precisely the “mode collapse” failure mode mentioned in the next-to-last section. Third, raise the classifier threshold above 0.7 — the kept pile becomes tiny but its average quality reaches the ceiling. There is no setting that simultaneously maximizes retention and quality. The job is to find the right point on that frontier for the downstream model's objective.
Plain Python: Heuristic + Perplexity Filter
Both stages of the cheap half of the cascade in self-contained Python — no PyTorch, no KenLM. The reference LM is a unigram trained on three sentences, the heuristic is a handful of guard clauses, and the pipeline is a guard-clause function that returns both the keep / drop decision and the reason. Run it on five toy documents to see every branch fire.
Two implementation choices in this code generalise to every production pipeline. First, every stage returns both a decision and a reason. The reasons accumulate into a histogram per shard and per source domain, and that histogram is what the data team stares at every morning. Second, the perplexity computation works in log space and uses a smoothing floor — without those two details, real documents would either underflow to zero probability or return NaN on the first OOV token.
Sanity check. If you replace the unigram with a 5-gram KenLM trained on 5B Wikipedia words, the same five-doc corpus produces very different perplexity numbers but the same relative ordering. The cascade's correctness depends on the ordering, not on the absolute scores — which is why every team re-tunes thresholds the moment they swap reference LMs.
PyTorch: Batched fastText-Style Classifier Scoring
The expensive third stage. A fastText-shaped classifier — embedding table + mean-pool + linear head — scored in batches on the GPU. The function returns one score per document; the threshold is applied by the caller. Variable-length docs are padded to the longest in each batch and a boolean mask keeps pad tokens from polluting the mean pool.
Three subtleties about how this function plugs into the rest of the pretraining stack are worth marking explicitly.
- Token IDs in, scores out — text is never reprocessed. The streaming tokenizer (chapter 3) is the only component that touches raw text. Every downstream filter and the model itself consumes int64 IDs. This decoupling means swapping classifiers re-runs in hours, not days.
- The classifier is tiny on purpose. ~10M parameters, one matmul per doc. A bigger classifier (a small transformer) would score better per doc but would dominate the pipeline cost — for minimal gain because by stage 3 the surviving documents are already mostly OK. The right place to spend capacity is the model being trained, not the gatekeeper at the door.
- The 0.55 threshold is empirical, not principled. It is the value that maximised downstream MMLU/HumanEval/GSM8K on a scaled-down 1B-parameter pilot. Every model size and every downstream eval suite has its own optimum. Production teams re-tune this threshold per snapshot, sometimes per domain.
At Massive Scale: 15T → 14.8T Tokens
Numbers from public reports. None of the major frontier models publish the exact filter settings, but the order-of-magnitude shape of the funnel is consistent across DeepSeek-V3, Llama 3, RedPajama-V2, and Falcon.
| Stage | Tokens in | Survival rate | Tokens out | Compute cost |
|---|---|---|---|---|
| Raw Common Crawl (1 snapshot) | ~50 T | — | ~50 T | 0 (free download) |
| URL / language ID filter | 50 T | ~60% | 30 T | ~10k CPU-hours |
| Dedup (section 8.2) | 30 T | ~50% | 15 T | ~100k CPU-hours |
| Heuristic (stage 1) | 15 T | ~70% | 10.5 T | ~5k CPU-hours |
| Perplexity (stage 2) | 10.5 T | ~70% | 7.4 T | ~50k CPU-hours |
| Classifier (stage 3) | 7.4 T | ~70% | 5.2 T per snapshot | ~200k CPU-hours + 1k GPU-hours |
| Aggregate (multiple snapshots) | — | — | ~14.8 T | — |
Two things to read from the table. First, every stage roughly thirds the corpus. Compounded over the three quality stages that is a retention from deduped input to final corpus — almost two-thirds of the deduped tokens are rejected for quality reasons. Second, the compute is dominated by the dedup pass and the perplexity stage, not the classifier. The cascade ordering pays for itself by keeping the classifier's bill under one percent of total pipeline cost.
The 14.8T number comes from many snapshots, not one
A single Common Crawl snapshot yields ~5T tokens after quality filtering. DeepSeek-V3 reports 14.8T training tokens, which comes from accumulating across multiple snapshots plus curated sources (the next-section topic of data mixing). The filtered output of one snapshot is not enough; the filtered output of many snapshots is what feeds a frontier-scale run.
Why filter quality matters more for MoE than for dense
Mixture-of-experts models (chapter 5) have many experts, each specialising on a slice of the input distribution. A junk-heavy training set creates a junk-handling expert that the router learns to send certain inputs to — and that expert's capacity is now permanently wasted. Dense models have no such routing; junk is averaged into every weight. As a result MoE models are more sensitive to data quality at fixed token count: the DeepSeek paper explicitly notes that tightening the classifier threshold by 5% gave them a larger downstream-eval bump than a comparable dense baseline saw.
Filtering interacts with the tokenizer
Your reference LM and your classifier both score documents using a tokenization that is decoupled from the main-model tokenizer (chapter 3). When you change the main-model tokenizer mid-project (rare but it happens) the filter scores do not automatically change with it — you are filtering on the old distribution and training on the new one. Teams have caught this exact bug 6 months into a run when downstream scores plateau unexpectedly; the fix is to re-score the entire corpus with a classifier rebuilt under the new tokenization.
Engineering Reality and Failure Modes
The cascade looks tidy in writing and is anything but in practice. Five failure modes recur across every team that has shipped a pretraining pipeline.
- Mode collapse toward the reference distribution. Over-strict perplexity cutoffs (or a classifier with too narrow a reference) produce a corpus that looks uniformly like Wikipedia. The model trained on it sounds like an encyclopaedia and fails on conversational, code, and creative tasks. Fix: per-domain cutoffs, and a small explicit budget of “diverse” data that bypasses stage 3.
- Classifier label drift. The classifier is trained on a fixed pair of (reference, crawl) sets. The crawl distribution drifts every quarter (new platforms, new spam formats). After a year, the “crawl” class no longer represents the current crawl, and the classifier's decisions become unreliable in ways that are invisible to its accuracy on the original validation set. Fix: retrain the classifier per snapshot, monitor score distributions for distribution shift.
- Language-specific failures. A perplexity LM trained on English Wikipedia gives nonsensically high perplexity to perfectly good Mandarin or Hindi prose. If language ID is buggy upstream, you silently delete entire languages. Fix: a separate reference LM per language, gated by a language-ID classifier upstream — and per-language retention dashboards that scream when a language drops off a cliff.
- Eval contamination via the reference set. If your reference corpus contains MMLU questions (because Wikipedia does), the classifier learns to prefer documents that look like MMLU questions, and your training set ends up enriched for MMLU-leakage patterns. Section 8.7 covers contamination detection; the relevant rule here is to scrub the reference corpus of any eval data before using it as the classifier's positive class.
- The “long tail of legitimately weird” problem. Source code, math papers with heavy LaTeX, chemistry formulas, and tabular data all score badly under a prose-trained reference LM but are exactly the domains the downstream model needs to learn. The fix is the next section's topic: domain-aware data mixing, where each domain has its own filter and its own weighting in the final training mixture. Quality filtering is per-domain; data mixing is how the per-domain corpora are combined.
Past all these mechanics, the deeper lesson of quality filtering is that the pretraining model never sees raw reality. It sees the world through the filter cascade your team built — and every choice you make in this section is permanently encoded in what the model thinks “good text” means. The filter is the model's upbringing. Treat it accordingly.
What the next section adds. Section 8.4 Data Mixing and Domain Weighting takes the filtered corpora produced here (per-domain: web, code, math, books, news, multilingual) and answers the question of how much of each to feed the model in what order. Quality filtering decides what each domain looks like; data mixing decides how loudly each domain speaks.