Chapter 8
22 min read
Section 43 of 117

Quality Filtering

Data: The Invisible Foundation

Section 8.2 took a 15-trillion-token raw web crawl and removed the duplicates. What remained was still mostly junk. The web is unfiltered in a way that does not survive contact with a learning machine: navigation bars, machine-translated nonsense, autogenerated SEO farms, broken Unicode, copy-pasted boilerplate, link spam, and pages whose only content is a cookie banner. Train a model on the deduped crawl as-is and you will burn months of GPU time learning the statistics of garbage. Quality filtering is the engineering practice that decides which documents are worth a model's attention — and at the 14.8T-token scale of DeepSeek-V3, it is the single largest lever on final model quality short of architecture itself.

What this section delivers. The three-stage cascade every modern pretraining stack uses — heuristic rules, perplexity scoring under a reference language model, and a learned binary classifier — derived from the problem each stage solves, computed on a toy corpus by hand, implemented in plain Python and then PyTorch, and finally connected to the engineering reality of running it on a petabyte of raw crawl per week.

The Junk Problem at 15 Trillion Tokens

Common Crawl serves a snapshot of the open web every few weeks. A single snapshot is about 3.5 PB compressed, roughly 90B documents, and somewhere on the order of 50T tokens after a permissive HTML→text extraction. After the dedup pass from section 8.2 the corpus shrinks by about 3×3\times — but the surviving 15T tokens are still, by document-level eyeballing, only 15–25% actually useful for teaching a language model.

The other 75–85% is not random noise. It is structured noise: real strings that real humans (or scrapers) put on the web for reasons that have nothing to do with conveying knowledge. Three families dominate.

Junk familyWhat it looks likeWhy it kills training
BoilerplateNavbars, cookie banners, footer link farms, "404 Page Not Found"Same 200 tokens repeated billions of times — the model wastes capacity memorizing them
Machine-generated SEOAuto-spun product descriptions, template-filled review pages, translated-then-back-translated articlesPlausible English with no information content — pollutes the model's factual posterior
Encoding / extraction failuresMojibake (é), broken Unicode, HTML entities leaked through, JSON / CSS dumpsTeaches the tokenizer wrong byte patterns and corrupts the embedding table

The cost of not filtering is not subtle. The original GPT-3 paper notes a roughly 2×2\times compute equivalence between adding curated data and growing the model — every junk token consumed was a model token of capacity wasted. Llama 2 reports that the Wikipedia-classifier filter contributed more to final downstream MMLU score than doubling the deduplication aggressiveness did. DeepSeek-V3 explicitly attributes a large share of its punching-above-its-active-parameter-count to aggressive quality filtering, not architecture.

The economic framing that makes this priority obvious. A pretraining run for a 671B-parameter MoE model is on the order of 6×10246 \times 10^{24} FLOPs. At a realised 2×10152 \times 10^{15} FLOPs/s/H100, that is roughly 95 million H100-hours. Every percentage point of junk in your training set is, very literally, a million dollars of GPU time spent learning the statistics of cookie banners. There is no cheaper place to spend engineering effort than the filtering pipeline.

Intuition: A Cascade of Increasingly Expensive Sieves

The temptation is to write one big “is this doc good?” function. Resist it. Quality filtering is built as a cascade of independent stages because the cost-per-doc rises by orders of magnitude as the signal-per-doc rises. Each stage's job is to make the next stage cheaper.

StageCost / docWhat it catchesWhat it cannot catch
1. Heuristic rules~5 µs (string scans)Empty pages, link farms, mojibake, runaway docsPlausible but vapid text — heuristics see only surface stats
2. Reference-LM perplexity~500 µs (5-gram lookup)Off-distribution prose (gibberish, machine translation artifacts)Topic / register mismatch — a Wikipedia LM thinks all news is "weird"
3. Learned binary classifier~5 ms (fastText / tiny transformer)Subtle register: textbook vs. blog vs. forum vs. spamAnything in its training distribution — the classifier inherits its labels' biases

The cascade is what makes the budget arithmetic close. Run the classifier on every raw doc and you spend 5 ms×90B145 \text{ ms} \times 90\text{B} \approx 14 million CPU-core-hours per snapshot. Run heuristics first to filter out 80% of the crawl, then perplexity to remove another 50%, and the classifier only sees 90B0.20.5=9B90\text{B} \cdot 0.2 \cdot 0.5 = 9\text{B} docs — a budget that fits on a single CPU cluster overnight. Re-ordering the cascade is the difference between a tractable pipeline and an intractable one.

The mental picture. Imagine panning for gold by hand. You start with a wide-mesh sieve (heuristic): rocks, leaves, sticks gone instantly, almost zero effort per gram. Then a finer mesh (perplexity): coarse sand gone, real effort per pan but only on what survived the wide mesh. Finally a magnifying glass over what remains (classifier): expensive per item but the volume is now small enough that you can afford it.

The Math of a Quality Filter

Each stage produces a real-valued score for every document. Let dd be a document and let s1(d),s2(d),s3(d)s_1(d), s_2(d), s_3(d) be the heuristic, perplexity, and classifier scores. The pipeline applies three thresholds τ1,τ2,τ3\tau_1, \tau_2, \tau_3 and keeps the document iff all three pass:

keep(d)=1 ⁣[s1(d)τ1]1 ⁣[s2(d)τ2]1 ⁣[s3(d)τ3]\text{keep}(d) = \mathbf{1}\!\left[s_1(d) \geq \tau_1\right] \cdot \mathbf{1}\!\left[s_2(d) \leq \tau_2\right] \cdot \mathbf{1}\!\left[s_3(d) \geq \tau_3\right]

Two of the inequalities point the way you expect (more of this is better) and one (perplexity) points the other way (lower perplexity = more reference-like = better). 1[]\mathbf{1}[\cdot] is the indicator function — 1 if the predicate is true, 0 otherwise.

Stage 1: heuristic score

The simplest workable form is the alphanumeric ratio, s1(d)={cd:c is letter}/ds_1(d) = |\{c \in d : c\text{ is letter}\}| / |d|. Real prose lands at s1[0.75,0.85]s_1 \in [0.75, 0.85]; nav-bar dumps land near 0.40.4. A threshold of τ1=0.6\tau_1 = 0.6 is the standard starting point. Production pipelines stack a dozen such rules (mean word length, line repetition rate, ratio of bullet points, fraction of lines ending in terminal punctuation, etc.). Each is a single O(N) string pass and runs before anything else.

Stage 2: perplexity under a reference LM

Train a small n-gram language model prefp_{\text{ref}} on a curated reference corpus (Wikipedia + books works well). Score document d=w1w2wNd = w_1 w_2 \dots w_N with its perplexity, which is the exponentiated average negative log-probability per word:

s2(d)=PP(d)=exp ⁣(1Ni=1Nlogpref(wiwi1,))s_2(d) = \text{PP}(d) = \exp\!\left(-\frac{1}{N} \sum_{i=1}^{N} \log p_{\text{ref}}(w_i \mid w_{i-1}, \dots)\right)

Interpretation. Perplexity is the “effective vocabulary size” the reference LM perceives at each step — a low PP\text{PP} means the LM is consistently unsurprised by the next word, i.e. the document is similar to what the LM was trained on. A high PP\text{PP} means the LM is constantly surprised — the document either uses words the reference never saw (gibberish, foreign language, code-switched text) or strings them together in ways the reference never saw (machine translation, autogen). The threshold τ2\tau_2 is set per-domain because domain-specific text legitimately scores higher under a Wikipedia LM than Wikipedia does.

Stage 3: learned binary classifier

Train a fastText-style classifier with reference text (Wikipedia + a few curated sources) as the positive class and an unfiltered random crawl as the negative class. The model is a single embedding table with mean-pooling and a 2-class linear head — about 10M parameters. Score documents by the predicted probability of the “reference” class:

s3(d)=P ⁣(class=refd;θ)=σ ⁣(W1Ni=1NE[wi])s_3(d) = P\!\left(\text{class} = \text{ref} \mid d ; \theta\right) = \sigma\!\left(W \cdot \frac{1}{N} \sum_{i=1}^{N} E[w_i]\right)

where ERV×dE \in \mathbb{R}^{V \times d} is the embedding table, WR2×dW \in \mathbb{R}^{2 \times d} is the classifier head, and σ\sigma is the softmax. The classifier picks up register and topic signals that unigram perplexity cannot — it distinguishes “textbook-quality explanation” from “competent blog post” from “auto-spun marketing copy.”

Why the indicator-product form is exact, not a heuristic

Treating the three stages as independent thresholds is a deliberate modelling choice. You could imagine a soft-AND like s1+s2+s3>τs_1 + s_2' + s_3 > \tau instead. Production pipelines do not, for two reasons. First, the independence assumption maps cleanly to the cascade ordering: each stage's threshold can be tuned in isolation against a held-out labelled set. Second, hard rejection at each stage is what makes the cost arithmetic work — soft-AND would force every stage to run on every document.

Manual Numerical Walkthrough

Five documents. Three filter stages. By hand. The point is to see that the numbers for the cascade are not mysterious — they are mean-ratios, log-sums, and one matrix-vector product.

Click to expand: five docs, three stages, every number computed by hand

The corpus.

  1. "Deep learning is the study of neural networks trained on data" (length 60 chars, 10 words)
  2. "BUY NOW!!! Click here >>> http://spam.tld/win <<<" (length 49 chars, 7 tokens)
  3. "the quick brown fox jumps over the lazy dog repeatedly forever" (length 61 chars, 11 words)
  4. "asdf qwer zxcv 1234 //// :::: ____ random gibberish line" (length 55 chars, 9 tokens)
  5. "Transformers replaced recurrent models for language tasks in 2017" (length 64 chars, 10 words)

Reference unigram. Train prefp_{\text{ref}} on the 23-word reference corpus from the code below. Notable probabilities: p(the)=4/230.174p(\text{the}) = 4/23 \approx 0.174, p(deep)=1/230.043p(\text{deep}) = 1/23 \approx 0.043, p(neural)=1/230.043p(\text{neural}) = 1/23 \approx 0.043, OOV smoothing floor psmooth=104p_{\text{smooth}} = 10^{-4}.

Stage 1 (alpha ratio, threshold 0.6).

DocLettersTotal charsα = s₁α ≥ 0.60?
150600.83
223490.47✗ DROP
351610.84
423550.42✗ DROP
556640.88

Two of five docs are dropped at the cheapest stage. Docs 2 and 4 never see another instruction. The downstream cost is now 3/5 of what it would have been.

Stage 2 (perplexity, threshold 60). Compute log-prob word-by-word for the three survivors.

Doc 1: deep / learning / is / the / study / of / neural / networks / trained / on / data (11 tokens). Reference contains: deep, learning, is, the, study, of, neural, networks. Out of vocab: trained, on, data. Sum of log-probs: 8×log(1/23)+3×log(104)8 \times \log(1/23) + 3 \times \log(10^{-4}) =25.1327.63=52.76= -25.13 - 27.63 = -52.76. Average: 52.76/11=4.80-52.76 / 11 = -4.80. So PP1=e4.80121\text{PP}_1 = e^{4.80} \approx 121. 121 > 60 — DROP. The unigram reference is too small to reward this doc; in production a 5-gram LM trained on billions of words would happily accept it.

Doc 3: All 11 words are in the reference (the, quick, brown, fox, jumps, over, the, lazy, dog, repeatedly, forever). But “repeatedly” and “forever” are not — they are OOV. So 9 in-vocab + 2 OOV. Each in-vocab word has p=1/23p = 1/23 except “the” which has p=4/23p = 4/23. Sum: 2log(4/23)+7log(1/23)+2log(104)2\log(4/23) + 7\log(1/23) + 2\log(10^{-4}) = 3.4921.9918.42=43.90-3.49 - 21.99 - 18.42 = -43.90. Avg: 3.99-3.99. PP =e3.9954= e^{3.99} \approx 54. 54 < 60 — KEEP.

Doc 5: transformers / replaced / recurrent / models / for / language / tasks / in / 2017 — only “for”, “language”, and “tasks” are in the reference (3 of 9 in-vocab). PP e(33.14+69.21)/9\approx e^{(3 \cdot 3.14 + 6 \cdot 9.21)/9} e7.191325\approx e^{7.19} \approx 1325. 1325 > 60 — DROP. Again, the tiny reference is unfair. In production this doc would pass.

Stage 3 (classifier, threshold 0.55). Only doc 3 survives. Pretend the trained classifier outputs probability distribution [P(crawl),P(ref)]=[0.30,0.70][P(\text{crawl}), P(\text{ref})] = [0.30, 0.70] on doc 3. s3=0.700.55s_3 = 0.70 \geq 0.55 KEEP.

Final retention. 1 of 5 docs kept = 20% document retention. Aggregate token retention is 11/4723%11 / 47 \approx 23\%. That feels low — and on this toy reference it is, because the reference corpus is 23 words. The real lesson is the arithmetic of the cascade: stage 1 removed 40% of docs at cost 5µs each; stage 2 removed 67% of survivors at cost ~500µs each; stage 3 ran on only 1 of 5 docs. That cost-shape is the cascade's entire point.

Calibration check. A real 5-gram reference LM trained on 5B Wikipedia words would assign PP140\text{PP}_1 \approx 40, PP390\text{PP}_3 \approx 90 (the fox sentence looks weird to Wikipedia), PP545\text{PP}_5 \approx 45. Under that LM, docs 1 and 5 would pass perplexity and doc 3 would drop — the opposite of what our toy reference produced. Filter quality is entirely a function of how representative your reference LM is. This is the most underappreciated rule in pretraining data engineering.

Visualizing the Three-Stage Pipeline

The visualizer below runs a synthetic 160-document corpus through the same three-stage cascade. Each bubble is one document; the bubble radius is the doc's token count. Drag any of the three threshold sliders and watch the cohort of survivors (green) shrink or expand, and the cohorts dropped at each stage (red / amber / violet) flow into the trash zone. The stats panel reports the same numbers a production pipeline logs after every shard.

Loading quality-filter pipeline visualizer…

Three behaviours to lock in by playing with the sliders. First, drop the heuristic threshold to 0.3 — most boilerplate-looking docs leak through, and the average quality of survivors falls sharply. Second, drop the perplexity cutoff from 110 to 50 (much stricter) — token retention craters and the kept corpus becomes uniformly Wikipedia-like, which is precisely the “mode collapse” failure mode mentioned in the next-to-last section. Third, raise the classifier threshold above 0.7 — the kept pile becomes tiny but its average quality reaches the ceiling. There is no setting that simultaneously maximizes retention and quality. The job is to find the right point on that frontier for the downstream model's objective.

The Pareto frontier is the actual deliverable. A filtering team's real artifact is not the cutoffs — it is a scatter plot of (retention, downstream eval score) across many cutoff choices, run on a tiny scaled-down model. The thresholds you ship are the point on that curve that the scaling-laws chapter (chapter 9) says you can afford given your compute budget.

Plain Python: Heuristic + Perplexity Filter

Both stages of the cheap half of the cascade in self-contained Python — no PyTorch, no KenLM. The reference LM is a unigram trained on three sentences, the heuristic is a handful of guard clauses, and the pipeline is a guard-clause function that returns both the keep / drop decision and the reason. Run it on five toy documents to see every branch fire.

🐍quality_filter.py
8A tiny stand-in for the reference language model

Production stacks (CCNet, DeepSeek's, Llama's) train a 5-gram KenLM on a clean reference corpus — Wikipedia, books, curated news. We collapse it to a unigram on three sentences so the math is auditable. The role is the same: a fast, fixed probability distribution that says how 'reference-like' a string looks.

EXECUTION STATE
len(REFERENCE) = 23 words
V = 23 (total tokens)
11Reference unigram probabilities

ref_prob[w] is the maximum-likelihood probability of word w in our reference. ref_prob['the'] = 4/23 ≈ 0.174, ref_prob['neural'] = 1/23 ≈ 0.043. Words not in the reference get a smoothing floor of 1e-4 inside doc_perplexity — so they hurt the score but don't crash it.

EXECUTION STATE
ref_prob['the'] = 0.174
ref_prob['neural'] = 0.043
smoothing floor = 1e-4
17Cheap length guard

The crawl is full of 12-character pages — error stubs, redirect pages, empty templates. Throwing them out before any other work is the single highest-throughput line in the entire pipeline. On Common Crawl this rule alone removes roughly 15% of all records at essentially zero cost.

EXECUTION STATE
min length = 50 chars
19Alphanumeric ratio — the boilerplate detector

Real prose is ~75-85% letters. Navigation bars, ASCII tables, JSON payloads, and link farms are <50% letters. alpha is the single most discriminating cheap heuristic. We drop doc if alpha < 0.6 — this is one of the three sliders in the visualizer above. The threshold is intentionally permissive so we don't kill code blocks and tables we want to keep.

EXECUTION STATE
alpha (good prose) = 0.78
alpha (BUY NOW spam) = 0.42
alpha (gibberish line) = 0.31
22Word-count window

Drop docs with fewer than 10 words (no signal) or more than 100k words (almost always concatenated dumps, log files, or scraped database exports). The upper bound is critical at scale: a single 50MB doc costs as much to score in stage 2 as 50,000 normal docs combined.

EXECUTION STATE
min words = 10
max words = 100000
25Average word length sanity check

English words average ~5 characters. Below 3 you are looking at random short strings (URLs, IDs, emoji). Above 10 you are usually looking at concatenated-no-spaces dumps from a broken HTML extractor. Both fail this guard cleanly. This costs O(N) per doc but runs only on docs that passed length + alpha — i.e. on a tiny fraction of the crawl.

35doc_perplexity: the formula made explicit

We compute log p(w) for each word under the reference, sum, divide by document length, and exponentiate the negation. That is the standard cross-entropy → perplexity transformation. A doc that looks exactly like the reference scores PP ≈ 1/avg_p (small). A doc full of OOV junk scores near 1/smooth = 10000.

EXECUTION STATE
smooth (OOV floor) = 1e-4
38Tokenize as lowercase alphabetic words

Crucially, the same regex that built REFERENCE must score docs — otherwise OOV rates explode and the cutoff becomes useless. In production, tokenization for the perplexity LM is frozen the moment the LM is trained; even adding a digit class later changes everyone's score.

41Accumulate log-prob, geometric-mean form

Working in log space is non-negotiable. Multiplying 5,000 probabilities under 0.1 underflows float64 by word ~300. log + exp at the end gives us numerical stability for documents of any length. The geometric mean (the /len(words)) also removes a strong length bias — long docs would otherwise look 'worse' than short ones.

46The perplexity cutoff is THE knob

Tighter (lower) cutoff = fewer docs kept, higher avg quality, more reference-mode-collapsed (Wikipedia-like) dataset. Looser (higher) cutoff = more diversity, more junk. DeepSeek-V3 uses domain-specific cutoffs: ~120 for code, ~80 for prose. There is no universal correct number — only the right one for your downstream eval suite.

EXECUTION STATE
PERPLEXITY_CUTOFF = 60.0
55Pipeline as guard clauses — cheap stages run first

This ordering is not stylistic. It is the reason the pipeline fits in your budget. heuristic_ok costs O(N) string scans (microseconds). doc_perplexity costs O(N) probability lookups + a log (milliseconds in pure Python, microseconds in optimized KenLM). Stage 3 (the learned classifier in the next code block) is the expensive one — that is why it lives outside this function and runs only on the survivors of these two cheap stages.

64Toy corpus — five documents that exercise every branch

Doc 0 (deep learning sentence) and doc 4 (transformers sentence) are reference-shaped and survive. Doc 1 (BUY NOW spam) fails the heuristic on alpha ratio (link + punctuation). Doc 2 (the quick brown fox) survives the heuristic and gets a tiny perplexity because every word is in the reference. Doc 3 (gibberish) fails heuristic on alpha and would fail perplexity on OOV if it got that far.

EXECUTION STATE
len(corpus) = 5 docs
73Logging WHY each doc was dropped

In production the (kept, reason) tuple is the single most valuable telemetry signal. You bucket the drop reasons over a 1B-doc batch and immediately see: is the heuristic too aggressive on code? Is the perplexity cutoff killing all your Mandarin docs? Without per-reason logging, tuning thresholds is guesswork. With it, every threshold change is a measurable A/B.

61 lines without explanation
1import math, re
2from collections import Counter
3
4# ------------------------------------------------------------------
5# A tiny "reference distribution" built from a clean corpus (Wiki-like).
6# In production this is a 5-gram KenLM trained on Wikipedia or books.
7# Here we hand-build a unigram for clarity.
8# ------------------------------------------------------------------
9REFERENCE = (
10    "the quick brown fox jumps over the lazy dog "
11    "deep learning is the study of neural networks "
12    "transformers replaced recurrent models for language tasks"
13).split()
14ref_counts = Counter(REFERENCE)
15V = sum(ref_counts.values())
16ref_prob = {w: c / V for w, c in ref_counts.items()}
17
18# ------------------------------------------------------------------
19# Stage 1: HEURISTIC FILTER -- cheap, runs first, drops 60-80% of crawl.
20# ------------------------------------------------------------------
21def heuristic_ok(text: str) -> bool:
22    if len(text) < 50:                    # too short to be a real doc
23        return False
24    alpha = sum(c.isalpha() for c in text) / max(1, len(text))
25    if alpha < 0.6:                       # mostly punctuation / nav cruft
26        return False
27    words = text.split()
28    if len(words) < 10 or len(words) > 100000:
29        return False
30    avg_wlen = sum(len(w) for w in words) / len(words)
31    if not (3 <= avg_wlen <= 10):         # English words are ~5 chars on avg
32        return False
33    return True
34
35# ------------------------------------------------------------------
36# Stage 2: PERPLEXITY FILTER -- score doc under the reference LM.
37# Lower perplexity = more "reference-like" = keep.
38# ------------------------------------------------------------------
39def doc_perplexity(text: str, smooth: float = 1e-4) -> float:
40    words = re.findall(r"[a-z]+", text.lower())
41    if not words:
42        return float("inf")
43    logp = 0.0
44    for w in words:
45        p = ref_prob.get(w, smooth)       # OOV gets a tiny floor
46        logp += math.log(p)
47    return math.exp(-logp / len(words))   # geometric mean of 1/p
48
49PERPLEXITY_CUTOFF = 60.0
50
51# ------------------------------------------------------------------
52# The pipeline. Each filter is a guard clause; cheap stages run first
53# so we never spend the expensive perplexity computation on obvious junk.
54# ------------------------------------------------------------------
55def keep(text: str) -> tuple[bool, str]:
56    if not heuristic_ok(text):
57        return False, "heuristic"
58    if doc_perplexity(text) > PERPLEXITY_CUTOFF:
59        return False, "perplexity"
60    return True, "kept"
61
62corpus = [
63    "Deep learning is the study of neural networks trained on data.",
64    "BUY NOW!!! Click here >>> http://spam.tld/win <<<",
65    "the quick brown fox jumps over the lazy dog repeatedly forever",
66    "asdf qwer zxcv 1234 ////  ::::  ____  random gibberish line",
67    "Transformers replaced recurrent models for language tasks in 2017.",
68]
69
70for doc in corpus:
71    kept, reason = keep(doc)
72    ppl = doc_perplexity(doc)
73    print(f"{'KEEP' if kept else 'DROP':4s} | reason={reason:10s} "
74          f"| ppl={ppl:8.2f} | {doc[:55]}")

Two implementation choices in this code generalise to every production pipeline. First, every stage returns both a decision and a reason. The reasons accumulate into a histogram per shard and per source domain, and that histogram is what the data team stares at every morning. Second, the perplexity computation works in log space and uses a smoothing floor — without those two details, real documents would either underflow to zero probability or return NaN on the first OOV token.

Sanity check. If you replace the unigram with a 5-gram KenLM trained on 5B Wikipedia words, the same five-doc corpus produces very different perplexity numbers but the same relative ordering. The cascade's correctness depends on the ordering, not on the absolute scores — which is why every team re-tunes thresholds the moment they swap reference LMs.

PyTorch: Batched fastText-Style Classifier Scoring

The expensive third stage. A fastText-shaped classifier — embedding table + mean-pool + linear head — scored in batches on the GPU. The function returns one score per document; the threshold is applied by the caller. Variable-length docs are padded to the longest in each batch and a boolean mask keeps pad tokens from polluting the mean pool.

🐍quality_classifier.py
4@torch.no_grad() because we are scoring, not training

Scoring 200M docs with autograd on would allocate gradient tensors for every embedding lookup — gigabytes of waste. no_grad cuts memory roughly in half and frees the optimizer not to be involved at all. This is one of the highest-ROI decorators in any inference pipeline.

6The classifier is intentionally small

A fastText-style model: embedding table of size (V, d) with d=64 to 128, plus a 2-class linear head. That is ~10M parameters for V=100k — small enough to fit alongside the tokenizer in RAM, fast enough to score thousands of docs per second per CPU core. The classifier MUST be cheap or it dominates the pipeline cost.

EXECUTION STATE
embedding dim d = 64–128
classifier params = ~10M
7Input is variable-length token IDs

We deliberately do NOT receive raw text — that would mean re-tokenizing 200M docs on the GPU. The streaming tokenizer (Chapter 3) emits int64 tensors once; every downstream filter consumes the IDs. This decoupling lets the same docs be re-scored by a new classifier in hours instead of days.

EXECUTION STATE
token_id_batches[i] = (L_i,) int64
11batch_size = 2048 docs per kernel launch

Too small and we are launch-bound (CPU dispatches dominate). Too large and we waste padding on the longest doc in the batch. 2k is the empirical sweet spot for fastText-sized classifiers on an A100/H100. The 'longest doc' problem is also why outlier-removal in stage 1 (max 100k words) matters — one 1M-token doc would balloon every batch it lands in.

EXECUTION STATE
batch_size = 2048
GPU memory per batch = ~250 MB
17Pad to longest in batch, build a boolean mask

Padding is necessary for vectorization. The mask is non-negotiable: without it the mean-pooling step would average in the pad token's embedding, biasing every short doc's score toward whatever the pad token learned to represent.

EXECUTION STATE
padded.shape = (B, max_len)
mask.shape = (B, max_len) bool
25Embedding lookup — the per-doc bag of tokens

classifier.embed is a single nn.Embedding lookup: (B, L) int -> (B, L, d) float. This step dominates wall-clock time of the function because it touches the V×d table once per token. It is also the step that benefits most from FP16/BF16 — halving the table size doubles the embedding-bandwidth throughput.

EXECUTION STATE
emb.shape = (B, L, d)
26Zero out padded positions

Multiplying by the mask (broadcast over the d dimension) sets the embedding to zero exactly where mask is False. This is cheaper than gathering with the mask and lets the subsequent .sum() be a plain reduction.

27Mean-pool: the fastText representation of a doc

Doc embedding = average of token embeddings. No position info, no order info — just bag-of-words. This is intentional: the classifier is meant to capture register and topic, not grammar, and it must run 1000× faster than a transformer. Dividing by mask.sum (the real doc length) gives a true mean rather than a sum-over-max-length.

EXECUTION STATE
doc_emb.shape = (B, d)
29Two-class linear head

head is nn.Linear(d, 2) — a single matmul. Output (B, 2): logits for [class 0 = crawl, class 1 = reference]. The whole forward pass for B=2048 docs is essentially one (V, d) gather + two matmuls. That is why this scales to trillions of docs.

EXECUTION STATE
logits.shape = (B, 2)
30Softmax → P(reference) is the quality score

Take column 1. P(reference) ∈ [0,1] becomes the doc's 'quality score' — exactly the cls value the visualizer slider operates on. Note we do NOT use the raw logit: thresholding on a probability is calibrated across batches; thresholding on a logit is not.

EXECUTION STATE
p_ref.shape = (B,)
31Move scores to CPU between batches

Otherwise the per-batch score tensors accumulate on GPU and OOM the device after a few thousand batches. Moving to CPU costs ~B*4 bytes per batch — trivial compared to the GPU memory we free.

38Threshold tuned on a labeled validation slice

0.55 is not a magic number. It is the threshold that, on a 50k-doc held-out set with human keep/drop labels, gives the precision/recall trade-off the team picked. Move it to 0.7 and you get a smaller, cleaner corpus; move it to 0.4 and you keep more diversity at the cost of more junk. The slider in the visualizer above is exactly this knob.

EXECUTION STATE
threshold = 0.55
41Final retention is a single log line that matters

Production pipelines log this number per shard, per source domain, per language. Sudden drops (last week 78% kept, today 41% kept) almost always indicate either a crawl format change OR a regression in the tokenizer — never a real quality shift. This metric is the closest thing the pretraining team has to a heartbeat.

36 lines without explanation
1import torch
2import torch.nn.functional as F
3
4@torch.no_grad()
5def quality_score_batch(
6    classifier: torch.nn.Module,   # tiny linear: (V,) -> (2,) reference vs crawl
7    token_id_batches: list[torch.Tensor],  # list of (L_i,) int64 tensors
8    device: str = "cuda",
9    batch_size: int = 2048,
10) -> torch.Tensor:
11    """Returns one quality score in [0,1] per document.
12
13    The 'classifier' is fastText-style: a single embedding table summed
14    across the document, fed through one linear layer, softmaxed.
15    Trained to discriminate 'reference' (Wikipedia / curated) from
16    'random crawl'. P(reference) becomes the quality score.
17    """
18    scores = []
19    for start in range(0, len(token_id_batches), batch_size):
20        batch = token_id_batches[start:start + batch_size]
21
22        # Pad to the longest doc in this batch. Padding token = 0.
23        max_len = max(t.size(0) for t in batch)
24        padded = torch.zeros(len(batch), max_len, dtype=torch.long, device=device)
25        mask   = torch.zeros(len(batch), max_len, dtype=torch.bool, device=device)
26        for i, t in enumerate(batch):
27            padded[i, :t.size(0)] = t.to(device)
28            mask[i, :t.size(0)]   = True
29
30        # Mean-pooled bag-of-tokens embedding (the fastText trick).
31        emb = classifier.embed(padded)                  # (B, L, d)
32        emb = emb * mask.unsqueeze(-1)                  # zero out pads
33        doc_emb = emb.sum(dim=1) / mask.sum(dim=1, keepdim=True).clamp(min=1)
34
35        logits = classifier.head(doc_emb)               # (B, 2)
36        p_ref  = F.softmax(logits, dim=-1)[:, 1]        # P(reference)
37        scores.append(p_ref.cpu())
38
39    return torch.cat(scores, dim=0)
40
41# ----- usage at scale: ~200M docs across one node ----------------
42# token_id_batches comes from the streaming tokenizer (Chapter 3).
43# Threshold is tuned on a held-out 50k-doc validation slice where
44# humans labeled "kept this for pretraining? yes/no".
45scores = quality_score_batch(classifier, token_id_batches)
46keep_mask = scores >= 0.55      # 0.55 = the third slider in the viz
47kept = [d for d, k in zip(documents, keep_mask.tolist()) if k]
48print(f"kept {len(kept):,} of {len(documents):,}  "
49      f"({100*len(kept)/len(documents):.1f}%)")

Three subtleties about how this function plugs into the rest of the pretraining stack are worth marking explicitly.

  1. Token IDs in, scores out — text is never reprocessed. The streaming tokenizer (chapter 3) is the only component that touches raw text. Every downstream filter and the model itself consumes int64 IDs. This decoupling means swapping classifiers re-runs in hours, not days.
  2. The classifier is tiny on purpose. ~10M parameters, one matmul per doc. A bigger classifier (a small transformer) would score better per doc but would dominate the pipeline cost — for minimal gain because by stage 3 the surviving documents are already mostly OK. The right place to spend capacity is the model being trained, not the gatekeeper at the door.
  3. The 0.55 threshold is empirical, not principled. It is the value that maximised downstream MMLU/HumanEval/GSM8K on a scaled-down 1B-parameter pilot. Every model size and every downstream eval suite has its own optimum. Production teams re-tune this threshold per snapshot, sometimes per domain.
Why the classifier is fastText and not a transformer. A 100M-param BERT-style classifier scores roughly 10× more accurately on the binary “ref vs crawl” task — and 100× slower per doc. At 90B docs per snapshot that is the difference between scoring the crawl in a day and scoring it in a quarter. The Pareto-optimal spot is “simple, fast, good enough,” not “deep, slow, marginally better.”

At Massive Scale: 15T → 14.8T Tokens

Numbers from public reports. None of the major frontier models publish the exact filter settings, but the order-of-magnitude shape of the funnel is consistent across DeepSeek-V3, Llama 3, RedPajama-V2, and Falcon.

StageTokens inSurvival rateTokens outCompute cost
Raw Common Crawl (1 snapshot)~50 T~50 T0 (free download)
URL / language ID filter50 T~60%30 T~10k CPU-hours
Dedup (section 8.2)30 T~50%15 T~100k CPU-hours
Heuristic (stage 1)15 T~70%10.5 T~5k CPU-hours
Perplexity (stage 2)10.5 T~70%7.4 T~50k CPU-hours
Classifier (stage 3)7.4 T~70%5.2 T per snapshot~200k CPU-hours + 1k GPU-hours
Aggregate (multiple snapshots)~14.8 T

Two things to read from the table. First, every stage roughly thirds the corpus. Compounded over the three quality stages that is a 0.730.340.7^3 \approx 0.34 retention from deduped input to final corpus — almost two-thirds of the deduped tokens are rejected for quality reasons. Second, the compute is dominated by the dedup pass and the perplexity stage, not the classifier. The cascade ordering pays for itself by keeping the classifier's bill under one percent of total pipeline cost.

The 14.8T number comes from many snapshots, not one

A single Common Crawl snapshot yields ~5T tokens after quality filtering. DeepSeek-V3 reports 14.8T training tokens, which comes from accumulating across multiple snapshots plus curated sources (the next-section topic of data mixing). The filtered output of one snapshot is not enough; the filtered output of many snapshots is what feeds a frontier-scale run.

Why filter quality matters more for MoE than for dense

Mixture-of-experts models (chapter 5) have many experts, each specialising on a slice of the input distribution. A junk-heavy training set creates a junk-handling expert that the router learns to send certain inputs to — and that expert's capacity is now permanently wasted. Dense models have no such routing; junk is averaged into every weight. As a result MoE models are more sensitive to data quality at fixed token count: the DeepSeek paper explicitly notes that tightening the classifier threshold by 5% gave them a larger downstream-eval bump than a comparable dense baseline saw.

Filtering interacts with the tokenizer

Your reference LM and your classifier both score documents using a tokenization that is decoupled from the main-model tokenizer (chapter 3). When you change the main-model tokenizer mid-project (rare but it happens) the filter scores do not automatically change with it — you are filtering on the old distribution and training on the new one. Teams have caught this exact bug 6 months into a run when downstream scores plateau unexpectedly; the fix is to re-score the entire corpus with a classifier rebuilt under the new tokenization.

Engineering Reality and Failure Modes

The cascade looks tidy in writing and is anything but in practice. Five failure modes recur across every team that has shipped a pretraining pipeline.

  1. Mode collapse toward the reference distribution. Over-strict perplexity cutoffs (or a classifier with too narrow a reference) produce a corpus that looks uniformly like Wikipedia. The model trained on it sounds like an encyclopaedia and fails on conversational, code, and creative tasks. Fix: per-domain cutoffs, and a small explicit budget of “diverse” data that bypasses stage 3.
  2. Classifier label drift. The classifier is trained on a fixed pair of (reference, crawl) sets. The crawl distribution drifts every quarter (new platforms, new spam formats). After a year, the “crawl” class no longer represents the current crawl, and the classifier's decisions become unreliable in ways that are invisible to its accuracy on the original validation set. Fix: retrain the classifier per snapshot, monitor score distributions for distribution shift.
  3. Language-specific failures. A perplexity LM trained on English Wikipedia gives nonsensically high perplexity to perfectly good Mandarin or Hindi prose. If language ID is buggy upstream, you silently delete entire languages. Fix: a separate reference LM per language, gated by a language-ID classifier upstream — and per-language retention dashboards that scream when a language drops off a cliff.
  4. Eval contamination via the reference set. If your reference corpus contains MMLU questions (because Wikipedia does), the classifier learns to prefer documents that look like MMLU questions, and your training set ends up enriched for MMLU-leakage patterns. Section 8.7 covers contamination detection; the relevant rule here is to scrub the reference corpus of any eval data before using it as the classifier's positive class.
  5. The “long tail of legitimately weird” problem. Source code, math papers with heavy LaTeX, chemistry formulas, and tabular data all score badly under a prose-trained reference LM but are exactly the domains the downstream model needs to learn. The fix is the next section's topic: domain-aware data mixing, where each domain has its own filter and its own weighting in the final training mixture. Quality filtering is per-domain; data mixing is how the per-domain corpora are combined.

Past all these mechanics, the deeper lesson of quality filtering is that the pretraining model never sees raw reality. It sees the world through the filter cascade your team built — and every choice you make in this section is permanently encoded in what the model thinks “good text” means. The filter is the model's upbringing. Treat it accordingly.

What the next section adds. Section 8.4 Data Mixing and Domain Weighting takes the filtered corpora produced here (per-domain: web, code, math, books, news, multilingual) and answers the question of how much of each to feed the model in what order. Quality filtering decides what each domain looks like; data mixing decides how loudly each domain speaks.
Loading comments...