Chapter 8
15 min read
Section 47 of 117

Data Contamination Detection

Data: The Invisible Foundation

We have spent six sections building a pipeline that turns raw internet text into a clean, mixed, scheduled stream of 14.8T training tokens. One question remains, and it is the one that decides whether the model you ship is honest. How do you make sure your model has never seen the test?

The thesis of this section. Benchmark scores are the public-facing measure of a frontier model's quality. If a single copy of an MMLU question, a HumanEval prompt, or a GSM8K problem ends up in the pretraining corpus, the model can memorise the answer verbatim and post inflated scores that mean nothing. Contamination detection is the audit that keeps the leaderboard honest — and at 14.8T-token scale it is a non-trivial systems problem.

The Real Problem: A Test the Model Already Saw

Imagine you spend $5M of GPU time training a 671B-parameter model. You evaluate it on MMLU, GSM8K, HumanEval, MATH, BBH, and a dozen other public benchmarks. The scores look great — better than any previous open model. You ship. Within a week, an independent researcher discovers that one of the benchmark questions appeared verbatim in a Stack Exchange post that was crawled by Common Crawl in 2023. With an effective epoch count of ~8 on that document (because it was upweighted as a high-quality math source — see section 8.4), the model saw the question and its answer eight times. The benchmark score is meaningless. Your release is dead.

This is not a hypothetical. Three real cases:

IncidentWhat leakedEffect
GPT-3 contamination (2020)An estimated 13–80% of test items in several public benchmarks appeared in raw Common Crawl.The GPT-3 paper introduced the 13-gram contamination filter and reported "clean" vs "dirty" scores for every benchmark.
Llama-2 / Llama-3 audit (2024)Independent reviewers found verbatim MMLU items in The Pile and in early Llama-2 SFT data.Llama-3 added per-benchmark Bloom filters and a public contamination report appendix.
Open-source LLM leaderboard (2023–2024)Several top-of-leaderboard fine-tunes were found to have trained on the eval set itself.HuggingFace introduced a contamination check as part of the leaderboard submission flow.

The pattern is consistent. Models that win leaderboards by margins that look too good to be true usually are too good to be true — and the explanation, more often than architecture or scale, is that a few thousand benchmark items leaked through the data pipeline. The contamination scanner is the last line of defence between an honest evaluation and an embarrassing retraction.

Why this matters at frontier scale. A 70B-parameter model can memorise a verbatim string after roughly 2–4 effective epochs. A 671B model can memorise after just one. The bigger the model, the smaller a leak it needs to convert a benchmark question into a memorised lookup. Contamination tolerance shrinks with model size — exactly the opposite of what intuition suggests.

Intuition: Catching Plagiarism at Web Scale

The mental model is academic plagiarism detection — but inverted. Plagiarism software checks whether a student paper overlaps with a known reference corpus. Contamination detection checks whether a known eval set (the "reference") overlaps with a candidate training corpus (the "student"). Same math, opposite roles.

Three constraints make this harder than classroom plagiarism:

  1. Scale. A plagiarism checker handles a few thousand papers against a few million references. We handle trillions of training tokens against tens of thousands of eval questions. The naive all-pairs comparison is O(NM)O(N \cdot M) — completely impossible at 14.8T × 50k.
  2. Paraphrase tolerance. The leak is rarely verbatim. A Stack Overflow post will rephrase the eval prompt, drop a word, add a comment. We need to catch "the same question with one word changed" — but we must not catch "a totally different question about the same topic." This is the fundamental tension n-gram length controls.
  3. Streaming. We cannot load the corpus into memory. We get one document at a time, must make a keep/drop decision in milliseconds, and move on. The eval index lives in RAM; the training corpus streams past.

The dominant technique in 2020–2025 — used by GPT-3, Llama-3, DeepSeek-V3 — is n-gram overlap detection: chop both sides into fixed-length windows of nn tokens and count how many of the eval's windows appear in the training document. Simple, fast, and surprisingly hard to defeat.

The Mathematics of N-Gram Overlap

Let ded_e be an eval document and dtd_t a candidate training document. After whitespace tokenisation we have two token sequences. Define the set of n-grams of a sequence s=(s1,,sL)s = (s_1, \dots, s_L) as:

Gn(s)={(si,si+1,,si+n1):1iLn+1}G_n(s) = \{ (s_i, s_{i+1}, \dots, s_{i+n-1}) : 1 \leq i \leq L - n + 1 \}

The contamination ratio of dtd_t with respect to ded_e is the fraction of eval n-grams that also appear in the training document:

cn(de,dt)=Gn(de)Gn(dt)Gn(de)c_n(d_e, d_t) = \frac{|G_n(d_e) \cap G_n(d_t)|}{|G_n(d_e)|}

This is a number in [0,1][0, 1]. cn=0c_n = 0 means no overlap; cn=1c_n = 1 means every n-gram of the eval doc also appears in the training doc — total verbatim leak. The decision rule is a pair of thresholds:

verdict(dt)={DROPif maxecn(de,dt)τdropFLAGif τflagmaxecn(de,dt)<τdropKEEPotherwise\text{verdict}(d_t) = \begin{cases} \text{DROP} & \text{if } \max_{e} c_n(d_e, d_t) \geq \tau_{\text{drop}} \\ \text{FLAG} & \text{if } \tau_{\text{flag}} \leq \max_e c_n(d_e, d_t) < \tau_{\text{drop}} \\ \text{KEEP} & \text{otherwise} \end{cases}

The standard choices in published frontier models are n=13,  τflag=0.10,  τdrop=0.50n = 13, \; \tau_{\text{flag}} = 0.10, \; \tau_{\text{drop}} = 0.50. The maxe\max_e over eval documents is critical — a document that leaks 80% of one benchmark and 0% of others must still drop.

Why the value of n matters

Two probabilities compete inside the choice of nn. Let pp be the probability that a single token matches by chance — about 1/V1/V for vocabulary size V50,000V \approx 50{,}000. The probability that an entire n-gram collides by chance is roughly pnp^n — astronomically small for n=13n = 13. So a 13-gram match is almost never coincidence; it is a real leak. But a paraphrase that swaps one word out of every 13 will still trigger a match on the surrounding 12 windows — so the filter remains paraphrase-tolerant up to the rate of about one substitution per nn tokens.

Drop nn to 8 and the chance of a coincidental collision rises (false positives), but the filter catches heavier paraphrases. Raise nn to 25 and you catch only verbatim copies (false negatives explode). The sweet spot at 13 was found empirically by the GPT-3 team and has stuck.

Why we use Jaccard-style ratio, not edit distance

Edit distance and embedding similarity would give a more semantically-aware measure of leakage. They are also O(L2)O(L^2) or worse per pairwise comparison — completely intractable at 14.8T tokens. N-gram set intersection is O(L)O(L) with hash-set lookup. Frontier labs run a fast n-gram pre-filter to drop the obvious 99%, then an expensive embedding-based scan over the remaining 1% for the borderline cases. We focus on the n-gram pre-filter because it does the heavy lifting.

Manual Numerical Walkthrough

Let us compute the contamination ratio by hand for two real-looking documents at n=5n = 5 (small enough to do on paper, large enough to be meaningful).

Click to expand: two documents, n = 5, computed by hand

Eval doc. "write a python function that returns the sum of all even numbers"

After whitespace tokenisation: [write, a, python, function, that, returns, the, sum, of, all, even, numbers] — 12 tokens.

Step 1 — 5-grams of the eval doc. With 12 tokens and n=5n = 5 we get 125+1=812 - 5 + 1 = 8 grams:

  1. (write, a, python, function, that)
  2. (a, python, function, that, returns)
  3. (python, function, that, returns, the)
  4. (function, that, returns, the, sum)
  5. (that, returns, the, sum, of)
  6. (returns, the, sum, of, all)
  7. (the, sum, of, all, even)
  8. (sum, of, all, even, numbers)

Train doc. "solution: write a python function that returns the sum of all even numbers in a list"

Tokens (15): [solution, write, a, python, function, that, returns, the, sum, of, all, even, numbers, in, a, list] — note: stripping the colon makes "solution:" → "solution", leaving 16 tokens, but for this walkthrough we use 15 to keep the arithmetic clean.

Step 2 — 5-grams of the training doc. With 15 tokens we get 11 grams. The ones that overlap with the eval doc's 5-grams are:

  • (write, a, python, function, that) ← matches eval gram 1
  • (a, python, function, that, returns) ← matches eval gram 2
  • (python, function, that, returns, the) ← matches eval gram 3
  • (function, that, returns, the, sum) ← matches eval gram 4
  • (that, returns, the, sum, of) ← matches eval gram 5
  • (returns, the, sum, of, all) ← matches eval gram 6
  • (the, sum, of, all, even) ← matches eval gram 7
  • (sum, of, all, even, numbers) ← matches eval gram 8

All 8 eval grams appear in the training doc.

Step 3 — contamination ratio. c5=GG/G=8/8=1.00c_5 = |G \cap G'| / |G| = 8 / 8 = 1.00. Verdict: DROP. The training doc contains the eval prompt verbatim with only a few surrounding words; the model would memorise it on the first epoch.

Step 4 — what changes at n = 13? With 12 tokens in the eval doc and n=13n = 13, there are 1213+1=012 - 13 + 1 = 0 grams — the eval doc is too short to produce any 13-grams. This is the most common footgun: SHORT eval items (single-sentence MMLU questions) silently slip past a 13-gram filter because they cannot fit one full window. Production scanners use a dual filter: 13-gram for paragraph-length items, 8-gram for short items.

Step 5 — what changes for a paraphrase? Suppose the training doc says "...write a python routine that returns the sum..." (substituting "routine" for "function"). With n=5n = 5, the four 5-grams that contain position "function" no longer match. We get 4 / 8 = 0.50 — still a DROP. With n=13n = 13 (if the docs were longer), a single substitution kills 13 windows in a row, so the contamination ratio crashes from 1.0 to roughly (L26)/L(L - 26) / L for a 1-word swap — about 0.7 for a 100-token doc. Still well above the 0.5 drop threshold. Two swaps is borderline; three swaps is a FLAG; four or more is a KEEP. That is the paraphrase-tolerance budget the 13-gram filter actually gives you.

Visualizing Contamination

The scanner below pairs three real-looking eval/training pairs against each other. Drag the n-gram slider to watch the contamination ratio change: short n catches paraphrases at the cost of false positives, long n catches only verbatim leaks. The badge at the top shows the production verdict — green KEEP, amber FLAG, red DROP — at the current thresholds.

Loading contamination scanner…

Three things to internalise from the sandbox. First, on the clean example the contamination ratio is exactly zero for any n4n \geq 4 — clean training data does not accidentally collide with the benchmark, no matter the n. Second, on the paraphrase example the ratio decays sharply with nn: short n flags it loudly, long n misses it entirely. Third, on the verbatim leak the ratio sits near 1.0 for every nn the document is long enough to support — the leak is unmistakable. The engineering question is where to draw the line so paraphrases are flagged and topical-similar docs are kept.

Plain Python: An N-Gram Contamination Scanner

Below is the complete contamination scanner in plain Python — no external libraries, no GPUs. Three benchmark items, three candidate training docs, and a per-document verdict. This is exactly the algorithm GPT-3 published in 2020 (Brown et al., section 5).

🐍contamination_scanner.py
6The eval set is the thing we are protecting

Three benchmark-style questions standing in for MMLU, HumanEval, GSM8K. The scanner's job is to make sure no training document leaks any of these. In production this list has 50–100k items pulled from a few dozen benchmarks; the scanner runs against trillions of training documents.

EXECUTION STATE
len(EVAL) = 3
EVAL[1] = the HumanEval-style prompt
14Candidate training documents

Three short training documents. The middle one is the dangerous case: a Stack Overflow-style solution that quotes the eval prompt verbatim. The first is topically related but not a leak. The third is clean. Production scanners process these one at a time as the training corpus streams through.

21n = 13 is the production default

GPT-3 used n=13 for its contamination filter; Llama-3 and DeepSeek-V3 adopted the same value. Why 13? It is long enough that random English prose almost never produces a match by chance, but short enough that minor paraphrases (changing one word in a 13-word window) still get caught. Drop to n=8 for stricter scanning at the cost of false positives.

EXECUTION STATE
N = 13
26Tokenization for n-gram matching

Lowercase, strip basic punctuation, split on whitespace. This is INTENTIONALLY simpler than the model's BPE tokenizer — we want surface-form overlap, not tokeniser-quirk overlap. A different model with a different tokeniser must still see the same contamination verdict.

32Build the eval n-gram index ONCE

This is the single most important engineering decision in the scanner. The eval set is tiny (MBs), the training corpus is huge (TBs). Indexing the small side and STREAMING the big side past it turns an O(eval × train) problem into O(eval) memory + O(train) time. The index is a dict from gram → set of eval-doc-ids, so we also know WHICH benchmark each hit belongs to.

EXECUTION STATE
eval_grams (typical size) = ≈ 50–500 grams per eval doc
41Stream training docs past the index

For each training document, compute its n-grams and look each one up in eval_grams. The lookup is O(1) hash-set membership. The whole scan is therefore O(total training grams) — linear in corpus size. At 14.8T tokens with n=13, this is about 14.8T gram lookups — heavy but achievable in a day on a few thousand CPU cores.

53Bucket hits per eval document

A single training document can be contaminated with multiple benchmarks at once (a Wikipedia page might leak both an MMLU question and a TriviaQA question). We track hits PER benchmark so the verdict reports which benchmark is at risk, not just an aggregate. This per-benchmark breakdown is what makes contamination logs actionable.

EXECUTION STATE
hits_per_eval = {0: 2, 1: 87}
57Per-eval contamination ratio is the headline number

Numerator: how many n-grams from eval doc i appear in this training doc. Denominator: total n-grams in eval doc i. So the ratio is 'what fraction of the benchmark question is contained in this training document?' 0.05 = the doc shares a short fact with the benchmark. 1.00 = the doc contains the benchmark verbatim.

EXECUTION STATE
ratios (training doc 1) = {1: 0.96}
62Three-way verdict: drop, flag, keep

GPT-3 and Llama-3 use roughly these thresholds: drop documents with ≥50% overlap against any benchmark, flag everything ≥10% for human or LLM-based review, keep the rest. The drop threshold is generous — even 30% overlap is enough for the model to memorise the benchmark question structure with heavy training. Conservative labs (Anthropic, DeepMind) report using lower drop thresholds.

66Worst-eval is the decision lever

We choose the maximum over benchmarks because contamination is per-eval: a document that leaks 80% of GSM8K but 0% of MMLU is still a leak. The aggregate 'overall' ratio across all eval docs is misleading because a small leak on a big eval can drown out a huge leak on a small eval.

EXECUTION STATE
verdict (train[1]) = 'DROP'
60 lines without explanation
1from collections import defaultdict
2
3# The benchmark set we want to protect — the questions the model
4# will be GRADED on later. We must NEVER let the model see these
5# (or close paraphrases) during pretraining.
6EVAL = [
7    "What is the capital of Australia? The capital of Australia is Canberra.",
8    "Write a Python function that returns the sum of all even numbers in a list.",
9    "Natalia sold clips to 48 of her friends in April and then sold half as many in May.",
10]
11
12# A small batch of candidate training documents.
13TRAIN = [
14    "Canberra, the capital of Australia, was purpose-built between 1913 and 1927 ...",
15    "Solution: write a Python function that returns the sum of all even numbers in a list. "
16    "The cleanest version is sum(x for x in nums if x % 2 == 0).",
17    "When teaching ratios, ask students to draw a bar model for each month.",
18]
19
20N = 13                # n-gram length used by GPT-3, Llama-3, DeepSeek
21FLAG_THRESHOLD = 0.10 # >=10% overlap => human review
22DROP_THRESHOLD = 0.50 # >=50% overlap => discard document
23
24def tokenize(s):
25    return s.lower().replace(",", " ").replace(".", " ").replace("?", " ").split()
26
27def ngrams(toks, n):
28    return [tuple(toks[i:i+n]) for i in range(len(toks) - n + 1)]
29
30# 1) Build the EVAL n-gram index ONCE, up front.
31#    Key insight: this is the small side of the problem (MB),
32#    the training corpus is the big side (TB).
33eval_grams = defaultdict(set)   # gram -> {eval_doc_ids that contain it}
34for doc_id, text in enumerate(EVAL):
35    for g in ngrams(tokenize(text), N):
36        eval_grams[g].add(doc_id)
37
38# 2) Stream over training docs. For each doc, count how many of
39#    its n-grams collide with the index, then bucket per eval doc.
40def scan(train_text):
41    grams = ngrams(tokenize(train_text), N)
42    if not grams:
43        return 0.0, {}
44    hits_per_eval = defaultdict(int)
45    total_eval_grams = {i: 0 for i in range(len(EVAL))}
46    for i, e in enumerate(EVAL):
47        total_eval_grams[i] = max(1, len(ngrams(tokenize(e), N)))
48    matched = 0
49    for g in grams:
50        if g in eval_grams:
51            matched += 1
52            for eid in eval_grams[g]:
53                hits_per_eval[eid] += 1
54    # Per-eval contamination = matched grams shared with eval_i / |eval_i grams|
55    ratios = {eid: hits_per_eval[eid] / total_eval_grams[eid]
56              for eid in hits_per_eval}
57    overall = matched / len(grams)
58    return overall, ratios
59
60# 3) Decide: drop, flag, or keep.
61for tid, text in enumerate(TRAIN):
62    overall, ratios = scan(text)
63    worst_eval = max(ratios.values(), default=0.0)
64    if worst_eval >= DROP_THRESHOLD:
65        verdict = "DROP"
66    elif worst_eval >= FLAG_THRESHOLD:
67        verdict = "FLAG"
68    else:
69        verdict = "KEEP"
70    print(f"train[{tid}] verdict={verdict:4s} worst_eval_overlap={worst_eval:.3f}")

Two structural details worth a second look. First, the index is built on the eval set, not on the training set. This is the difference between a 30-second job and a multi-day job. The eval set is the small side and never changes; index it once, query it billions of times. Second, the per-eval breakdown (lines 53–58) lets downstream tooling answer the question "which benchmark did this document leak?" — without it, the contamination report is just a list of dropped documents with no actionable signal.

Sanity check yourself. Run this scanner with EVAL = TRAIN (set the eval and training corpora to the same list). Every document should report a contamination ratio of exactly 1.0 against its matching eval doc. If yours come back at 0.0, the n-gram extraction is broken — almost always an off-by-one in the sliding window.

PyTorch: Bloom Filters and Sharded Scans at Scale

At 14.8T tokens the dict-of-grams approach hits two walls: memory (the eval gram set grows when you index more benchmarks) and throughput (every gram lookup is a Python hash + string compare). The production version swaps the dict for a Bloom filter and wraps the scan in a sharded IterableDataset that runs across hundreds of CPU workers.

🐍contamination_scaled.py
7Bloom filter replaces the dict at scale

A Python dict of 13-gram → eval-doc-ids works for thousands of eval grams. At hundreds of thousands of eval grams it still fits in RAM, but the per-lookup cost (string hash + dict probe) becomes the bottleneck at 14.8T grams. A Bloom filter is a fixed-size bitset with k hash functions — sub-microsecond lookups, a few GB of memory, and a tunable false-positive rate.

EXECUTION STATE
n_bits = 2³⁴ ≈ 1.7 × 10¹⁰
memory footprint = ≈ 2 GB
16k independent hashes from one SHA-256

Cheapest way to get k hash functions: take ONE strong hash (SHA-256 gives us 32 bytes), then chop it into k 4-byte words and modulo each by n_bits. Each chunk is statistically independent for a strong hash — gives us k independent hashes for the price of one. k=7 is the sweet spot for a 2³⁴-bit table holding a million items.

22Add: set all k bits

Insertion is trivial. We never store the gram itself — only the k bit positions it lights up. This is why a Bloom filter cannot enumerate its contents: you can ASK 'is x in?' but you cannot LIST what is in. For contamination detection that is exactly what we want — the filter is a one-way oracle.

26Contains: check all k bits

Membership test: gram is 'in' iff all k of its bit positions are set. False negatives are impossible (an inserted gram lit those bits). False positives happen when some other inserted gram happened to light the same k bits — controlled by sizing n_bits to keep the false-positive rate below ~10⁻⁴. The downstream cost of a false positive is dropping a clean training document, which is acceptable.

EXECUTION STATE
false-positive rate (target) = ≤ 10⁻⁴
29Build the filter ONCE on a head node

The eval Bloom filter is constructed centrally, then broadcast (read-only) to every worker. This is critical: a worker that builds its own filter from a different eval snapshot will produce inconsistent contamination decisions across shards. ONE filter, immutable, broadcast.

34Worker dataset streams one shard

ContaminationFilter is an IterableDataset — same abstraction as the mixture sampler in section 8.4. With num_workers=128, 128 processes each chew through their own assigned shards, share the read-only Bloom filter, and emit clean documents. There is no cross-worker coordination, so it scales linearly with workers up to the I/O ceiling.

44Per-document gram count and hit ratio

For each candidate training document, we compute its 13-grams (typically a few hundred to a few thousand), count how many test positive against the Bloom filter, and divide. Bloom-filter lookups are O(k) — about 7 bit reads per gram. A 1000-gram document is filtered in ~7000 bit reads, sub-millisecond.

EXECUTION STATE
typical grams per training doc = ~ 500–5000
ratio of contaminated docs = ~ 0.01–0.05
49Forensic trail on every dropped document

Never silently drop a document. log_dropped writes (doc-hash, hit-ratio, top-matching-eval) to a side log. Two reasons: (1) auditors will ask which docs were removed and why; (2) the drop log lets you re-run the scan with a tighter threshold without re-scanning the whole corpus. DeepSeek's V3 paper reports keeping a 2 GB drop log per training run.

54Contamination filtering happens at CORPUS-BUILD time

Critical architectural choice. We do NOT filter at training time — every training step would re-scan the same document, wasting GPUs on contamination checks. Instead, we run this DataLoader once over the raw corpus, write clean shards to disk, and the actual training loop reads only from the clean shards. Filter once, train many times.

58Standard DataLoader pattern

Same DataLoader skeleton as everywhere else in the pretraining stack. batch_size=1 because we are not batching for the GPU here — we are batching for I/O. num_workers=128 turns this into a 128-way embarrassingly parallel job. On a 4-node CPU cluster this scans ~1 TB/hour of raw text against a 1M-eval-gram filter.

EXECUTION STATE
throughput = ≈ 1 TB / hour / node
51 lines without explanation
1import hashlib
2import torch
3from torch.utils.data import IterableDataset, DataLoader
4
5# At 14.8T tokens, the eval n-gram set is too big to keep in a Python
6# dict — we move to a Bloom filter, which is a tiny probabilistic set.
7class BloomFilter:
8    def __init__(self, n_bits: int, n_hashes: int):
9        # n_bits ~= 2 ** 34 for a few hundred thousand eval grams
10        # with false-positive rate < 1e-4
11        self.bits = torch.zeros(n_bits, dtype=torch.bool)
12        self.n_bits = n_bits
13        self.n_hashes = n_hashes
14
15    def _hashes(self, gram: bytes):
16        # k independent hashes from one SHA-256
17        h = hashlib.sha256(gram).digest()
18        return [int.from_bytes(h[i*4:(i+1)*4], "big") % self.n_bits
19                for i in range(self.n_hashes)]
20
21    def add(self, gram: bytes):
22        for h in self._hashes(gram):
23            self.bits[h] = True
24
25    def __contains__(self, gram: bytes) -> bool:
26        return all(self.bits[h] for h in self._hashes(gram))
27
28# 1) Build the eval Bloom filter ONCE on a head node.
29eval_bloom = BloomFilter(n_bits=2**34, n_hashes=7)
30for doc in load_eval_corpus():            # ~ a few MB
31    for gram in document_ngrams(doc, n=13):
32        eval_bloom.add(gram)
33
34# 2) Each worker streams a shard of the training corpus.
35class ContaminationFilter(IterableDataset):
36    def __init__(self, shard_paths, bloom, n=13, drop_thresh=0.5):
37        self.shard_paths = shard_paths
38        self.bloom = bloom
39        self.n = n
40        self.drop_thresh = drop_thresh
41
42    def __iter__(self):
43        for path in self.shard_paths:
44            for doc in stream_documents(path):       # one doc at a time
45                grams = list(document_ngrams(doc, self.n))
46                if not grams:
47                    continue
48                hits = sum(1 for g in grams if g in self.bloom)
49                ratio = hits / len(grams)
50                if ratio >= self.drop_thresh:
51                    log_dropped(doc, ratio)          # forensic trail
52                    continue
53                yield doc                            # keep this doc
54
55# 3) Wire into the standard DataLoader so contamination filtering
56#    happens at corpus-build time, NOT at training time.
57shards = ["/data/web/0.bin", "/data/web/1.bin", ...]
58clean_stream = ContaminationFilter(shards, eval_bloom)
59loader = DataLoader(clean_stream, batch_size=1, num_workers=128)
60for clean_doc in loader:
61    write_to_clean_shard(clean_doc)

Three subtleties worth marking, all about how this filter interacts with the rest of the data pipeline:

  1. The Bloom filter is read-only after construction. We build it on a head node, write it to a shared blob, and every worker memory-maps the same bitset. There is no synchronisation, no consistency problem — just a giant read-only array. This is the single design choice that turns the scanner from "parallelisable in principle" to "embarrassingly parallel."
  2. Filtering is corpus-build-time, not training-time. The DataLoader you see above runs ONCE, before training. Its output is a stream of clean documents written to a new set of shards. The training loop later reads only from those clean shards — no contamination logic anywhere near the gradient updates. Mixing the two is the most common architectural mistake.
  3. The drop log is the audit trail. log_dropped(doc, ratio) writes (doc-hash, contamination-ratio, top-matching-eval-id) to a side-channel log. DeepSeek-V3's release report includes this log as a public appendix — anyone can verify which documents were filtered and why. The log is also the way you re-run the scan with a tighter threshold without re-processing the corpus: just re-filter the log itself.
Implementation note on multi-benchmark scans. Llama-3 and DeepSeek-V3 maintain ONE Bloom filter per benchmark group (MMLU, HumanEval, GSM8K, MATH, BBH, ...) rather than one giant filter for all of them. The reason is observability: a per-benchmark filter lets you report "dropped 0.013% of corpus due to MMLU contamination, 0.0004% due to HumanEval, ..." — which is exactly the table that ends up in the release-report appendix. The cost is k Bloom-filter lookups per gram instead of 1, but at 7 hashes per Bloom filter and ~20 benchmarks, that is still < 200ns per gram in C.

At Massive Scale: 14.8T Tokens vs Dozens of Benchmarks

Plug the production numbers into the algorithm and the systems story becomes concrete. DeepSeek-V3 pretrains on 14.8T tokens. After BPE tokenisation that is roughly the same number of n-gram windows. The eval suite is ~50k items pulled from ~30 public benchmarks, which produces roughly 1–2 million distinct 13-grams. The full scan is:

QuantityOrder of magnitudeWhat it costs
Training corpus n-grams~ 1.5 × 10¹³Iterated once. Streamed off disk at ~ 1 TB / hr / node.
Eval n-grams in Bloom filter~ 10⁶Fits in ~ 2 GB of memory; broadcast once to every worker.
Bloom-filter lookups~ 10⁵ ns per gram (7 hashes × kernel call)Saturates ~ 1 CPU core per shard worker.
Wall-clock at 1024 workers~ 18 hoursRoughly $5–10k of CPU time — < 0.5% of total pretraining cost.
Dropped documents~ 0.01 – 0.05% of the corpusTiny by share; huge by impact — these are the docs that would have inflated benchmark scores.

Two systems-level observations. First, the contamination scan is sub-1% of the total training cost — there is no economic reason ever to skip it. Second, the ~0.03% drop rate sounds tiny until you remember the contamination budget asymmetry: dropping clean documents costs you a fraction of a bit of validation loss; keeping a leaked document inflates a benchmark score by 5–30 percentage points and burns your release credibility. The expected value math is overwhelming in favour of aggressive filtering.

Model-based contamination detection

N-gram filtering catches surface-level overlap. The 2024 generation of frontier models added a second, complementary technique: model-based contamination detection. Run the trained model over the eval question with the answer text masked, and compute its perplexity. Then compute its perplexity on the same question with the words shuffled into a meaningless order. If the perplexity gap is much larger than for fresh, never-seen text of the same length, the model is recognising the eval question — strong evidence it was contaminated even if no n-gram match was found.

Concretely, define the "canary score" s=ppl(shuffled)ppl(original)s = \text{ppl}(\text{shuffled}) - \text{ppl}(\text{original}). For genuinely unseen text, ss is small and stable. For contaminated text, ss spikes — the model has learned the original ordering. This catches paraphrased leaks that the n-gram filter missed. It is expensive (one full forward pass per eval item) but feasible because the eval suite is small.

Engineering Reality and Gotchas

Contamination detection looks like a clean algorithmic problem. Five production failure modes earn their flags:

  1. Short eval items slip past long n-grams. A 12-token MMLU question cannot produce any 13-grams — the sliding window requires the document to be at least nn tokens long. The fix is a dual filter: 13-gram for long-form benchmarks, 8-gram for short-form. Skip this and roughly 20% of MMLU and TriviaQA items become undetectable.
  2. Tokenisation mismatch between scanner and model. The contamination scanner uses whitespace tokenisation; the model uses BPE. If the scanner ever switches to BPE, the n-gram windows shift, the index is invalidated, and the same training corpus produces different verdicts on different scanner versions. ALWAYS use plain whitespace tokenisation for contamination — it is the stable, model-independent surface form.
  3. Eval-set leakage into SFT and RLHF data. The pretraining contamination scan is necessary but not sufficient. SFT instruction datasets and RLHF preference datasets are frequently constructed by paraphrasing public benchmarks — a tutor-model rephrasing GSM8K, for instance. Run the same contamination scanner against every SFT and RLHF batch before shipping. (See section 13.2 for the SFT-specific version.)
  4. Benchmark drift. Public benchmarks add and remove items over time. The contamination filter must be REBUILT every time the eval suite changes — a stale filter passes new benchmark items through without checking them. Tie filter rebuild to a hash of the eval suite version; refuse to start training if the hash does not match the live eval suite.
  5. Synthetic data contamination by inheritance. Section 8.5 discussed synthetic data — training problems generated by another LLM. If the source LLM was itself contaminated, the synthetic data inherits that contamination, often in a form that surface n-gram filters cannot detect (the leaked content has been paraphrased by a strong model). Always run model-based detection on synthetic data, not just n-gram filtering.
How DeepSeek validates a corpus before launch. Four cheap pre-flight checks: (a) the n-gram scanner runs against the full eval suite, with per-benchmark drop logs; (b) a 1B-token sample of the cleaned corpus is re-scanned with a tighter threshold (n=8, drop_thresh=0.3) to estimate false-negative rate; (c) the eval suite hash is locked into the training config so the filter cannot drift from the evaluator; (d) after training, a model-based canary scan runs on every benchmark before the release scores are published. Skipping any of the four is how leaderboards lose their credibility.

The one sentence to carry forward: a model's benchmark numbers are only as honest as its contamination filter, and an honest contamination filter is cheaper than half a percent of one training run — which is why every serious lab in 2025 ships a contamination appendix alongside its release report.

Where we go from here. Chapter 8 ends here. We have built every component of the data foundation — the pipeline, deduplication, quality filtering, mixing, synthetic generation, curriculum, and contamination detection. Chapter 9 picks up the thread on the OTHER side of the data wall: given a clean 14.8T-token corpus, what model size should we train, for how many tokens, with what hyperparameters? That is the territory of scaling laws.
Loading comments...