We have spent six sections building a pipeline that turns raw internet text into a clean, mixed, scheduled stream of 14.8T training tokens. One question remains, and it is the one that decides whether the model you ship is honest. How do you make sure your model has never seen the test?
The thesis of this section. Benchmark scores are the public-facing measure of a frontier model's quality. If a single copy of an MMLU question, a HumanEval prompt, or a GSM8K problem ends up in the pretraining corpus, the model can memorise the answer verbatim and post inflated scores that mean nothing. Contamination detection is the audit that keeps the leaderboard honest — and at 14.8T-token scale it is a non-trivial systems problem.
The Real Problem: A Test the Model Already Saw
Imagine you spend $5M of GPU time training a 671B-parameter model. You evaluate it on MMLU, GSM8K, HumanEval, MATH, BBH, and a dozen other public benchmarks. The scores look great — better than any previous open model. You ship. Within a week, an independent researcher discovers that one of the benchmark questions appeared verbatim in a Stack Exchange post that was crawled by Common Crawl in 2023. With an effective epoch count of ~8 on that document (because it was upweighted as a high-quality math source — see section 8.4), the model saw the question and its answer eight times. The benchmark score is meaningless. Your release is dead.
This is not a hypothetical. Three real cases:
| Incident | What leaked | Effect |
|---|---|---|
| GPT-3 contamination (2020) | An estimated 13–80% of test items in several public benchmarks appeared in raw Common Crawl. | The GPT-3 paper introduced the 13-gram contamination filter and reported "clean" vs "dirty" scores for every benchmark. |
| Llama-2 / Llama-3 audit (2024) | Independent reviewers found verbatim MMLU items in The Pile and in early Llama-2 SFT data. | Llama-3 added per-benchmark Bloom filters and a public contamination report appendix. |
| Open-source LLM leaderboard (2023–2024) | Several top-of-leaderboard fine-tunes were found to have trained on the eval set itself. | HuggingFace introduced a contamination check as part of the leaderboard submission flow. |
The pattern is consistent. Models that win leaderboards by margins that look too good to be true usually are too good to be true — and the explanation, more often than architecture or scale, is that a few thousand benchmark items leaked through the data pipeline. The contamination scanner is the last line of defence between an honest evaluation and an embarrassing retraction.
Intuition: Catching Plagiarism at Web Scale
The mental model is academic plagiarism detection — but inverted. Plagiarism software checks whether a student paper overlaps with a known reference corpus. Contamination detection checks whether a known eval set (the "reference") overlaps with a candidate training corpus (the "student"). Same math, opposite roles.
Three constraints make this harder than classroom plagiarism:
- Scale. A plagiarism checker handles a few thousand papers against a few million references. We handle trillions of training tokens against tens of thousands of eval questions. The naive all-pairs comparison is — completely impossible at 14.8T × 50k.
- Paraphrase tolerance. The leak is rarely verbatim. A Stack Overflow post will rephrase the eval prompt, drop a word, add a comment. We need to catch "the same question with one word changed" — but we must not catch "a totally different question about the same topic." This is the fundamental tension n-gram length controls.
- Streaming. We cannot load the corpus into memory. We get one document at a time, must make a keep/drop decision in milliseconds, and move on. The eval index lives in RAM; the training corpus streams past.
The dominant technique in 2020–2025 — used by GPT-3, Llama-3, DeepSeek-V3 — is n-gram overlap detection: chop both sides into fixed-length windows of tokens and count how many of the eval's windows appear in the training document. Simple, fast, and surprisingly hard to defeat.
The Mathematics of N-Gram Overlap
Let be an eval document and a candidate training document. After whitespace tokenisation we have two token sequences. Define the set of n-grams of a sequence as:
The contamination ratio of with respect to is the fraction of eval n-grams that also appear in the training document:
This is a number in . means no overlap; means every n-gram of the eval doc also appears in the training doc — total verbatim leak. The decision rule is a pair of thresholds:
The standard choices in published frontier models are . The over eval documents is critical — a document that leaks 80% of one benchmark and 0% of others must still drop.
Why the value of n matters
Two probabilities compete inside the choice of . Let be the probability that a single token matches by chance — about for vocabulary size . The probability that an entire n-gram collides by chance is roughly — astronomically small for . So a 13-gram match is almost never coincidence; it is a real leak. But a paraphrase that swaps one word out of every 13 will still trigger a match on the surrounding 12 windows — so the filter remains paraphrase-tolerant up to the rate of about one substitution per tokens.
Drop to 8 and the chance of a coincidental collision rises (false positives), but the filter catches heavier paraphrases. Raise to 25 and you catch only verbatim copies (false negatives explode). The sweet spot at 13 was found empirically by the GPT-3 team and has stuck.
Why we use Jaccard-style ratio, not edit distance
Edit distance and embedding similarity would give a more semantically-aware measure of leakage. They are also or worse per pairwise comparison — completely intractable at 14.8T tokens. N-gram set intersection is with hash-set lookup. Frontier labs run a fast n-gram pre-filter to drop the obvious 99%, then an expensive embedding-based scan over the remaining 1% for the borderline cases. We focus on the n-gram pre-filter because it does the heavy lifting.
Manual Numerical Walkthrough
Let us compute the contamination ratio by hand for two real-looking documents at (small enough to do on paper, large enough to be meaningful).
Click to expand: two documents, n = 5, computed by hand
Eval doc. "write a python function that returns the sum of all even numbers"
After whitespace tokenisation: [write, a, python, function, that, returns, the, sum, of, all, even, numbers] — 12 tokens.
Step 1 — 5-grams of the eval doc. With 12 tokens and we get grams:
- (write, a, python, function, that)
- (a, python, function, that, returns)
- (python, function, that, returns, the)
- (function, that, returns, the, sum)
- (that, returns, the, sum, of)
- (returns, the, sum, of, all)
- (the, sum, of, all, even)
- (sum, of, all, even, numbers)
Train doc. "solution: write a python function that returns the sum of all even numbers in a list"
Tokens (15): [solution, write, a, python, function, that, returns, the, sum, of, all, even, numbers, in, a, list] — note: stripping the colon makes "solution:" → "solution", leaving 16 tokens, but for this walkthrough we use 15 to keep the arithmetic clean.
Step 2 — 5-grams of the training doc. With 15 tokens we get 11 grams. The ones that overlap with the eval doc's 5-grams are:
- (write, a, python, function, that) ← matches eval gram 1
- (a, python, function, that, returns) ← matches eval gram 2
- (python, function, that, returns, the) ← matches eval gram 3
- (function, that, returns, the, sum) ← matches eval gram 4
- (that, returns, the, sum, of) ← matches eval gram 5
- (returns, the, sum, of, all) ← matches eval gram 6
- (the, sum, of, all, even) ← matches eval gram 7
- (sum, of, all, even, numbers) ← matches eval gram 8
All 8 eval grams appear in the training doc.
Step 3 — contamination ratio. . Verdict: DROP. The training doc contains the eval prompt verbatim with only a few surrounding words; the model would memorise it on the first epoch.
Step 4 — what changes at n = 13? With 12 tokens in the eval doc and , there are grams — the eval doc is too short to produce any 13-grams. This is the most common footgun: SHORT eval items (single-sentence MMLU questions) silently slip past a 13-gram filter because they cannot fit one full window. Production scanners use a dual filter: 13-gram for paragraph-length items, 8-gram for short items.
Step 5 — what changes for a paraphrase? Suppose the training doc says "...write a python routine that returns the sum..." (substituting "routine" for "function"). With , the four 5-grams that contain position "function" no longer match. We get 4 / 8 = 0.50 — still a DROP. With (if the docs were longer), a single substitution kills 13 windows in a row, so the contamination ratio crashes from 1.0 to roughly for a 1-word swap — about 0.7 for a 100-token doc. Still well above the 0.5 drop threshold. Two swaps is borderline; three swaps is a FLAG; four or more is a KEEP. That is the paraphrase-tolerance budget the 13-gram filter actually gives you.
Visualizing Contamination
The scanner below pairs three real-looking eval/training pairs against each other. Drag the n-gram slider to watch the contamination ratio change: short n catches paraphrases at the cost of false positives, long n catches only verbatim leaks. The badge at the top shows the production verdict — green KEEP, amber FLAG, red DROP — at the current thresholds.
Three things to internalise from the sandbox. First, on the clean example the contamination ratio is exactly zero for any — clean training data does not accidentally collide with the benchmark, no matter the n. Second, on the paraphrase example the ratio decays sharply with : short n flags it loudly, long n misses it entirely. Third, on the verbatim leak the ratio sits near 1.0 for every the document is long enough to support — the leak is unmistakable. The engineering question is where to draw the line so paraphrases are flagged and topical-similar docs are kept.
Plain Python: An N-Gram Contamination Scanner
Below is the complete contamination scanner in plain Python — no external libraries, no GPUs. Three benchmark items, three candidate training docs, and a per-document verdict. This is exactly the algorithm GPT-3 published in 2020 (Brown et al., section 5).
Two structural details worth a second look. First, the index is built on the eval set, not on the training set. This is the difference between a 30-second job and a multi-day job. The eval set is the small side and never changes; index it once, query it billions of times. Second, the per-eval breakdown (lines 53–58) lets downstream tooling answer the question "which benchmark did this document leak?" — without it, the contamination report is just a list of dropped documents with no actionable signal.
Sanity check yourself. Run this scanner withEVAL = TRAIN(set the eval and training corpora to the same list). Every document should report a contamination ratio of exactly 1.0 against its matching eval doc. If yours come back at 0.0, the n-gram extraction is broken — almost always an off-by-one in the sliding window.
PyTorch: Bloom Filters and Sharded Scans at Scale
At 14.8T tokens the dict-of-grams approach hits two walls: memory (the eval gram set grows when you index more benchmarks) and throughput (every gram lookup is a Python hash + string compare). The production version swaps the dict for a Bloom filter and wraps the scan in a sharded IterableDataset that runs across hundreds of CPU workers.
Three subtleties worth marking, all about how this filter interacts with the rest of the data pipeline:
- The Bloom filter is read-only after construction. We build it on a head node, write it to a shared blob, and every worker memory-maps the same bitset. There is no synchronisation, no consistency problem — just a giant read-only array. This is the single design choice that turns the scanner from "parallelisable in principle" to "embarrassingly parallel."
- Filtering is corpus-build-time, not training-time. The DataLoader you see above runs ONCE, before training. Its output is a stream of clean documents written to a new set of shards. The training loop later reads only from those clean shards — no contamination logic anywhere near the gradient updates. Mixing the two is the most common architectural mistake.
- The drop log is the audit trail.
log_dropped(doc, ratio)writes (doc-hash, contamination-ratio, top-matching-eval-id) to a side-channel log. DeepSeek-V3's release report includes this log as a public appendix — anyone can verify which documents were filtered and why. The log is also the way you re-run the scan with a tighter threshold without re-processing the corpus: just re-filter the log itself.
At Massive Scale: 14.8T Tokens vs Dozens of Benchmarks
Plug the production numbers into the algorithm and the systems story becomes concrete. DeepSeek-V3 pretrains on 14.8T tokens. After BPE tokenisation that is roughly the same number of n-gram windows. The eval suite is ~50k items pulled from ~30 public benchmarks, which produces roughly 1–2 million distinct 13-grams. The full scan is:
| Quantity | Order of magnitude | What it costs |
|---|---|---|
| Training corpus n-grams | ~ 1.5 × 10¹³ | Iterated once. Streamed off disk at ~ 1 TB / hr / node. |
| Eval n-grams in Bloom filter | ~ 10⁶ | Fits in ~ 2 GB of memory; broadcast once to every worker. |
| Bloom-filter lookups | ~ 10⁵ ns per gram (7 hashes × kernel call) | Saturates ~ 1 CPU core per shard worker. |
| Wall-clock at 1024 workers | ~ 18 hours | Roughly $5–10k of CPU time — < 0.5% of total pretraining cost. |
| Dropped documents | ~ 0.01 – 0.05% of the corpus | Tiny by share; huge by impact — these are the docs that would have inflated benchmark scores. |
Two systems-level observations. First, the contamination scan is sub-1% of the total training cost — there is no economic reason ever to skip it. Second, the ~0.03% drop rate sounds tiny until you remember the contamination budget asymmetry: dropping clean documents costs you a fraction of a bit of validation loss; keeping a leaked document inflates a benchmark score by 5–30 percentage points and burns your release credibility. The expected value math is overwhelming in favour of aggressive filtering.
Model-based contamination detection
N-gram filtering catches surface-level overlap. The 2024 generation of frontier models added a second, complementary technique: model-based contamination detection. Run the trained model over the eval question with the answer text masked, and compute its perplexity. Then compute its perplexity on the same question with the words shuffled into a meaningless order. If the perplexity gap is much larger than for fresh, never-seen text of the same length, the model is recognising the eval question — strong evidence it was contaminated even if no n-gram match was found.
Concretely, define the "canary score" . For genuinely unseen text, is small and stable. For contaminated text, spikes — the model has learned the original ordering. This catches paraphrased leaks that the n-gram filter missed. It is expensive (one full forward pass per eval item) but feasible because the eval suite is small.
Engineering Reality and Gotchas
Contamination detection looks like a clean algorithmic problem. Five production failure modes earn their flags:
- Short eval items slip past long n-grams. A 12-token MMLU question cannot produce any 13-grams — the sliding window requires the document to be at least tokens long. The fix is a dual filter: 13-gram for long-form benchmarks, 8-gram for short-form. Skip this and roughly 20% of MMLU and TriviaQA items become undetectable.
- Tokenisation mismatch between scanner and model. The contamination scanner uses whitespace tokenisation; the model uses BPE. If the scanner ever switches to BPE, the n-gram windows shift, the index is invalidated, and the same training corpus produces different verdicts on different scanner versions. ALWAYS use plain whitespace tokenisation for contamination — it is the stable, model-independent surface form.
- Eval-set leakage into SFT and RLHF data. The pretraining contamination scan is necessary but not sufficient. SFT instruction datasets and RLHF preference datasets are frequently constructed by paraphrasing public benchmarks — a tutor-model rephrasing GSM8K, for instance. Run the same contamination scanner against every SFT and RLHF batch before shipping. (See section 13.2 for the SFT-specific version.)
- Benchmark drift. Public benchmarks add and remove items over time. The contamination filter must be REBUILT every time the eval suite changes — a stale filter passes new benchmark items through without checking them. Tie filter rebuild to a hash of the eval suite version; refuse to start training if the hash does not match the live eval suite.
- Synthetic data contamination by inheritance. Section 8.5 discussed synthetic data — training problems generated by another LLM. If the source LLM was itself contaminated, the synthetic data inherits that contamination, often in a form that surface n-gram filters cannot detect (the leaked content has been paraphrased by a strong model). Always run model-based detection on synthetic data, not just n-gram filtering.
The one sentence to carry forward: a model's benchmark numbers are only as honest as its contamination filter, and an honest contamination filter is cheaper than half a percent of one training run — which is why every serious lab in 2025 ships a contamination appendix alongside its release report.