Section 8.1 sketched the full data pipeline — crawl, extract, filter, dedup, mix, tokenise. The dedup step looks innocent on the diagram, a single box between filtering and mixing. But for a 14.8T-token corpus that step is what stands between a state-of-the-art model and a slow, memorising, evaluation-leaking failure. The web is duplicate by nature: Reuters articles syndicated to a thousand outlets, AI-rephrased SEO farms, license boilerplate repeated on every legal page, cookie banners on every commercial site. Train naively and the model spends its capacity memorising the most repeated strings. Dedup is the lever that decides whether you train a language model or a giant autocomplete cache of the open web.
What this section delivers. A complete derivation of the MinHash + Locality-Sensitive Hashing (LSH) pipeline that turns the all-pairs problem into an distributed job, the math behind the signature length and band geometry that controls precision vs. recall, and the production gotchas that separate a clean 14.8T-token corpus from a corrupted one. By the end you will be able to reason about why DeepSeek-V3, RefinedWeb, and FineWeb all converge on the same tuning regime.
The Duplication Disaster
Raw web data is staggeringly duplicated. Independent measurements on Common Crawl converge on roughly the same shape: about 30% of crawled URLs are exact duplicates of another URL (mostly mirroring and re-indexing), and roughly another 40% are near-duplicates at the paragraph or document level (syndication, scraping, AI re-write, boilerplate footers). After aggressive dedup the surviving unique-content fraction is closer to 30% of the input bytes. The remainder is, from a learning standpoint, free gradient-stealing redundancy.
Three concrete failure modes appear when you skip dedup:
- Memorisation crowds out generalisation. Lee et al. (2022) showed that LMs trained on un-deduped corpora memorise training sequences at rates up to higher than the deduped baseline — for any sequence that appears more than a handful of times. The cost is paid in parameter capacity that could have learned a new fact instead.
- Eval contamination ruins benchmarks. If a single MMLU question appears verbatim in the training corpus, the model may score 90% on that question by recall — telling you nothing about its reasoning. Common Crawl contains scraped copies of nearly every public benchmark; without dedup against the eval set, your reported score is unreliable in ways you cannot detect post-hoc.
- Effective dataset size shrinks the scaling law. Chinchilla says — model parameters and unique tokens should grow together. If 70% of your 14.8T tokens are duplicates, your effective is closer to 4.5T, and a model sized for 14.8T is now massively over-parameterised relative to its training signal. The compute budget was spent on the wrong recipe.
| Failure mode | Detection difficulty | Cost at 671B scale |
|---|---|---|
| Memorisation of repeated strings | High — surfaces only in red-team probes | Capability ceiling, PII leakage risk |
| Eval contamination | Very high — invalidates all reported scores | Public credibility loss |
| Wasted parameter capacity | Medium — visible as flat validation loss | Tens of millions of $ in idle compute |
| Boilerplate over-representation | Low — visible in token-frequency stats | Refusal patterns, low diversity |
Intuition: From Exact Match to Near-Duplicate
Two documents are near-duplicates when most of their content overlaps, even if word order, punctuation, or a few interspersed words differ. The minimum useful representation is the k-shingle set: the set of all overlapping k-token windows of the document. Two documents with high shingle-set overlap share a high Jaccard similarity . This is the target quantity.
Computing Jaccard directly is fast for a pair and impossible for documents. At the number of pairs is . If a single Jaccard comparison takes one microsecond, the full all-pairs sweep finishes in roughly 14 million years on one machine and 14 years on a million-core cluster. The straightforward algorithm is unusable. We need a sub-quadratic alternative that is almost as accurate.
The MinHash + LSH idea unfolds in two layers:
- Compress. Replace every document's huge shingle set with a small fixed-size signature (typically 128 integers) such that two signatures agree on a coordinate with probability exactly equal to the Jaccard similarity of the original sets. We have lost size but kept a per-position similarity oracle.
- Bucket. Slice each signature into bands of rows. Hash each band independently into a giant hashtable. Two documents end up in the same bucket somewhere iff at least one of their bands matched exactly. By picking and we get a sharp S-curve: documents above some Jaccard threshold are almost-certainly collided, documents below are almost-certainly not.
The Math: Jaccard, MinHash, and LSH
Jaccard similarity
For shingle sets and :
means the shingle sets are identical (so the documents differ only in punctuation or whitespace); means no shingle is shared. Empirically the threshold matches what humans would call "same article, edited".
MinHash: the heart of the trick
Let be a random permutation of the universe of all possible shingles. Define — the smallest index any shingle of A maps to under . The central identity is:
Why? For any element , the probability that is the minimum over is uniform — . The event happens exactly when that argmin lies in . There are such elements, so the probability is . Clean and exact.
We cannot store a true random permutation of elements, so we approximate it with a strong hash function under a fresh salt — times. The signature is the length- vector . The fraction of agreement between two signatures is an unbiased estimator of with standard error . At and the standard error is about 0.035 — tight enough to threshold reliably.
LSH banding: turning the estimator into an index
A pairwise signature comparison still costs work. The locality-sensitive trick: slice the length- signature into bands of rows, with . Hash each band; two documents collide in band iff every one of the rows in that band matches. The probability that one band matches is where . The probability that any of the bands matches is therefore:
This is the LSH S-curve. It rises sharply from near 0 below some critical similarity to near 1 above it. The inflection point — the similarity at which collision probability equals one-half — is approximately:
Three intuitions to lock in:
| Lever | What it controls | Cost |
|---|---|---|
| Increase r | Sharper cutoff, fewer false positives | More missed true duplicates near threshold |
| Increase b | Lower threshold, catches looser duplicates | More candidate pairs to verify (slower) |
| Increase K = b·r | Lower variance of similarity estimate | Linear cost in storage and hashing |
Production pipelines target with thresholds at . With , the canonical choice is (FineWeb default) or (RefinedWeb default). The first is more permissive, the second more conservative; the difference shows up in how many surviving boilerplate pages your corpus keeps.
Manual Numerical Walkthrough
Let us run the full machinery on two micro-documents by hand, with shingles and hash functions sliced as . Every number is computed explicitly so the mechanism is laid bare.
Click to expand: two docs, K=4, by hand
Documents. A = "the cat sat on the mat" and B = "the cat sat on a mat" — a 1-word edit.
Step 1: 2-shingles. A = {"the cat", "cat sat", "sat on", "on the", "the mat"} (5 shingles). B = {"the cat", "cat sat", "sat on", "on a", "a mat"} (5 shingles).
Step 2: true Jaccard. (the cat / cat sat / sat on), (5 + 5 − 3). So — below the usual 0.8 threshold; these two would NOT be dedup'd at the production setting, but we will see the LSH machinery in action anyway.
Step 3: hash each shingle under 4 different salts. Imagine the hash table below (each row is one shingle, each column is one of the K=4 hash functions; numbers are arbitrary but reproducible).
| shingle | h₁ | h₂ | h₃ | h₄ |
|---|---|---|---|---|
| the cat | 17 | 42 | 19 | 88 |
| cat sat | 31 | 11 | 74 | 23 |
| sat on | 5 | 60 | 33 | 14 |
| on the | 52 | 29 | 8 | 71 |
| the mat | 44 | 3 | 57 | 66 |
| on a | 2 | 95 | 41 | 12 |
| a mat | 63 | 7 | 22 | 39 |
Step 4: MinHash signatures = min over each column, restricted to the doc's own shingles.
For A (rows the cat / cat sat / sat on / on the / the mat): .
For B (rows the cat / cat sat / sat on / on a / a mat): .
Step 5: estimated Jaccard from signatures. Compare position-by-position: vs. . Zero positions match → the MinHash estimate is . The true Jaccard was 0.43; with only K=4 hashes the noise dominates. This is why production uses K=128 — the standard error falls as .
Step 6: LSH banding with b=2, r=2. Band 1 of A is ; band 1 of B is . Not equal. Band 2 of A is ; band 2 of B is . Not equal. So this pair never shares a bucket — it would never even be considered for verification. Correct outcome (true J = 0.43 is below threshold) but via a low-K demo, not a real signal.
Check against the S-curve. . So roughly 1 in 3 such pairs would actually collide; we got unlucky on this throw, and that is exactly the kind of noise production drowns out by setting K=128, b=20, r=6 — at which point and .
Visualizing MinHash + LSH on a Toy Corpus
The visualizer below holds six documents: an original Reuters article, a syndicated copy with two extra words, an AI rephrase, two cookie banners that differ by one word, and one totally unrelated recipe. Each document is converted into its 3-shingle set, then into a 12-slot MinHash signature. Drag the rows per band slider and watch both the signature colour-banding and the S-curve on the right respond. Increase to sharpen the cutoff (you will watch borderline pairs drop out of the collide column); decrease it to flatten the curve and pick up more candidates.
Three patterns are worth burning into intuition. First, A & B (Reuters and its syndication) and D & E (the two cookie banners) almost always collide — their Jaccard is well above the inflection. Second, C (the AI rephrase) is the hard case: low shingle overlap because the surface words moved, so MinHash + LSH can miss it. This is exactly the motivation behind semantic dedup (SemDeDup, embedding-based dedup) at the end of this section — MinHash catches lexical near-duplicates, embeddings catch paraphrastic ones. Third, F (the recipe) never collides with anything, which is the boring case the algorithm is engineered to make cheap.
Plain Python: MinHash + LSH from Scratch
Before touching any library let's implement the entire pipeline in ~50 lines of plain Python — shingling, signature, banding, verification, and survivor selection. The point is to expose every decision the production system inherits.
Three structural details deserve a second pass. First, MinHash is approximate but unbiased: the expected number of signature-slot agreements is exactly . The standard error falls as , so doubling halves the noise on every Jaccard estimate — at linear cost in memory and hashing throughput.
Second, the LSH step is a recall filter, not a decision rule. It exists to make the exact-Jaccard verification tractable; without verification the false-positive rate at any would let through too much noise. Always pair the two stages; never trust LSH alone.
Third, the tie-break (keep lower-indexed doc) is a deterministic choice that matters in practice. Production pipelines often replace it with "keep the doc whose language model perplexity is lower" — the better-formed version of two near-duplicates wins. The selection rule does not change the dedup math, only which copy survives.
Sanity check. Rundedupon a 1000-doc toy corpus with 200 injected duplicates (random word swaps). At and threshold 0.8, you should recover roughly 195/200 duplicates with under 5 false positives. That ratio is the empirical S-curve in action.
PySpark: Dedup at 14.8T Tokens
The plain Python version dies at about documents — single-machine memory cannot hold the signature table, let alone the candidate-pair set. At documents the only realistic shape is a distributed map-shuffle-reduce job. Below is the production skeleton: map each doc to a signature independently, flatMap each signature to band-rows, shuffle on , then verify the surviving candidate pairs.
Three subtleties matter at scale that the plain-Python version hid:
- Shuffle dominates wall-clock. The map stage is embarrassingly parallel and runs at near-line-rate. The expensive step is the network shuffle on — every executor sends band-rows to every other executor. On a well-tuned Spark cluster the shuffle is 60–80% of total wall-clock; everything else is rounding error.
- Bucket skew is the killer. A few bands of common boilerplate (Wikipedia license footer, GDPR banner) accumulate millions of docs in one bucket. Without skew protection a single reducer OOMs and the job hangs. The standard fix: pre-filter obviously-skewed band hashes (count occurrences, drop the top 0.01% before the verification stage). Lost recall on extreme boilerplate is a feature, not a bug.
- Connected components is not optional. Surviving pairs form a graph; you must compute connected components and keep one representative per component. Otherwise three copies of the same article (A↔B, B↔C) leave A and C both standing because no direct edge linked them. GraphFrames on Spark or Pregel-style iteration on a separate cluster are the two production options.
datatrove, and DeepSeek's described pipeline all use the same MinHash + LSH backbone. The differences are in tuning, in whether connected-components clustering runs, in whether dedup is global or per-snapshot, and in how aggressively oversized buckets are pruned. The math is universal; the engineering is where corpora diverge.At Massive Scale: The Real Pipeline
DeepSeek-V3's pre-training corpus is described as 14.8T high-quality tokens, deduped from a raw input of roughly 50T tokens — a 70% reduction. Three layers of dedup are stacked, each catching a different class of redundancy.
| Stage | What it catches | Algorithm | Reduction |
|---|---|---|---|
| URL & exact dedup | Mirrored & re-crawled pages | Hash of normalised URL + document SHA-256 | ≈ 30% |
| MinHash + LSH (this section) | Near-duplicate documents | k=5 shingles, K=128, b=20, r=6, J≥0.8 | ≈ 35% |
| Substring dedup (suffix array) | Repeated long n-grams within docs | Lee et al. 2022 — drop any 50-token span appearing ≥2× | ≈ 5% |
| Embedding / SemDeDup (optional) | Paraphrastic duplicates | Sentence-encoder + ε-clustering in embedding space | ≈ 0–3% |
Two observations on the table. First, MinHash + LSH is the workhorse — it removes more than all other dedup stages combined. That is why this section spent the whole math budget on it. Second, the stages compose: URL dedup gets you out of the basement cheaply, MinHash does the heavy lifting on text, substring dedup mops up partial overlaps that document-level dedup misses (think two articles that share a single quoted paragraph), and embedding dedup chases the long tail of paraphrases. Skipping any one of them leaves a different class of duplicate on the floor.
Memory and compute footprint
For 30B documents at K=128 the signature table alone is — large but fits in a few hundred nodes of distributed memory. The band-row table is times that (614 TB on disk before shuffle), and the shuffle moves it once. A well-tuned Spark cluster of ~10 000 cores finishes the full dedup of one Common Crawl snapshot in roughly 12–24 hours. Comparable to two days of training on the same hardware — small in the budget, decisive in outcome.
Why MoE training amplifies the cost of bad dedup
For a dense model, a duplicated sentence wastes a small fraction of every parameter update — the gradient is averaged across the batch and the duplicate's contribution is diluted. In a Mixture-of-Experts model (Chapter 5) the duplicate routes to the same experts every time it appears. Those experts overfit to the boilerplate, their gates become biased, and the router's load-balancing (Chapter 6) starts to wobble. The DeepSeek team reports in the V3 paper that insufficient dedup on a MoE corpus measurably worsens expert specialisation — a failure mode that is invisible on the loss curve until the model is benchmarked. Dedup is, for MoE, a routing question as much as a data question.
Engineering Reality and Gotchas
MinHash + LSH on paper is two screens of code. In production, four failure modes earn their flags:
- Shingle size is genre-sensitive. k=5 works for English web text; k=8 is standard for code (where tokens carry more structure and short n-grams collide spuriously across functions); k=3 for non-Latin scripts where word segmentation is ambiguous. The wrong can move effective recall by 20 points. Always benchmark on a held-out labelled set before locking the pipeline.
- Hash quality leaks downstream. A weak hash (MD5 truncated, Murmur with bad seed mixing) creates spurious collisions that survive the LSH and contaminate the verification stage with false positives. Always use BLAKE2 / xxHash / SHA-256 family hashes with independent salts. The cost of stronger hashing is rounding error compared to a 12-hour Spark job.
- Per-snapshot dedup leaks duplicates across snapshots. Common Crawl re-publishes most of the web every month; an article first crawled in 2022-04 reappears in 2022-05, 2022-06, … . Dedup within each snapshot misses all of these. The fix is a single cross-snapshot dedup pass at the end, treating the union as one corpus. This is the single largest source of dedup-pipeline regret for first-time teams.
- Eval contamination dedup is a separate, stricter step. The threshold for dedup against your own training corpus is J ≥ 0.8 (loose, keep variety). The threshold for dedup against the MMLU / GSM8K / HumanEval public test sets is much stricter: every 13-gram substring overlap with a known test prompt removes the containing document. The two passes use different algorithms (LSH for the first, exact n-gram suffix-array lookup for the second). Conflating them is the most common way labs publish accidentally contaminated benchmarks.
The one sentence to carry forward: dedup is the cheapest, highest-leverage stage of the entire training pipeline — the algorithm fits on two screens, the math has been stable for twenty-five years, and skipping it has cost more mis-spent GPU-years than any other single decision in language model training.