Chapter 8
20 min read
Section 42 of 117

Deduplication at Scale

Data: The Invisible Foundation

Section 8.1 sketched the full data pipeline — crawl, extract, filter, dedup, mix, tokenise. The dedup step looks innocent on the diagram, a single box between filtering and mixing. But for a 14.8T-token corpus that step is what stands between a state-of-the-art model and a slow, memorising, evaluation-leaking failure. The web is duplicate by nature: Reuters articles syndicated to a thousand outlets, AI-rephrased SEO farms, license boilerplate repeated on every legal page, cookie banners on every commercial site. Train naively and the model spends its capacity memorising the most repeated strings. Dedup is the lever that decides whether you train a language model or a giant autocomplete cache of the open web.

What this section delivers. A complete derivation of the MinHash + Locality-Sensitive Hashing (LSH) pipeline that turns the O(N2)O(N^2) all-pairs problem into an O(N)O(N) distributed job, the math behind the signature length and band geometry that controls precision vs. recall, and the production gotchas that separate a clean 14.8T-token corpus from a corrupted one. By the end you will be able to reason about why DeepSeek-V3, RefinedWeb, and FineWeb all converge on the same (b,r)(b, r) tuning regime.

The Duplication Disaster

Raw web data is staggeringly duplicated. Independent measurements on Common Crawl converge on roughly the same shape: about 30% of crawled URLs are exact duplicates of another URL (mostly mirroring and re-indexing), and roughly another 40% are near-duplicates at the paragraph or document level (syndication, scraping, AI re-write, boilerplate footers). After aggressive dedup the surviving unique-content fraction is closer to 30% of the input bytes. The remainder is, from a learning standpoint, free gradient-stealing redundancy.

Three concrete failure modes appear when you skip dedup:

  1. Memorisation crowds out generalisation. Lee et al. (2022) showed that LMs trained on un-deduped corpora memorise training sequences at rates up to 10×10\times higher than the deduped baseline — for any sequence that appears more than a handful of times. The cost is paid in parameter capacity that could have learned a new fact instead.
  2. Eval contamination ruins benchmarks. If a single MMLU question appears verbatim in the training corpus, the model may score 90% on that question by recall — telling you nothing about its reasoning. Common Crawl contains scraped copies of nearly every public benchmark; without dedup against the eval set, your reported score is unreliable in ways you cannot detect post-hoc.
  3. Effective dataset size shrinks the scaling law. Chinchilla says NDN \propto D — model parameters and unique tokens should grow together. If 70% of your 14.8T tokens are duplicates, your effective DD is closer to 4.5T, and a model sized for 14.8T is now massively over-parameterised relative to its training signal. The compute budget was spent on the wrong recipe.
Failure modeDetection difficultyCost at 671B scale
Memorisation of repeated stringsHigh — surfaces only in red-team probesCapability ceiling, PII leakage risk
Eval contaminationVery high — invalidates all reported scoresPublic credibility loss
Wasted parameter capacityMedium — visible as flat validation lossTens of millions of $ in idle compute
Boilerplate over-representationLow — visible in token-frequency statsRefusal patterns, low diversity
Why exact-match dedup is not enough. A trivial pass that hashes documents and drops repeats catches about a third of the problem. The remaining two-thirds — paraphrased syndication, AI rewrites, cookie banners with one different word — slip through unchanged because their bytes differ. We need a similarity-aware filter that runs in time proportional to the corpus size, not its square.

Intuition: From Exact Match to Near-Duplicate

Two documents are near-duplicates when most of their content overlaps, even if word order, punctuation, or a few interspersed words differ. The minimum useful representation is the k-shingle set: the set of all overlapping k-token windows of the document. Two documents with high shingle-set overlap share a high Jaccard similarity J(A,B)=AB/ABJ(A, B) = |A \cap B| / |A \cup B|. This is the target quantity.

Computing Jaccard directly is fast for a pair and impossible for NN documents. At N=3×1010N = 3 \times 10^{10} the number of pairs is 4.5×1020\approx 4.5 \times 10^{20}. If a single Jaccard comparison takes one microsecond, the full all-pairs sweep finishes in roughly 14 million years on one machine and 14 years on a million-core cluster. The straightforward algorithm is unusable. We need a sub-quadratic alternative that is almost as accurate.

The MinHash + LSH idea unfolds in two layers:

  1. Compress. Replace every document's huge shingle set with a small fixed-size signature (typically 128 integers) such that two signatures agree on a coordinate with probability exactly equal to the Jaccard similarity of the original sets. We have lost size but kept a per-position similarity oracle.
  2. Bucket. Slice each signature into bb bands of rr rows. Hash each band independently into a giant hashtable. Two documents end up in the same bucket somewhere iff at least one of their bb bands matched exactly. By picking bb and rr we get a sharp S-curve: documents above some Jaccard threshold are almost-certainly collided, documents below are almost-certainly not.
The right mental picture. Imagine every document is a person and the Jaccard similarity is the fraction of taste they share on a 100-question survey. The signature is a 128-question summary of each person's answers. LSH banding is a speed-dating venue: tables ask different sub-questions, two people share a table iff they answer one sub-questionnaire identically. Pairs with high taste-overlap almost-always find a matching table; pairs with low overlap almost-never do. We only ever ask the table neighbours to compare full notes.

The Math: Jaccard, MinHash, and LSH

Jaccard similarity

For shingle sets AA and BB:

J(A,B)=ABAB[0,1]J(A, B) = \frac{|A \cap B|}{|A \cup B|} \in [0, 1]

J=1J = 1 means the shingle sets are identical (so the documents differ only in punctuation or whitespace); J=0J = 0 means no shingle is shared. Empirically the threshold J0.8J \ge 0.8 matches what humans would call "same article, edited".

MinHash: the heart of the trick

Let π\pi be a random permutation of the universe of all possible shingles. Define hπ(A)=minxAπ(x)h_{\pi}(A) = \min_{x \in A} \pi(x) — the smallest index any shingle of A maps to under π\pi. The central identity is:

Pr[hπ(A)=hπ(B)]=J(A,B)\Pr[h_{\pi}(A) = h_{\pi}(B)] = J(A, B)

Why? For any element xABx \in A \cup B, the probability that π(x)\pi(x) is the minimum over ABA \cup B is uniform — 1/AB1 / |A \cup B|. The event hπ(A)=hπ(B)h_{\pi}(A) = h_{\pi}(B) happens exactly when that argmin lies in ABA \cap B. There are AB|A \cap B| such elements, so the probability is AB/AB=J(A,B)|A \cap B| / |A \cup B| = J(A, B). Clean and exact.

We cannot store a true random permutation of 101110^{11} elements, so we approximate it with a strong hash function under a fresh salt — KK times. The signature is the length-KK vector sig(A)=(h1(A),,hK(A))\text{sig}(A) = (h_1(A), \dots, h_K(A)). The fraction of agreement between two signatures is an unbiased estimator of J(A,B)J(A, B) with standard error σJ(1J)/K\sigma \approx \sqrt{J(1 - J) / K}. At K=128K = 128 and J=0.8J = 0.8 the standard error is about 0.035 — tight enough to threshold reliably.

LSH banding: turning the estimator into an index

A pairwise signature comparison still costs O(N2)O(N^2) work. The locality-sensitive trick: slice the length-KK signature into bb bands of rr rows, with br=Kb \cdot r = K. Hash each band; two documents collide in band jj iff every one of the rr rows in that band matches. The probability that one band matches is srs^r where s=J(A,B)s = J(A, B). The probability that any of the bb bands matches is therefore:

Pcollide(s)=1(1sr)bP_{\text{collide}}(s) = 1 - (1 - s^{r})^{b}

This is the LSH S-curve. It rises sharply from near 0 below some critical similarity to near 1 above it. The inflection point — the similarity at which collision probability equals one-half — is approximately:

s(1/b)1/rs^{*} \approx (1/b)^{1/r}

Three intuitions to lock in:

LeverWhat it controlsCost
Increase rSharper cutoff, fewer false positivesMore missed true duplicates near threshold
Increase bLower threshold, catches looser duplicatesMore candidate pairs to verify (slower)
Increase K = b·rLower variance of similarity estimateLinear cost in storage and hashing

Production pipelines target s0.7s^{*} \approx 0.7 with thresholds at 0.80.8. With K=128K = 128, the canonical choice is b=20,r=6b = 20, r = 6 (FineWeb default) or b=14,r=9b = 14, r = 9 (RefinedWeb default). The first is more permissive, the second more conservative; the difference shows up in how many surviving boilerplate pages your corpus keeps.

The S-curve is the entire control surface. Every published dedup pipeline reports (b,r)(b, r) and a Jaccard threshold. Those three numbers, together with the shingle size kk, completely determine the precision/recall of the dedup. Tuning them is the difference between a 14.8T-token corpus and a 9T-token one.

Manual Numerical Walkthrough

Let us run the full machinery on two micro-documents by hand, with k=2k = 2 shingles and K=4K = 4 hash functions sliced as b=2,r=2b = 2, r = 2. Every number is computed explicitly so the mechanism is laid bare.

Click to expand: two docs, K=4, by hand

Documents. A = "the cat sat on the mat" and B = "the cat sat on a mat" — a 1-word edit.

Step 1: 2-shingles. A = {"the cat", "cat sat", "sat on", "on the", "the mat"} (5 shingles). B = {"the cat", "cat sat", "sat on", "on a", "a mat"} (5 shingles).

Step 2: true Jaccard. AB=3|A \cap B| = 3 (the cat / cat sat / sat on), AB=7|A \cup B| = 7 (5 + 5 − 3). So J(A,B)=3/70.43J(A, B) = 3/7 \approx 0.43 — below the usual 0.8 threshold; these two would NOT be dedup'd at the production setting, but we will see the LSH machinery in action anyway.

Step 3: hash each shingle under 4 different salts. Imagine the hash table below (each row is one shingle, each column is one of the K=4 hash functions; numbers are arbitrary but reproducible).

shingleh₁h₂h₃h₄
the cat17421988
cat sat31117423
sat on5603314
on the5229871
the mat4435766
on a2954112
a mat6372239

Step 4: MinHash signatures = min over each column, restricted to the doc's own shingles.

For A (rows the cat / cat sat / sat on / on the / the mat): sig(A)=(min{17,31,5,52,44}, min{42,11,60,29,3}, min{19,74,33,8,57}, min{88,23,14,71,66})=(5,3,8,14)\text{sig}(A) = (\min\{17,31,5,52,44\},\ \min\{42,11,60,29,3\},\ \min\{19,74,33,8,57\},\ \min\{88,23,14,71,66\}) = (5, 3, 8, 14).

For B (rows the cat / cat sat / sat on / on a / a mat): sig(B)=(min{17,31,5,2,63}, min{42,11,60,95,7}, min{19,74,33,41,22}, min{88,23,14,12,39})=(2,7,19,12)\text{sig}(B) = (\min\{17,31,5,2,63\},\ \min\{42,11,60,95,7\},\ \min\{19,74,33,41,22\},\ \min\{88,23,14,12,39\}) = (2, 7, 19, 12).

Step 5: estimated Jaccard from signatures. Compare position-by-position: (5,3,8,14)(5,3,8,14) vs. (2,7,19,12)(2,7,19,12). Zero positions match → the MinHash estimate is J^=0/4=0\hat{J} = 0/4 = 0. The true Jaccard was 0.43; with only K=4 hashes the noise dominates. This is why production uses K=128 — the standard error falls as 1/K1/\sqrt{K}.

Step 6: LSH banding with b=2, r=2. Band 1 of A is (5,3)(5, 3); band 1 of B is (2,7)(2, 7). Not equal. Band 2 of A is (8,14)(8, 14); band 2 of B is (19,12)(19, 12). Not equal. So this pair never shares a bucket — it would never even be considered for verification. Correct outcome (true J = 0.43 is below threshold) but via a low-K demo, not a real signal.

Check against the S-curve. Pcollide(s=0.43,b=2,r=2)=1(10.432)2=1(10.185)2=10.664=0.336P_{\text{collide}}(s = 0.43, b=2, r=2) = 1 - (1 - 0.43^{2})^{2} = 1 - (1 - 0.185)^{2} = 1 - 0.664 = 0.336. So roughly 1 in 3 such pairs would actually collide; we got unlucky on this throw, and that is exactly the kind of noise production drowns out by setting K=128, b=20, r=6 — at which point Pcollide(0.8,20,6)0.99P_{\text{collide}}(0.8, 20, 6) \approx 0.99 and Pcollide(0.4,20,6)0.08P_{\text{collide}}(0.4, 20, 6) \approx 0.08.

Visualizing MinHash + LSH on a Toy Corpus

The visualizer below holds six documents: an original Reuters article, a syndicated copy with two extra words, an AI rephrase, two cookie banners that differ by one word, and one totally unrelated recipe. Each document is converted into its 3-shingle set, then into a 12-slot MinHash signature. Drag the rows per band slider and watch both the signature colour-banding and the S-curve on the right respond. Increase rr to sharpen the cutoff (you will watch borderline pairs drop out of the collide column); decrease it to flatten the curve and pick up more candidates.

Loading MinHash + LSH visualizer…

Three patterns are worth burning into intuition. First, A & B (Reuters and its syndication) and D & E (the two cookie banners) almost always collide — their Jaccard is well above the inflection. Second, C (the AI rephrase) is the hard case: low shingle overlap because the surface words moved, so MinHash + LSH can miss it. This is exactly the motivation behind semantic dedup (SemDeDup, embedding-based dedup) at the end of this section — MinHash catches lexical near-duplicates, embeddings catch paraphrastic ones. Third, F (the recipe) never collides with anything, which is the boring case the algorithm is engineered to make cheap.

Plain Python: MinHash + LSH from Scratch

Before touching any library let's implement the entire pipeline in ~50 lines of plain Python — shingling, signature, banding, verification, and survivor selection. The point is to expose every decision the production system inherits.

🐍dedup_plain.py
4k-shingles = local word windows

A document is reduced to the set of its k-token sliding windows. k=5 is the workhorse for English web text — short enough to overlap heavily on paraphrases, long enough to make accidental collisions astronomically unlikely. Sets, not lists: two copies of a sentence still count as one shingle.

EXECUTION STATE
k (typical web text) = 5
shingles per 100-word doc = ≈ 96
9MinHash = min over hashed shingles, repeated N times

True random permutations of the shingle universe are impossible to store (the universe has ~10^11 distinct shingles in Common Crawl). The standard trick: use N independent hash functions, take min over the document's shingles under each one. Each min is the index of the 'first-surviving' shingle under that permutation — an unbiased estimator of Jaccard similarity per slot.

EXECUTION STATE
num_hashes (production) = 128 (FineWeb), 256 (RefinedWeb)
memory per doc = 128 × 8 B = 1 KB
14The minimum hash per slot is the signature byte

For each of the N hash functions we scan the document's shingles and keep the smallest hash value seen. The collection of N mins is the MinHash signature. Critically, the probability that two documents have the SAME min on a given slot equals their true Jaccard similarity — that is the entire mathematical foundation of the method.

EXECUTION STATE
sig.shape = (num_hashes,) of uint64
21LSH banding: slice the signature into b chunks of r rows

A length-(b·r) signature becomes b separate band-hashes. Two documents share a bucket in band j iff their j-th band is byte-identical. The bargain: instead of all O(N²) pairwise comparisons, we only verify pairs that landed in the same bucket at least once.

EXECUTION STATE
(b, r) typical = (20, 6) → K=120
candidate pairs = ≪ O(N²), usually O(N)
26Bucket = (band index, band content) → list of docs

Distinct bands live in distinct namespaces — a band-3 hash and a band-7 hash that happen to equal each other do NOT cause a collision. That separation is what makes the b independent attempts statistically independent.

30Candidate pairs from co-bucketed docs

Inside one bucket with m docs we emit m·(m-1)/2 candidate pairs. In practice almost every bucket has m=1; pathologically duplicated content (boilerplate, license text) creates the few enormous buckets that dominate the work.

37Exact Jaccard verification on candidates

MinHash is approximate — a pair might collide in LSH without being above threshold. We verify with exact set Jaccard, which is fast because (a) the candidate set is tiny and (b) shingle sets are already in memory. This two-stage design is the whole point: cheap recall, then accurate precision.

EXECUTION STATE
threshold (FineWeb default) = 0.8
41Keep the lower-indexed doc, drop the other

A deterministic tie-break that respects the corpus ordering (earlier = canonical). Production pipelines sometimes break ties by language-model perplexity instead, keeping the more fluent version.

51 lines without explanation
1import hashlib, random
2from collections import defaultdict
3
4def shingles(text, k=5):
5    """k-shingles = sliding windows of k whitespace tokens.
6       Captures local word order; insensitive to global structure."""
7    toks = text.lower().split()
8    return {" ".join(toks[i:i + k]) for i in range(len(toks) - k + 1)}
9
10def minhash(shingle_set, num_hashes=128, seed=42):
11    """Approximate one random permutation per hash slot by hashing each
12       shingle with a different salt and keeping the minimum.
13       Returns a length-num_hashes signature of 64-bit ints."""
14    rng = random.Random(seed)
15    salts = [rng.getrandbits(64) for _ in range(num_hashes)]
16    sig = [float("inf")] * num_hashes
17    for sh in shingle_set:
18        for i, salt in enumerate(salts):
19            h = int(hashlib.blake2b(
20                f"{salt}|{sh}".encode(), digest_size=8
21            ).hexdigest(), 16)
22            if h < sig[i]:
23                sig[i] = h
24    return sig
25
26def lsh_buckets(sigs, b=20, r=6):
27    """Slice each length-(b*r) signature into b bands of r rows.
28       Two docs share a bucket iff at least one band matches exactly."""
29    assert all(len(s) == b * r for s in sigs)
30    buckets = defaultdict(list)          # (band_idx, band_hash) -> doc_ids
31    for doc_id, sig in enumerate(sigs):
32        for band in range(b):
33            chunk = tuple(sig[band * r : (band + 1) * r])
34            buckets[(band, chunk)].append(doc_id)
35    # Candidate pairs = any two docs that share at least one bucket.
36    pairs = set()
37    for docs in buckets.values():
38        for i in range(len(docs)):
39            for j in range(i + 1, len(docs)):
40                pairs.add((docs[i], docs[j]))
41    return pairs
42
43def jaccard(a, b):
44    if not a and not b:
45        return 1.0
46    return len(a & b) / len(a | b)
47
48def dedup(docs, threshold=0.8, b=20, r=6):
49    """End-to-end dedup. Returns the surviving doc IDs."""
50    shs = [shingles(d) for d in docs]
51    sigs = [minhash(s, num_hashes=b * r) for s in shs]
52    candidates = lsh_buckets(sigs, b=b, r=r)
53    # Verify with exact Jaccard on shingle sets — the cheap pre-filter
54    # we just ran throws away >99.9% of pairs, so this scan is tractable.
55    drop = set()
56    for i, j in candidates:
57        if jaccard(shs[i], shs[j]) >= threshold:
58            drop.add(max(i, j))          # keep the lower-index doc
59    return [d for k, d in enumerate(docs) if k not in drop]

Three structural details deserve a second pass. First, MinHash is approximate but unbiased: the expected number of signature-slot agreements is exactly KJK \cdot J. The standard error falls as 1/K1/\sqrt{K}, so doubling KK halves the noise on every Jaccard estimate — at linear cost in memory and hashing throughput.

Second, the LSH step is a recall filter, not a decision rule. It exists to make the exact-Jaccard verification tractable; without verification the false-positive rate at any (b,r)(b, r) would let through too much noise. Always pair the two stages; never trust LSH alone.

Third, the tie-break (keep lower-indexed doc) is a deterministic choice that matters in practice. Production pipelines often replace it with "keep the doc whose language model perplexity is lower" — the better-formed version of two near-duplicates wins. The selection rule does not change the dedup math, only which copy survives.

Sanity check. Run dedup on a 1000-doc toy corpus with 200 injected duplicates (random word swaps). At (b,r,K)=(20,6,128)(b, r, K) = (20, 6, 128) and threshold 0.8, you should recover roughly 195/200 duplicates with under 5 false positives. That ratio is the empirical S-curve in action.

PySpark: Dedup at 14.8T Tokens

The plain Python version dies at about 10710^{7} documents — single-machine memory cannot hold the signature table, let alone the candidate-pair set. At 101010^{10} documents the only realistic shape is a distributed map-shuffle-reduce job. Below is the production skeleton: map each doc to a signature independently, flatMap each signature to bb band-rows, shuffle on (band,hash)(\text{band}, \text{hash}), then verify the surviving candidate pairs.

🐍dedup_pyspark.py
6Read raw corpus — billions of rows, hundreds of TB

Common Crawl 2024 alone is ~400 TB of raw HTML, ~95 TB after text extraction. The Parquet layout matters: one row per document, partitioned by crawl-month, lets Spark split work across thousands of executors with no skew.

EXECUTION STATE
rows in scope = ≈ 10–50 B docs
executors typical = 1 000 – 10 000
11Signature length controls memory vs. precision

128 permutations gives a Jaccard estimator with standard error ≈ 1/√128 ≈ 8.8%. RefinedWeb pushes this to 256 (s.e. ≈ 6.3%) for higher-confidence dedup. The per-doc memory is num_hashes × 8 bytes — at 30 B docs, 128 hashes is 240 GB of signatures, easy to keep distributed.

14Map stage: each doc → one signature, independently

Embarrassingly parallel — no shuffle, no cross-doc communication. The bottleneck is text parsing and shingle hashing, both CPU-bound. On a 10 000-core cluster this stage runs at ~5 M docs/sec; the full 30 B-doc corpus takes a few hours.

EXECUTION STATE
throughput = ~5 M docs/sec/cluster
25flatMap explodes one doc into b band-hashes

This is the data-engineering version of the LSH idea: every doc now appears b=20 times in the band table, once per band. The table grows by 20× but each row is tiny, and the next groupBy is what makes the algorithm tractable.

EXECUTION STATE
band table size = 20 × 30 B = 600 B rows
row size = ~24 B → ~14 TB
33Shuffle on (band, h) — duplicates funnel to one reducer

The Spark shuffle is the only expensive network step. Any two docs that share even one band hash will land on the same reducer; everyone else is invisible to each other. This is where O(N²) becomes O(N) — without the shuffle, dedup at this scale would be impossible.

34collect_list per bucket → candidate clusters

Each surviving bucket holds the doc IDs that collided in that band. Most buckets have size 1 and are discarded. A few pathological buckets (license footers, cookie banners, AI-generated SEO spam) can hold tens of millions of docs — the pipeline must guard against memory blow-ups here by spilling oversized buckets to disk.

41Pair generation + exact-Jaccard verification

From clusters we emit O(m²) candidate pairs per bucket of size m. Verification re-fetches the shingle sets (broadcast or join), computes exact Jaccard, and keeps the pair iff ≥ THRESHOLD. The two-stage MinHash → exact-Jaccard split lets a 30 B-doc dedup finish in a day on a normal Spark cluster.

EXECUTION STATE
candidates per bucket = ≪ 0.1% of N
wall-clock total = ~12–24 h
43What is NOT shown: connected-components clustering

A real pipeline does one more step — build a graph from surviving pairs and take connected components, then keep one representative per component. Without this, 'A duplicates B' and 'B duplicates C' leave A and C as separate survivors. FineWeb and RefinedWeb both use GraphFrames on the Spark cluster for this.

39 lines without explanation
1from datasketch import MinHash, MinHashLSH
2from pyspark.sql import SparkSession, Row
3import hashlib
4
5spark = SparkSession.builder.appName("dedup").getOrCreate()
6docs = spark.read.parquet("s3://corpus/raw/cc-2024-*").select("doc_id", "text")
7
8NUM_HASHES = 128                       # ⇒ b=20, r=6 by default in datasketch
9THRESHOLD  = 0.80
10
11def doc_to_sig(row):
12    """Map: one document → (doc_id, MinHash signature)."""
13    text = row.text.lower().split()
14    shingles = {" ".join(text[i:i+5]) for i in range(len(text) - 4)}
15    m = MinHash(num_perm=NUM_HASHES, seed=42)
16    for sh in shingles:
17        m.update(sh.encode("utf8"))
18    return Row(doc_id=row.doc_id, sig=m.digest().tolist())
19
20sigs = docs.rdd.map(doc_to_sig).toDF().cache()
21
22# Emit (band_id, band_hash, doc_id) tuples — one row per band per doc.
23def explode_bands(row, b=20, r=6):
24    out = []
25    for band in range(b):
26        chunk = tuple(row.sig[band * r : (band + 1) * r])
27        h = hashlib.blake2b(repr((band, chunk)).encode(), digest_size=16).digest()
28        out.append((band, h, row.doc_id))
29    return out
30
31bands = sigs.rdd.flatMap(explode_bands).toDF(["band", "h", "doc_id"])
32
33# Shuffle: docs sharing a (band, h) land on the same reducer.
34candidate_pairs = (
35    bands.groupBy("band", "h")
36         .agg({"doc_id": "collect_list"})
37         .withColumnRenamed("collect_list(doc_id)", "docs")
38         .filter("size(docs) > 1")
39         .selectExpr("explode(docs) AS a", "docs")
40         .selectExpr("a", "explode(docs) AS b")
41         .filter("a < b")
42         .distinct()
43)
44
45# Verify with broadcast-joined signatures; exact Jaccard on shingle sets
46# is run only on this dramatically reduced candidate set.
47# (Verification UDF omitted for brevity — see section text.)

Three subtleties matter at scale that the plain-Python version hid:

  1. Shuffle dominates wall-clock. The map stage is embarrassingly parallel and runs at near-line-rate. The expensive step is the network shuffle on (band,h)(\text{band}, h)— every executor sends band-rows to every other executor. On a well-tuned Spark cluster the shuffle is 60–80% of total wall-clock; everything else is rounding error.
  2. Bucket skew is the killer. A few bands of common boilerplate (Wikipedia license footer, GDPR banner) accumulate millions of docs in one bucket. Without skew protection a single reducer OOMs and the job hangs. The standard fix: pre-filter obviously-skewed band hashes (count occurrences, drop the top 0.01% before the verification stage). Lost recall on extreme boilerplate is a feature, not a bug.
  3. Connected components is not optional. Surviving pairs form a graph; you must compute connected components and keep one representative per component. Otherwise three copies of the same article (A↔B, B↔C) leave A and C both standing because no direct edge linked them. GraphFrames on Spark or Pregel-style iteration on a separate cluster are the two production options.
Implementation note. RefinedWeb's public codebase, FineWeb's datatrove, and DeepSeek's described pipeline all use the same MinHash + LSH backbone. The differences are in (b,r)(b, r) tuning, in whether connected-components clustering runs, in whether dedup is global or per-snapshot, and in how aggressively oversized buckets are pruned. The math is universal; the engineering is where corpora diverge.

At Massive Scale: The Real Pipeline

DeepSeek-V3's pre-training corpus is described as 14.8T high-quality tokens, deduped from a raw input of roughly 50T tokens — a 70% reduction. Three layers of dedup are stacked, each catching a different class of redundancy.

StageWhat it catchesAlgorithmReduction
URL & exact dedupMirrored & re-crawled pagesHash of normalised URL + document SHA-256≈ 30%
MinHash + LSH (this section)Near-duplicate documentsk=5 shingles, K=128, b=20, r=6, J≥0.8≈ 35%
Substring dedup (suffix array)Repeated long n-grams within docsLee et al. 2022 — drop any 50-token span appearing ≥2×≈ 5%
Embedding / SemDeDup (optional)Paraphrastic duplicatesSentence-encoder + ε-clustering in embedding space≈ 0–3%

Two observations on the table. First, MinHash + LSH is the workhorse — it removes more than all other dedup stages combined. That is why this section spent the whole math budget on it. Second, the stages compose: URL dedup gets you out of the basement cheaply, MinHash does the heavy lifting on text, substring dedup mops up partial overlaps that document-level dedup misses (think two articles that share a single quoted paragraph), and embedding dedup chases the long tail of paraphrases. Skipping any one of them leaves a different class of duplicate on the floor.

Memory and compute footprint

For 30B documents at K=128 the signature table alone is 30×109×128×8=30.7 TB30 \times 10^{9} \times 128 \times 8 = 30.7 \text{ TB} — large but fits in a few hundred nodes of distributed memory. The band-row table is b=20b = 20 times that (614 TB on disk before shuffle), and the shuffle moves it once. A well-tuned Spark cluster of ~10 000 cores finishes the full dedup of one Common Crawl snapshot in roughly 12–24 hours. Comparable to two days of training on the same hardware — small in the budget, decisive in outcome.

Why MoE training amplifies the cost of bad dedup

For a dense model, a duplicated sentence wastes a small fraction of every parameter update — the gradient is averaged across the batch and the duplicate's contribution is diluted. In a Mixture-of-Experts model (Chapter 5) the duplicate routes to the same experts every time it appears. Those experts overfit to the boilerplate, their gates become biased, and the router's load-balancing (Chapter 6) starts to wobble. The DeepSeek team reports in the V3 paper that insufficient dedup on a MoE corpus measurably worsens expert specialisation — a failure mode that is invisible on the loss curve until the model is benchmarked. Dedup is, for MoE, a routing question as much as a data question.

The compound effect on Chinchilla scaling. Train a 67B-parameter dense model on 1.3T tokens of which 30% are near-duplicates. Effective D0.9TD \approx 0.9 \text{T} — well below the Chinchilla ratio of D=20ND = 20 N which would call for 1.34T tokens. The model is now under-trained, the loss is on the wrong scaling curve, and adding more raw web data without dedup only widens the gap. The single most cost-effective intervention in massive model training is not bigger GPUs or longer training — it is a stricter dedup pipeline.

Engineering Reality and Gotchas

MinHash + LSH on paper is two screens of code. In production, four failure modes earn their flags:

  1. Shingle size kk is genre-sensitive. k=5 works for English web text; k=8 is standard for code (where tokens carry more structure and short n-grams collide spuriously across functions); k=3 for non-Latin scripts where word segmentation is ambiguous. The wrong kk can move effective recall by 20 points. Always benchmark on a held-out labelled set before locking the pipeline.
  2. Hash quality leaks downstream. A weak hash (MD5 truncated, Murmur with bad seed mixing) creates spurious collisions that survive the LSH and contaminate the verification stage with false positives. Always use BLAKE2 / xxHash / SHA-256 family hashes with independent salts. The cost of stronger hashing is rounding error compared to a 12-hour Spark job.
  3. Per-snapshot dedup leaks duplicates across snapshots. Common Crawl re-publishes most of the web every month; an article first crawled in 2022-04 reappears in 2022-05, 2022-06, … . Dedup within each snapshot misses all of these. The fix is a single cross-snapshot dedup pass at the end, treating the union as one corpus. This is the single largest source of dedup-pipeline regret for first-time teams.
  4. Eval contamination dedup is a separate, stricter step. The threshold for dedup against your own training corpus is J ≥ 0.8 (loose, keep variety). The threshold for dedup against the MMLU / GSM8K / HumanEval public test sets is much stricter: every 13-gram substring overlap with a known test prompt removes the containing document. The two passes use different algorithms (LSH for the first, exact n-gram suffix-array lookup for the second). Conflating them is the most common way labs publish accidentally contaminated benchmarks.
What DeepSeek monitors throughout the dedup run. Three live metrics: dedup rate per snapshot (should plateau around 70%; a sudden swing means a crawl-quality regression), token-frequency Zipf slope on the survivor corpus (a flatter slope after dedup is the expected signature; a sharper slope means under-dedup), and per-domain survival rate (boilerplate-heavy domains like e-commerce should drop from millions of pages to thousands; a flat ratio means LSH banding is too lax). All three are dashboarded throughout the multi-hour job.

The one sentence to carry forward: dedup is the cheapest, highest-leverage stage of the entire training pipeline — the algorithm fits on two screens, the math has been stable for twenty-five years, and skipping it has cost more mis-spent GPU-years than any other single decision in language model training.

Where we go from here. The corpus survives dedup but is not yet ready to train on — it still contains spam, machine translation artifacts, malformed HTML extraction, and toxic content. Section 8.3 turns to quality filtering: how to score and discard the low-value pages that dedup left behind, using model-based classifiers, perplexity filters, and rule-based heuristics. The survivors of dedup + filtering are what 14.8T-token corpora are made of.
Loading comments...