Every public benchmark, every architecture diagram, every loss curve in a modern LLM paper sits on top of one piece of infrastructure that is almost never drawn: the data pipeline. DeepSeek-V3 was pretrained on 14.8 trillion tokens. Those tokens did not exist on disk anywhere in the world before the team built them. They were manufactured — assembled, filtered, deduplicated, scored, and repacked — from several petabytes of raw web crawl, code repositories, and scientific text. This section is the architectural blueprint of that manufacture. What goes in is web sludge; what comes out is the most information-dense corpus a 671B model has ever been fed. Everything that follows in this chapter is about doing each step well; this section is about why the steps exist, what shape they form, and how the shape governs every other engineering choice in pretraining.
The pipeline in one sentence. Take a petabyte of dirty text, run it through eight successive filters that each throw away the majority of what they see, tokenize what survives, shuffle it across thousands of GPUs, and feed it at a constant 3-million-tokens-per-second line rate for 54 days. The architecture is whatever lets that sentence finish without dropping a packet.
The Problem: From Web Sludge to 14.8T Clean Tokens
Start from the naïve question: where do the tokens come from? The public answer — "a snapshot of the internet" — hides the actual engineering problem. Raw CommonCrawl is roughly 400 TB of compressed WARC per snapshot. After HTML extraction it expands to a few petabytes of plain text. Of that, the overwhelming majority is unusable for pretraining: it is the wrong language, it is templated SEO spam, it is bit-identical copies of articles seen a hundred times in earlier shards, it is product listings, it is autogenerated forum boilerplate, it is parked-domain placeholder text, it is malware-injected mirrors. A pretraining corpus is what is left after you remove all of that. The job of the pipeline is to remove it without removing anything useful and without starving the training run.
Three pressures push back on each other and define every design choice:
- Quality. A 671B model is unforgiving. Junk tokens do not just produce no signal — they actively burn gradient capacity that could have gone to high-information text. The Chinchilla / DeepSeek scaling laws (Chapter 9) assume tokens are roughly equal in worth, so letting a quarter of your corpus be ad copy effectively shrinks the model.
- Volume. Even after aggressive filtering you still need tokens of survivors. That means you must start with tokens of raw input — petabyte-scale. Per-document Python code that is too slow by 10× turns a 3-day filter pass into a 30-day filter pass.
- Throughput. Training never pauses. The pipeline must deliver tokens to H800-class GPUs at line rate — about 3 million tokens per second aggregate — with zero stalls. A GPU starved for input is a GPU costing $10/hour to do nothing.
These three pressures form a triangle that pretraining engineers have learned to navigate by separating concerns: an offline stage that prioritizes quality and volume but can take days, and an online stage that prioritizes throughput and runs in lockstep with training. The whole pipeline is the choreography between the two.
Intuition: A Refinery, Not a Hard Drive
The instinct of someone seeing "14.8T tokens" for the first time is to picture a giant hard drive. That mental model is wrong in a useful way. The right mental model is an oil refinery.
Crude oil arrives at a refinery in vast, dirty volumes. It is fractioned in successive towers — each tower removes one class of impurity and passes the rest forward. Lighter, more valuable products come out the top; sludge comes out the bottom. The refinery's value is not the crude (which is cheap and abundant) and it is not the storage tanks (any company can buy storage). The value is the process: the sequence of towers, the temperatures, the catalysts, the throughput rate. A refinery that produces 1 million barrels a day of clean gasoline is worth more than a refinery that produces 1 million barrels a year of the same gasoline. Throughput is part of the product.
A pretraining data pipeline is the same shape. Raw crawl is the crude. Each stage — language ID, exact dedup, near-dedup, quality classifier — is a tower. The clean tokens at the bottom are the gasoline. And the rate at which the bottom of the funnel delivers tokens to the model is the refinery's output: a number that has to match the model's consumption rate every second of every training day.
Why not skip stages and just feed more crude? Two reasons. First, the downstream stages cost more per token. A quality classifier that runs at 1k docs/sec/CPU is fine when fed pre-deduplicated text; it is bankrupting when fed raw crawl because 90% of its work is on documents that exact dedup would have removed for free. The filters are ordered by cost per kept token: cheap, high-throw-away filters first; expensive, surgical filters last. Second, the model's loss responds more to the worst 5% of its training data than to the best 95%. Skipping a stage upgrades the worst 5%, not the average. The economics push you toward thorough filtering, not toward more crude.
The Math: Funnel Retention and Train-Loader Throughput
Two equations capture the entire architectural problem. The first is the funnel itself. Let the pipeline have stages, with per-stage retention rate — the fraction of input bytes (or documents, or tokens; we will use bytes for now and convert at the end) that survive stage . Let be the raw input in bytes. After stage the surviving bytes are:
And the final survivors in tokens are:
where is the average number of bytes per tokenized unit (about 4 for BPE on English text, lower for compressed languages, higher for code). The product is the overall retention rate. For DeepSeek-V3-class pipelines it lands at roughly to : you need 50–200× more raw bytes than you will keep tokens. The training token budget is the spec; the raw-input budget is the spec multiplied by .
The second equation is the throughput contract. If the model consumes tokens in days of training across GPUs, then the loader must deliver:
tokens per second of aggregate throughput, and per-GPU:
For the DeepSeek-V3 numbers (, , ), tokens/sec aggregate and tokens/sec per GPU. At 4 bytes/token that is roughly aggregate read bandwidth the loader has to sustain — every second, for two months — without ever leaving a GPU waiting.
Manual Numerical Walkthrough
Plug the DeepSeek-V3 numbers into both equations and follow them with pen and paper.
Click to expand: funnel + throughput arithmetic for the 14.8T budget
Step 1 — Choose retention rates per stage. Realistic numbers from the public literature (RefinedWeb, Dolma, DeepSeek tech report):
- Text extraction (HTML → text): r₁ = 0.50
- Language ID (English + Chinese + code): r₂ = 0.40
- URL / heuristic filter: r₃ = 0.50
- Exact dedup: r₄ = 0.50
- Near-dedup (MinHash LSH): r₅ = 0.50
- Quality classifier: r₆ = 0.40
- PII + safety: r₇ = 0.95
- Tokenize + pack: r₈ = 1.00 (no further filtering, just encoding)
Step 2 — Product of retention.
∏ rⱼ = 0.50 · 0.40 · 0.50 · 0.50 · 0.50 · 0.40 · 0.95 · 1.00
= 0.0095 ≈ 0.95%
Roughly 1 in 105 raw bytes survives. That sets the raw-input budget.
Step 3 — Required raw input. Target tokens ; bytes/token ; survivor bytes ≈ 59.2 TB.
B₀ = B_S / ∏ rⱼ = 5.92 × 10¹³ / 0.0095 ≈ 6.23 × 10¹⁵ bytes ≈ 6.23 PB
You need about 6 PB of raw crawl to manufacture 14.8 TB of training-grade tokens. That is roughly the volume of two full CommonCrawl snapshots plus aggressive code and book corpora. Real teams provision 2–3× that as a safety margin because retention estimates always slip downward in production.
Step 4 — Offline cluster sizing. Assume the slowest stage runs at documents/sec/CPU core (the quality classifier on a CPU-bound fastText model). With documents averaging ~10 KB, that is 10 MB/s/core. To process 6 PB in 14 calendar days:
cores_needed = 6 × 10¹⁵ B / (10 × 10⁶ B/s/core · 14 · 86400 s)
≈ 4,960 cores ≈ 78 high-core hosts of 64 cores each
Roughly 80 CPU hosts running for two weeks. That is the size of the offline pre-pass — and the bill that frontier labs hide behind phrases like "custom data pipeline".
Step 5 — Online throughput contract. Plug into the second equation:
τ = 14.8 × 10¹² / (54 · 86400) ≈ 3.17 × 10⁶ tokens/sec
τ_GPU = 3.17 × 10⁶ / 2048 ≈ 1,550 tokens/sec/GPU
aggregate read = τ · 4 B ≈ 12.7 GB/s
12.7 GB/s sustained over 54 days is roughly 60 PB of total reads — more than the raw corpus, because the loader cycles through the tokenized shards multiple times during pretraining. A single NVMe SSD sustains ~3 GB/s; a striped NVMe pool of 16 drives or a distributed object store (S3, Ceph, Lustre) is the typical landing spot.
Step 6 — Per-GPU loader budget. 1,550 tokens/sec = 6.2 KB/sec at 4 B/token. Per GPU the bandwidth is trivial. What is not trivial is the latency: every GPU step must have a batch in pinned host memory before the previous step finishes DMA-ing the last one. A 4096-context batch of micro-batch 2 is 65 KB; the DataLoader must produce it in less than the GPU's single training-step time, which on a 671B MoE is roughly ms. So each CPU worker has roughly 200 ms of slack — long compared to memory copies, very short compared to a Python dictionary lookup or a TCP round-trip.
Step 7 — Sanity check against a tighter pipeline. Tighten dedup to r₅ = 0.35 and quality to r₆ = 0.50. ∏ rⱼ becomes 0.50 · 0.40 · 0.50 · 0.50 · 0.35 · 0.50 · 0.95 ≈ 0.0083 — slightly worse. The required raw input grows to 7.1 PB, but the kept tokens are higher quality. The pipeline designer chooses where on this quality-vs-volume frontier the project lands; the math gives them the price of every choice.
Visualizing the Pipeline
Each slice of the funnel below is one stage. The slice's left edge is the bytes entering the stage; its right edge is the bytes surviving. Drag the retention sliders on the right and watch the funnel collapse in real time. The output meter at the top right turns green when the surviving tokens cross the 14.8T target and red when they fall short.
Three things to try. First, click the "Lazy pipeline" preset. Retention is much higher per stage, the funnel barely narrows, you exceed 14.8T tokens easily — but the "tokens" you keep are saturated with templated SEO and near-duplicates. This is the trap of skipping stages: the volume is fine, the quality is not. Second, switch to the DeepSeek-V3 default and drag the near-dedup slider to 0.20. Watch the output meter dip below target — you would now need to feed 9 PB of raw input to recover. Aggressive filtering is cheap in compute and expensive in raw-input requirements. Third, increase the GPU count from 2048 to 4096 and watch per-GPU throughput halve. The loader has more slack per GPU but the aggregate bandwidth requirement still has to be met by your storage layer.
Plain Python: A Composable Generator Pipeline
Before pulling in PyTorch, it is worth writing the offline pipeline as plain Python. The whole structure is generator composition: each stage is a function that consumes an iterator and yields an iterator. The same code runs on a single laptop with a 1 MB shard and — modulo distribution — on a 4-PB CommonCrawl dump.
Three architectural points are worth marking. First, nothing in this pipeline materializes a list. Every stage is lazy. That is what makes it scale: peak memory is a constant function of the per-document size, not the corpus size. Second, the order of stages matters economically. Language ID and exact dedup both throw away half their input but cost <1 µs per doc; a perplexity-based quality model costs ~1 ms per doc. Putting the cheap filters first amortizes the expensive ones, which is why every production pipeline from Dolma to NeMo Curator runs the cheap stages early. Third, the "DAG" is just function composition. Production frameworks add scheduling, retries, sharded shuffles, and provenance tracking, but every one of them ultimately compiles down to a chain of operators identical in shape to the chain above.
Where this code grows up. The same five-function skeleton — ingest, filter, dedup, score, tokenize — appears in every published pretraining pipeline. RedPajama-Data, Dolma, FineWeb, Nemotron Curator, and DeepSeek-V3's internal toolchain are all variations on this theme with industrial scheduling on top.
PyTorch: Sharded IterableDataset and the Train Loader
Once the offline pipeline has produced tokenized shards on disk, the online half takes over. Its only job is to stream those shards into the GPU at line rate, without ever replaying tokens and without ever stalling. PyTorch's IterableDataset + DataLoader is the canonical implementation. The complexity is not in the iterator body — it is in the sharding logic that ensures each token reaches exactly one GPU per epoch.
Two subtleties recur in every real loader and are worth surfacing:
- Token-level shuffling is wrong. If you naïvely shuffle across token boundaries, you destroy long-range structure within documents — the very signal a 4096-context transformer is supposed to model. The shuffle granularity is the shard, sometimes the document, never the token. The packed-sequence loader above preserves the original token order within each document and only shuffles the order in which documents are streamed.
- Resumability is part of the loader, not the trainer. When training crashes (and at 54 days of wall-clock, it will), you must resume from the last token boundary, not the last batch. Production loaders persist a (shard_index, byte_offset) pair per (rank, worker) into the checkpoint and reseed exactly there on restart. Get this wrong and you replay or skip millions of tokens silently, which surfaces as mysterious loss spikes weeks later.
What Changes at Massive Scale
At toy scale the pipeline is a Python script that runs in twenty minutes. At 14.8T-token scale it is its own distributed system, with operational constraints that look more like a database than a script. Four things change as the corpus grows.
The offline stage becomes a job scheduler
Past about 100 TB, no single host can store the working set. Stages become Spark / Ray / Beam jobs with explicit input and output shard manifests; intermediate results are written to object storage and consumed by the next stage. Provenance tracking — for each surviving token, which raw URL did it come from? — becomes a hard requirement for contamination detection (Section 8.7). The pipeline stops being a program and starts being a workflow.
Dedup becomes the bottleneck
Exact dedup over 6 PB needs roughly document hashes in a single set. That does not fit in memory on any host. The fix is sorted external merge or a sharded RocksDB / Bloom-filter approach. Near-dedup is worse: MinHash LSH over a billion documents requires a permutation-based signature for each doc and a multi-pass band-collision search. Section 8.2 spends its entire length on this stage because, at frontier scale, dedup typically costs more CPU than every other stage combined.
The online loader becomes I/O-bound, not CPU-bound
At 2048 GPUs the aggregate read is ~12 GB/s. A single NVMe SSD does ~3 GB/s; a single S3 bucket region throttles around 25 GB/s. The loader's architecture splits: hot shards live on per-host NVMe (read locally by the GPU's own workers), cold shards live in object storage and are prefetched ~10 minutes ahead of consumption. Some teams add a CDN-style cache tier in between. The DataLoader code looks unchanged; the storage layer underneath becomes a multi-tier system whose design is the difference between 95% and 99% MFU.
Failure isolation becomes a design goal
At 2048 GPUs, you will lose roughly one GPU per day to ECC errors, cable flaps, or NCCL deadlocks. The loader must not lose its place when its host dies. The fix is deterministic per-shard ordering plus a step-counter checkpoint: any restarted rank can reconstruct exactly which token it was about to read by replaying its sharding function with the same seed. The Python code in the previous section already bakes this in (seed + rank*1009 + wid); the engineering reality is that without it, every node failure costs hours of re-shuffle and risks replaying tokens.
| Scale | Pipeline character | Dominant cost |
|---|---|---|
| 1 GB toy | single Python script | developer time |
| 1 TB academic | single host with multiprocessing | tokenizer CPU |
| 100 TB lab | Spark / Ray cluster | exact dedup memory |
| 1 PB+ frontier | multi-stage DAG + object store + provenance DB | near-dedup + classifier |
| 6 PB DeepSeek-V3 | offline cluster + online tiered loader + checkpoint-aware sharding | throughput SLA |
Engineering Reality: Where Pipelines Actually Break
Three failure modes account for the majority of real-world pipeline incidents at frontier scale. Each of them is invisible in the math and only becomes obvious in the engineering.
- Silent token replay. A bug in the rank-/worker-mod sharding sends the same shard to two workers. The model still trains, the loss still drops, but it drops on a corpus that is effectively 1.x epochs of the same data — equivalent to training on 30–50% less unique text than the spec promised. The only diagnostic is a per-shard usage counter compared against an expectation; the fix is unit tests on the sharding function and reproducibility checks across restarts.
- Quality drift across the run. If the offline pipeline produces shards in the order they were processed (cheap first, expensive later), then shard has systematically different quality than shard . Without a global shuffle, the model sees easy text early and hard text late, which interacts badly with the learning-rate schedule (Chapter 11). The fix is a corpus-wide shuffle done once at the end of the offline pipeline, before shards are written to their final paths.
- Loader stalls disguised as training stalls. When the loader cannot keep up — usually because cold shards have not been prefetched in time — GPUs idle for a few hundred milliseconds per step. NCCL allreduces still complete, the train loop still ticks, but MFU drops from 40% to 30%. The diagnosis is to add per-step loader latency telemetry; the fix is to widen the prefetch window or warm the storage cache. This is the most common reason a model team mistakes a data problem for an optimizer problem.
One sentence to carry forward into the rest of the chapter: the pipeline is a contract between the offline retention math and the online throughput math, and every subsequent design choice — dedup algorithm, quality classifier, mixing weights, curriculum, contamination check — is a renegotiation of one side of that contract. The remaining sections of Chapter 8 walk through each renegotiation in order.