The Real Problem: A Fluent Model With No Manners
After trillions of tokens of pretraining, the base model has read more text than any human ever will. It can complete a sentence in any of forty languages, summarise a Wikipedia paragraph in the style of Hemingway, finish a Python function it has never seen, and recite the first hundred digits of pi. What it cannot do is answer a question.
Drop a base model into a chat box and type “What is the capital of France?”. A typical un-aligned Llama-2-7B base will produce one of three responses: it will continue the question (“What is the capital of France? What is the capital of Germany?...”), it will hallucinate a multiple-choice quiz around it, or it will dutifully complete the most common web-scraped continuation, which on a typical pretraining mix happens to be the start of an IELTS practice exercise. Paris appears nowhere. The model is not broken; it is doing exactly what we trained it to do — predict the next token under the empirical distribution of the web.
The web is not full of helpful answers. It is full of forum posts, listicles, half-finished tutorials, question stems without answers, and arguments. A base model trained on the web knows facts but not format. The mechanism that turns “a library of completion patterns” into “a useful assistant” is supervised fine-tuning, and the lever that turns SFT from a vague hope into a reliable engineering process is the data.
Why is this a problem worth a whole section? Because the instinct most engineers walk in with is wrong. People assume SFT scales like pretraining — more data, longer training, lower loss, better model. The empirical record from every frontier lab over the last three years is the exact opposite:
- LIMA (Meta, 2023): a 65B model fine-tuned on carefully hand-curated examples reached the same human-preference win rate as the same base model fine-tuned on filtered Common Crawl examples. Quality > quantity by a 50× factor.
- Tülu 2 (Allen AI, 2023): systematic ablations showed that adding low-quality data to a clean SFT mix decreases downstream eval scores by 2–6 points. More data is worse than no data when the floor is low.
- DeepSeek-R1 (2025): the post-RL cold-start SFT step used roughly rejection-sampled examples — but only after the team threw away ~95% of the candidate pool. The keep rate, not the raw volume, is what trained the model.
Three independent teams, three different scales, one conclusion: SFT is a curation problem disguised as a training problem. The mathematics of the next-token loss is genuinely trivial — we will derive it in two equations. Everything difficult about SFT lives upstream of the train loop, in the data pipeline.
Intuition: A Style Demo, Not a Knowledge Injection
Here is the analogy that makes SFT click. Imagine a brilliant but socially-awkward librarian. They have read every book in the building and can recite passages from any of them on request. But when you walk in and say “Could you help me write a polite email declining a wedding invitation?”, they freeze. Not because they don't know what a polite email looks like — they have read thousands of them. Because they don't know that your question is a request for one. They might quote the etiquette section of a 1950s manners book at you. They might describe the historical evolution of the declined-RSVP genre. They might, in their confusion, just complete your sentence and ask “why?”.
SFT is the moment you hand this librarian a stack of three thousand index cards. On the front of each card: a question someone might ask. On the back: a model answer in the voice and shape you want. You do not teach them new facts — they already know everything on the back of every card. You teach them which face goes with which face. After enough cards, the next time someone walks in and says “polite email declining a wedding invitation”, the librarian instinctively flips to the back of the card, not the front.
This is also why the SFT data pipeline looks nothing like a pretraining data pipeline. Pretraining wants volume and breadth: every domain, every register, every language, hundreds of billions of tokens. SFT wants precision and consistency: every example needs to be on the response sub-manifold you want, written in the voice you want, formatted the way you want. A single bad example — a hallucinated fact, a refusal where there shouldn't be one, a sycophantic preamble — tells the model that your sub-manifold includes that point too. Multiply by a few hundred such examples and the sub-manifold gets so blurry that the model defaults back to its base behaviour for any prompt that does not sit exactly on a demonstrated point.
The practical consequence is uncomfortable for anyone who has spent a career scaling data pipelines: SFT data quality is a manual problem, all the way down. Every successful frontier lab has at some point in their post-training pipeline a group of expert annotators rewriting model outputs by hand. Synthetic data from a stronger model helps. Rejection sampling helps. But the floor is set by the worst example in the corpus, and lowering that floor requires human eyes on every candidate.
The Math: Cross-Entropy on the Assistant Turn Only
Strip away the curation drama and the SFT loss is the same next-token cross-entropy that drove pretraining. The only difference is which positions contribute to the average. Let be the flat token sequence after applying the chat template — system prompt, then user turn, then assistant turn, optionally more turns. Let be the loss mask: if token belongs to an assistant turn, and otherwise.
The SFT loss for this single example is . The right-hand sum is the standard next-token cross-entropy. The factor turns off the loss everywhere the target token is not an assistant token — i.e. we do not penalise the model for being wrong about the user's next word, only about its own. The normaliser divides by the count of gradient-bearing positions so that the loss is comparable across examples with different assistant-token densities.
Three quantities derived from this loss matter for the data side of SFT. First, the effective token count — the total number of supervised positions across the corpus. This, not the raw example count, is what your one-epoch compute budget should be sized against. A corpus of single-turn chat examples with average 150-token assistant responses gives — fifteen million gradient-bearing tokens, or about 0.0001% of a typical pretraining run.
Second, the per-category effective contribution . Long-reasoning examples contribute proportionally more than single-line refusals. A naive 50/50 example-count split between “math” and “safety” might be a 90/10 split in terms, because a math chain-of-thought is 5–10× longer than a calibrated refusal. Always report mix proportions in effective tokens, not example counts.
Third, the quality-weighted effective count where is a per-example quality score (from a learned quality classifier, a human rating, or a rejection-sampling rank). The LIMA result, in these terms, says that dominates eval performance once it exceeds a small threshold; pushing up at the cost of average is a net loss.
| Quantity | Symbol | Typical SFT value | What it controls |
|---|---|---|---|
| Examples | |𝒟| | 1K – 1M | labour cost |
| Effective tokens | N_eff | 0.5M – 500M | compute / one-epoch step count |
| Per-category fraction | N_c / N_eff | any | downstream skill distribution |
| Quality-weighted count | Q | context-dependent | eval-score uplift (LIMA) |
Manual Numerical Walkthrough
Click to expand: building the loss mask on a 12-token toy chat and computing one SFT step
Take a tiny chat-template example with one user turn and one assistant turn. Use a stylised tokeniser where each word maps to one token and we have role tags [USR] and [AST] plus an end-of-turn marker [EOT]. The conversation is:
User: “What is two plus three?”
Assistant: “Five.”
Step 1: tokenise + emit the mask.
| t | token | role | m_t |
|---|---|---|---|
| 1 | [USR] | tag | 0 |
| 2 | What | user | 0 |
| 3 | is | user | 0 |
| 4 | two | user | 0 |
| 5 | plus | user | 0 |
| 6 | three | user | 0 |
| 7 | ? | user | 0 |
| 8 | [EOT] | user-EOT | 0 |
| 9 | [AST] | tag | 0 |
| 10 | Five | assistant | 1 |
| 11 | . | assistant | 1 |
| 12 | [EOT] | assistant-EOT | 1 |
Notice exactly which positions get : the assistant content tokens (10, 11) and the assistant-EOT (12), but not the [AST] role tag (9). The model should learn to produce the answer and the stop signal, but it should never be supervised on the role tag — the tokeniser places that automatically.
Step 2: count the gradient-bearing positions. . Out of 12 tokens, only 3 contribute to the loss — a 25% supervision density. This is squarely in the “healthy” band; a multi-turn reasoning example with 600-token chains-of-thought and 30-token user turns would hit 95% supervision density, and a single-line refusal example would hit 10%.
Step 3: compute the per-token NLL. Pretend the model assigns the following probabilities to the true next-token at each supervised position (after a single SFT step from the base model):
| predict at t | target token | p(target) | −log p |
|---|---|---|---|
| 9 → 10 | Five | 0.04 | 3.22 |
| 10 → 11 | . | 0.55 | 0.60 |
| 11 → 12 | [EOT] | 0.28 | 1.27 |
Step 4: average over supervised tokens. . That is the SFT loss for this example.
Step 5: contrast with the buggy version. If you forgot the mask and averaged over all 12 tokens with a base-model-like probability of 0.1 for each non-assistant token (typical for “the next word in a random web-scraped sequence”), you would get an additional 9 contributions of ~2.3 each and the final “loss” would be roughly — higher in absolute terms but dominated by noise from the unsupervised positions, and the gradient signal pointed in the wrong direction. The model would “learn” to predict the user's next word better, which is precisely the wrong behaviour.
Step 6: the corpus view. Multiply this by ten thousand examples. The healthy SFT corpus has of gradient — 4 minutes of compute on an 8×H100 node at BF16. Compute-cheap. Curation-expensive.
Interactive: Designing the Mix
The math says it; the widget shows it. Slide the budget, slide the curation quality, slide the per-category mix. Three things change at once:
- The stacked share bar shows what proportion of your example budget goes to each category. The numbers on each segment are the absolute example counts.
- The response-length histogram stacks the assistant-turn-length distributions of each category. Math and code skew right; refusals huddle on the left. This is where most of your actually lives.
- The metrics panel shows the gradient token count, a diversity score, and an estimated eval uplift. The uplift curve is the empirical S-shape from LIMA / Tülu / R1 — quality dominates past a few thousand examples.
Three exercises to try with the widget. (a) Set the budget to 1K and crank quality to 100%. Note the eval uplift. Now set the budget to 250K and drop quality to 30%. The gradient-token count is 250× bigger, but the predicted uplift is roughly the same — the curated 1K beats the sloppy 250K. This is the LIMA result rendered in pixels. (b) Push the math+code share to 80% and the chat share to 5%. The histogram leans hard to the right. Models trained on this mix are great at reasoning and terrible at small talk; this is exactly the failure mode that pushed OpenAI to publish the InstructGPT mix proportions in 2022. (c) Crank safety+refusals to 50%. The histogram piles up on the left (refusals are short) and the diversity score plummets. The resulting model refuses everything — the “over-refusal tax” that haunted Claude 2 and Llama-2-Chat at release.
Plain Python: Mask, Dedupe, Decontaminate
Before any GPU sees an example, the SFT pipeline does three things in CPU-land: it builds the assistant-only loss mask, it removes near-duplicate prompts, and it filters out any example that overlaps an eval benchmark. The code below implements all three from scratch, with no PyTorch and no framework magic — every line is the literal algorithm the frontier labs publish in their data-prep appendices.
The whole file is ~100 lines and runs on a laptop. For a million-example SFT corpus the dedup step is the only non-trivial cost: O(N · num_perms) hashing plus O(N²) pair comparisons. Production pipelines (datatrove, dolma, nemotron) replace the inner loop with banded LSH — split the signature into bands, hash each band into a bucket, and only compare examples that collide in at least one bucket. That turns the quadratic step into a near-linear pass at the cost of a bit of recall.
PyTorch: A Production-Style SFT Dataset
Once the corpus is filtered, deduped, and decontaminated, the PyTorch side is the easy half. We need three things: an IterableDataset that streams JSONL lines and emits packed windows, a masked cross-entropy loss that averages only over assistant tokens, and a tokeniser that knows how to apply the model's chat template. The 90 lines below cover all three.
Two engineering details that this code gets right and most first-attempts get wrong. Packing: concatenating multiple short examples into one fixed-length window. Without packing, every example becomes its own padded sequence, and the median chat example wastes 70–90% of every batch on pad tokens. With packing, GPU utilisation on an 8×H100 SFT run jumps from ~40% to ~85% on the same corpus. Tail emission: yielding the final partial window rather than discarding it. Discarding biases the corpus toward longer examples and is the second-most common “mysterious loss drift after epoch 1” bug.
At Massive Scale: How the Frontier Labs Actually Do It
The from-scratch pipeline above is the minimum viable SFT prep. At frontier scale — Llama, Claude, GPT-4, DeepSeek-V3, Gemini, Qwen-3 — the same skeleton holds, but several production-grade layers are added on top. We will walk through each one with concrete numbers from public papers and post-mortems.
The four data sources, ranked by quality bar
- Hand-written by domain experts. 1K–10K examples per skill. The most expensive ($5–25 per example) and highest-quality source. Used for safety-critical skills (refusal calibration, instruction following, system-prompt adherence) and for “seed” examples that anchor a synthetic pipeline. LIMA's original 1K was this. Anthropic's “constitution authors” produce this. OpenAI's contractor fleet produces this at scale.
- Distilled from a stronger model. A few thousand to a few million examples generated by a frontier model (often the previous generation of the same family) and filtered with a smaller reward or quality model. Tülu 3 uses this for ~40% of its mix. DeepSeek-R1's cold-start SFT is exclusively this, drawn from R1-Zero outputs. Per-example cost is dollars of compute, not labour — a 100× cost reduction over hand-writing.
- Rejection-sampled from the model itself. Generate candidate answers per prompt, score with a reward model or rule-based checker, keep only the top-1. DeepSeek-R1 used for reasoning, keeping ~5% of candidates. Llama-3 used on instruction-following prompts. The keep rate, not the candidate count, is the headline quality knob.
- Open instruction datasets. ShareGPT, UltraChat, OpenOrca, NoRobots, WildChat. Cheap (free) and large (1M+ examples) but quality is the lowest of the four tiers and the contamination risk is the highest. Frontier labs use these as a base layer or drop them entirely; small-team SFT runs lean on them heavily.
The canonical mix proportion (effective tokens, not examples)
| Category | Llama-3 SFT | Tülu 3 SFT | Phi-3 SFT | DeepSeek-R1 SFT |
|---|---|---|---|---|
| General chat / IF | ~50% | ~30% | ~25% | ~20% |
| Math / reasoning | ~15% | ~25% | ~30% | ~45% |
| Code | ~14% | ~20% | ~25% | ~20% |
| Multilingual | ~8% | ~5% | ~5% | ~5% |
| Tool use | ~5% | ~10% | ~5% | ~5% |
| Safety / refusal | ~8% | ~10% | ~10% | ~5% |
Two things to notice. First, the mix shifts dramatically with the model's intended workload: R1 is a reasoning model and its mix is reasoning-heavy; Llama-3 is a general-purpose assistant and its mix is chat-heavy. Second, the safety fraction is in single-digit percent across the board. The instinct “just add more safety data” is wrong — past ~10% safety share, the model starts refusing benign prompts and the over-refusal tax kicks in. Anthropic's Claude 3.5 sat closer to ; OpenAI's GPT-4 turbo post-2024 sits closer to .
The rejection-sampling pipeline (the modern data factory)
The most consequential shift in SFT data collection from 2023 to 2025 has been the rise of self-distillation via rejection sampling. The pipeline:
- Take a base model and a set of high-quality prompts (the prompt pool can be hand-written, scraped, or model-generated).
- Generate candidate completions per prompt with a high sampling temperature.
- Score each candidate with a stack of filters: a reward model (learned), a rule-based checker (regex, execution, equality), an LLM-as-judge pass, sometimes a human spot-check on a sub-sample.
- Keep the top-1 candidate per prompt — or drop the prompt entirely if no candidate clears the bar.
- The surviving (prompt, best-completion) pairs are the SFT corpus.
Why is this so effective? Because the model itself, sampled enough times, contains a higher-quality response than any single generation. Rejection sampling is just importance-weighted maximum-likelihood: we approximate the policy “sample, then take the argmax under the reward” with a finite-sample top-1, and SFT on the result. DeepSeek-R1 scaled this to reasoning prompts with candidates each, keeping the top-1 after a multi-stage filter — the surviving 600K examples are the SFT corpus that gave R1 its non-RL reasoning baseline. The compute cost was the cost of single-shot inference; the labour cost was zero.
Decontamination at scale
Every published frontier model since GPT-4 reports decontamination against a public eval suite. The standard protocol — 13-gram word overlap, per the LLaMA-2 paper — is now industry-canonical. Llama-3's data team reported that their SFT corpus, before filtering, had a 13-gram match against of the GSM8K test set, of MMLU, of HumanEval, and nontrivially against every benchmark they tested. After filtering, those numbers go to zero. The cost is dropping ~5–8% of the candidate corpus.
The frontier extension is cross-version decontamination: when you train v2 of your model, you decontaminate against not just the v2 eval set but also v1's eval set, so that version comparisons are not biased by “v2 saw the v1 test set during SFT”. This is what blew up several open-model leaderboard scores in late 2024 — teams trained on synthetic data from a model that had been evaluated on the same benchmark, and the leakage flowed through.
Engineering Reality: The Tax Items Nobody Talks About
The textbook version of SFT data collection ends at the rejection-sampling pipeline. The honest version has another half-dozen production concerns that every team ships and almost no paper publishes. They are not glamorous, but every one of them has at some point cost a team a week of training or a quarter of eval scores.
1. Length filtering, both ends
Drop assistant turns shorter than tokens — these are almost always either refusals (collected separately) or low-effort responses that teach the model to be terse. Drop assistant turns longer than — typically the 99th percentile of your corpus, but capped at the model's context window minus the prompt budget. Without this, a single 32K-token reasoning example dominates an entire packed window and the gradient signal from short-form chat data drowns.
2. PII scrubbing
Run a named-entity recogniser pass for emails, phone numbers, social-security-like patterns, IP addresses, and credit-card-like patterns. Replace them with placeholders ([EMAIL], [PHONE], etc.). This is not optional: PII memorisation is a publishable bug, a regulatory risk (GDPR, CCPA), and a brand risk. The Microsoft Phi team published a notable case in 2024 where their pre-PII-scrubbing SFT corpus had memorisable email addresses that surfaced in chat-mode outputs.
3. Refusal calibration
The most counter-intuitive piece of the SFT data design. You need to teach the model when to refuse — but every refusal example you add also teaches it the shape of refusal, which leaks into prompts where the answer should be a thoughtful response, not a refusal. Modern pipelines therefore include a small fraction (1–3%) of borderline-but-helpful examples: prompts that look like they should be refused (medical, legal, sensitive-but-public-info) but where the demonstrated response is a calibrated, factual, contextually-appropriate answer. Without these, the over-refusal tax shows up within days of release.
4. Multi-turn balance
Single-turn SFT data is easy to collect; multi-turn data is hard. Most public datasets are single-turn or shallow multi-turn (one follow-up). The result, if you train only on single-turn data, is a model that resets context after every assistant response — it forgets the previous turn because it was never supervised on conditioning the next response on prior assistant turns. Tülu 3 and Llama-3 both report that pushing the multi-turn fraction above of effective tokens was necessary to make conversation feel coherent past turn 3.
5. Language-ID and code-language balance
A subtle but important step. Run a fastText langid model over every example and drop examples where the user turn and the assistant turn disagree on language by more than a small fraction (unless the example is explicitly a translation example). Without this, the corpus is sprinkled with cross-lingual artefacts (English user turn, French assistant turn) that confuse the model. The same trick applies to code: detect the programming language and balance the mix so Python does not eat 80% of the code budget.
6. The eval-feedback loop
Train on the corpus, evaluate against your held-out benchmark suite, identify the worst-performing categories, and route the next data-collection cycle toward those categories. The frontier labs run this loop weekly. The DeepSeek-V3 post-training report describes nine such cycles between the base model checkpoint and the final chat model. The Llama-3 post-training report describes six. SFT is not a one-shot pipeline; it is a closed-loop control system whose input is eval scores and whose output is data-collection priorities.
The takeaway. If you remember nothing else from this section, remember this: the SFT loss is trivial, and the corpus is everything. Mask the assistant turn only. Decontaminate against your eval set. Dedupe aggressively. Mix categories by effective tokens, not example counts. Filter by quality, not quantity. Reject- sample whenever you can. And every time eval scores regress, look at the data before you look at the model. Two times out of three, it is the data.
With the corpus prepared, the next section turns to the chat template itself — the deterministic Jinja string that converts (role, content) turns into a token sequence and the role tags. It looks like a formatting detail. It is, in fact, the single most overlooked source of train/inference skew in modern post-training.