The Real Problem: A Genius Who Cannot Hold a Conversation
Pretraining is over. The model has eaten fifteen trillion tokens of the internet. It has seen every Wikipedia article in every major language, the full Linux kernel source, the complete arXiv archive, a healthy slice of GitHub, every parliamentary transcript Common Crawl could find, and several lifetimes of Reddit. It knows things. It can finish a Python function in your style after seeing five lines. It can complete a half-written legal brief in passable legal English. It can predict the next byte of a JPEG header.
And it cannot answer a question.
Ask a fresh base model "What is the capital of France?" and you will get back something like:
What is the capital of France? What is the capital of Germany? What is the capital of Spain? What is the capital of…
The model is not broken. It is doing exactly what it was trained to do — predict the next token of the document in front of it. During pretraining it saw a billion lists of country capitals, and your question looks like the first line of one of them. From the model's point of view, the most likely continuation of your sentence is another sentence in the same shape, not an answer.
This is the gap that supervised fine-tuning closes. The model already knows that Paris is the capital of France — that knowledge was baked in months ago at the cost of millions of dollars of GPU time. What it does not know is the shape of a conversation: that a question expects an answer, that an instruction expects compliance, that "Write me a poem about clouds" should produce a poem, not the next line of a creative writing prompt template scraped from a 2018 blog post.
Intuition: Etiquette Class, Not a Second Degree
Imagine a polyglot scholar who has spent a decade reading every book in a vast library. She knows medicine, law, physics, poetry, and the entire history of cinema. But she has spent the whole decade alone in the stacks. She has never been to a dinner party. She does not know that when someone asks "Could you pass the salt?" the correct response is to pass the salt, not to deliver a lecture on sodium chloride crystal structure. She does not know that when someone says "Tell me about Napoleon" they want a short answer, not the opening of a 600-page biography.
She does not need another decade of reading. She needs a few weekends of dinner parties — a small, curated set of question → polite-and-useful-answer pairs from which she can extract the social protocol. After a few hundred such examples she will start to recognize the shape of a request and respond appropriately. The facts she draws on are still the ones she learned in the library; only the delivery has changed.
That is SFT. The pretraining corpus is the library. The fine-tuning corpus is the dinner party — typically 10,000 to 1,000,000 hand-curated conversations, four to six orders of magnitude smaller than pretraining. The mechanism is identical to pretraining (predict the next token, minimize cross-entropy), but the gradient flows only through one half of each example.
The Math: Cross-Entropy on a Masked Sub-Sequence
A language model defines a probability distribution over the next token given everything that came before. For a sequence of tokens , the model parameters give us at every position. The pretraining objective is to maximize the total log-likelihood of every token:
.
Every token in the sequence contributes equally to the loss. The model is rewarded for being good at predicting each next token from the entire corpus — chat or otherwise, code or prose, prompts or completions.
SFT keeps the same loss function and changes exactly one thing: it restricts the sum to a subset of positions. Given a single training example split into a prompt segment and a response segment , the SFT loss is:
.
The denominator is , the number of response tokens. Prompt tokens still appear in the conditioning context — the model attends to them, uses them to build its hidden state, would be hopelessly lost without them — but they never appear as labels. They never produce a gradient.
Operationally the trick is implemented with a single sentinel value, almost universally called . Every cross-entropy implementation in the deep learning ecosystem treats as "skip this position". So the practical recipe is just:
- Tokenize the full conversation (prompt + response) into one sequence of ids.
- Clone the ids to make a tensor.
- Overwrite every prompt position in the labels with .
- Call cross-entropy. It silently drops the masked positions from both the sum and the count.
Every symbol explained: is the token at position ; is the sequence of all prior tokens; is every learnable weight in the model (embedding table, attention projections, MLP weights, output head); is the softmax-normalized next-token distribution the model produces; the negative sign turns log-probability into "loss to minimize"; the division by turns a sum into a mean so the loss scale does not depend on response length.
Manual Numerical Walkthrough
Let us compute the SFT loss for one small example by hand, so the mask stops being magic and starts being arithmetic.
Click to expand: cross-entropy over a 7-token example, every number visible
Step 1 — the conversation. User says "2+2?", assistant says "4.". After tokenization (numbers are illustrative) we get:
| position t | token | role | input_id |
|---|---|---|---|
| 0 | User: | prompt | 101 |
| 1 | 2 | prompt | 234 |
| 2 | + | prompt | 567 |
| 3 | 2 | prompt | 8 |
| 4 | ? | prompt | 9 |
| 5 | 4 | response | 42 |
| 6 | . | response | 99 |
Sequence length , prompt length , response length .
Step 2 — labels with the mask. The labels tensor is the input ids with every prompt position overwritten:
| position t | input_id | label | in loss? |
|---|---|---|---|
| 0 | 101 | -100 | skip |
| 1 | 234 | -100 | skip |
| 2 | 567 | -100 | skip |
| 3 | 8 | -100 | skip |
| 4 | 9 | -100 | skip |
| 5 | 42 | 42 | yes |
| 6 | 99 | 99 | yes |
Step 3 — the model's predictions. At every position the model outputs a distribution over the vocabulary. Suppose at position 5 it gives the correct token (id 42) probability , and at position 6 it gives the correct token probability . The negative log-likelihoods are:
| position | p(correct token) | NLL = -log p |
|---|---|---|
| 5 | 0.018 | 4.017 |
| 6 | 0.41 | 0.892 |
Step 4 — combine. The cross-entropy loss with and is:
.
Step 5 — what would have happened without the mask. If we forgot the mask, all 7 positions would contribute. Suppose the prompt positions had NLLs (the base model is quite uncertain about user phrasing). Then the unmasked loss would be . The mask not only changes which gradients flow — it also changes the scale of the loss. Two networks trained with and without the mask are not directly comparable.
Step 6 — gradient flow. Because depends only on positions 5 and 6, every parameter update during SFT is computed from a small fraction of the tokens in the batch (response tokens are typically a small slice of the total). This is why a single A100 can fine-tune a 7B model in a few hours despite the per-example forward cost being the same as pretraining.
Visualizing What the Loss Mask Actually Does
The mask is the entire mechanism, so it is worth seeing it in action. The widget below shows three small conversations, each tokenized into prompt tokens (blue), assistant tokens (green), and chat-template control tokens (grey). Switch between the three training modes to see how the loss-bearing tokens — and the total loss — change.
Three things to notice as you play with it.
- In SFT mode, every prompt token's label becomes and its per-token loss disappears from the bar chart. The model can still attend to those tokens during the forward pass — it just receives no gradient signal for predicting them.
- The refusal example has an almost flat loss profile in SFT mode. Every response token is high-frequency English ("I", "can", "help", "with", "that") that the base model already predicts with high probability. The SFT signal is not about teaching the model these words — it is about teaching the model that this particular sequence is the right one to emit in response to that particular prompt shape.
- The prompt-only mode is included on purpose so you can see what a broken mask looks like. A model fine-tuned this way would get better at predicting the kind of questions users ask — and learn nothing about how to answer them.
Plain Python: Loss Masking from Scratch
Before we look at the production version, let us write the loss computation in plain Python with no frameworks. The point is to convince yourself that there is no PyTorch magic involved — the SFT loss is a mean of a filtered list.
Running this prints tokens in the loss and a value of . There are no tensors here, no autograd graph, no shifted labels — just one list comprehension. Every framework you will encounter (PyTorch, TensorFlow, JAX, MLX, the HuggingFace SFTTrainer, axolotl, torchtune, LLaMA-Factory) is doing exactly this under the hood, with more layers of wrapping.
PyTorch: The Real SFT Loss in Six Lines
Now the real thing. The code below runs against a real model (Qwen2.5-0.5B is small enough to load on a laptop) and computes the SFT loss for one conversation end-to-end. The load-and-tokenize lines are scaffolding; the SFT mechanism itself is exactly two lines: followed by .
The first time most people read this code they expect a fancy SFT-specific loss function — some kind of . There is none. Supervised fine-tuning shares its loss function, its tokenizer, its model architecture, its optimizer choice, and its mixed-precision setup with pretraining. The only thing that differs is which positions carry a non- label.
What Changes at 70B Parameters and 10K Examples
Nothing about the math changes at scale. The same six-line PyTorch recipe is what runs inside the SFT stages of every modern open-weight release — Llama 3, Qwen 2.5, DeepSeek V3, Mistral Large, Phi-4, Gemma 2. What changes is the constants around the recipe.
| Knob | Pretraining | SFT |
|---|---|---|
| Tokens seen | 10–15 trillion | 10 million – 1 billion |
| Examples | ≈ a snapshot of the web | 1K – 1M curated conversations |
| Epochs | 1 (sometimes < 1) | 1 – 5 (often 3) |
| Sequence length | 8K – 32K | 4K – 16K (capped to fit examples) |
| Learning rate | 1e-4 to 3e-4 peak | 5e-6 to 5e-5 peak (10–100× lower) |
| Optimizer | AdamW | AdamW (identical) |
| Precision | BF16 / FP8 | BF16 (FP8 is rare for SFT) |
| Loss masking | none | labels[:, :prompt_len] = -100 |
| Wall-clock cost | millions of GPU-hours | tens to thousands of GPU-hours |
| Total compute | $10M – $100M+ | $1K – $100K |
The two rows that matter for understanding SFT's character are the learning rate (an order of magnitude lower) and the token budget (six orders of magnitude smaller). Together they say: do not move the weights far, do not show the model much. The base model has done the heavy lifting; SFT is a polish, not a rebuild.
A few engineering consequences follow from this:
- Memory dominates compute. A 70B SFT job is dominated by the cost of holding optimizer states (AdamW needs two FP32 moments per parameter), not by the cost of the forward pass. This is why FSDP / ZeRO-3 sharding shows up in SFT recipes — the model fits on one node, but the optimizer states do not.
- Throughput is bottlenecked by short sequences. SFT examples are typically 200–2000 tokens long. Without packing (concatenating multiple short examples into one max-length sequence), GPU utilization on a 7B+ model drops to single-digit percent because the matrix multiplies are too small. Every serious SFT framework implements example packing.
- Communication cost almost disappears. Because there are so few gradient steps (maybe 1000–10000 in a full SFT run vs. 1M+ in pretraining), even cross-node all-reduce overhead becomes negligible. A surprising amount of public SFT is done on single 8×H100 nodes.
- Loss values are not directly comparable across mixes. Because the denominator changes when you change your data mix (more code → more response tokens → different loss scale), comparing "loss = 1.4 on dataset A" vs. "loss = 1.6 on dataset B" tells you almost nothing. Eval benchmarks (MMLU, MT-Bench, AlpacaEval) carry the signal instead.
The Superficial Alignment Hypothesis
In May 2023, Meta published a paper called LIMA: Less Is More for Alignment. They fine-tuned a 65B base LLaMA on 1,000 hand-curated examples — three orders of magnitude fewer than the standard ChatGPT-era recipe of ~50K — and showed that the resulting model was competitive with (sometimes preferred over) models trained on hundreds of thousands of examples. The authors offered the following framing, which has become the standard mental model for the field:
A model's knowledge and capabilities are learned almost entirely during pretraining, while alignment teaches it which subdistribution of formats should be used when interacting with users.
Call this the superficial alignment hypothesis. It is the strongest defense of the framing we opened the section with: SFT is a format adapter on top of a knowledge-rich base. If the hypothesis is correct, the implications are concrete:
- Data quality dominates data quantity. 1,000 great examples beat 100,000 mediocre ones. You will see this confirmed in every open-weight release writeup since 2023 — Olmo, Tulu, Hermes, Nous, Dolphin, all of them obsess over example quality.
- Knowledge does not arrive during SFT. If the base model does not know something, SFT will not teach it — at best, it will teach the model to confidently fabricate. This is the central source of post-SFT hallucinations: the base model knew when to be uncertain, and the SFT data taught it to suppress that uncertainty in favor of fluent answers.
- Domain transfer is style transfer. A "medical" SFT does not inject medical knowledge — it teaches the model to use the medical-conversation register when answering medical-shaped questions. The actual facts have to already be in the base.
- The same base model can be fine-tuned in many directions cheaply. One pretraining run, many SFT runs. This is what makes the open-weight ecosystem possible — you do not need to redo $50M of pretraining to ship a domain-specific variant.
Subsequent work has put nuance on this. Models can learn new factual associations during SFT, but only inefficiently and only for facts they were close to knowing already. And the boundary between "format" and "capability" is fuzzier than LIMA suggested — for instance, learning to structure step-by-step reasoning during SFT does appear to improve raw reasoning benchmark scores in a way that looks suspiciously like a capability boost. But as a working model the framing is excellent and we will lean on it for the rest of this chapter.
Engineering Reality: What SFT Cannot Fix
The framing also tells us, by negation, what SFT is bad at. The same mechanism that lets you cheaply adapt a base model to a new format makes SFT a poor tool for several jobs that practitioners keep trying to use it for.
| Goal | Will SFT work? | Why |
|---|---|---|
| Teach a model to follow chat instructions | Yes (this is its job) | Pure format adaptation on top of existing capabilities. |
| Adopt a specific brand voice or refusal style | Yes | Style is exactly what the SFT objective propagates. |
| Output strict JSON / function calls | Yes | Format. A few thousand examples is usually enough. |
| Add knowledge of last month's company data | No | SFT does not store facts reliably. Use RAG. |
| Make the model better at math | Partially | Helps with reasoning format; the underlying skill comes from pretraining or RL. |
| Remove a baked-in bias from the base | Weakly | SFT moves outputs, not internal representations. Bias often comes back under distribution shift. |
| Teach a model to reliably refuse harmful requests | Partially | Brittle; defenses break under jailbreaks. RLHF / Constitutional AI is more robust. |
| Make a small model behave like a large one | Up to a point (distillation) | Style transfers; capability ceiling is set by the base model's size. |
The most common failure mode by far is catastrophic forgetting: an aggressive SFT run on a narrow domain can damage capabilities the base model had. Train too long on medical Q&A and the model gets worse at code. Train too long on a single brand voice and the model stops being able to write in any other register. We will spend all of section 13.5 on this — it is the single most important constraint on how you choose SFT data, learning rate, and number of epochs.
The takeaway in one sentence. SFT is masked next-token prediction on prompt-response pairs, using the same loss function as pretraining and a learning rate one to two orders of magnitude lower, on a dataset three to six orders of magnitude smaller — and its job is not to teach the model what to know, but to teach it when and how to speak.
In the rest of this chapter we follow that recipe top to bottom. Section 13.2 looks at where good SFT data comes from and what makes it "good". Section 13.3 goes deep on chat templates and the tokenization details that determine where the prompt ends and the response begins. Section 13.4 covers the training configuration — learning rate schedules, packing, batch size, gradient accumulation — at the scale you would actually run an SFT job. And section 13.5 confronts catastrophic forgetting head-on, with the mitigation techniques that real labs use to keep their post-SFT models from getting amnesia.