Chapter 13
15 min read
Section 72 of 117

What SFT Actually Does

Supervised Fine-Tuning (SFT)

The Real Problem: A Genius Who Cannot Hold a Conversation

Pretraining is over. The model has eaten fifteen trillion tokens of the internet. It has seen every Wikipedia article in every major language, the full Linux kernel source, the complete arXiv archive, a healthy slice of GitHub, every parliamentary transcript Common Crawl could find, and several lifetimes of Reddit. It knows things. It can finish a Python function in your style after seeing five lines. It can complete a half-written legal brief in passable legal English. It can predict the next byte of a JPEG header.

And it cannot answer a question.

Ask a fresh base model "What is the capital of France?" and you will get back something like:

What is the capital of France? What is the capital of Germany? What is the capital of Spain? What is the capital of…

The model is not broken. It is doing exactly what it was trained to do — predict the next token of the document in front of it. During pretraining it saw a billion lists of country capitals, and your question looks like the first line of one of them. From the model's point of view, the most likely continuation of your sentence is another sentence in the same shape, not an answer.

This is the gap that supervised fine-tuning closes. The model already knows that Paris is the capital of France — that knowledge was baked in months ago at the cost of millions of dollars of GPU time. What it does not know is the shape of a conversation: that a question expects an answer, that an instruction expects compliance, that "Write me a poem about clouds" should produce a poem, not the next line of a creative writing prompt template scraped from a 2018 blog post.

Loading base vs SFT comparison…
The framing for this entire chapter. SFT is not a knowledge transfer operation. It is a format transfer operation. The base model already knows the facts; SFT teaches it the protocol for emitting them on request. Almost every confusing result you will read about SFT — why so few examples are needed, why it can degrade capabilities, why instruction-tuning datasets feel suspiciously similar across labs — becomes obvious once you internalize this framing.

Intuition: Etiquette Class, Not a Second Degree

Imagine a polyglot scholar who has spent a decade reading every book in a vast library. She knows medicine, law, physics, poetry, and the entire history of cinema. But she has spent the whole decade alone in the stacks. She has never been to a dinner party. She does not know that when someone asks "Could you pass the salt?" the correct response is to pass the salt, not to deliver a lecture on sodium chloride crystal structure. She does not know that when someone says "Tell me about Napoleon" they want a short answer, not the opening of a 600-page biography.

She does not need another decade of reading. She needs a few weekends of dinner parties — a small, curated set of question → polite-and-useful-answer pairs from which she can extract the social protocol. After a few hundred such examples she will start to recognize the shape of a request and respond appropriately. The facts she draws on are still the ones she learned in the library; only the delivery has changed.

That is SFT. The pretraining corpus is the library. The fine-tuning corpus is the dinner party — typically 10,000 to 1,000,000 hand-curated conversations, four to six orders of magnitude smaller than pretraining. The mechanism is identical to pretraining (predict the next token, minimize cross-entropy), but the gradient flows only through one half of each example.

Mental check. Why does fine-tuning need so few examples relative to pretraining? Because it is not teaching new patterns — only how to route the patterns already there. Routing is cheap; learning is expensive.

The Math: Cross-Entropy on a Masked Sub-Sequence

A language model defines a probability distribution over the next token given everything that came before. For a sequence of tokens x1,x2,,xTx_1, x_2, \dots, x_T, the model parameters θ\theta give us pθ(xtx<t)p_\theta(x_t \mid x_{<t}) at every position. The pretraining objective is to maximize the total log-likelihood of every token:

Lpretrain(θ)=1Tt=1Tlogpθ(xtx<t)\mathcal{L}_{\text{pretrain}}(\theta) = - \frac{1}{T} \sum_{t=1}^{T} \log p_\theta(x_t \mid x_{<t}).

Every token in the sequence contributes equally to the loss. The model is rewarded for being good at predicting each next token from the entire corpus — chat or otherwise, code or prose, prompts or completions.

SFT keeps the same loss function and changes exactly one thing: it restricts the sum to a subset of positions. Given a single training example split into a prompt segment P=(x1,,xm)P = (x_1, \dots, x_m) and a response segment R=(xm+1,,xT)R = (x_{m+1}, \dots, x_T), the SFT loss is:

LSFT(θ)=1Rt=m+1Tlogpθ(xtx<t)\mathcal{L}_{\text{SFT}}(\theta) = - \frac{1}{|R|} \sum_{t = m+1}^{T} \log p_\theta(x_t \mid x_{<t}).

The denominator is R=Tm|R| = T - m, the number of response tokens. Prompt tokens still appear in the conditioning context — the model attends to them, uses them to build its hidden state, would be hopelessly lost without them — but they never appear as labels. They never produce a gradient.

Operationally the trick is implemented with a single sentinel value, almost universally called 100-100. Every cross-entropy implementation in the deep learning ecosystem treats 100-100 as "skip this position". So the practical recipe is just:

  • Tokenize the full conversation (prompt + response) into one sequence of ids.
  • Clone the ids to make a labels\text{labels} tensor.
  • Overwrite every prompt position in the labels with 100-100.
  • Call cross-entropy. It silently drops the masked positions from both the sum and the count.

Every symbol explained: xtx_t is the token at position tt; x<tx_{<t} is the sequence of all prior tokens; θ\theta is every learnable weight in the model (embedding table, attention projections, MLP weights, output head); pθp_\theta is the softmax-normalized next-token distribution the model produces; the negative sign turns log-probability into "loss to minimize"; the division by R|R| turns a sum into a mean so the loss scale does not depend on response length.

Manual Numerical Walkthrough

Let us compute the SFT loss for one small example by hand, so the 100-100 mask stops being magic and starts being arithmetic.

Click to expand: cross-entropy over a 7-token example, every number visible

Step 1 — the conversation. User says "2+2?", assistant says "4.". After tokenization (numbers are illustrative) we get:

position ttokenroleinput_id
0User:prompt101
1 2prompt234
2+prompt567
32prompt8
4?prompt9
54response42
6.response99

Sequence length T=7T = 7, prompt length m=5m = 5, response length R=2|R| = 2.

Step 2 — labels with the mask. The labels tensor is the input ids with every prompt position overwritten:

position tinput_idlabelin loss?
0101-100skip
1234-100skip
2567-100skip
38-100skip
49-100skip
54242yes
69999yes

Step 3 — the model's predictions. At every position the model outputs a distribution over the vocabulary. Suppose at position 5 it gives the correct token (id 42) probability p5=0.018p_5 = 0.018, and at position 6 it gives the correct token probability p6=0.41p_6 = 0.41. The negative log-likelihoods are:

positionp(correct token)NLL = -log p
50.0184.017
60.410.892

Step 4 — combine. The cross-entropy loss with reduction=mean\text{reduction}=\text{mean} and ignore_index=100\text{ignore\_index}=-100 is:

LSFT=4.017+0.8922=2.4545\mathcal{L}_{\text{SFT}} = \frac{4.017 + 0.892}{2} = 2.4545.

Step 5 — what would have happened without the mask. If we forgot the mask, all 7 positions would contribute. Suppose the prompt positions had NLLs [6.1,5.3,9.7,4.2,3.8][6.1, 5.3, 9.7, 4.2, 3.8] (the base model is quite uncertain about user phrasing). Then the unmasked loss would be 6.1+5.3+9.7+4.2+3.8+4.017+0.89274.86\frac{6.1+5.3+9.7+4.2+3.8+4.017+0.892}{7} \approx 4.86. The mask not only changes which gradients flow — it also changes the scale of the loss. Two networks trained with and without the mask are not directly comparable.

Step 6 — gradient flow. Because L/θ\partial \mathcal{L}/\partial \theta depends only on positions 5 and 6, every parameter update during SFT is computed from a small fraction of the tokens in the batch (response tokens are typically a small slice of the total). This is why a single A100 can fine-tune a 7B model in a few hours despite the per-example forward cost being the same as pretraining.

Visualizing What the Loss Mask Actually Does

The mask is the entire mechanism, so it is worth seeing it in action. The widget below shows three small conversations, each tokenized into prompt tokens (blue), assistant tokens (green), and chat-template control tokens (grey). Switch between the three training modes to see how the loss-bearing tokens — and the total loss — change.

Loading SFT loss-masking visualizer…

Three things to notice as you play with it.

  • In SFT mode, every prompt token's label becomes 100-100 and its per-token loss disappears from the bar chart. The model can still attend to those tokens during the forward pass — it just receives no gradient signal for predicting them.
  • The refusal example has an almost flat loss profile in SFT mode. Every response token is high-frequency English ("I", "can", "help", "with", "that") that the base model already predicts with high probability. The SFT signal is not about teaching the model these words — it is about teaching the model that this particular sequence is the right one to emit in response to that particular prompt shape.
  • The prompt-only mode is included on purpose so you can see what a broken mask looks like. A model fine-tuned this way would get better at predicting the kind of questions users ask — and learn nothing about how to answer them.

Plain Python: Loss Masking from Scratch

Before we look at the production version, let us write the loss computation in plain Python with no frameworks. The point is to convince yourself that there is no PyTorch magic involved — the SFT loss is a mean of a filtered list.

SFT loss in twenty lines of plain Python
🐍sft_loss_plain.py
1math import (unused here, kept for clarity)

We do not actually need anything from math for this snippet — the loss is a plain mean of negative log-likelihoods. It is here to remind you that everything we are about to do is just arithmetic.

4Prompt tokens

A toy list of five integer token ids that represent the user's question. In a real pipeline these come out of a BPE or SentencePiece tokenizer. Numbers like 234 are arbitrary — what matters is that the tokenizer maps the same text to the same ids on every machine.

EXAMPLE
tokenizer("What is 2+2?")['input_ids']
5Response tokens

Two tokens for the assistant's answer — id 42 might be '4', id 99 might be '.'. The SFT objective is: maximize the probability of this exact two-token sequence given the prompt.

9input_ids = prompt + response

The model is autoregressive. Internally it sees one long sequence of seven tokens and predicts the next token at each position. There is no architectural distinction between prompt and response — the distinction lives entirely in the labels we are about to construct.

13IGNORE = -100

PyTorch's cross-entropy uses -100 as the default ignore_index. Any label equal to -100 is silently dropped from both the sum and the count. This is the entire SFT trick in one number.

EXAMPLE
F.cross_entropy(logits, labels, ignore_index=-100)
14Labels start as -100 for the prompt half

We create the label tensor by concatenating [-100]*len(prompt) with the response token ids. Position i's label means 'this is the token I want the model to predict at step i'. Prompt positions get -100 because we do not want gradient there — the user wrote those tokens, the model did not need to learn them.

16Response labels are the real ids

On response positions, the label is the actual token id the model should have predicted. Cross-entropy will compute -log p(label_i | preceding tokens) at each of those positions.

21Fake per-token negative log-likelihoods

In a real run, each entry would be -log of the probability the model assigned to the correct next token at that position. High values mean the model was surprised, low values mean it was confident. We use round numbers here so the arithmetic is checkable.

25Filter — keep only positions where label != IGNORE

List comprehension walks the seven positions in parallel and keeps the NLL only when its label is not -100. After this line, contributing has length 2 — exactly the number of response tokens.

EXAMPLE
contributing == [4.1, 0.9]
26Mean → the SFT loss

Standard cross-entropy with reduction='mean' divides by the number of contributing positions, not by the full sequence length. So sft_loss = (4.1 + 0.9) / 2 = 2.5. The five prompt tokens are gone — they vanished from both numerator and denominator.

28Print — sanity check

You should see '2 / 7' tokens in the loss and a final loss of 2.500. Those two numbers are the entire mechanism: only 2 of 7 positions backpropagate, and the gradient signal is concentrated on the response.

20 lines without explanation
1import math
2
3# A tiny "tokenized" conversation. In reality this comes from a tokenizer
4# like tiktoken or SentencePiece — here we just use stand-in integer ids.
5prompt_ids   = [101, 234, 567, 8, 9]      # "User: What is 2+2?"
6response_ids = [42, 99]                    # "4."
7
8# Stitch prompt + response into one sequence: the model sees this exact
9# string, left to right, predicting each next token.
10input_ids = prompt_ids + response_ids       # length 7
11
12# The labels are the *next* token at each position. We use -100 for the
13# prompt half so it does not contribute to the loss. (-100 is the magic
14# value torch.nn.CrossEntropyLoss skips by default.)
15IGNORE = -100
16labels = (
17    [IGNORE] * len(prompt_ids)               # prompt tokens — ignored
18    + response_ids                           # response tokens — supervised
19)
20
21# Pretend the model returned these per-token negative log-likelihoods.
22# In a real run they come from cross_entropy(logits, labels) per position.
23nll = [3.1, 2.4, 9.6, 8.2, 7.4, 4.1, 0.9]
24
25# SFT loss = mean NLL over positions where label != IGNORE.
26contributing = [n for n, y in zip(nll, labels) if y != IGNORE]
27sft_loss = sum(contributing) / len(contributing)
28
29print(f"tokens in loss : {len(contributing)} / {len(labels)}")
30print(f"sum of NLL     : {sum(contributing):.3f}")
31print(f"sft loss (mean): {sft_loss:.3f}")

Running this prints 2/72 / 7 tokens in the loss and a value of 2.5002.500. There are no tensors here, no autograd graph, no shifted labels — just one list comprehension. Every framework you will encounter (PyTorch, TensorFlow, JAX, MLX, the HuggingFace SFTTrainer, axolotl, torchtune, LLaMA-Factory) is doing exactly this under the hood, with more layers of wrapping.

PyTorch: The Real SFT Loss in Six Lines

Now the real thing. The code below runs against a real model (Qwen2.5-0.5B is small enough to load on a laptop) and computes the SFT loss for one conversation end-to-end. The load-and-tokenize lines are scaffolding; the SFT mechanism itself is exactly two lines: labels=ids.clone()\text{labels} = \text{ids.clone()} followed by labels[:,:m]=100\text{labels}[:, :m] = -100.

The real SFT loss on a real HuggingFace model
🐍sft_loss_pytorch.py
1PyTorch import

Core PyTorch — gives us tensors and the autograd engine that will compute gradients of the loss with respect to every parameter in the model.

2Functional API

F.cross_entropy is what HuggingFace's model.loss attribute calls under the hood when you pass labels. We import it for explicitness.

3HF tokenizer + causal LM classes

AutoTokenizer.from_pretrained downloads the BPE merges and special tokens. AutoModelForCausalLM downloads the architecture and weights of a small base model — small enough to run on a laptop.

5Load tokenizer for Qwen2.5-0.5B

Qwen2.5-0.5B is a 500M-parameter base model. We pick it because it already ships with a chat template baked into the tokenizer config, which we will lean on in two lines.

6Load the model weights

Returns a torch.nn.Module with ~500M parameters. The instance is in train mode by default; gradients flow.

9messages — a structured conversation

The OpenAI-style list-of-dicts format. SFT data is universally curated in this shape: one user turn, one assistant turn, sometimes a system turn at the top.

13apply_chat_template — turn structure into a string

The tokenizer applies a Jinja template defined by the model's authors. For Qwen2.5 this wraps each turn in <|im_start|>role and <|im_end|> markers. Crucially, both the base model and the SFT'd model will use the *same* template — SFT is what teaches the base model to respect the markers.

EXAMPLE
<|im_start|>user\nWhat is 17 * 4?<|im_end|>\n<|im_start|>assistant\n17 * 4 = 68.<|im_end|>
14Tokenize the full conversation

Convert the templated string into a batch of token ids. Shape is (1, T) — one example in the batch, T tokens long. T will be in the low 30s for this short example.

17Re-template the prompt alone

We need to know where the assistant turn starts so we can mask everything before it. add_generation_prompt=True appends the opening <|im_start|>assistant marker so the prompt length matches what the full conversation will contain up to (but not including) the assistant's first content token.

19prompt_len = number of tokens in the prompt half

This integer is the boundary. Positions [0, prompt_len) are 'prompt', positions [prompt_len, T) are 'response'. For our example this is roughly 16; the assistant response then takes the remaining ~12 positions.

22labels = ids.clone()

Start with the labels equal to the input ids — this is what unmasked next-token prediction would use. We will mutate this tensor on the next line.

23labels[:, :prompt_len] = -100 ← the entire SFT trick

Overwrite every prompt position with -100. From here on, cross-entropy will skip these positions in both the numerator and denominator. The model still attends to those tokens during the forward pass (it needs the context to predict the response), it just receives no gradient signal for predicting them.

EXAMPLE
labels  →  [-100, -100, ..., -100, 17, *, 4, =, 68, .]
26Forward pass with labels= shortcut

HuggingFace models, when called with both input_ids and labels, internally (a) shift labels left by one position, (b) compute logits over the vocabulary, (c) apply F.cross_entropy with ignore_index=-100, and (d) attach the scalar to out.loss.

27out.loss is a scalar tensor

One number, with the full autograd graph attached. Calling .backward() on it will populate .grad on every model parameter.

30loss.backward() — gradient flows only through response positions

Even though the model attended to the prompt tokens, the -100 mask kills the contribution of those positions to the loss. So d(loss)/d(parameter) carries information only about how to better predict the assistant tokens. This is why a single A100 can fine-tune a 7B model in a few hours — you are only paying loss on a few hundred tokens per example, not tens of thousands.

32Print — verify the mask worked

(labels != -100).sum() returns the number of supervised positions. It should equal the length of the assistant's tokenized response — about a dozen tokens for this example. If you ever see this equal the full sequence length, your loss mask is broken and you are accidentally training on the prompt.

18 lines without explanation
1import torch
2import torch.nn.functional as F
3from transformers import AutoTokenizer, AutoModelForCausalLM
4
5tok   = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-0.5B")
6model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen2.5-0.5B")
7
8# Build the chat-templated string the model expects at inference time.
9messages = [
10    {"role": "user",      "content": "What is 17 * 4?"},
11    {"role": "assistant", "content": "17 * 4 = 68."},
12]
13text = tok.apply_chat_template(messages, tokenize=False)
14ids  = tok(text, return_tensors="pt").input_ids        # shape (1, T)
15
16# Re-tokenize the prompt half alone so we know its length.
17prompt_text = tok.apply_chat_template(
18    messages[:1], tokenize=False, add_generation_prompt=True,
19)
20prompt_len = tok(prompt_text, return_tensors="pt").input_ids.shape[1]
21
22# Standard SFT label construction: clone ids, then mask prompt positions.
23labels = ids.clone()
24labels[:, :prompt_len] = -100                          # IGNORE
25
26# Forward pass with built-in shifted cross-entropy.
27out  = model(input_ids=ids, labels=labels)
28loss = out.loss                                        # scalar
29
30# Gradient flows only through the response positions.
31loss.backward()
32print(f"sequence length      : {ids.shape[1]}")
33print(f"supervised positions : {(labels != -100).sum().item()}")
34print(f"loss                 : {loss.item():.4f}")

The first time most people read this code they expect a fancy SFT-specific loss function — some kind of F.sft_cross_entropy()F.sft\_cross\_entropy(\dots). There is none. Supervised fine-tuning shares its loss function, its tokenizer, its model architecture, its optimizer choice, and its mixed-precision setup with pretraining. The only thing that differs is which positions carry a non-100-100 label.

Implementation gotcha. If you ever build your own SFT loop and your validation loss looks suspiciously low, your first hypothesis should be that the prompt mask is broken — you are training on the user's text, which is much easier to predict than the assistant's text. Print (labels100).sum()(\text{labels} \ne -100)\text{.sum()} on the first batch.

What Changes at 70B Parameters and 10K Examples

Nothing about the math changes at scale. The same six-line PyTorch recipe is what runs inside the SFT stages of every modern open-weight release — Llama 3, Qwen 2.5, DeepSeek V3, Mistral Large, Phi-4, Gemma 2. What changes is the constants around the recipe.

KnobPretrainingSFT
Tokens seen10–15 trillion10 million – 1 billion
Examples≈ a snapshot of the web1K – 1M curated conversations
Epochs1 (sometimes < 1)1 – 5 (often 3)
Sequence length8K – 32K4K – 16K (capped to fit examples)
Learning rate1e-4 to 3e-4 peak5e-6 to 5e-5 peak (10–100× lower)
OptimizerAdamWAdamW (identical)
PrecisionBF16 / FP8BF16 (FP8 is rare for SFT)
Loss maskingnonelabels[:, :prompt_len] = -100
Wall-clock costmillions of GPU-hourstens to thousands of GPU-hours
Total compute$10M – $100M+$1K – $100K

The two rows that matter for understanding SFT's character are the learning rate (an order of magnitude lower) and the token budget (six orders of magnitude smaller). Together they say: do not move the weights far, do not show the model much. The base model has done the heavy lifting; SFT is a polish, not a rebuild.

A few engineering consequences follow from this:

  • Memory dominates compute. A 70B SFT job is dominated by the cost of holding optimizer states (AdamW needs two FP32 moments per parameter), not by the cost of the forward pass. This is why FSDP / ZeRO-3 sharding shows up in SFT recipes — the model fits on one node, but the optimizer states do not.
  • Throughput is bottlenecked by short sequences. SFT examples are typically 200–2000 tokens long. Without packing (concatenating multiple short examples into one max-length sequence), GPU utilization on a 7B+ model drops to single-digit percent because the matrix multiplies are too small. Every serious SFT framework implements example packing.
  • Communication cost almost disappears. Because there are so few gradient steps (maybe 1000–10000 in a full SFT run vs. 1M+ in pretraining), even cross-node all-reduce overhead becomes negligible. A surprising amount of public SFT is done on single 8×H100 nodes.
  • Loss values are not directly comparable across mixes. Because the denominator R|R| changes when you change your data mix (more code → more response tokens → different loss scale), comparing "loss = 1.4 on dataset A" vs. "loss = 1.6 on dataset B" tells you almost nothing. Eval benchmarks (MMLU, MT-Bench, AlpacaEval) carry the signal instead.

The Superficial Alignment Hypothesis

In May 2023, Meta published a paper called LIMA: Less Is More for Alignment. They fine-tuned a 65B base LLaMA on 1,000 hand-curated examples — three orders of magnitude fewer than the standard ChatGPT-era recipe of ~50K — and showed that the resulting model was competitive with (sometimes preferred over) models trained on hundreds of thousands of examples. The authors offered the following framing, which has become the standard mental model for the field:

A model's knowledge and capabilities are learned almost entirely during pretraining, while alignment teaches it which subdistribution of formats should be used when interacting with users.

Call this the superficial alignment hypothesis. It is the strongest defense of the framing we opened the section with: SFT is a format adapter on top of a knowledge-rich base. If the hypothesis is correct, the implications are concrete:

  • Data quality dominates data quantity. 1,000 great examples beat 100,000 mediocre ones. You will see this confirmed in every open-weight release writeup since 2023 — Olmo, Tulu, Hermes, Nous, Dolphin, all of them obsess over example quality.
  • Knowledge does not arrive during SFT. If the base model does not know something, SFT will not teach it — at best, it will teach the model to confidently fabricate. This is the central source of post-SFT hallucinations: the base model knew when to be uncertain, and the SFT data taught it to suppress that uncertainty in favor of fluent answers.
  • Domain transfer is style transfer. A "medical" SFT does not inject medical knowledge — it teaches the model to use the medical-conversation register when answering medical-shaped questions. The actual facts have to already be in the base.
  • The same base model can be fine-tuned in many directions cheaply. One pretraining run, many SFT runs. This is what makes the open-weight ecosystem possible — you do not need to redo $50M of pretraining to ship a domain-specific variant.

Subsequent work has put nuance on this. Models can learn new factual associations during SFT, but only inefficiently and only for facts they were close to knowing already. And the boundary between "format" and "capability" is fuzzier than LIMA suggested — for instance, learning to structure step-by-step reasoning during SFT does appear to improve raw reasoning benchmark scores in a way that looks suspiciously like a capability boost. But as a working model the framing is excellent and we will lean on it for the rest of this chapter.

Engineering Reality: What SFT Cannot Fix

The framing also tells us, by negation, what SFT is bad at. The same mechanism that lets you cheaply adapt a base model to a new format makes SFT a poor tool for several jobs that practitioners keep trying to use it for.

GoalWill SFT work?Why
Teach a model to follow chat instructionsYes (this is its job)Pure format adaptation on top of existing capabilities.
Adopt a specific brand voice or refusal styleYesStyle is exactly what the SFT objective propagates.
Output strict JSON / function callsYesFormat. A few thousand examples is usually enough.
Add knowledge of last month's company dataNoSFT does not store facts reliably. Use RAG.
Make the model better at mathPartiallyHelps with reasoning format; the underlying skill comes from pretraining or RL.
Remove a baked-in bias from the baseWeaklySFT moves outputs, not internal representations. Bias often comes back under distribution shift.
Teach a model to reliably refuse harmful requestsPartiallyBrittle; defenses break under jailbreaks. RLHF / Constitutional AI is more robust.
Make a small model behave like a large oneUp to a point (distillation)Style transfers; capability ceiling is set by the base model's size.

The most common failure mode by far is catastrophic forgetting: an aggressive SFT run on a narrow domain can damage capabilities the base model had. Train too long on medical Q&A and the model gets worse at code. Train too long on a single brand voice and the model stops being able to write in any other register. We will spend all of section 13.5 on this — it is the single most important constraint on how you choose SFT data, learning rate, and number of epochs.

The takeaway in one sentence. SFT is masked next-token prediction on prompt-response pairs, using the same loss function as pretraining and a learning rate one to two orders of magnitude lower, on a dataset three to six orders of magnitude smaller — and its job is not to teach the model what to know, but to teach it when and how to speak.

In the rest of this chapter we follow that recipe top to bottom. Section 13.2 looks at where good SFT data comes from and what makes it "good". Section 13.3 goes deep on chat templates and the tokenization details that determine where the prompt ends and the response begins. Section 13.4 covers the training configuration — learning rate schedules, packing, batch size, gradient accumulation — at the scale you would actually run an SFT job. And section 13.5 confronts catastrophic forgetting head-on, with the mitigation techniques that real labs use to keep their post-SFT models from getting amnesia.

Loading comments...