The Real Problem: A Base Model Has No Idea Who Is Talking
A base model is a colossal next-token predictor trained on raw web text. Drop a question into one and you do not get an answer — you get whatever the internet usually puts after a question. Sometimes that is a reply. Sometimes it is more questions in a FAQ. Sometimes it is a snide YouTube comment. The model has no concept of turns, no concept of a system role, no concept of when its own reply should stop. Roles, turns, and stopping are not properties that emerge from pretraining; they are properties the model must be taught.
That is what supervised fine-tuning is for. But SFT is itself just more next-token prediction — the same loss, the same optimizer, the same forward pass. What changes is the shape of the training data. Instead of raw text, we feed the model conversations wrapped in a strict, machine-readable envelope: special tokens that say “a new user turn starts here,” “this is the assistant's reply,” “this turn is finished.” That envelope is the chat template, and getting it right is the difference between a model that follows instructions cleanly and a model that produces 90% nonsense.
Intuition: A Script for a Two-Voice Play
Imagine handing an improv troupe a screenplay with no character names and no scene breaks — just a wall of dialogue. They would have to guess, line by line, who speaks next. That is a base model trying to hold a conversation. It can string words together fluently, but whether the next sentence should come from the user or the assistant is a coin-flip every few tokens.
Now hand the same troupe a properly formatted script: USER: in one font, ASSISTANT: in another, “END OF TURN” written explicitly at the end of every block. Every actor knows when to speak, what to say, and when to stop. The text itself has not changed — only the structure around it has — but the performance is unrecognisable.
That is what a chat template does. It surrounds plain conversational text with a small alphabet of special tokens that the model has dedicated embedding rows for. Those tokens never appear in normal English. The moment the model sees <|start_header_id|>assistant<|end_header_id|>, it knows with near-certainty that the next several hundred tokens are its own reply — because in the entire SFT dataset that pattern appeared exactly there and only there.
The Mathematical Idea: A Bijection Between Conversations and Token Sequences
Let a conversation be a list of turns , where is a role and is the content string of turn . A chat template is a deterministic function that emits a token sequence of length . For the function to be useful, it must be a bijection on the equivalence class of conversations the model is supposed to handle: given , we can recover exactly, and vice versa.
Bijectivity is what guarantees the model can route attention by role. Inside the sequence, each role boundary is marked by a special token (e.g., and a closing token ). Because has zero probability of appearing inside natural , the boundaries are unambiguous — and the role-name token immediately after is the cleanest possible feature for any head in the transformer that wants to condition on speaker.
SFT then optimises the standard causal-LM loss on , but with the trick from §13.1: the loss at position is masked out unless falls inside an assistant span. Formally, , where is the set of token indices produced by assistant turns (and the assistant's EOT). The gradient is zero outside , so the model spends 100% of its capacity learning to generate assistant replies rather than imitate users.
Anatomy of a Modern Template
Every production template, regardless of vendor, is made of the same five ingredients:
| Ingredient | Purpose | Example (Llama-3) |
|---|---|---|
| Sequence-start token | One BOS at the start of every training sample. Marks the absolute beginning of the sequence so positional encodings are anchored. | <|begin_of_text|> |
| Role-header opener | Bracket that introduces a role name. Always followed by a fixed role string and a closer. | <|start_header_id|> |
| Role-header closer | Closes the role-name bracket. Usually followed by a fixed whitespace pattern (often two newlines). | <|end_header_id|>\n\n |
| Content payload | The actual text of the turn — system instructions, user prompt, assistant reply, tool output. The only ingredient that varies per row. | What is the boiling point of water? |
| End-of-turn token | Terminator. The inference loop stops here on assistant turns; the trainer uses it as the boundary between turns. | <|eot_id|> |
Different families combine these five differently. ChatML (OpenAI, Qwen) uses <|im_start|> / <|im_end|> as a single matched pair around each turn. Mistral wraps user instructions in [INST] … [/INST] and tucks system prompts inside the first user turn. Gemma has no system role at all and uses <start_of_turn> / <end_of_turn> bracketing. The choice is arbitrary — what matters is that exactly one choice is used everywhere, all the time.
The template is part of the model. Llama-3 and Llama-3.1 share the same template; Llama-3.1 and Llama-3.2 do not. Mixing them silently produces a Llama-3 model that thinks every prompt is malformed. Templates ship intokenizer_config.jsonnext to the model weights for exactly this reason — the template version-locks with the weights.
Manual Numerical Walkthrough: Hand-Rendering a Two-Turn Chat
Below is a hand-rendered Llama-3 template for a two-turn conversation. Open the panel and walk through it with a pencil — every token has a purpose, and once you have counted them once by hand, every future template bug becomes obvious in seconds.
Manual Numerical Walkthrough — open to see every token
The conversation we are rendering:
| Turn | Role | Content |
|---|---|---|
| 1 | system | Answer in one sentence. |
| 2 | user | What is the boiling point of water? |
| 3 | assistant | Water boils at 100 degrees Celsius at sea level. |
Step 1 — Emit the sequence-start token. One token total. This token is not in the loss; we are not asking the model to predict the first thing in the sequence because there is no preceding context.
pos 0: <|begin_of_text|> [special, no-loss]
Step 2 — Emit the system turn. Header bracket around the role name, double newline, the content, the end-of-turn token. None of these are in the loss because the system message is part of the prompt, not something the model should learn to generate.
pos 1: <|start_header_id|> [special, no-loss] pos 2: system [role name, no-loss] pos 3: <|end_header_id|> [special, no-loss] pos 4: \n\n [whitespace, no-loss] pos 5: Answer [content, no-loss] pos 6: in [content, no-loss] pos 7: one [content, no-loss] pos 8: sentence. [content, no-loss] pos 9: <|eot_id|> [special, no-loss] ← system EOT is masked
Step 3 — Emit the user turn. Same shape as system. Again, nothing in the loss — the model is not being graded on producing user questions.
pos 10: <|start_header_id|> [special, no-loss] pos 11: user [role name, no-loss] pos 12: <|end_header_id|> [special, no-loss] pos 13: \n\n [whitespace, no-loss] pos 14: What [content, no-loss] pos 15: is [content, no-loss] pos 16: the [content, no-loss] pos 17: boiling [content, no-loss] pos 18: point [content, no-loss] pos 19: of [content, no-loss] pos 20: water? [content, no-loss] pos 21: <|eot_id|> [special, no-loss] ← user EOT is masked
Step 4 — Emit the assistant turn. Header is still masked (the role header is part of the prompt for the assistant's next-token prediction, not the thing being generated). The content tokens are in the loss, and so is the final <|eot_id|> — that is what teaches the model to stop.
pos 22: <|start_header_id|> [special, no-loss] pos 23: assistant [role name, no-loss] pos 24: <|end_header_id|> [special, no-loss] pos 25: \n\n [whitespace, no-loss] pos 26: Water [content, ✓ IN LOSS] pos 27: boils [content, ✓ IN LOSS] pos 28: at [content, ✓ IN LOSS] pos 29: 100 [content, ✓ IN LOSS] pos 30: degrees [content, ✓ IN LOSS] pos 31: Celsius [content, ✓ IN LOSS] pos 32: at [content, ✓ IN LOSS] pos 33: sea [content, ✓ IN LOSS] pos 34: level. [content, ✓ IN LOSS] pos 35: <|eot_id|> [special, ✓ IN LOSS] ← assistant EOT IS in loss
Bookkeeping:
| Quantity | Value | Note |
|---|---|---|
| Total tokens | 36 | Headers + content + EOTs |
| Special tokens | 10 | BOS + 3×start_header + 3×end_header + 3×EOT |
| Tokens in the loss | 10 | 9 assistant words + assistant EOT |
| Loss-token fraction | ≈ 28% | Typical for short replies; long replies hit 70-90% |
What the gradient does at each masked vs unmasked position: at positions 0–25 (BOS, system, user, assistant header), the model still runs a forward pass — every attention head sees every token — but the cross-entropy at those positions is zero, so no gradient flows from them. At positions 26–35 the cross-entropy is real and the gradient pushes the model to put higher probability on Water given the system+user context, on boils given that context plus Water, and so on.
Now repeat this for a 100-turn assistant-heavy conversation and you have the input to a real SFT step.
Visualizing the Template Zoo
The explorer below lets you edit a tiny three-turn chat and watch every major template family render it byte for byte. Click between Raw concatenation, ChatML, Llama-3, Gemma, and Mistral to see how wildly different the same conversation looks once it is wrapped for each model.
Two things to notice. First, the raw view collapses all three turns into a single space-separated string — exactly what a base model sees if you skip the template. There is no way for the model to know where one role ends and the next begins, which is why a base model on a chat task feels “drunk.” Second, every templated view contains 8–30 special tokens of overhead per turn. For a long multi-turn document, that overhead is a real fraction of the sequence — and it is exactly the fraction the loss mask removes from the gradient.
Loss Masking: Teaching Only the Assistant Tokens
The template gets the data into the right shape. The loss mask decides what the model is actually graded on. The two are joined at the hip: the template defines the spans, the mask selects the spans that contribute gradient. Toggle the modes below to see every choice the field has tried and why only one of them is the modern default.
The assistant content + EOT mode (the third button) is the recipe every modern SFT codebase converges on. The reasoning chain is short and surprisingly tight:
- Mask user / system tokens because the model is not the user. Training it on user tokens turns SFT into a slow, expensive form of pretraining on a tiny, weird corpus.
- Mask role headers because the model does not generate headers at inference — the runtime emits them between turns. Training on them teaches a behaviour the model will never use.
- Keep assistant content in the loss because that is the entire point of SFT.
- Keep assistant
<|eot_id|>in the loss because the inference loop watches for that token and stops as soon as it appears. A model that has not been graded on producing EOT does not produce EOT, and the user gets the famous “model rambles for 4096 tokens” bug. - Keep every assistant turn in multi-turn chats (not just the last one). Early HuggingFace defaults dropped all but the last assistant turn, wasting roughly half the carefully annotated SFT data on every multi-turn sample. The whole open- source instruct ecosystem in 2023 paid for this mistake before fixing it in 2024.
Plain Python: Rendering and Masking from Scratch
The full pipeline is fewer than 150 lines of Python with no dependencies. We render the chat into a string in lock-step with a per-character boolean mask, tokenize, inherit the mask onto the tokens, and emit -100 at every masked position. This is the entire mechanism — every production SFT codebase is this script plus distributed-training glue.
PyTorch + HuggingFace: Doing the Same Thing in Production
In production we do not hand-write the template — every modern instruct checkpoint ships with a Jinja2 chat template inside its tokenizer_config.json. Calling tokenizer.apply_chat_template() runs it. The only thing we have to add is the per-turn mask, which we build by re-rendering the prefix conversation one turn at a time and using the resulting token counts as span boundaries.
From toy script to real SFT
The script above is what runs inside trl.SFTTrainer, torchtune, and axolotl — minus the multi-GPU collator, the gradient checkpointing, and the FSDP wrap. Once the rendered + masked tensors are produced, the rest of SFT is identical to pretraining: forward pass, cross-entropy with ignore_index=-100, backward, optimizer step. Everything novel about SFT is here in the data layer.
At Massive Scale: Why Template Bugs Are Catastrophic
When the model is small, a template bug looks like a 1–2 point dip on MT-Bench and a slightly chattier assistant. When the model is large, the same bug eats hundreds of thousands of GPU-hours and delays a launch.
The compute multiplier on a bad mask
A 70B Llama-3 SFT run on a 500k-conversation dataset is roughly FLOPs where and training tokens — about FLOPs, or three days on a 256-H100 node at fp8 peak. If the loss mask is wrong (say, it accidentally grades user tokens too), the run still converges to something — just to a model that has learned to imitate users 30% of the time. Discovering this requires a full eval cycle (another half-day of GPU time), an investigation, and a re-run. One template bug is a week.
The data-mix multiplier
Frontier SFT mixes data from a dozen different sources (Tülu, Ultra, in-house annotators, distilled outputs from a stronger model, math traces, code traces, tool-use traces). Each source ships in its own format. The first stage of every SFT pipeline is a normalisation pass that coerces all of them into the canonical messages list. A bug in that converter for any single source silently poisons that fraction of the training data — and because the converter is invisible to apply_chat_template, the poisoned data renders cleanly and trains without errors.
The inference-skew multiplier
Many serving stacks (vLLM, TGI, SGLang) build the prompt from their own templating layer rather than re-using apply_chat_template. If the serving template differs from the training template by a single whitespace, the model sees a prompt-shape distribution at inference that it never saw during SFT, and tail-task quality regresses while in-distribution evals look fine. The fix is always re-using the tokenizer's own template at inference — even when that costs you a couple of engineering hours of integration work.
Tool-call templates: the next layer of overhead
Once the model has to emit structured tool calls, the template has to define a sub-grammar for arguments. Llama-3.1 uses a dedicated <|python_tag|> prefix; Qwen-2.5 wraps tool calls in <tool_call>…</tool_call>; Anthropic-style models use XML-like tags. The format does not much matter for capability; what matters is that the template renders the tool call the same way at training time and at serving time. Every team that has shipped tool-use has been burned at least once by a mismatch in this layer.
Engineering Reality: The Catalogue of Template Disasters
Two years of open-source SFT releases have produced a recurring cast of bugs. The pattern is always the same: the bug is invisible in train-loss curves, mostly invisible in standard evals, and only shows up when a user has a long conversation, asks a follow-up, or gets a tool call. By the time the regression is caught, a weight release has already been published. This is the list every SFT engineer eventually memorises.
- The never-ending response. Assistant
<|eot_id|>was masked out of the loss. The model generates beautifully and then keeps going until it hitsmax_new_tokens. Fix: include the assistant EOT in the loss; print one sample before training. - The double-BOS sequence.
apply_chat_templatealready adds BOS;tokenizer(text)with defaultadd_special_tokens=Trueadds another one. The model trains on<|begin_of_text|><|begin_of_text|>…and then sees only a single BOS at inference. Fix: alwaysadd_special_tokens=Falsewhen re-tokenizing template output. - User tokens in the loss. The mask got inverted. The model becomes a good user-simulator and a bad assistant. Catches: print one sample; sanity-check the assistant-token fraction (typically 30–80%, never 100%, never < 10%).
- The padding-graded loss. The collator padded the labels tensor with
pad_token_idinstead of-100. Loss looks suspiciously low (predicting a single padding token is trivial); MMLU drops 5–10 points. Fix: pad labels withIGNORE_INDEX, always. - Template-version drift. The training code uses v0 of the template (with two newlines); the inference code uses v1 (with one newline). In-distribution evals are fine because they go through the same training code; users get a slightly worse model than the eval reports.
- Multi-turn last-only masking. The trainer was configured to grade only the final assistant reply per conversation. Multi-turn capability degrades; single-turn looks unchanged. Fix: grade every assistant span in multi-turn data.
- Truncated mid-turn. The sequence was truncated to fit the context window, leaving a half-rendered assistant turn with no EOT. The model learns that some assistant turns simply stop in the middle of a word. Fix: drop the trailing partial turn rather than truncating it; never truncate inside a turn.
- Tool-call format skew. The SFT data wrapped tool calls one way; the serving prompt builds them another. Tool-use tail accuracy crashes while normal chat stays fine. Fix: render tool calls with the model's own template at every layer.
Fix these and what is left is the actual capability that supervised fine-tuning is supposed to teach. The next section (§13.4, SFT Training Configuration) builds the optimizer, schedule, and batch-size discipline that turns these correctly masked tokens into a model the user wants to talk to.
The mental model that unifies this section: the chat template is the API between conversations and tokens. SFT teaches the model to be fluent in the assistant half of that API. Everything else in this chapter — data collection, training config, forgetting mitigation — is in service of that single goal, and every one of them assumes the template is byte-perfect.