Boo-AI — Master Artificial Intelligence by Building from Scratch

The Real Problem: A Base Model Has No Idea Who Is Talking

A base model is a colossal next-token predictor trained on raw web text. Drop a question into one and you do not get an answer — you get whatever the internet usually puts after a question. Sometimes that is a reply. Sometimes it is more questions in a FAQ. Sometimes it is a snide YouTube comment. The model has no concept of turns, no concept of a system role, no concept of when its own reply should stop. Roles, turns, and stopping are not properties that emerge from pretraining; they are properties the model must be taught.

That is what supervised fine-tuning is for. But SFT is itself just more next-token prediction — the same loss, the same optimizer, the same forward pass. What changes is the shape of the training data. Instead of raw text, we feed the model conversations wrapped in a strict, machine-readable envelope: special tokens that say “a new user turn starts here,” “this is the assistant's reply,” “this turn is finished.” That envelope is the chat template, and getting it right is the difference between a model that follows instructions cleanly and a model that produces 90% nonsense.

The one-line takeaway of this section: the chat template is the only contract between your SFT data, your inference code, and your model. If those three disagree by even a single token, the model degrades silently — and the bug is almost invisible in training metrics.

Intuition: A Script for a Two-Voice Play

Imagine handing an improv troupe a screenplay with no character names and no scene breaks — just a wall of dialogue. They would have to guess, line by line, who speaks next. That is a base model trying to hold a conversation. It can string words together fluently, but whether the next sentence should come from the user or the assistant is a coin-flip every few tokens.

Now hand the same troupe a properly formatted script: USER: in one font, ASSISTANT: in another, “END OF TURN” written explicitly at the end of every block. Every actor knows when to speak, what to say, and when to stop. The text itself has not changed — only the structure around it has — but the performance is unrecognisable.

That is what a chat template does. It surrounds plain conversational text with a small alphabet of special tokens that the model has dedicated embedding rows for. Those tokens never appear in normal English. The moment the model sees <|start_header_id|>assistant<|end_header_id|>, it knows with near-certainty that the next several hundred tokens are its own reply — because in the entire SFT dataset that pattern appeared exactly there and only there.

The Mathematical Idea: A Bijection Between Conversations and Token Sequences

Let a conversation $C$ be a list of turns $C = [(r_1, c_1), (r_2, c_2), \dots, (r_T, c_T)]$ , where $r_t \in \{ \text{system}, \text{user}, \text{assistant}, \text{tool} \}$ is a role and $c_t$ is the content string of turn $t$ . A chat template is a deterministic function $\Phi : C \mapsto x_{1:N}$ that emits a token sequence of length $N$ . For the function to be useful, it must be a bijection on the equivalence class of conversations the model is supposed to handle: given $x_{1:N}$ , we can recover $C$ exactly, and vice versa.

Bijectivity is what guarantees the model can route attention by role. Inside the sequence, each role boundary is marked by a special token $s_r$ (e.g., $s_{\text{user}}$ and a closing token $s_{\text{EOT}}$ ). Because $s_r$ has zero probability of appearing inside natural $c_t$ , the boundaries are unambiguous — and the role-name token immediately after $s_r$ is the cleanest possible feature for any head in the transformer that wants to condition on speaker.

SFT then optimises the standard causal-LM loss on $\Phi(C)$ , but with the trick from §13.1: the loss at position $i$ is masked out unless $i$ falls inside an assistant span. Formally, $\mathcal{L} = -\sum_{i \in A(C)} \log p_\theta(x_i \mid x_{<i})$ , where $A(C)$ is the set of token indices produced by assistant turns (and the assistant's EOT). The gradient is zero outside $A(C)$ , so the model spends 100% of its capacity learning to generate assistant replies rather than imitate users.

Why bijection matters at inference: at serving time the runtime constructs the prompt with the same

\Phi

but truncates just before the assistant's content, then samples until

s_{\text{EOT}}

. If training and inference use different

\Phi

, the model sees a prompt-shape distribution at inference that it never saw in training, and quality collapses.

Anatomy of a Modern Template

Every production template, regardless of vendor, is made of the same five ingredients:

Ingredient	Purpose	Example (Llama-3)
Sequence-start token	One BOS at the start of every training sample. Marks the absolute beginning of the sequence so positional encodings are anchored.	<\|begin_of_text\|>
Role-header opener	Bracket that introduces a role name. Always followed by a fixed role string and a closer.	<\|start_header_id\|>
Role-header closer	Closes the role-name bracket. Usually followed by a fixed whitespace pattern (often two newlines).	<\|end_header_id\|>\n\n
Content payload	The actual text of the turn — system instructions, user prompt, assistant reply, tool output. The only ingredient that varies per row.	What is the boiling point of water?
End-of-turn token	Terminator. The inference loop stops here on assistant turns; the trainer uses it as the boundary between turns.	<\|eot_id\|>

Different families combine these five differently. ChatML (OpenAI, Qwen) uses <|im_start|> / <|im_end|> as a single matched pair around each turn. Mistral wraps user instructions in [INST] … [/INST] and tucks system prompts inside the first user turn. Gemma has no system role at all and uses <start_of_turn> / <end_of_turn> bracketing. The choice is arbitrary — what matters is that exactly one choice is used everywhere, all the time.

The template is part of the model. Llama-3 and Llama-3.1 share the same template; Llama-3.1 and Llama-3.2 do not. Mixing them silently produces a Llama-3 model that thinks every prompt is malformed. Templates ship in tokenizer_config.json next to the model weights for exactly this reason — the template version-locks with the weights.

Manual Numerical Walkthrough: Hand-Rendering a Two-Turn Chat

Below is a hand-rendered Llama-3 template for a two-turn conversation. Open the panel and walk through it with a pencil — every token has a purpose, and once you have counted them once by hand, every future template bug becomes obvious in seconds.

Manual Numerical Walkthrough — open to see every token

The conversation we are rendering:

Turn	Role	Content
1	system	Answer in one sentence.
2	user	What is the boiling point of water?
3	assistant	Water boils at 100 degrees Celsius at sea level.

Step 1 — Emit the sequence-start token. One token total. This token is not in the loss; we are not asking the model to predict the first thing in the sequence because there is no preceding context.

pos  0:  <|begin_of_text|>          [special, no-loss]

Step 2 — Emit the system turn. Header bracket around the role name, double newline, the content, the end-of-turn token. None of these are in the loss because the system message is part of the prompt, not something the model should learn to generate.

pos  1:  <|start_header_id|>        [special, no-loss]
pos  2:  system                      [role name, no-loss]
pos  3:  <|end_header_id|>           [special, no-loss]
pos  4:  \n\n                        [whitespace, no-loss]
pos  5:  Answer                      [content, no-loss]
pos  6:  in                          [content, no-loss]
pos  7:  one                         [content, no-loss]
pos  8:  sentence.                   [content, no-loss]
pos  9:  <|eot_id|>                  [special, no-loss]   ← system EOT is masked

Step 3 — Emit the user turn. Same shape as system. Again, nothing in the loss — the model is not being graded on producing user questions.

pos 10:  <|start_header_id|>        [special, no-loss]
pos 11:  user                        [role name, no-loss]
pos 12:  <|end_header_id|>           [special, no-loss]
pos 13:  \n\n                        [whitespace, no-loss]
pos 14:  What                        [content, no-loss]
pos 15:  is                          [content, no-loss]
pos 16:  the                         [content, no-loss]
pos 17:  boiling                     [content, no-loss]
pos 18:  point                       [content, no-loss]
pos 19:  of                          [content, no-loss]
pos 20:  water?                      [content, no-loss]
pos 21:  <|eot_id|>                  [special, no-loss]   ← user EOT is masked

Step 4 — Emit the assistant turn. Header is still masked (the role header is part of the prompt for the assistant's next-token prediction, not the thing being generated). The content tokens are in the loss, and so is the final <|eot_id|> — that is what teaches the model to stop.

pos 22:  <|start_header_id|>        [special, no-loss]
pos 23:  assistant                   [role name, no-loss]
pos 24:  <|end_header_id|>           [special, no-loss]
pos 25:  \n\n                        [whitespace, no-loss]
pos 26:  Water                       [content, ✓ IN LOSS]
pos 27:  boils                       [content, ✓ IN LOSS]
pos 28:  at                          [content, ✓ IN LOSS]
pos 29:  100                         [content, ✓ IN LOSS]
pos 30:  degrees                     [content, ✓ IN LOSS]
pos 31:  Celsius                     [content, ✓ IN LOSS]
pos 32:  at                          [content, ✓ IN LOSS]
pos 33:  sea                         [content, ✓ IN LOSS]
pos 34:  level.                      [content, ✓ IN LOSS]
pos 35:  <|eot_id|>                  [special, ✓ IN LOSS] ← assistant EOT IS in loss

Bookkeeping:

Quantity	Value	Note
Total tokens	36	Headers + content + EOTs
Special tokens	10	BOS + 3×start_header + 3×end_header + 3×EOT
Tokens in the loss	10	9 assistant words + assistant EOT
Loss-token fraction	≈ 28%	Typical for short replies; long replies hit 70-90%

What the gradient does at each masked vs unmasked position: at positions 0–25 (BOS, system, user, assistant header), the model still runs a forward pass — every attention head sees every token — but the cross-entropy at those positions is zero, so no gradient flows from them. At positions 26–35 the cross-entropy is real and the gradient pushes the model to put higher probability on Water given the system+user context, on boils given that context plus Water, and so on.

Now repeat this for a 100-turn assistant-heavy conversation and you have the input to a real SFT step.

Visualizing the Template Zoo

The explorer below lets you edit a tiny three-turn chat and watch every major template family render it byte for byte. Click between Raw concatenation, ChatML, Llama-3, Gemma, and Mistral to see how wildly different the same conversation looks once it is wrapped for each model.

Loading chat-template explorer…

Two things to notice. First, the raw view collapses all three turns into a single space-separated string — exactly what a base model sees if you skip the template. There is no way for the model to know where one role ends and the next begins, which is why a base model on a chat task feels “drunk.” Second, every templated view contains 8–30 special tokens of overhead per turn. For a long multi-turn document, that overhead is a real fraction of the sequence — and it is exactly the fraction the loss mask removes from the gradient.

Loss Masking: Teaching Only the Assistant Tokens

The template gets the data into the right shape. The loss mask decides what the model is actually graded on. The two are joined at the hip: the template defines the spans, the mask selects the spans that contribute gradient. Toggle the modes below to see every choice the field has tried and why only one of them is the modern default.

Loading loss-mask visualizer…

The assistant content + EOT mode (the third button) is the recipe every modern SFT codebase converges on. The reasoning chain is short and surprisingly tight:

Mask user / system tokens because the model is not the user. Training it on user tokens turns SFT into a slow, expensive form of pretraining on a tiny, weird corpus.
Mask role headers because the model does not generate headers at inference — the runtime emits them between turns. Training on them teaches a behaviour the model will never use.
Keep assistant content in the loss because that is the entire point of SFT.
Keep assistant <|eot_id|> in the loss because the inference loop watches for that token and stops as soon as it appears. A model that has not been graded on producing EOT does not produce EOT, and the user gets the famous “model rambles for 4096 tokens” bug.
Keep every assistant turn in multi-turn chats (not just the last one). Early HuggingFace defaults dropped all but the last assistant turn, wasting roughly half the carefully annotated SFT data on every multi-turn sample. The whole open- source instruct ecosystem in 2023 paid for this mistake before fixing it in 2024.

Plain Python: Rendering and Masking from Scratch

The full pipeline is fewer than 150 lines of Python with no dependencies. We render the chat into a string in lock-step with a per-character boolean mask, tokenize, inherit the mask onto the tokens, and emit -100 at every masked position. This is the entire mechanism — every production SFT codebase is this script plus distributed-training glue.

Render + mask a Llama-3 chat — pure Python

🐍render_mask_plain.py

Explanation(16)

Code(163)

18Special tokens — the model's role-vocabulary

EXAMPLE

BOS → id 128000   EOT → id 128009

19Header wrappers — boundaries the model can attend to

These two tokens bracket the role name. The pair acts like an unambiguous delimiter: nowhere in normal pretraining text would <|start_header_id|>user<|end_header_id|> ever appear, so the model can learn 'when I see this pattern, the next several tokens are a user turn' with very few SFT examples.

20End-of-turn token — the per-turn stop signal

<|eot_id|> is what the inference loop watches for. In serving code, model.generate() stops the moment this token is sampled. If SFT does not include <|eot_id|> in the assistant turns AND in the loss, the model keeps talking forever (the classic 'chatty assistant' bug).

26The ignore index — how we 'mask' positions out of the loss

PyTorch's nn.CrossEntropyLoss takes an ignore_index argument. Any label tensor element equal to that value (default -100) is silently skipped. This is the entire loss-masking machinery — no special op, no extra layer. Set the label to -100 and the gradient at that position is exactly zero.

EXAMPLE

loss = F.cross_entropy(logits, labels, ignore_index=-100)

35Turn = (role, content)

Every modern chat dataset normalises to this dataclass-of-dataclasses shape. ShareGPT, OpenAssistant, UltraChat, Tülu, Dolma — they all distill down to a list of {role, content}. The first job of any SFT pipeline is to coerce the wild input format into exactly this.

EXAMPLE

Turn(role='user', content='What is 2+2?')

43render_turn — wrap one turn in the Llama-3 envelope

Three pieces glued together: the role header, the content, the per-turn terminator. The double newline after <|end_header_id|> is part of the official template — Meta's training data has it consistently, so we have to as well. Omit it and the model never quite knows the header is finished.

EXAMPLE

render_turn(Turn('user','hi')) → '<|start_header_id|>user<|end_header_id|>\n\nhi<|eot_id|>'

55render_chat — full conversation + per-character mask

We build TWO parallel arrays in lock-step: the output string and a boolean array where mask[i] = True iff character i belongs to assistant content (the part we want the model to learn to produce). Doing this character-by-character makes the tokenizer step trivial later — every token inherits the mask of its first character.

EXAMPLE

text = '<|begin_of_text|><|start_header_id|>...'  ;  mask = [False,False,...,True,...]

58Prepend BOS once, mask it out

BOS goes at the very start of every training sequence and is NOT in the loss. The model is not asked to predict BOS because BOS is the prompt's first token by construction — there is no 'previous token' from which to predict it.

EXAMPLE

out = '<|begin_of_text|>' ; mask = [False]

63Loop: emit prefix → content → suffix per turn

The prefix is always masked (the header is not what the model is being graded on producing — it is what tells the model whose turn is next). The content is masked iff role != 'assistant'. The suffix (EOT) is masked iff role != 'assistant' — for assistant turns the model MUST learn to emit EOT.

69The key invariant: is_assistant drives the mask

This one boolean is the entire pedagogical contract of SFT: train the model to predict assistant tokens, NOT to predict the user's words. If you flip this and accidentally mask the assistant content while keeping user content in the loss, the model learns to be a great user simulator and a terrible assistant. The bug exists in the wild — see the Vicuna v0 → v1 patchnotes.

EXAMPLE

is_assistant=True → mask=[True,True,...]  ;  is_assistant=False → mask=[False,False,...]

74Assistant <|eot_id|> is IN the loss — user <|eot_id|> is NOT

This is the single most copy-pasted subtlety in modern SFT. Without the assistant EOT in the loss, the model never learns to stop and generates until it hits max_new_tokens (the 'never-ending response' bug). With user EOTs in the loss, the model learns to interrupt itself mid-reply with an EOT — equally bad.

84A toy tokenizer to make the structure visible

Real Llama-3 uses tiktoken-style BPE: each word is split into 1-3 sub-word tokens, special strings get their dedicated single id. We use a whitespace approximation so the printed table is human-readable. The labels[] and IGNORE handling is identical regardless of tokenizer.

EXAMPLE

tokenize('hello world', [True,True,...,True,True]) → tokens=['hello','world'], labels=[id, id]

96Greedy match of special tokens first

We have to check for special tokens BEFORE doing whitespace splits, because <|begin_of_text|> contains the character '<' which would otherwise start a regular word. Real tokenizers register the special tokens with the byte-pair-encoder so this is automatic; here we do it by hand to keep things explicit.

EXAMPLE

i=0 → text='<|begin_of_text|>...' → match BOS → tokens=['<|begin_of_text|>'], i=17

103Inherit the loss flag from the first character of the token

Because we built char_mask in lock-step with the string, the first character of any token tells us whether the whole token is in the loss. This is also how real production code does it — HuggingFace's apply_chat_template returns a dict with an 'assistant_masks' field that is exactly this, per token instead of per character.

EXAMPLE

token='Water' at char index 60 → char_mask[60]=True → label=real_id, in loss

132Real label = the NEXT token id (causal LM target)

Causal LMs predict the next token, so at position i the label is token_id[i+1]. We elide that detail (hash for fake ids) because the IGNORE handling is what this section is about. In production: labels = input_ids.clone() ; labels[~mask] = -100 — done in one line per the PyTorch example below.

144Demo — print the rendered string and the token/label table

Running this prints the full Llama-3-formatted string followed by a per-token table showing exactly which tokens are graded. For the 3-turn demo above, the assistant's reply ('Water boils at 100 degrees Celsius at sea level.<|eot_id|>') gives 11 tokens in the loss out of ~30 total. That ~37% loss-token fraction is typical for short-reply SFT data; long-reply SFT (multi-paragraph assistant turns) hits 70-90%.

EXAMPLE

11 / 30 tokens contribute to the loss

147 lines without explanation

1"""
2Render a multi-turn chat into a Llama-3-style template and build the
3per-token label mask used for supervised fine-tuning.
4
5Every modern open-source SFT codebase boils down to this script with more
6edge cases. We keep it deliberately tiny so the mechanism is obvious:
7
8  1. Define the special tokens.
9  2. For each turn, emit a fixed prefix, the content, and a fixed suffix.
10  3. Record which positions belong to the assistant's reply.
11  4. Set every non-assistant label to -100 so PyTorch's CrossEntropyLoss
12     ignores it.
13
14The output is exactly what you would feed into model.forward() during SFT.
15"""
16
17from dataclasses import dataclass
18from typing import List, Tuple
19
20# ---------------------------------------------------------------------------
21# 1. Llama-3 special tokens (the strings that the tokenizer maps to single ids)
22# ---------------------------------------------------------------------------
23
24BOS         = "<|begin_of_text|>"
25START_HDR   = "<|start_header_id|>"
26END_HDR     = "<|end_header_id|>"
27EOT         = "<|eot_id|>"
28
29# In real life we use the tokenizer's "ignore index". -100 is PyTorch's default
30# for nn.CrossEntropyLoss(ignore_index=-100) — labels with this value contribute
31# zero to the loss.
32IGNORE = -100
33
34
35# ---------------------------------------------------------------------------
36# 2. A turn is just (role, content).
37# ---------------------------------------------------------------------------
38
39@dataclass
40class Turn:
41    role: str       # "system" | "user" | "assistant"
42    content: str
43
44
45# ---------------------------------------------------------------------------
46# 3. Render one turn into a string.
47# ---------------------------------------------------------------------------
48
49def render_turn(turn: Turn) -> str:
50    return (
51        f"{START_HDR}{turn.role}{END_HDR}\n\n"
52        f"{turn.content}"
53        f"{EOT}"
54    )
55
56
57# ---------------------------------------------------------------------------
58# 4. Render the whole conversation + return per-character label mask.
59# ---------------------------------------------------------------------------
60
61def render_chat(turns: List[Turn]) -> Tuple[str, List[bool]]:
62    """Return (full_text, per_char_mask). mask[i] is True iff char i belongs
63    to an assistant *reply* (not to any header or special token)."""
64    out = BOS
65    mask = [False] * len(BOS)
66
67    for t in turns:
68        prefix = f"{START_HDR}{t.role}{END_HDR}\n\n"
69        out += prefix
70        mask += [False] * len(prefix)
71
72        is_assistant = t.role == "assistant"
73        out += t.content
74        mask += [is_assistant] * len(t.content)
75
76        # We INCLUDE the assistant's <|eot_id|> in the loss so the model learns
77        # when to stop. For user / system EOTs we mask them out.
78        out += EOT
79        mask += [is_assistant] * len(EOT)
80
81    return out, mask
82
83
84# ---------------------------------------------------------------------------
85# 5. A toy "tokenizer" — splits on whitespace, treats special strings as one
86#    token each. Real Llama-3 uses tiktoken BPE; the structure is identical.
87# ---------------------------------------------------------------------------
88
89SPECIALS = {BOS, START_HDR, END_HDR, EOT}
90
91def tokenize(text: str, char_mask: List[bool]) -> Tuple[List[str], List[int]]:
92    """Returns (tokens, labels). labels[i] is the token id we want the model
93    to predict at position i, or IGNORE to mask the loss at that position."""
94    tokens: List[str] = []
95    labels: List[int] = []
96
97    i = 0
98    n = len(text)
99    while i < n:
100        # Try to match a special token at position i.
101        matched = None
102        for s in SPECIALS:
103            if text.startswith(s, i):
104                matched = s
105                break
106
107        if matched:
108            tokens.append(matched)
109            # A special token is in the loss iff the *first char* is in the mask.
110            labels.append(_label_for(matched, char_mask[i]))
111            i += len(matched)
112            continue
113
114        # Otherwise grab the next whitespace-delimited word.
115        if text[i].isspace():
116            i += 1
117            continue
118
119        j = i
120        while j < n and not text[j].isspace() and not _starts_special(text, j):
121            j += 1
122
123        word = text[i:j]
124        tokens.append(word)
125        labels.append(_label_for(word, char_mask[i]))
126        i = j
127
128    return tokens, labels
129
130
131def _starts_special(s: str, i: int) -> bool:
132    return any(s.startswith(tok, i) for tok in SPECIALS)
133
134
135def _label_for(token: str, in_loss: bool) -> int:
136    # In real SFT, the label is the *next* token id. Here we use a fake "id"
137    # equal to hash(token) just to show the shape — the IGNORE handling is the
138    # part that matters.
139    return hash(token) % 32000 if in_loss else IGNORE
140
141
142# ---------------------------------------------------------------------------
143# 6. Demo
144# ---------------------------------------------------------------------------
145
146if __name__ == "__main__":
147    chat = [
148        Turn("system",    "Answer in one sentence."),
149        Turn("user",      "What is the boiling point of water?"),
150        Turn("assistant", "Water boils at 100 degrees Celsius at sea level."),
151    ]
152
153    text, char_mask = render_chat(chat)
154    tokens, labels  = tokenize(text, char_mask)
155
156    print(text)
157    print("\n--- token / label table ---")
158    for tok, lbl in zip(tokens, labels):
159        flag = "✓" if lbl != IGNORE else " "
160        print(f"  {flag}  {tok!r:<30}  label={lbl}")
161
162    n_loss = sum(1 for l in labels if l != IGNORE)
163    print(f"\n{n_loss} / {len(labels)} tokens contribute to the loss.")

PyTorch + HuggingFace: Doing the Same Thing in Production

In production we do not hand-write the template — every modern instruct checkpoint ships with a Jinja2 chat template inside its tokenizer_config.json. Calling tokenizer.apply_chat_template() runs it. The only thing we have to add is the per-turn mask, which we build by re-rendering the prefix conversation one turn at a time and using the resulting token counts as span boundaries.

Render + mask with HuggingFace — production pattern

🐍render_mask_hf.py

Explanation(16)

Code(134)

14AutoTokenizer brings the template with it

Every modern instruct checkpoint on the HuggingFace Hub ships with a chat_template stored in tokenizer_config.json. Calling tokenizer.apply_chat_template() runs that Jinja2 template — you do NOT hand-write the special-token sequence yourself. The template is what guarantees byte-for-byte agreement between your SFT data and the format the model will see at inference time.

EXAMPLE

tokenizer.chat_template[:80] → '{% set loop_messages = messages %}...'

18Pad token fallback — keep training from crashing on EOS=None

Several open-source models (Llama-2, Mistral) ship with pad_token=None. The collator below needs SOMETHING to pad with; reusing EOS is the standard trick and is safe as long as the attention_mask correctly zeros out the padded positions (which it does).

21The ignore index for cross-entropy

Same constant we used in the plain-Python version. Cross-entropy with ignore_index=-100 contributes exactly 0 to the loss AND 0 to the gradient at masked positions. The masked positions still go through the forward pass, the attention still sees them, only the loss is silent.

28messages — the canonical SFT format

HuggingFace's convention since late 2023. Every dataset on the Hub that supports chat-tuned models uses this shape: a list of {role, content} dicts. Datasets that arrive in other formats (alpaca, sharegpt) are usually run through a one-liner converter before they reach the trainer.

EXAMPLE

row['messages'] = [{role:'user', content:'...'}, {role:'assistant', content:'...'}]

46Render the whole chat into a string

apply_chat_template(tokenize=False) returns the full rendered string. With tokenize=True it returns input_ids directly — but we want the string here so we can re-tokenize sub-spans for the mask. add_generation_prompt=False because our last message IS an assistant turn; we are not asking the model to generate a new one, we are teaching it on an existing one.

EXAMPLE

'<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\nAnswer in one sentence.<|eot_id|>...'

53Tokenize the full text WITHOUT adding more specials

apply_chat_template already inserted BOS and the role headers; we don't want the tokenizer to add a SECOND BOS at the front. add_special_tokens=False is the standard idiom for any string that came out of a template renderer.

EXAMPLE

input_ids = [128000, 128006, 9125, ...]  shape: list[int] of length S

55labels initially = input_ids — we will mask non-assistant later

Start from a copy of input_ids. Wherever the model should be graded, the label stays at the real token id; wherever we mask, we will overwrite with IGNORE_INDEX. This is the same one-line pattern used in trl, axolotl, and torchtune: labels = input_ids.clone(); labels[mask] = -100.

56Cursor walks through the token sequence one turn at a time

We don't have a character-level offset map (real tokenizers do via return_offsets_mapping, but apply_chat_template doesn't expose them cleanly). The fastest portable trick: render the conversation up through turn i, count tokens, and use that count as the end index of turn i. Subtract the previous cursor and you have the per-turn span.

62Re-render the partial conversation up to (and including) turn i

The template is Jinja2 over the messages list — it produces deterministic output. By rendering [m0], [m0,m1], [m0,m1,m2] etc. and tokenizing each, we get a strictly monotone list of end-of-turn indices. The cost is one extra apply_chat_template + tokenize per turn (~microseconds); the gain is correctness without depending on offset mappings.

66end is the token-index where turn i finishes

input_ids[cursor:end] is exactly the span of tokens belonging to turn i, headers and EOT included. For the demo's 3-turn chat: system span = 0..8, user span = 9..21, assistant span = 22..36 (numbers approximate; depends on tokenizer).

EXAMPLE

after system turn: cursor=0, end=9, span = [BOS, hdr_start, 'system', hdr_end, '\n', 'Answer', 'in', 'one', 'sentence.', EOT]

68The masking rule: non-assistant → IGNORE

For system and user turns, every token in the span (including the role headers and the EOT) gets label IGNORE_INDEX. For assistant turns we leave the labels alone — they keep their real token ids and contribute to the loss. This includes the assistant's <|eot_id|>, which is what teaches the model to stop.

75.map applies format_and_mask to every row

datasets.Dataset.map runs the function over the whole dataset (in parallel with num_proc=N if needed) and caches the result to disk. remove_columns drops the original 'messages' so the saved Arrow file is only input_ids + labels — about a third the size and ~5x faster to load at training time.

84Pad labels with IGNORE, NOT with pad_token_id

This is the single most common collator bug in SFT. If you pad labels with pad_token_id, the model is graded on predicting padding tokens at every step beyond the real sequence length. Loss looks suspiciously low, gradients are dominated by the easy pad-prediction signal, and quality drops by 5-10 MMLU points before anyone notices. Always pad labels with -100.

EXAMPLE

labels = torch.full((B, max_len), IGNORE_INDEX, dtype=torch.long)  # NOT pad_token_id

88attention_mask = 1 for real tokens, 0 for padding

Tells the transformer's attention layers to ignore padded positions. The masked-out tokens cannot be attended to, so they contribute nothing to any other token's hidden state. Combined with labels=-100 they are fully neutralised — present in the tensor for shape reasons, invisible to the loss.

101model(**batch) — HF computes the loss for you when labels are present

AutoModelForCausalLM's forward pass detects the labels= kwarg and computes cross-entropy internally with the correct shift (predict token i+1 from positions 0..i) and the correct ignore_index. The returned out.loss is a scalar tensor ready for .backward(). out.logits is the [B, S, V] tensor of predictions if you need to do anything custom.

EXAMPLE

out.loss.shape → torch.Size([])  ;  out.logits.shape → [B, S, 128256]

110Print and EYEBALL the first sample before launching a real run

This is the single highest-value debugging habit in SFT. A 10-hour 8-GPU run on a corrupted mask wastes ~$200 and a day. A 30-second sanity print catches 95% of template / mask bugs: wrong number of assistant tokens in the loss, missing EOT, double-BOS, off-by-one cursor. Every SFT codebase ships with one — trl has print_one_sample(), torchtune has tokenize_check.py, axolotl has 'preprocess'.

118 lines without explanation

1"""
2Render and mask a chat dataset for Llama-3 SFT using the production stack:
3HuggingFace tokenizers + datasets + a single line of label-masking.
4
5This is what trl.SFTTrainer does under the hood (minus the multi-GPU
6choreography). Replace MODEL_ID with any chat-template-aware checkpoint
7and everything below works unchanged.
8"""
9
10from typing import Dict, List
11import torch
12from datasets import Dataset
13from transformers import AutoTokenizer
14
15MODEL_ID = "meta-llama/Meta-Llama-3.1-8B-Instruct"
16
17tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)
18# Some templates expect a pad token; reuse EOS to keep things simple.
19if tokenizer.pad_token is None:
20    tokenizer.pad_token = tokenizer.eos_token
21
22IGNORE_INDEX = -100
23
24
25# ---------------------------------------------------------------------------
26# 1. Toy SFT dataset (in production you'd load Tülu / UltraChat / etc.)
27# ---------------------------------------------------------------------------
28raw = [
29    {
30        "messages": [
31            {"role": "system",    "content": "Answer in one sentence."},
32            {"role": "user",      "content": "What is the boiling point of water?"},
33            {"role": "assistant", "content": "Water boils at 100 degrees Celsius at sea level."},
34        ]
35    },
36    {
37        "messages": [
38            {"role": "user",      "content": "And at the top of Everest?"},
39            {"role": "assistant", "content": "About 71 degrees Celsius."},
40        ]
41    },
42]
43ds = Dataset.from_list(raw)
44
45
46# ---------------------------------------------------------------------------
47# 2. The rendering + masking function — one function per dataset row.
48# ---------------------------------------------------------------------------
49def format_and_mask(row: Dict) -> Dict:
50    msgs: List[Dict] = row["messages"]
51
52    # 2a. Render the WHOLE chat with the model's official template.
53    full_text = tokenizer.apply_chat_template(
54        msgs,
55        tokenize=False,
56        add_generation_prompt=False,   # we already have the assistant turn
57    )
58
59    # 2b. Render JUST the prompt up to (but not including) each assistant turn,
60    #     so we can find the byte offsets where assistant tokens start.
61    input_ids  = tokenizer(full_text, add_special_tokens=False).input_ids
62    labels     = list(input_ids)               # we will overwrite below
63    cursor     = 0                              # current position in input_ids
64
65    for i, m in enumerate(msgs):
66        prefix_msgs = msgs[: i + 1]
67
68        # Where does the conversation END after including this turn?
69        partial_text = tokenizer.apply_chat_template(
70            prefix_msgs, tokenize=False, add_generation_prompt=False,
71        )
72        end = len(tokenizer(partial_text, add_special_tokens=False).input_ids)
73
74        if m["role"] != "assistant":
75            # Mask EVERYTHING in this span — user / system tokens contribute 0.
76            for j in range(cursor, end):
77                labels[j] = IGNORE_INDEX
78
79        cursor = end
80
81    return {"input_ids": input_ids, "labels": labels}
82
83
84ds = ds.map(format_and_mask, remove_columns=ds.column_names)
85
86
87# ---------------------------------------------------------------------------
88# 3. Collator — pad both input_ids AND labels to the longest in batch.
89#    The labels' pad value MUST be IGNORE_INDEX, not the tokenizer's pad id,
90#    or the model gets graded on predicting padding (the classic 'why is my
91#    loss so low' bug).
92# ---------------------------------------------------------------------------
93def collate(batch: List[Dict]) -> Dict[str, torch.Tensor]:
94    max_len = max(len(b["input_ids"]) for b in batch)
95    pad_id  = tokenizer.pad_token_id
96
97    input_ids = torch.full((len(batch), max_len), pad_id, dtype=torch.long)
98    labels    = torch.full((len(batch), max_len), IGNORE_INDEX, dtype=torch.long)
99    attn      = torch.zeros((len(batch), max_len), dtype=torch.long)
100
101    for i, b in enumerate(batch):
102        L = len(b["input_ids"])
103        input_ids[i, :L] = torch.tensor(b["input_ids"])
104        labels[i, :L]    = torch.tensor(b["labels"])
105        attn[i, :L]      = 1
106
107    return {"input_ids": input_ids, "labels": labels, "attention_mask": attn}
108
109
110# ---------------------------------------------------------------------------
111# 4. One step of training — for clarity, not optimised.
112# ---------------------------------------------------------------------------
113def train_step(model, batch):
114    out = model(**batch)
115    out.loss.backward()
116    return out.loss.item()
117
118
119# ---------------------------------------------------------------------------
120# 5. Sanity print — verify the mask before launching a 10-hour run.
121# ---------------------------------------------------------------------------
122if __name__ == "__main__":
123    sample = ds[0]
124    decoded = tokenizer.batch_decode([sample["input_ids"]])[0]
125    print(decoded)
126
127    print("\n--- token / label ---")
128    for tok_id, lbl in zip(sample["input_ids"], sample["labels"]):
129        tok = tokenizer.decode([tok_id])
130        flag = "✓" if lbl != IGNORE_INDEX else " "
131        print(f"  {flag}  {tok!r:<20}  label={lbl}")
132
133    n = sum(1 for l in sample["labels"] if l != IGNORE_INDEX)
134    print(f"\n{n} / {len(sample['labels'])} tokens contribute to the loss.")

From toy script to real SFT

The script above is what runs inside trl.SFTTrainer, torchtune, and axolotl — minus the multi-GPU collator, the gradient checkpointing, and the FSDP wrap. Once the rendered + masked tensors are produced, the rest of SFT is identical to pretraining: forward pass, cross-entropy with ignore_index=-100, backward, optimizer step. Everything novel about SFT is here in the data layer.

At Massive Scale: Why Template Bugs Are Catastrophic

When the model is small, a template bug looks like a 1–2 point dip on MT-Bench and a slightly chattier assistant. When the model is large, the same bug eats hundreds of thousands of GPU-hours and delays a launch.

The compute multiplier on a bad mask

A 70B Llama-3 SFT run on a 500k-conversation dataset is roughly $6 \cdot N \cdot D$ FLOPs where $N = 70 \cdot 10^9$ and $D \approx 5 \cdot 10^8$ training tokens — about $2 \cdot 10^{20}$ FLOPs, or three days on a 256-H100 node at fp8 peak. If the loss mask is wrong (say, it accidentally grades user tokens too), the run still converges to something — just to a model that has learned to imitate users 30% of the time. Discovering this requires a full eval cycle (another half-day of GPU time), an investigation, and a re-run. One template bug is a week.

The data-mix multiplier

Frontier SFT mixes data from a dozen different sources (Tülu, Ultra, in-house annotators, distilled outputs from a stronger model, math traces, code traces, tool-use traces). Each source ships in its own format. The first stage of every SFT pipeline is a normalisation pass that coerces all of them into the canonical messages list. A bug in that converter for any single source silently poisons that fraction of the training data — and because the converter is invisible to apply_chat_template, the poisoned data renders cleanly and trains without errors.

The inference-skew multiplier

Many serving stacks (vLLM, TGI, SGLang) build the prompt from their own templating layer rather than re-using apply_chat_template. If the serving template differs from the training template by a single whitespace, the model sees a prompt-shape distribution at inference that it never saw during SFT, and tail-task quality regresses while in-distribution evals look fine. The fix is always re-using the tokenizer's own template at inference — even when that costs you a couple of engineering hours of integration work.

Tool-call templates: the next layer of overhead

Once the model has to emit structured tool calls, the template has to define a sub-grammar for arguments. Llama-3.1 uses a dedicated <|python_tag|> prefix; Qwen-2.5 wraps tool calls in <tool_call>…</tool_call>; Anthropic-style models use XML-like tags. The format does not much matter for capability; what matters is that the template renders the tool call the same way at training time and at serving time. Every team that has shipped tool-use has been burned at least once by a mismatch in this layer.

Engineering Reality: The Catalogue of Template Disasters

Two years of open-source SFT releases have produced a recurring cast of bugs. The pattern is always the same: the bug is invisible in train-loss curves, mostly invisible in standard evals, and only shows up when a user has a long conversation, asks a follow-up, or gets a tool call. By the time the regression is caught, a weight release has already been published. This is the list every SFT engineer eventually memorises.

The never-ending response. Assistant <|eot_id|> was masked out of the loss. The model generates beautifully and then keeps going until it hits max_new_tokens. Fix: include the assistant EOT in the loss; print one sample before training.
The double-BOS sequence. apply_chat_template already adds BOS; tokenizer(text) with default add_special_tokens=True adds another one. The model trains on <|begin_of_text|><|begin_of_text|>… and then sees only a single BOS at inference. Fix: always add_special_tokens=False when re-tokenizing template output.
User tokens in the loss. The mask got inverted. The model becomes a good user-simulator and a bad assistant. Catches: print one sample; sanity-check the assistant-token fraction (typically 30–80%, never 100%, never < 10%).
The padding-graded loss. The collator padded the labels tensor with pad_token_id instead of -100. Loss looks suspiciously low (predicting a single padding token is trivial); MMLU drops 5–10 points. Fix: pad labels with IGNORE_INDEX, always.
Template-version drift. The training code uses v0 of the template (with two newlines); the inference code uses v1 (with one newline). In-distribution evals are fine because they go through the same training code; users get a slightly worse model than the eval reports.
Multi-turn last-only masking. The trainer was configured to grade only the final assistant reply per conversation. Multi-turn capability degrades; single-turn looks unchanged. Fix: grade every assistant span in multi-turn data.
Truncated mid-turn. The sequence was truncated to fit the context window, leaving a half-rendered assistant turn with no EOT. The model learns that some assistant turns simply stop in the middle of a word. Fix: drop the trailing partial turn rather than truncating it; never truncate inside a turn.
Tool-call format skew. The SFT data wrapped tool calls one way; the serving prompt builds them another. Tool-use tail accuracy crashes while normal chat stays fine. Fix: render tool calls with the model's own template at every layer.

Fix these and what is left is the actual capability that supervised fine-tuning is supposed to teach. The next section (§13.4, SFT Training Configuration) builds the optimizer, schedule, and batch-size discipline that turns these correctly masked tokens into a model the user wants to talk to.

The mental model that unifies this section: the chat template is the API between conversations and tokens. SFT teaches the model to be fluent in the assistant half of that API. Everything else in this chapter — data collection, training config, forgetting mitigation — is in service of that single goal, and every one of them assumes the template is byte-perfect.