Boo-AI — Master Artificial Intelligence by Building from Scratch

Introduction

Machine translation uses special tokens to mark sequence boundaries and enable batching. Understanding when and how each special token is used is crucial for correct implementation.

This section covers the special token scheme for our German-English translation project.

5.1 The Special Tokens

Token Definitions

Token	ID	Symbol	Purpose
PAD	0	<pad>	Padding for batching
UNK	1	<unk>	Unknown/OOV tokens
BOS	2	<bos>	Beginning of sequence
EOS	3	<eos>	End of sequence

Why These Specific IDs?

PAD = 0 by convention:

Zero-initialization is common
Easy to create with torch.zeros()
Works well with embedding layers

UNK = 1:

Second-lowest priority
Rarely used with good tokenizers
Fallback for truly unknown text

BOS/EOS = 2/3:

Sequential after special tokens
Different IDs allow distinguishing start from end

5.2 When Each Token is Used

Sequence Construction Overview

📝text

1Source (German):  [German tokens]                    → Encoder
2Target Input:     <bos> [English tokens]             → Decoder input
3Target Label:     [English tokens] <eos>             → Loss computation
4
5Batching adds:    <pad> tokens for alignment

Detailed Flow

📝text

1German sentence:  "Der Hund läuft."
2English sentence: "The dog runs."
3
4Tokenized:
5  Source tokens: [Der, Hund, läuft, .]
6  Target tokens: [The, dog, runs, .]
7
8With special tokens:
9  Encoder input:  [Der, Hund, läuft, .]
10  Decoder input:  [<bos>, The, dog, runs, .]
11  Decoder label:  [The, dog, runs, ., <eos>]
12                   ↓    ↓    ↓    ↓    ↓
13  Decoder output: [The, dog, runs, ., <eos>]  (predicted)

5.3 Source Sequence (Encoder Input)

No Special Tokens Needed

The encoder processes the source sentence as-is:

🐍python

1def prepare_source(text: str, tokenizer) -> List[int]:
2    """
3    Prepare source sequence for encoder.
4
5    Encoder doesn't need BOS/EOS because:
6    - We're not generating source text
7    - Encoder sees the whole sequence at once
8    - Position encoding provides positional information
9    """
10    return tokenizer.encode_source(text)
11
12
13# Example
14source = "Der Hund läuft im Park."
15source_ids = tokenizer.encode_source(source)
16# → [Der, Hund, läuft, im, Park, .]
17# No BOS or EOS!

Why No BOS/EOS in Source?

Not generating: We're encoding, not generating source text
Full visibility: Encoder sees entire sequence simultaneously
Simpler: Fewer tokens to process
Standard practice: Most translation models skip source special tokens

Exception: Some Models Use Source Markers

Some models (like BART) use special source formatting:

📝text

1# BART style (different from our approach)
2<lang_de> Der Hund läuft. </s>

For our project, we follow the standard: no special tokens in source.

5.4 Target Sequence (Decoder Input/Labels)

The Shift Pattern

The decoder uses teacher forcing during training:

📝text

1Target text: "The dog runs."
2
3Decoder INPUT (shifted right):
4  [<bos>, The, dog, runs, .]
5  Position: 0    1    2    3   4
6
7Decoder LABEL (original + EOS):
8  [The, dog, runs, ., <eos>]
9  Position: 0   1    2   3   4
10
11At each position, decoder predicts the next token:
12  Input <bos> → Predict "The"    ✓
13  Input The   → Predict "dog"    ✓
14  Input dog   → Predict "runs"   ✓
15  Input runs  → Predict "."      ✓
16  Input .     → Predict <eos>    ✓ (signal to stop)

Implementation

🐍python

1def prepare_target(
2    text: str,
3    tokenizer,
4    max_length: int = 128
5) -> Tuple[List[int], List[int]]:
6    """
7    Prepare target sequence for decoder.
8
9    Returns:
10        decoder_input: [BOS, tok1, tok2, ...] - input to decoder
11        decoder_label: [tok1, tok2, ..., EOS] - expected output
12    """
13    # Tokenize without special tokens
14    token_ids = tokenizer.sp.encode(text, out_type=int)
15
16    # Create decoder input (with BOS)
17    decoder_input = [tokenizer.bos_id] + token_ids
18
19    # Create decoder label (with EOS)
20    decoder_label = token_ids + [tokenizer.eos_id]
21
22    # Truncate if needed
23    if max_length:
24        decoder_input = decoder_input[:max_length]
25        decoder_label = decoder_label[:max_length]
26
27    return decoder_input, decoder_label
28
29
30# Example
31text = "The dog runs."
32dec_input, dec_label = prepare_target(text, tokenizer)
33
34print(f"Text: '{text}'")
35print(f"Decoder input: {dec_input}")  # [2, 45, 123, 456, 7]
36print(f"Decoder label: {dec_label}")   # [45, 123, 456, 7, 3]

5.5 The PAD Token

Why Padding is Necessary

Batching requires sequences of equal length:

📝text

1Sentence 1: "Hi"        → 2 tokens
2Sentence 2: "Hello world" → 3 tokens
3Sentence 3: "Goodbye"   → 2 tokens
4
5Without padding: Can't batch (different lengths)
6With padding:
7  [Hi, <pad>, <pad>]     → 3 tokens
8  [Hello, world, <pad>]  → 3 tokens
9  [Good, bye, <pad>]     → 3 tokens
10
11Now all have same length → Batchable!

Padding Implementation

🐍python

1def pad_sequences(
2    sequences: List[List[int]],
3    pad_id: int = 0,
4    max_length: Optional[int] = None
5) -> Tuple[torch.Tensor, torch.Tensor]:
6    """
7    Pad sequences to equal length.
8
9    Args:
10        sequences: List of token ID lists
11        pad_id: Padding token ID
12        max_length: Maximum length (or longest sequence if None)
13
14    Returns:
15        padded: [batch, max_len] tensor of token IDs
16        mask: [batch, max_len] tensor (1=real, 0=padding)
17    """
18    # Determine max length
19    if max_length is None:
20        max_length = max(len(seq) for seq in sequences)
21
22    batch_size = len(sequences)
23    padded = torch.full((batch_size, max_length), pad_id, dtype=torch.long)
24    mask = torch.zeros(batch_size, max_length, dtype=torch.long)
25
26    for i, seq in enumerate(sequences):
27        length = min(len(seq), max_length)
28        padded[i, :length] = torch.tensor(seq[:length])
29        mask[i, :length] = 1
30
31    return padded, mask
32
33
34# Example
35sequences = [
36    [10, 20, 30],
37    [40, 50],
38    [60, 70, 80, 90],
39]
40
41padded, mask = pad_sequences(sequences, pad_id=0)
42print(f"Padded:\n{padded}")
43print(f"Mask:\n{mask}")

Output:

📝text

1Padded:
2tensor([[10, 20, 30,  0],
3        [40, 50,  0,  0],
4        [60, 70, 80, 90]])
5Mask:
6tensor([[1, 1, 1, 0],
7        [1, 1, 0, 0],
8        [1, 1, 1, 1]])

Padding and Attention Masks

Padding tokens should be ignored in attention:

🐍python

1def create_padding_mask(input_ids: torch.Tensor, pad_id: int = 0) -> torch.Tensor:
2    """
3    Create mask that ignores padding tokens.
4
5    Args:
6        input_ids: [batch, seq_len] token IDs
7        pad_id: Padding token ID
8
9    Returns:
10        mask: [batch, 1, 1, seq_len] attention mask
11              (1 = attend, 0 = ignore)
12    """
13    # Where tokens are NOT padding
14    mask = (input_ids != pad_id).unsqueeze(1).unsqueeze(2)
15    return mask.float()
16
17
18# Example
19input_ids = torch.tensor([
20    [10, 20, 30, 0, 0],
21    [40, 50, 60, 70, 0],
22])
23
24mask = create_padding_mask(input_ids, pad_id=0)
25print(f"Input IDs:\n{input_ids}")
26print(f"Padding mask:\n{mask.squeeze()}")

Output:

📝text

1Input IDs:
2tensor([[10, 20, 30,  0,  0],
3        [40, 50, 60, 70,  0]])
4Padding mask:
5tensor([[1., 1., 1., 0., 0.],
6        [1., 1., 1., 1., 0.]])

5.6 Special Tokens and Loss Computation

Which Tokens Contribute to Loss?

Include in loss:

Regular tokens (predictions we care about)
EOS token (model must learn when to stop)

Exclude from loss:

PAD tokens (not real predictions)
BOS token (always given as input, never predicted)

Ignoring Padding in Loss

🐍python

1import torch.nn as nn
2
3
4def compute_translation_loss(
5    logits: torch.Tensor,
6    labels: torch.Tensor,
7    pad_id: int = 0
8) -> torch.Tensor:
9    """
10    Compute cross-entropy loss ignoring padding.
11
12    Args:
13        logits: [batch, seq_len, vocab_size] model predictions
14        labels: [batch, seq_len] target token IDs
15        pad_id: Padding token ID to ignore
16
17    Returns:
18        Scalar loss
19    """
20    # Reshape for cross entropy
21    # logits: [batch * seq_len, vocab_size]
22    # labels: [batch * seq_len]
23    batch_size, seq_len, vocab_size = logits.shape
24
25    logits_flat = logits.view(-1, vocab_size)
26    labels_flat = labels.view(-1)
27
28    # Use ignore_index to skip padding
29    criterion = nn.CrossEntropyLoss(ignore_index=pad_id)
30    loss = criterion(logits_flat, labels_flat)
31
32    return loss
33
34
35# Example
36batch_size, seq_len, vocab_size = 2, 5, 100
37
38# Fake logits
39logits = torch.randn(batch_size, seq_len, vocab_size)
40
41# Labels with padding
42labels = torch.tensor([
43    [45, 67, 89, 3, 0],   # "The dog runs EOS PAD"
44    [12, 34, 56, 78, 3],  # "A cat sleeps well EOS"
45])
46
47loss = compute_translation_loss(logits, labels, pad_id=0)
48print(f"Loss: {loss.item():.4f}")

5.7 Complete Data Preparation Pipeline

Full Translation Example

🐍python

1class TranslationExample:
2    """
3    A single translation training example.
4
5    Contains all components needed for training:
6    - Source sequence (encoder input)
7    - Target decoder input (with BOS)
8    - Target labels (with EOS)
9    - Masks for attention
10    """
11
12    def __init__(
13        self,
14        source_ids: List[int],
15        target_ids: List[int],
16        pad_id: int = 0,
17        bos_id: int = 2,
18        eos_id: int = 3
19    ):
20        # Store raw IDs
21        self.source_ids = source_ids
22        self.target_ids = target_ids
23
24        # Prepare decoder input and labels
25        self.decoder_input = [bos_id] + target_ids
26        self.decoder_label = target_ids + [eos_id]
27
28        self.pad_id = pad_id
29
30    def __repr__(self):
31        return (
32            f"TranslationExample(\n"
33            f"  source={self.source_ids}\n"
34            f"  decoder_input={self.decoder_input}\n"
35            f"  decoder_label={self.decoder_label}\n"
36            f")"
37        )
38
39
40def prepare_batch(
41    examples: List[TranslationExample],
42    pad_id: int = 0
43) -> dict:
44    """
45    Prepare a batch of examples for training.
46
47    Returns dictionary with:
48    - encoder_input: [batch, src_len]
49    - encoder_mask: [batch, 1, 1, src_len]
50    - decoder_input: [batch, tgt_len]
51    - decoder_mask: [batch, 1, tgt_len, tgt_len] (causal + padding)
52    - labels: [batch, tgt_len]
53    """
54    # Collect sequences
55    source_seqs = [ex.source_ids for ex in examples]
56    decoder_input_seqs = [ex.decoder_input for ex in examples]
57    label_seqs = [ex.decoder_label for ex in examples]
58
59    # Pad each type
60    encoder_input, encoder_padding = pad_sequences(source_seqs, pad_id)
61    decoder_input, decoder_padding = pad_sequences(decoder_input_seqs, pad_id)
62    labels, _ = pad_sequences(label_seqs, pad_id)
63
64    # Create attention masks
65    batch_size = len(examples)
66    src_len = encoder_input.size(1)
67    tgt_len = decoder_input.size(1)
68
69    # Encoder mask: just padding mask
70    encoder_mask = (encoder_input != pad_id).unsqueeze(1).unsqueeze(2).float()
71
72    # Decoder mask: causal + padding
73    # Causal mask: lower triangular
74    causal_mask = torch.tril(torch.ones(tgt_len, tgt_len))
75    # Padding mask
76    padding_mask = (decoder_input != pad_id).unsqueeze(1).unsqueeze(2).float()
77    # Combine
78    decoder_mask = causal_mask.unsqueeze(0) * padding_mask
79
80    # Cross-attention mask (decoder attending to encoder)
81    cross_mask = encoder_mask  # Same as encoder mask
82
83    return {
84        "encoder_input": encoder_input,
85        "encoder_mask": encoder_mask,
86        "decoder_input": decoder_input,
87        "decoder_mask": decoder_mask,
88        "cross_mask": cross_mask,
89        "labels": labels,
90    }

5.8 Inference-Time Handling

Generation Without Teacher Forcing

During inference, we don't have the target text:

🐍python

1def generate_translation(
2    model,
3    source_ids: torch.Tensor,
4    tokenizer,
5    max_length: int = 50,
6    device: str = "cpu"
7) -> List[int]:
8    """
9    Generate translation using greedy decoding.
10
11    Args:
12        model: Trained transformer model
13        source_ids: [1, src_len] source token IDs
14        tokenizer: Tokenizer with special token IDs
15        max_length: Maximum output length
16        device: Device to run on
17
18    Returns:
19        List of generated token IDs
20    """
21    model.eval()
22    source_ids = source_ids.to(device)
23
24    # Start with BOS token
25    generated = [tokenizer.bos_id]
26
27    with torch.no_grad():
28        # Encode source once
29        memory = model.encode(source_ids)
30
31        for _ in range(max_length):
32            # Prepare decoder input
33            decoder_input = torch.tensor([generated], device=device)
34
35            # Get next token prediction
36            logits = model.decode(decoder_input, memory)
37            next_token_logits = logits[0, -1, :]  # Last position
38            next_token = next_token_logits.argmax().item()
39
40            # Append to generated sequence
41            generated.append(next_token)
42
43            # Stop if EOS
44            if next_token == tokenizer.eos_id:
45                break
46
47    return generated[1:]  # Remove BOS
48
49
50# Usage example (pseudocode)
51# source = "Der Hund läuft im Park."
52# source_ids = tokenizer.encode_source(source)
53# generated_ids = generate_translation(model, source_ids, tokenizer)
54# translation = tokenizer.decode(generated_ids)
55# print(translation)  # "The dog runs in the park."

Key Differences: Training vs Inference

Aspect	Training	Inference
Target available	Yes	No
Decoder input	Full target with BOS	Generate token by token
Teacher forcing	Yes	No
EOS handling	In labels	Stop condition
PAD handling	In batch	Not needed

Summary

Special Token Usage

Token	Encoder	Decoder Input	Decoder Label	Loss
PAD	Masked	Masked	Ignored	No
UNK	Encoded	Encoded	Include	Yes
BOS	Not used	First token	Not used	No
EOS	Not used	Not used	Last token	Yes

The Complete Picture

📝text

1Training:
2  Source:       [src_1, src_2, ..., src_n, PAD, PAD]
3  Dec Input:    [BOS, tgt_1, tgt_2, ..., tgt_m, PAD]
4  Dec Label:    [tgt_1, tgt_2, ..., tgt_m, EOS, PAD]
5  Loss Mask:    [  1,    1,    1,  ...,  1,   1,  0]
6
7Inference:
8  Source:       [src_1, src_2, ..., src_n]
9  Dec Input:    [BOS] → [BOS, pred_1] → [BOS, pred_1, pred_2] → ...
10  Stop when:    pred_k == EOS

Implementation Checklist

☐ PAD token ID = 0 for easy masking
☐ Source sequences: no special tokens
☐ Target decoder input: prepend BOS
☐ Target labels: append EOS
☐ Loss computation: ignore PAD tokens
☐ Inference: start with BOS, stop at EOS

Exercises

Conceptual Questions

Why does the decoder input have BOS but not EOS, while labels have EOS but not BOS?
If we accidentally included PAD tokens in the loss, what would happen to training?
During inference, what happens if the model never produces EOS?

Implementation Exercises

Modify the batch preparation to support different padding strategies (left vs right padding).
Implement beam search decoding that properly handles EOS tokens.
Add support for multiple special tokens (e.g., language tags like <de>, <en>).

Analysis Exercises

Compare training with and without label smoothing. How does it affect translation quality?
Analyze the attention patterns at positions with PAD tokens. Are they properly masked?

Chapter Summary

What We Learned

Why Subword: Handles OOV, balances vocab size and sequence length
BPE Algorithm: Iterative merging of frequent pairs
Implementation: From scratch and with SentencePiece
Special Tokens: PAD, UNK, BOS, EOS and their roles
Data Pipeline: Preparing batches for translation

Ready for the Project

With tokenization complete, we have:

Trained tokenizer for German-English
Proper handling of special tokens
Batch preparation with masking
Integration with PyTorch datasets

Next Chapter Preview

In Chapter 6, we'll implement the remaining transformer components: the Feed-Forward Network and Layer Normalization. These are the final building blocks before assembling complete encoder and decoder layers.