Introduction
Machine translation uses special tokens to mark sequence boundaries and enable batching. Understanding when and how each special token is used is crucial for correct implementation.
This section covers the special token scheme for our German-English translation project.
5.1 The Special Tokens
Token Definitions
| Token | ID | Symbol | Purpose |
|---|---|---|---|
| PAD | 0 | <pad> | Padding for batching |
| UNK | 1 | <unk> | Unknown/OOV tokens |
| BOS | 2 | <bos> | Beginning of sequence |
| EOS | 3 | <eos> | End of sequence |
Why These Specific IDs?
PAD = 0 by convention:
- Zero-initialization is common
- Easy to create with
torch.zeros() - Works well with embedding layers
UNK = 1:
- Second-lowest priority
- Rarely used with good tokenizers
- Fallback for truly unknown text
BOS/EOS = 2/3:
- Sequential after special tokens
- Different IDs allow distinguishing start from end
5.2 When Each Token is Used
Sequence Construction Overview
1Source (German): [German tokens] โ Encoder
2Target Input: <bos> [English tokens] โ Decoder input
3Target Label: [English tokens] <eos> โ Loss computation
4
5Batching adds: <pad> tokens for alignmentDetailed Flow
1German sentence: "Der Hund lรคuft."
2English sentence: "The dog runs."
3
4Tokenized:
5 Source tokens: [Der, Hund, lรคuft, .]
6 Target tokens: [The, dog, runs, .]
7
8With special tokens:
9 Encoder input: [Der, Hund, lรคuft, .]
10 Decoder input: [<bos>, The, dog, runs, .]
11 Decoder label: [The, dog, runs, ., <eos>]
12 โ โ โ โ โ
13 Decoder output: [The, dog, runs, ., <eos>] (predicted)5.3 Source Sequence (Encoder Input)
No Special Tokens Needed
The encoder processes the source sentence as-is:
1def prepare_source(text: str, tokenizer) -> List[int]:
2 """
3 Prepare source sequence for encoder.
4
5 Encoder doesn't need BOS/EOS because:
6 - We're not generating source text
7 - Encoder sees the whole sequence at once
8 - Position encoding provides positional information
9 """
10 return tokenizer.encode_source(text)
11
12
13# Example
14source = "Der Hund lรคuft im Park."
15source_ids = tokenizer.encode_source(source)
16# โ [Der, Hund, lรคuft, im, Park, .]
17# No BOS or EOS!Why No BOS/EOS in Source?
- Not generating: We're encoding, not generating source text
- Full visibility: Encoder sees entire sequence simultaneously
- Simpler: Fewer tokens to process
- Standard practice: Most translation models skip source special tokens
Exception: Some Models Use Source Markers
Some models (like BART) use special source formatting:
1# BART style (different from our approach)
2<lang_de> Der Hund lรคuft. </s>For our project, we follow the standard: no special tokens in source.
5.4 Target Sequence (Decoder Input/Labels)
The Shift Pattern
The decoder uses teacher forcing during training:
1Target text: "The dog runs."
2
3Decoder INPUT (shifted right):
4 [<bos>, The, dog, runs, .]
5 Position: 0 1 2 3 4
6
7Decoder LABEL (original + EOS):
8 [The, dog, runs, ., <eos>]
9 Position: 0 1 2 3 4
10
11At each position, decoder predicts the next token:
12 Input <bos> โ Predict "The" โ
13 Input The โ Predict "dog" โ
14 Input dog โ Predict "runs" โ
15 Input runs โ Predict "." โ
16 Input . โ Predict <eos> โ (signal to stop)Implementation
1def prepare_target(
2 text: str,
3 tokenizer,
4 max_length: int = 128
5) -> Tuple[List[int], List[int]]:
6 """
7 Prepare target sequence for decoder.
8
9 Returns:
10 decoder_input: [BOS, tok1, tok2, ...] - input to decoder
11 decoder_label: [tok1, tok2, ..., EOS] - expected output
12 """
13 # Tokenize without special tokens
14 token_ids = tokenizer.sp.encode(text, out_type=int)
15
16 # Create decoder input (with BOS)
17 decoder_input = [tokenizer.bos_id] + token_ids
18
19 # Create decoder label (with EOS)
20 decoder_label = token_ids + [tokenizer.eos_id]
21
22 # Truncate if needed
23 if max_length:
24 decoder_input = decoder_input[:max_length]
25 decoder_label = decoder_label[:max_length]
26
27 return decoder_input, decoder_label
28
29
30# Example
31text = "The dog runs."
32dec_input, dec_label = prepare_target(text, tokenizer)
33
34print(f"Text: '{text}'")
35print(f"Decoder input: {dec_input}") # [2, 45, 123, 456, 7]
36print(f"Decoder label: {dec_label}") # [45, 123, 456, 7, 3]5.5 The PAD Token
Why Padding is Necessary
Batching requires sequences of equal length:
1Sentence 1: "Hi" โ 2 tokens
2Sentence 2: "Hello world" โ 3 tokens
3Sentence 3: "Goodbye" โ 2 tokens
4
5Without padding: Can't batch (different lengths)
6With padding:
7 [Hi, <pad>, <pad>] โ 3 tokens
8 [Hello, world, <pad>] โ 3 tokens
9 [Good, bye, <pad>] โ 3 tokens
10
11Now all have same length โ Batchable!Padding Implementation
1def pad_sequences(
2 sequences: List[List[int]],
3 pad_id: int = 0,
4 max_length: Optional[int] = None
5) -> Tuple[torch.Tensor, torch.Tensor]:
6 """
7 Pad sequences to equal length.
8
9 Args:
10 sequences: List of token ID lists
11 pad_id: Padding token ID
12 max_length: Maximum length (or longest sequence if None)
13
14 Returns:
15 padded: [batch, max_len] tensor of token IDs
16 mask: [batch, max_len] tensor (1=real, 0=padding)
17 """
18 # Determine max length
19 if max_length is None:
20 max_length = max(len(seq) for seq in sequences)
21
22 batch_size = len(sequences)
23 padded = torch.full((batch_size, max_length), pad_id, dtype=torch.long)
24 mask = torch.zeros(batch_size, max_length, dtype=torch.long)
25
26 for i, seq in enumerate(sequences):
27 length = min(len(seq), max_length)
28 padded[i, :length] = torch.tensor(seq[:length])
29 mask[i, :length] = 1
30
31 return padded, mask
32
33
34# Example
35sequences = [
36 [10, 20, 30],
37 [40, 50],
38 [60, 70, 80, 90],
39]
40
41padded, mask = pad_sequences(sequences, pad_id=0)
42print(f"Padded:\n{padded}")
43print(f"Mask:\n{mask}")Output:
1Padded:
2tensor([[10, 20, 30, 0],
3 [40, 50, 0, 0],
4 [60, 70, 80, 90]])
5Mask:
6tensor([[1, 1, 1, 0],
7 [1, 1, 0, 0],
8 [1, 1, 1, 1]])Padding and Attention Masks
Padding tokens should be ignored in attention:
1def create_padding_mask(input_ids: torch.Tensor, pad_id: int = 0) -> torch.Tensor:
2 """
3 Create mask that ignores padding tokens.
4
5 Args:
6 input_ids: [batch, seq_len] token IDs
7 pad_id: Padding token ID
8
9 Returns:
10 mask: [batch, 1, 1, seq_len] attention mask
11 (1 = attend, 0 = ignore)
12 """
13 # Where tokens are NOT padding
14 mask = (input_ids != pad_id).unsqueeze(1).unsqueeze(2)
15 return mask.float()
16
17
18# Example
19input_ids = torch.tensor([
20 [10, 20, 30, 0, 0],
21 [40, 50, 60, 70, 0],
22])
23
24mask = create_padding_mask(input_ids, pad_id=0)
25print(f"Input IDs:\n{input_ids}")
26print(f"Padding mask:\n{mask.squeeze()}")Output:
1Input IDs:
2tensor([[10, 20, 30, 0, 0],
3 [40, 50, 60, 70, 0]])
4Padding mask:
5tensor([[1., 1., 1., 0., 0.],
6 [1., 1., 1., 1., 0.]])5.6 Special Tokens and Loss Computation
Which Tokens Contribute to Loss?
Include in loss:
- Regular tokens (predictions we care about)
- EOS token (model must learn when to stop)
Exclude from loss:
- PAD tokens (not real predictions)
- BOS token (always given as input, never predicted)
Ignoring Padding in Loss
1import torch.nn as nn
2
3
4def compute_translation_loss(
5 logits: torch.Tensor,
6 labels: torch.Tensor,
7 pad_id: int = 0
8) -> torch.Tensor:
9 """
10 Compute cross-entropy loss ignoring padding.
11
12 Args:
13 logits: [batch, seq_len, vocab_size] model predictions
14 labels: [batch, seq_len] target token IDs
15 pad_id: Padding token ID to ignore
16
17 Returns:
18 Scalar loss
19 """
20 # Reshape for cross entropy
21 # logits: [batch * seq_len, vocab_size]
22 # labels: [batch * seq_len]
23 batch_size, seq_len, vocab_size = logits.shape
24
25 logits_flat = logits.view(-1, vocab_size)
26 labels_flat = labels.view(-1)
27
28 # Use ignore_index to skip padding
29 criterion = nn.CrossEntropyLoss(ignore_index=pad_id)
30 loss = criterion(logits_flat, labels_flat)
31
32 return loss
33
34
35# Example
36batch_size, seq_len, vocab_size = 2, 5, 100
37
38# Fake logits
39logits = torch.randn(batch_size, seq_len, vocab_size)
40
41# Labels with padding
42labels = torch.tensor([
43 [45, 67, 89, 3, 0], # "The dog runs EOS PAD"
44 [12, 34, 56, 78, 3], # "A cat sleeps well EOS"
45])
46
47loss = compute_translation_loss(logits, labels, pad_id=0)
48print(f"Loss: {loss.item():.4f}")5.7 Complete Data Preparation Pipeline
Full Translation Example
1class TranslationExample:
2 """
3 A single translation training example.
4
5 Contains all components needed for training:
6 - Source sequence (encoder input)
7 - Target decoder input (with BOS)
8 - Target labels (with EOS)
9 - Masks for attention
10 """
11
12 def __init__(
13 self,
14 source_ids: List[int],
15 target_ids: List[int],
16 pad_id: int = 0,
17 bos_id: int = 2,
18 eos_id: int = 3
19 ):
20 # Store raw IDs
21 self.source_ids = source_ids
22 self.target_ids = target_ids
23
24 # Prepare decoder input and labels
25 self.decoder_input = [bos_id] + target_ids
26 self.decoder_label = target_ids + [eos_id]
27
28 self.pad_id = pad_id
29
30 def __repr__(self):
31 return (
32 f"TranslationExample(\n"
33 f" source={self.source_ids}\n"
34 f" decoder_input={self.decoder_input}\n"
35 f" decoder_label={self.decoder_label}\n"
36 f")"
37 )
38
39
40def prepare_batch(
41 examples: List[TranslationExample],
42 pad_id: int = 0
43) -> dict:
44 """
45 Prepare a batch of examples for training.
46
47 Returns dictionary with:
48 - encoder_input: [batch, src_len]
49 - encoder_mask: [batch, 1, 1, src_len]
50 - decoder_input: [batch, tgt_len]
51 - decoder_mask: [batch, 1, tgt_len, tgt_len] (causal + padding)
52 - labels: [batch, tgt_len]
53 """
54 # Collect sequences
55 source_seqs = [ex.source_ids for ex in examples]
56 decoder_input_seqs = [ex.decoder_input for ex in examples]
57 label_seqs = [ex.decoder_label for ex in examples]
58
59 # Pad each type
60 encoder_input, encoder_padding = pad_sequences(source_seqs, pad_id)
61 decoder_input, decoder_padding = pad_sequences(decoder_input_seqs, pad_id)
62 labels, _ = pad_sequences(label_seqs, pad_id)
63
64 # Create attention masks
65 batch_size = len(examples)
66 src_len = encoder_input.size(1)
67 tgt_len = decoder_input.size(1)
68
69 # Encoder mask: just padding mask
70 encoder_mask = (encoder_input != pad_id).unsqueeze(1).unsqueeze(2).float()
71
72 # Decoder mask: causal + padding
73 # Causal mask: lower triangular
74 causal_mask = torch.tril(torch.ones(tgt_len, tgt_len))
75 # Padding mask
76 padding_mask = (decoder_input != pad_id).unsqueeze(1).unsqueeze(2).float()
77 # Combine
78 decoder_mask = causal_mask.unsqueeze(0) * padding_mask
79
80 # Cross-attention mask (decoder attending to encoder)
81 cross_mask = encoder_mask # Same as encoder mask
82
83 return {
84 "encoder_input": encoder_input,
85 "encoder_mask": encoder_mask,
86 "decoder_input": decoder_input,
87 "decoder_mask": decoder_mask,
88 "cross_mask": cross_mask,
89 "labels": labels,
90 }5.8 Inference-Time Handling
Generation Without Teacher Forcing
During inference, we don't have the target text:
1def generate_translation(
2 model,
3 source_ids: torch.Tensor,
4 tokenizer,
5 max_length: int = 50,
6 device: str = "cpu"
7) -> List[int]:
8 """
9 Generate translation using greedy decoding.
10
11 Args:
12 model: Trained transformer model
13 source_ids: [1, src_len] source token IDs
14 tokenizer: Tokenizer with special token IDs
15 max_length: Maximum output length
16 device: Device to run on
17
18 Returns:
19 List of generated token IDs
20 """
21 model.eval()
22 source_ids = source_ids.to(device)
23
24 # Start with BOS token
25 generated = [tokenizer.bos_id]
26
27 with torch.no_grad():
28 # Encode source once
29 memory = model.encode(source_ids)
30
31 for _ in range(max_length):
32 # Prepare decoder input
33 decoder_input = torch.tensor([generated], device=device)
34
35 # Get next token prediction
36 logits = model.decode(decoder_input, memory)
37 next_token_logits = logits[0, -1, :] # Last position
38 next_token = next_token_logits.argmax().item()
39
40 # Append to generated sequence
41 generated.append(next_token)
42
43 # Stop if EOS
44 if next_token == tokenizer.eos_id:
45 break
46
47 return generated[1:] # Remove BOS
48
49
50# Usage example (pseudocode)
51# source = "Der Hund lรคuft im Park."
52# source_ids = tokenizer.encode_source(source)
53# generated_ids = generate_translation(model, source_ids, tokenizer)
54# translation = tokenizer.decode(generated_ids)
55# print(translation) # "The dog runs in the park."Key Differences: Training vs Inference
| Aspect | Training | Inference |
|---|---|---|
| Target available | Yes | No |
| Decoder input | Full target with BOS | Generate token by token |
| Teacher forcing | Yes | No |
| EOS handling | In labels | Stop condition |
| PAD handling | In batch | Not needed |
Summary
Special Token Usage
| Token | Encoder | Decoder Input | Decoder Label | Loss |
|---|---|---|---|---|
| PAD | Masked | Masked | Ignored | No |
| UNK | Encoded | Encoded | Include | Yes |
| BOS | Not used | First token | Not used | No |
| EOS | Not used | Not used | Last token | Yes |
The Complete Picture
1Training:
2 Source: [src_1, src_2, ..., src_n, PAD, PAD]
3 Dec Input: [BOS, tgt_1, tgt_2, ..., tgt_m, PAD]
4 Dec Label: [tgt_1, tgt_2, ..., tgt_m, EOS, PAD]
5 Loss Mask: [ 1, 1, 1, ..., 1, 1, 0]
6
7Inference:
8 Source: [src_1, src_2, ..., src_n]
9 Dec Input: [BOS] โ [BOS, pred_1] โ [BOS, pred_1, pred_2] โ ...
10 Stop when: pred_k == EOSImplementation Checklist
- โ PAD token ID = 0 for easy masking
- โ Source sequences: no special tokens
- โ Target decoder input: prepend BOS
- โ Target labels: append EOS
- โ Loss computation: ignore PAD tokens
- โ Inference: start with BOS, stop at EOS
Exercises
Conceptual Questions
- Why does the decoder input have BOS but not EOS, while labels have EOS but not BOS?
- If we accidentally included PAD tokens in the loss, what would happen to training?
- During inference, what happens if the model never produces EOS?
Implementation Exercises
- Modify the batch preparation to support different padding strategies (left vs right padding).
- Implement beam search decoding that properly handles EOS tokens.
- Add support for multiple special tokens (e.g., language tags like
<de>,<en>).
Analysis Exercises
- Compare training with and without label smoothing. How does it affect translation quality?
- Analyze the attention patterns at positions with PAD tokens. Are they properly masked?
Chapter Summary
What We Learned
- Why Subword: Handles OOV, balances vocab size and sequence length
- BPE Algorithm: Iterative merging of frequent pairs
- Implementation: From scratch and with SentencePiece
- Special Tokens: PAD, UNK, BOS, EOS and their roles
- Data Pipeline: Preparing batches for translation
Ready for the Project
With tokenization complete, we have:
- Trained tokenizer for German-English
- Proper handling of special tokens
- Batch preparation with masking
- Integration with PyTorch datasets
Next Chapter Preview
In Chapter 6, we'll implement the remaining transformer components: the Feed-Forward Network and Layer Normalization. These are the final building blocks before assembling complete encoder and decoder layers.