Chapter 5
20 min read
Section 29 of 75

Special Tokens for Sequence-to-Sequence

Subword Tokenization for Translation

Introduction

Machine translation uses special tokens to mark sequence boundaries and enable batching. Understanding when and how each special token is used is crucial for correct implementation.

This section covers the special token scheme for our German-English translation project.


5.1 The Special Tokens

Token Definitions

TokenIDSymbolPurpose
PAD0<pad>Padding for batching
UNK1<unk>Unknown/OOV tokens
BOS2<bos>Beginning of sequence
EOS3<eos>End of sequence

Why These Specific IDs?

PAD = 0 by convention:

  • Zero-initialization is common
  • Easy to create with torch.zeros()
  • Works well with embedding layers

UNK = 1:

  • Second-lowest priority
  • Rarely used with good tokenizers
  • Fallback for truly unknown text

BOS/EOS = 2/3:

  • Sequential after special tokens
  • Different IDs allow distinguishing start from end

5.2 When Each Token is Used

Sequence Construction Overview

๐Ÿ“text
1Source (German):  [German tokens]                    โ†’ Encoder
2Target Input:     <bos> [English tokens]             โ†’ Decoder input
3Target Label:     [English tokens] <eos>             โ†’ Loss computation
4
5Batching adds:    <pad> tokens for alignment

Detailed Flow

๐Ÿ“text
1German sentence:  "Der Hund lรคuft."
2English sentence: "The dog runs."
3
4Tokenized:
5  Source tokens: [Der, Hund, lรคuft, .]
6  Target tokens: [The, dog, runs, .]
7
8With special tokens:
9  Encoder input:  [Der, Hund, lรคuft, .]
10  Decoder input:  [<bos>, The, dog, runs, .]
11  Decoder label:  [The, dog, runs, ., <eos>]
12                   โ†“    โ†“    โ†“    โ†“    โ†“
13  Decoder output: [The, dog, runs, ., <eos>]  (predicted)

5.3 Source Sequence (Encoder Input)

No Special Tokens Needed

The encoder processes the source sentence as-is:

๐Ÿpython
1def prepare_source(text: str, tokenizer) -> List[int]:
2    """
3    Prepare source sequence for encoder.
4
5    Encoder doesn't need BOS/EOS because:
6    - We're not generating source text
7    - Encoder sees the whole sequence at once
8    - Position encoding provides positional information
9    """
10    return tokenizer.encode_source(text)
11
12
13# Example
14source = "Der Hund lรคuft im Park."
15source_ids = tokenizer.encode_source(source)
16# โ†’ [Der, Hund, lรคuft, im, Park, .]
17# No BOS or EOS!

Why No BOS/EOS in Source?

  1. Not generating: We're encoding, not generating source text
  2. Full visibility: Encoder sees entire sequence simultaneously
  3. Simpler: Fewer tokens to process
  4. Standard practice: Most translation models skip source special tokens

Exception: Some Models Use Source Markers

Some models (like BART) use special source formatting:

๐Ÿ“text
1# BART style (different from our approach)
2<lang_de> Der Hund lรคuft. </s>

For our project, we follow the standard: no special tokens in source.


5.4 Target Sequence (Decoder Input/Labels)

The Shift Pattern

The decoder uses teacher forcing during training:

๐Ÿ“text
1Target text: "The dog runs."
2
3Decoder INPUT (shifted right):
4  [<bos>, The, dog, runs, .]
5  Position: 0    1    2    3   4
6
7Decoder LABEL (original + EOS):
8  [The, dog, runs, ., <eos>]
9  Position: 0   1    2   3   4
10
11At each position, decoder predicts the next token:
12  Input <bos> โ†’ Predict "The"    โœ“
13  Input The   โ†’ Predict "dog"    โœ“
14  Input dog   โ†’ Predict "runs"   โœ“
15  Input runs  โ†’ Predict "."      โœ“
16  Input .     โ†’ Predict <eos>    โœ“ (signal to stop)

Implementation

๐Ÿpython
1def prepare_target(
2    text: str,
3    tokenizer,
4    max_length: int = 128
5) -> Tuple[List[int], List[int]]:
6    """
7    Prepare target sequence for decoder.
8
9    Returns:
10        decoder_input: [BOS, tok1, tok2, ...] - input to decoder
11        decoder_label: [tok1, tok2, ..., EOS] - expected output
12    """
13    # Tokenize without special tokens
14    token_ids = tokenizer.sp.encode(text, out_type=int)
15
16    # Create decoder input (with BOS)
17    decoder_input = [tokenizer.bos_id] + token_ids
18
19    # Create decoder label (with EOS)
20    decoder_label = token_ids + [tokenizer.eos_id]
21
22    # Truncate if needed
23    if max_length:
24        decoder_input = decoder_input[:max_length]
25        decoder_label = decoder_label[:max_length]
26
27    return decoder_input, decoder_label
28
29
30# Example
31text = "The dog runs."
32dec_input, dec_label = prepare_target(text, tokenizer)
33
34print(f"Text: '{text}'")
35print(f"Decoder input: {dec_input}")  # [2, 45, 123, 456, 7]
36print(f"Decoder label: {dec_label}")   # [45, 123, 456, 7, 3]

5.5 The PAD Token

Why Padding is Necessary

Batching requires sequences of equal length:

๐Ÿ“text
1Sentence 1: "Hi"        โ†’ 2 tokens
2Sentence 2: "Hello world" โ†’ 3 tokens
3Sentence 3: "Goodbye"   โ†’ 2 tokens
4
5Without padding: Can't batch (different lengths)
6With padding:
7  [Hi, <pad>, <pad>]     โ†’ 3 tokens
8  [Hello, world, <pad>]  โ†’ 3 tokens
9  [Good, bye, <pad>]     โ†’ 3 tokens
10
11Now all have same length โ†’ Batchable!

Padding Implementation

๐Ÿpython
1def pad_sequences(
2    sequences: List[List[int]],
3    pad_id: int = 0,
4    max_length: Optional[int] = None
5) -> Tuple[torch.Tensor, torch.Tensor]:
6    """
7    Pad sequences to equal length.
8
9    Args:
10        sequences: List of token ID lists
11        pad_id: Padding token ID
12        max_length: Maximum length (or longest sequence if None)
13
14    Returns:
15        padded: [batch, max_len] tensor of token IDs
16        mask: [batch, max_len] tensor (1=real, 0=padding)
17    """
18    # Determine max length
19    if max_length is None:
20        max_length = max(len(seq) for seq in sequences)
21
22    batch_size = len(sequences)
23    padded = torch.full((batch_size, max_length), pad_id, dtype=torch.long)
24    mask = torch.zeros(batch_size, max_length, dtype=torch.long)
25
26    for i, seq in enumerate(sequences):
27        length = min(len(seq), max_length)
28        padded[i, :length] = torch.tensor(seq[:length])
29        mask[i, :length] = 1
30
31    return padded, mask
32
33
34# Example
35sequences = [
36    [10, 20, 30],
37    [40, 50],
38    [60, 70, 80, 90],
39]
40
41padded, mask = pad_sequences(sequences, pad_id=0)
42print(f"Padded:\n{padded}")
43print(f"Mask:\n{mask}")

Output:

๐Ÿ“text
1Padded:
2tensor([[10, 20, 30,  0],
3        [40, 50,  0,  0],
4        [60, 70, 80, 90]])
5Mask:
6tensor([[1, 1, 1, 0],
7        [1, 1, 0, 0],
8        [1, 1, 1, 1]])

Padding and Attention Masks

Padding tokens should be ignored in attention:

๐Ÿpython
1def create_padding_mask(input_ids: torch.Tensor, pad_id: int = 0) -> torch.Tensor:
2    """
3    Create mask that ignores padding tokens.
4
5    Args:
6        input_ids: [batch, seq_len] token IDs
7        pad_id: Padding token ID
8
9    Returns:
10        mask: [batch, 1, 1, seq_len] attention mask
11              (1 = attend, 0 = ignore)
12    """
13    # Where tokens are NOT padding
14    mask = (input_ids != pad_id).unsqueeze(1).unsqueeze(2)
15    return mask.float()
16
17
18# Example
19input_ids = torch.tensor([
20    [10, 20, 30, 0, 0],
21    [40, 50, 60, 70, 0],
22])
23
24mask = create_padding_mask(input_ids, pad_id=0)
25print(f"Input IDs:\n{input_ids}")
26print(f"Padding mask:\n{mask.squeeze()}")

Output:

๐Ÿ“text
1Input IDs:
2tensor([[10, 20, 30,  0,  0],
3        [40, 50, 60, 70,  0]])
4Padding mask:
5tensor([[1., 1., 1., 0., 0.],
6        [1., 1., 1., 1., 0.]])

5.6 Special Tokens and Loss Computation

Which Tokens Contribute to Loss?

Include in loss:

  • Regular tokens (predictions we care about)
  • EOS token (model must learn when to stop)

Exclude from loss:

  • PAD tokens (not real predictions)
  • BOS token (always given as input, never predicted)

Ignoring Padding in Loss

๐Ÿpython
1import torch.nn as nn
2
3
4def compute_translation_loss(
5    logits: torch.Tensor,
6    labels: torch.Tensor,
7    pad_id: int = 0
8) -> torch.Tensor:
9    """
10    Compute cross-entropy loss ignoring padding.
11
12    Args:
13        logits: [batch, seq_len, vocab_size] model predictions
14        labels: [batch, seq_len] target token IDs
15        pad_id: Padding token ID to ignore
16
17    Returns:
18        Scalar loss
19    """
20    # Reshape for cross entropy
21    # logits: [batch * seq_len, vocab_size]
22    # labels: [batch * seq_len]
23    batch_size, seq_len, vocab_size = logits.shape
24
25    logits_flat = logits.view(-1, vocab_size)
26    labels_flat = labels.view(-1)
27
28    # Use ignore_index to skip padding
29    criterion = nn.CrossEntropyLoss(ignore_index=pad_id)
30    loss = criterion(logits_flat, labels_flat)
31
32    return loss
33
34
35# Example
36batch_size, seq_len, vocab_size = 2, 5, 100
37
38# Fake logits
39logits = torch.randn(batch_size, seq_len, vocab_size)
40
41# Labels with padding
42labels = torch.tensor([
43    [45, 67, 89, 3, 0],   # "The dog runs EOS PAD"
44    [12, 34, 56, 78, 3],  # "A cat sleeps well EOS"
45])
46
47loss = compute_translation_loss(logits, labels, pad_id=0)
48print(f"Loss: {loss.item():.4f}")

5.7 Complete Data Preparation Pipeline

Full Translation Example

๐Ÿpython
1class TranslationExample:
2    """
3    A single translation training example.
4
5    Contains all components needed for training:
6    - Source sequence (encoder input)
7    - Target decoder input (with BOS)
8    - Target labels (with EOS)
9    - Masks for attention
10    """
11
12    def __init__(
13        self,
14        source_ids: List[int],
15        target_ids: List[int],
16        pad_id: int = 0,
17        bos_id: int = 2,
18        eos_id: int = 3
19    ):
20        # Store raw IDs
21        self.source_ids = source_ids
22        self.target_ids = target_ids
23
24        # Prepare decoder input and labels
25        self.decoder_input = [bos_id] + target_ids
26        self.decoder_label = target_ids + [eos_id]
27
28        self.pad_id = pad_id
29
30    def __repr__(self):
31        return (
32            f"TranslationExample(\n"
33            f"  source={self.source_ids}\n"
34            f"  decoder_input={self.decoder_input}\n"
35            f"  decoder_label={self.decoder_label}\n"
36            f")"
37        )
38
39
40def prepare_batch(
41    examples: List[TranslationExample],
42    pad_id: int = 0
43) -> dict:
44    """
45    Prepare a batch of examples for training.
46
47    Returns dictionary with:
48    - encoder_input: [batch, src_len]
49    - encoder_mask: [batch, 1, 1, src_len]
50    - decoder_input: [batch, tgt_len]
51    - decoder_mask: [batch, 1, tgt_len, tgt_len] (causal + padding)
52    - labels: [batch, tgt_len]
53    """
54    # Collect sequences
55    source_seqs = [ex.source_ids for ex in examples]
56    decoder_input_seqs = [ex.decoder_input for ex in examples]
57    label_seqs = [ex.decoder_label for ex in examples]
58
59    # Pad each type
60    encoder_input, encoder_padding = pad_sequences(source_seqs, pad_id)
61    decoder_input, decoder_padding = pad_sequences(decoder_input_seqs, pad_id)
62    labels, _ = pad_sequences(label_seqs, pad_id)
63
64    # Create attention masks
65    batch_size = len(examples)
66    src_len = encoder_input.size(1)
67    tgt_len = decoder_input.size(1)
68
69    # Encoder mask: just padding mask
70    encoder_mask = (encoder_input != pad_id).unsqueeze(1).unsqueeze(2).float()
71
72    # Decoder mask: causal + padding
73    # Causal mask: lower triangular
74    causal_mask = torch.tril(torch.ones(tgt_len, tgt_len))
75    # Padding mask
76    padding_mask = (decoder_input != pad_id).unsqueeze(1).unsqueeze(2).float()
77    # Combine
78    decoder_mask = causal_mask.unsqueeze(0) * padding_mask
79
80    # Cross-attention mask (decoder attending to encoder)
81    cross_mask = encoder_mask  # Same as encoder mask
82
83    return {
84        "encoder_input": encoder_input,
85        "encoder_mask": encoder_mask,
86        "decoder_input": decoder_input,
87        "decoder_mask": decoder_mask,
88        "cross_mask": cross_mask,
89        "labels": labels,
90    }

5.8 Inference-Time Handling

Generation Without Teacher Forcing

During inference, we don't have the target text:

๐Ÿpython
1def generate_translation(
2    model,
3    source_ids: torch.Tensor,
4    tokenizer,
5    max_length: int = 50,
6    device: str = "cpu"
7) -> List[int]:
8    """
9    Generate translation using greedy decoding.
10
11    Args:
12        model: Trained transformer model
13        source_ids: [1, src_len] source token IDs
14        tokenizer: Tokenizer with special token IDs
15        max_length: Maximum output length
16        device: Device to run on
17
18    Returns:
19        List of generated token IDs
20    """
21    model.eval()
22    source_ids = source_ids.to(device)
23
24    # Start with BOS token
25    generated = [tokenizer.bos_id]
26
27    with torch.no_grad():
28        # Encode source once
29        memory = model.encode(source_ids)
30
31        for _ in range(max_length):
32            # Prepare decoder input
33            decoder_input = torch.tensor([generated], device=device)
34
35            # Get next token prediction
36            logits = model.decode(decoder_input, memory)
37            next_token_logits = logits[0, -1, :]  # Last position
38            next_token = next_token_logits.argmax().item()
39
40            # Append to generated sequence
41            generated.append(next_token)
42
43            # Stop if EOS
44            if next_token == tokenizer.eos_id:
45                break
46
47    return generated[1:]  # Remove BOS
48
49
50# Usage example (pseudocode)
51# source = "Der Hund lรคuft im Park."
52# source_ids = tokenizer.encode_source(source)
53# generated_ids = generate_translation(model, source_ids, tokenizer)
54# translation = tokenizer.decode(generated_ids)
55# print(translation)  # "The dog runs in the park."

Key Differences: Training vs Inference

AspectTrainingInference
Target availableYesNo
Decoder inputFull target with BOSGenerate token by token
Teacher forcingYesNo
EOS handlingIn labelsStop condition
PAD handlingIn batchNot needed

Summary

Special Token Usage

TokenEncoderDecoder InputDecoder LabelLoss
PADMaskedMaskedIgnoredNo
UNKEncodedEncodedIncludeYes
BOSNot usedFirst tokenNot usedNo
EOSNot usedNot usedLast tokenYes

The Complete Picture

๐Ÿ“text
1Training:
2  Source:       [src_1, src_2, ..., src_n, PAD, PAD]
3  Dec Input:    [BOS, tgt_1, tgt_2, ..., tgt_m, PAD]
4  Dec Label:    [tgt_1, tgt_2, ..., tgt_m, EOS, PAD]
5  Loss Mask:    [  1,    1,    1,  ...,  1,   1,  0]
6
7Inference:
8  Source:       [src_1, src_2, ..., src_n]
9  Dec Input:    [BOS] โ†’ [BOS, pred_1] โ†’ [BOS, pred_1, pred_2] โ†’ ...
10  Stop when:    pred_k == EOS

Implementation Checklist

  • โ˜ PAD token ID = 0 for easy masking
  • โ˜ Source sequences: no special tokens
  • โ˜ Target decoder input: prepend BOS
  • โ˜ Target labels: append EOS
  • โ˜ Loss computation: ignore PAD tokens
  • โ˜ Inference: start with BOS, stop at EOS

Exercises

Conceptual Questions

  1. Why does the decoder input have BOS but not EOS, while labels have EOS but not BOS?
  2. If we accidentally included PAD tokens in the loss, what would happen to training?
  3. During inference, what happens if the model never produces EOS?

Implementation Exercises

  1. Modify the batch preparation to support different padding strategies (left vs right padding).
  2. Implement beam search decoding that properly handles EOS tokens.
  3. Add support for multiple special tokens (e.g., language tags like <de>, <en>).

Analysis Exercises

  1. Compare training with and without label smoothing. How does it affect translation quality?
  2. Analyze the attention patterns at positions with PAD tokens. Are they properly masked?

Chapter Summary

What We Learned

  1. Why Subword: Handles OOV, balances vocab size and sequence length
  2. BPE Algorithm: Iterative merging of frequent pairs
  3. Implementation: From scratch and with SentencePiece
  4. Special Tokens: PAD, UNK, BOS, EOS and their roles
  5. Data Pipeline: Preparing batches for translation

Ready for the Project

With tokenization complete, we have:

  • Trained tokenizer for German-English
  • Proper handling of special tokens
  • Batch preparation with masking
  • Integration with PyTorch datasets

Next Chapter Preview

In Chapter 6, we'll implement the remaining transformer components: the Feed-Forward Network and Layer Normalization. These are the final building blocks before assembling complete encoder and decoder layers.