Chapter 12
16 min read
Section 61 of 75

Vocabulary and Tokenizer Setup

Multi30k Dataset Setup

For machine translation, we need a tokenizer that can handle both source and target languages effectively. This section covers building a joint BPE tokenizer for German-English translation using our implementation from Chapter 5.


3.1 Tokenizer Strategy for Translation

Joint vs Separate Vocabularies

πŸ“text
1OPTION 1: JOINT VOCABULARY (Recommended)
2────────────────────────────────────────
3
4Train single tokenizer on combined source + target data.
5
6Advantages:
7  + Shared subwords for cognates (similar words)
8    German "Haus" and English "house" may share subwords
9  + Single embedding matrix
10  + Simpler architecture
11  + Works well for related languages
12
13Disadvantages:
14  - May waste vocabulary on one language
15  - Less optimal for very different languages
16
17
18OPTION 2: SEPARATE VOCABULARIES
19───────────────────────────────
20
21Train separate tokenizer for each language.
22
23Advantages:
24  + Optimal vocabulary per language
25  + Better for distant language pairs (EN-ZH)
26
27Disadvantages:
28  - Larger total vocabulary
29  - No shared representations
30  - More complex architecture
31
32
33RECOMMENDATION FOR DE-EN:
34─────────────────────────
35
36Use JOINT vocabulary because:
37- German and English share many cognates
38- Both use Latin script
39- Shared subwords help alignment learning
40- Simpler implementation
41
42Typical sizes:
43  - 8,000 tokens: Fast training, good for small data
44  - 16,000 tokens: Good balance
45  - 32,000 tokens: Better quality, slower training

3.2 Building Joint BPE Tokenizer

Training Pipeline

🐍python
1from typing import List, Dict, Optional, Tuple
2from collections import Counter
3import json
4from pathlib import Path
5
6
7class JointBPETokenizer:
8    """
9    Joint BPE tokenizer for translation.
10
11    Trains on combined source and target data to create
12    a shared vocabulary for both languages.
13
14    Special tokens:
15    - <pad>: Padding
16    - <unk>: Unknown
17    - <bos>: Beginning of sequence
18    - <eos>: End of sequence
19
20    Args:
21        vocab_size: Target vocabulary size
22        min_frequency: Minimum frequency for BPE merges
23
24    Example:
25        >>> tokenizer = JointBPETokenizer(vocab_size=8000)
26        >>> tokenizer.train(german_sentences + english_sentences)
27        >>> tokens = tokenizer.encode("Hello world")
28    """
29
30    # Special tokens
31    PAD_TOKEN = "<pad>"
32    UNK_TOKEN = "<unk>"
33    BOS_TOKEN = "<bos>"
34    EOS_TOKEN = "<eos>"
35
36    SPECIAL_TOKENS = [PAD_TOKEN, UNK_TOKEN, BOS_TOKEN, EOS_TOKEN]
37
38    def __init__(
39        self,
40        vocab_size: int = 8000,
41        min_frequency: int = 2
42    ):
43        self.vocab_size = vocab_size
44        self.min_frequency = min_frequency
45
46        # Initialize empty vocabulary
47        self.token_to_id: Dict[str, int] = {}
48        self.id_to_token: Dict[int, str] = {}
49        self.merges: List[Tuple[str, str]] = []
50
51        # Add special tokens
52        self._init_special_tokens()
53
54    def _init_special_tokens(self):
55        """Initialize special token mappings."""
56        for i, token in enumerate(self.SPECIAL_TOKENS):
57            self.token_to_id[token] = i
58            self.id_to_token[i] = token
59
60    @property
61    def pad_id(self) -> int:
62        return self.token_to_id[self.PAD_TOKEN]
63
64    @property
65    def unk_id(self) -> int:
66        return self.token_to_id[self.UNK_TOKEN]
67
68    @property
69    def bos_id(self) -> int:
70        return self.token_to_id[self.BOS_TOKEN]
71
72    @property
73    def eos_id(self) -> int:
74        return self.token_to_id[self.EOS_TOKEN]
75
76    def train(
77        self,
78        sentences: List[str],
79        verbose: bool = True
80    ):
81        """
82        Train BPE tokenizer on sentences.
83
84        Args:
85            sentences: List of training sentences
86            verbose: Print progress
87        """
88        if verbose:
89            print(f"Training BPE tokenizer...")
90            print(f"  Sentences: {len(sentences):,}")
91            print(f"  Target vocab size: {self.vocab_size}")
92
93        # Step 1: Build initial vocabulary (characters)
94        word_freqs = self._get_word_frequencies(sentences)
95        splits = self._initialize_splits(word_freqs)
96
97        if verbose:
98            print(f"  Unique words: {len(word_freqs):,}")
99
100        # Step 2: Learn BPE merges
101        num_merges = self.vocab_size - len(self.token_to_id) - len(self._get_unique_chars(word_freqs))
102
103        for i in range(num_merges):
104            # Find best pair
105            pair_freqs = self._compute_pair_frequencies(splits, word_freqs)
106
107            if not pair_freqs:
108                break
109
110            best_pair = max(pair_freqs, key=pair_freqs.get)
111
112            if pair_freqs[best_pair] < self.min_frequency:
113                break
114
115            # Apply merge
116            self.merges.append(best_pair)
117            splits = self._apply_merge(splits, best_pair)
118
119            if verbose and (i + 1) % 500 == 0:
120                print(f"    Merge {i+1}: {best_pair[0]} + {best_pair[1]}")
121
122        # Step 3: Build final vocabulary
123        self._build_vocabulary(splits)
124
125        if verbose:
126            print(f"  Final vocabulary size: {len(self.token_to_id)}")
127            print("Training complete!")
128
129    def encode(
130        self,
131        text: str,
132        add_special_tokens: bool = True
133    ) -> List[int]:
134        """
135        Encode text to token IDs.
136
137        Args:
138            text: Input text
139            add_special_tokens: Add BOS/EOS tokens
140
141        Returns:
142            List of token IDs
143        """
144        words = text.lower().split()
145        token_ids = []
146
147        if add_special_tokens:
148            token_ids.append(self.bos_id)
149
150        for word in words:
151            # Tokenize word
152            word_tokens = self._tokenize_word(word)
153
154            for token in word_tokens:
155                if token in self.token_to_id:
156                    token_ids.append(self.token_to_id[token])
157                else:
158                    token_ids.append(self.unk_id)
159
160        if add_special_tokens:
161            token_ids.append(self.eos_id)
162
163        return token_ids
164
165    def decode(
166        self,
167        token_ids: List[int],
168        skip_special_tokens: bool = True
169    ) -> str:
170        """
171        Decode token IDs to text.
172
173        Args:
174            token_ids: List of token IDs
175            skip_special_tokens: Skip special tokens
176
177        Returns:
178            Decoded text
179        """
180        tokens = []
181
182        for token_id in token_ids:
183            token = self.id_to_token.get(token_id, self.UNK_TOKEN)
184
185            if skip_special_tokens and token in self.SPECIAL_TOKENS:
186                continue
187
188            tokens.append(token)
189
190        # Join and clean up
191        text = ''.join(tokens)
192        text = text.replace('</w>', ' ')
193        text = text.strip()
194
195        return text
196
197    def save(self, path: str):
198        """Save tokenizer to file."""
199        data = {
200            'vocab_size': self.vocab_size,
201            'min_frequency': self.min_frequency,
202            'token_to_id': self.token_to_id,
203            'merges': self.merges,
204        }
205
206        with open(path, 'w', encoding='utf-8') as f:
207            json.dump(data, f, ensure_ascii=False, indent=2)
208
209        print(f"Saved tokenizer to {path}")
210
211    @classmethod
212    def load(cls, path: str) -> 'JointBPETokenizer':
213        """Load tokenizer from file."""
214        with open(path, 'r', encoding='utf-8') as f:
215            data = json.load(f)
216
217        tokenizer = cls(
218            vocab_size=data['vocab_size'],
219            min_frequency=data['min_frequency']
220        )
221
222        tokenizer.token_to_id = data['token_to_id']
223        tokenizer.merges = [tuple(m) for m in data['merges']]
224
225        # Rebuild id_to_token properly
226        tokenizer.id_to_token = {v: k for k, v in tokenizer.token_to_id.items()}
227
228        print(f"Loaded tokenizer from {path}")
229        return tokenizer

3.3 Training Tokenizer on Multi30k

Full Training Script

🐍python
1# train_tokenizer.py
2
3from pathlib import Path
4
5def main():
6    # Configuration
7    VOCAB_SIZE = 8000
8    MIN_FREQUENCY = 2
9    DATA_DIR = Path("data/multi30k")
10    OUTPUT_DIR = Path("data/tokenizer")
11
12    OUTPUT_DIR.mkdir(parents=True, exist_ok=True)
13
14    # Load training data
15    print("Loading training data...")
16
17    with open(DATA_DIR / "train.de", 'r') as f:
18        german = [line.strip() for line in f]
19
20    with open(DATA_DIR / "train.en", 'r') as f:
21        english = [line.strip() for line in f]
22
23    # Combine for joint vocabulary
24    combined = german + english
25    print(f"  German sentences: {len(german):,}")
26    print(f"  English sentences: {len(english):,}")
27    print(f"  Combined: {len(combined):,}")
28
29    # Train tokenizer
30    tokenizer = JointBPETokenizer(
31        vocab_size=VOCAB_SIZE,
32        min_frequency=MIN_FREQUENCY
33    )
34
35    tokenizer.train(combined, verbose=True)
36
37    # Save tokenizer
38    tokenizer.save(OUTPUT_DIR / "tokenizer.json")
39
40    # Print statistics
41    print("\nTokenizer Statistics:")
42    print(f"  Vocabulary size: {len(tokenizer.token_to_id)}")
43    print(f"  Number of merges: {len(tokenizer.merges)}")
44
45    # Test on samples
46    print("\nSample tokenizations:")
47
48    samples = [
49        ("DE", "Ein Hund lΓ€uft im Park."),
50        ("EN", "A dog runs in the park."),
51        ("DE", "Das Wetter ist schΓΆn heute."),
52        ("EN", "The weather is nice today."),
53    ]
54
55    for lang, text in samples:
56        tokens = tokenizer.encode(text, add_special_tokens=False)
57        decoded = tokenizer.decode(tokens)
58        print(f"  [{lang}] {text}")
59        print(f"       Tokens: {tokens[:10]}...")
60        print(f"       Length: {len(tokens)}")
61
62    print("\nTokenizer training complete!")
63
64
65if __name__ == "__main__":
66    main()

3.4 Vocabulary Analysis

Analyzing the Tokenizer

🐍python
1def analyze_tokenizer(tokenizer: JointBPETokenizer):
2    """
3    Analyze tokenizer properties.
4    """
5    print("Tokenizer Analysis")
6    print("=" * 60)
7
8    # Basic stats
9    print(f"\nBasic Statistics:")
10    print(f"  Total vocabulary: {len(tokenizer.token_to_id)}")
11    print(f"  Number of merges: {len(tokenizer.merges)}")
12    print(f"  Special tokens: {len(tokenizer.SPECIAL_TOKENS)}")
13
14    # Token length distribution
15    token_lengths = [len(t) for t in tokenizer.token_to_id.keys()
16                    if t not in tokenizer.SPECIAL_TOKENS and t != '</w>']
17
18    if token_lengths:
19        print(f"\nToken Length Distribution:")
20        print(f"  Average: {sum(token_lengths)/len(token_lengths):.2f} chars")
21        print(f"  Max: {max(token_lengths)} chars")
22        print(f"  Min: {min(token_lengths)} chars")
23
24    # Count single characters vs subwords
25    single_chars = sum(1 for t in tokenizer.token_to_id
26                      if len(t) == 1 and t not in tokenizer.SPECIAL_TOKENS)
27    subwords = len(tokenizer.token_to_id) - single_chars - len(tokenizer.SPECIAL_TOKENS)
28
29    print(f"\nToken Types:")
30    print(f"  Single characters: {single_chars}")
31    print(f"  Subwords: {subwords}")
32    print(f"  Special tokens: {len(tokenizer.SPECIAL_TOKENS)}")
33
34
35def analyze_tokenization_quality(
36    tokenizer: JointBPETokenizer,
37    sentences: List[str]
38) -> Dict:
39    """
40    Analyze tokenization quality on a corpus.
41
42    Returns statistics about token counts, unknown rates, etc.
43    """
44    total_tokens = 0
45    total_unk = 0
46    tokens_per_word = []
47
48    for sentence in sentences:
49        words = sentence.split()
50        tokens = tokenizer.encode(sentence, add_special_tokens=False)
51
52        total_tokens += len(tokens)
53        total_unk += sum(1 for t in tokens if t == tokenizer.unk_id)
54        tokens_per_word.append(len(tokens) / max(len(words), 1))
55
56    return {
57        'total_sentences': len(sentences),
58        'total_tokens': total_tokens,
59        'avg_tokens_per_sentence': total_tokens / len(sentences),
60        'avg_tokens_per_word': sum(tokens_per_word) / len(tokens_per_word),
61        'unknown_rate': total_unk / total_tokens if total_tokens > 0 else 0,
62    }

3.5 Special Token Handling

Proper Use of Special Tokens

πŸ“text
1SPECIAL TOKENS FOR TRANSLATION:
2───────────────────────────────
3
4<pad> (ID: 0)
5─────────────
6Purpose: Padding sequences to equal length
7Used in: Both source and target batches
8Example:
9  Batch: [[1,2,3,4], [1,2,0,0]]
10                    ↑ padding
11
12<unk> (ID: 1)
13─────────────
14Purpose: Unknown/out-of-vocabulary tokens
15Used in: Rare words not in vocabulary
16Note: High UNK rate indicates vocabulary issues
17
18<bos> (ID: 2)
19─────────────
20Purpose: Beginning of sequence marker
21Used in: Start of target sequences
22Example:
23  Target input:  [<bos>, word1, word2, ...]
24  Target output: [word1, word2, ..., <eos>]
25
26<eos> (ID: 3)
27─────────────
28Purpose: End of sequence marker
29Used in: End of target sequences
30Signals: Generation should stop
31
32
33TRAINING DATA FORMAT:
34─────────────────────
35
36Source: [src_word1, src_word2, ..., src_wordN]
37        (No BOS/EOS needed for source)
38
39Target Input:  [<bos>, tgt_word1, tgt_word2, ..., tgt_wordN]
40Target Output: [tgt_word1, tgt_word2, ..., tgt_wordN, <eos>]
41
42
43INFERENCE:
44──────────
45
461. Encode source: [src_tokens]
472. Start with: [<bos>]
483. Generate until <eos> or max_length
494. Remove <bos> and <eos> from final output

Code Example

🐍python
1def prepare_batch(source_ids, target_ids, pad_id, bos_id, eos_id):
2    """
3    Prepare source and target for training.
4    """
5    # Source: just pad to max length
6    src_padded = pad_sequence(source_ids, pad_id)
7
8    # Target: add BOS at start, EOS at end
9    tgt_with_special = []
10    for tgt in target_ids:
11        tgt_with_special.append([bos_id] + tgt + [eos_id])
12
13    tgt_padded = pad_sequence(tgt_with_special, pad_id)
14
15    # For training:
16    # - Input to decoder: tgt_padded[:, :-1]  (includes BOS, excludes last)
17    # - Labels: tgt_padded[:, 1:]  (excludes BOS, includes EOS)
18
19    return {
20        'source': src_padded,
21        'target_input': tgt_padded[:, :-1],
22        'target_output': tgt_padded[:, 1:],
23    }

3.6 Complete Tokenizer Setup

Full Setup Workflow

πŸ“text
1FULL SETUP WORKFLOW:
2────────────────────
3
41. TRAIN TOKENIZER:
5   ─────────────────
6   # Load all training data
7   german = load_file('train.de')
8   english = load_file('train.en')
9
10   # Create and train
11   tokenizer = JointBPETokenizer(vocab_size=8000)
12   tokenizer.train(german + english)
13
14   # Save
15   tokenizer.save('data/tokenizer/tokenizer.json')
16
17
182. VERIFY TOKENIZER:
19   ─────────────────
20   # Check vocabulary size
21   assert len(tokenizer.token_to_id) <= 8000
22
23   # Check special tokens
24   assert tokenizer.pad_id == 0
25   assert tokenizer.bos_id == 2
26   assert tokenizer.eos_id == 3
27
28   # Test round-trip
29   text = "The dog runs in the park"
30   ids = tokenizer.encode(text)
31   decoded = tokenizer.decode(ids)
32   # decoded should approximately equal text
33
34
353. COMPUTE STATISTICS:
36   ───────────────────
37   # For each split
38   for split in ['train', 'val', 'test']:
39       german = load_file(f'{split}.de')
40       english = load_file(f'{split}.en')
41
42       # Average tokens per sentence
43       de_tokens = [len(tokenizer.encode(s)) for s in german]
44       en_tokens = [len(tokenizer.encode(s)) for s in english]
45
46       print(f"{split}:")
47       print(f"  DE avg tokens: {mean(de_tokens):.1f}")
48       print(f"  EN avg tokens: {mean(en_tokens):.1f}")
🐍python
1VOCAB_SIZE = 8000
2# Reasoning:
3# - Multi30k has ~30K sentences
4# - Limited vocabulary (image descriptions)
5# - 8K captures most patterns
6# - Faster training than larger vocab
7
8MIN_FREQUENCY = 2
9# Reasoning:
10# - Avoid single-occurrence artifacts
11# - Keep meaningful subwords
12
13
14EXPECTED STATISTICS:
15────────────────────
16
17With vocab_size=8000 on Multi30k:
18  - Avg tokens/word: ~1.3
19  - Avg tokens/sentence: ~15 (after BOS/EOS)
20  - Unknown rate: < 1%

Summary

Tokenizer Setup Checklist

  • Train joint BPE tokenizer on combined DE+EN data
  • Verify vocabulary size (recommend 8000 for Multi30k)
  • Check special token IDs (pad=0, unk=1, bos=2, eos=3)
  • Test encode/decode round-trip
  • Analyze tokenization statistics
  • Save tokenizer for reproducibility

Key Configuration

ParameterRecommended ValueReasoning
vocab_size8000Good balance for small dataset
min_frequency2Avoid noise from rare patterns
special_tokenspad,unk,bos,eosStandard translation setup

Files to Save

πŸ“text
1data/tokenizer/
2β”œβ”€β”€ tokenizer.json     # Vocabulary and merges
3└── config.json        # Training configuration

Exercises

Implementation

  1. Add byte-level BPE fallback for truly unknown characters.
  2. Implement SentencePiece-style tokenization as alternative.
  3. Create tokenizer comparison tool (BPE vs WordPiece).

Analysis

  1. Compare vocabulary overlap between German and English tokens.
  2. Find words that get heavily split vs words that stay whole.

Next Section Preview

In the next section, we'll cover Data Loading Pipelineβ€”creating PyTorch datasets and dataloaders for efficient training.

Loading comments...