For machine translation, we need a tokenizer that can handle both source and target languages effectively. This section covers building a joint BPE tokenizer for German-English translation using our implementation from Chapter 5.
3.1 Tokenizer Strategy for Translation
Joint vs Separate Vocabularies
πtext
1OPTION 1: JOINT VOCABULARY (Recommended)
2ββββββββββββββββββββββββββββββββββββββββ
3
4Train single tokenizer on combined source + target data.
5
6Advantages:
7 + Shared subwords for cognates (similar words)
8 German "Haus" and English "house" may share subwords
9 + Single embedding matrix
10 + Simpler architecture
11 + Works well for related languages
12
13Disadvantages:
14 - May waste vocabulary on one language
15 - Less optimal for very different languages
16
17
18OPTION 2: SEPARATE VOCABULARIES
19βββββββββββββββββββββββββββββββ
20
21Train separate tokenizer for each language.
22
23Advantages:
24 + Optimal vocabulary per language
25 + Better for distant language pairs (EN-ZH)
26
27Disadvantages:
28 - Larger total vocabulary
29 - No shared representations
30 - More complex architecture
31
32
33RECOMMENDATION FOR DE-EN:
34βββββββββββββββββββββββββ
35
36Use JOINT vocabulary because:
37- German and English share many cognates
38- Both use Latin script
39- Shared subwords help alignment learning
40- Simpler implementation
41
42Typical sizes:
43 - 8,000 tokens: Fast training, good for small data
44 - 16,000 tokens: Good balance
45 - 32,000 tokens: Better quality, slower training3.2 Building Joint BPE Tokenizer
Training Pipeline
πpython
1from typing import List, Dict, Optional, Tuple
2from collections import Counter
3import json
4from pathlib import Path
5
6
7class JointBPETokenizer:
8 """
9 Joint BPE tokenizer for translation.
10
11 Trains on combined source and target data to create
12 a shared vocabulary for both languages.
13
14 Special tokens:
15 - <pad>: Padding
16 - <unk>: Unknown
17 - <bos>: Beginning of sequence
18 - <eos>: End of sequence
19
20 Args:
21 vocab_size: Target vocabulary size
22 min_frequency: Minimum frequency for BPE merges
23
24 Example:
25 >>> tokenizer = JointBPETokenizer(vocab_size=8000)
26 >>> tokenizer.train(german_sentences + english_sentences)
27 >>> tokens = tokenizer.encode("Hello world")
28 """
29
30 # Special tokens
31 PAD_TOKEN = "<pad>"
32 UNK_TOKEN = "<unk>"
33 BOS_TOKEN = "<bos>"
34 EOS_TOKEN = "<eos>"
35
36 SPECIAL_TOKENS = [PAD_TOKEN, UNK_TOKEN, BOS_TOKEN, EOS_TOKEN]
37
38 def __init__(
39 self,
40 vocab_size: int = 8000,
41 min_frequency: int = 2
42 ):
43 self.vocab_size = vocab_size
44 self.min_frequency = min_frequency
45
46 # Initialize empty vocabulary
47 self.token_to_id: Dict[str, int] = {}
48 self.id_to_token: Dict[int, str] = {}
49 self.merges: List[Tuple[str, str]] = []
50
51 # Add special tokens
52 self._init_special_tokens()
53
54 def _init_special_tokens(self):
55 """Initialize special token mappings."""
56 for i, token in enumerate(self.SPECIAL_TOKENS):
57 self.token_to_id[token] = i
58 self.id_to_token[i] = token
59
60 @property
61 def pad_id(self) -> int:
62 return self.token_to_id[self.PAD_TOKEN]
63
64 @property
65 def unk_id(self) -> int:
66 return self.token_to_id[self.UNK_TOKEN]
67
68 @property
69 def bos_id(self) -> int:
70 return self.token_to_id[self.BOS_TOKEN]
71
72 @property
73 def eos_id(self) -> int:
74 return self.token_to_id[self.EOS_TOKEN]
75
76 def train(
77 self,
78 sentences: List[str],
79 verbose: bool = True
80 ):
81 """
82 Train BPE tokenizer on sentences.
83
84 Args:
85 sentences: List of training sentences
86 verbose: Print progress
87 """
88 if verbose:
89 print(f"Training BPE tokenizer...")
90 print(f" Sentences: {len(sentences):,}")
91 print(f" Target vocab size: {self.vocab_size}")
92
93 # Step 1: Build initial vocabulary (characters)
94 word_freqs = self._get_word_frequencies(sentences)
95 splits = self._initialize_splits(word_freqs)
96
97 if verbose:
98 print(f" Unique words: {len(word_freqs):,}")
99
100 # Step 2: Learn BPE merges
101 num_merges = self.vocab_size - len(self.token_to_id) - len(self._get_unique_chars(word_freqs))
102
103 for i in range(num_merges):
104 # Find best pair
105 pair_freqs = self._compute_pair_frequencies(splits, word_freqs)
106
107 if not pair_freqs:
108 break
109
110 best_pair = max(pair_freqs, key=pair_freqs.get)
111
112 if pair_freqs[best_pair] < self.min_frequency:
113 break
114
115 # Apply merge
116 self.merges.append(best_pair)
117 splits = self._apply_merge(splits, best_pair)
118
119 if verbose and (i + 1) % 500 == 0:
120 print(f" Merge {i+1}: {best_pair[0]} + {best_pair[1]}")
121
122 # Step 3: Build final vocabulary
123 self._build_vocabulary(splits)
124
125 if verbose:
126 print(f" Final vocabulary size: {len(self.token_to_id)}")
127 print("Training complete!")
128
129 def encode(
130 self,
131 text: str,
132 add_special_tokens: bool = True
133 ) -> List[int]:
134 """
135 Encode text to token IDs.
136
137 Args:
138 text: Input text
139 add_special_tokens: Add BOS/EOS tokens
140
141 Returns:
142 List of token IDs
143 """
144 words = text.lower().split()
145 token_ids = []
146
147 if add_special_tokens:
148 token_ids.append(self.bos_id)
149
150 for word in words:
151 # Tokenize word
152 word_tokens = self._tokenize_word(word)
153
154 for token in word_tokens:
155 if token in self.token_to_id:
156 token_ids.append(self.token_to_id[token])
157 else:
158 token_ids.append(self.unk_id)
159
160 if add_special_tokens:
161 token_ids.append(self.eos_id)
162
163 return token_ids
164
165 def decode(
166 self,
167 token_ids: List[int],
168 skip_special_tokens: bool = True
169 ) -> str:
170 """
171 Decode token IDs to text.
172
173 Args:
174 token_ids: List of token IDs
175 skip_special_tokens: Skip special tokens
176
177 Returns:
178 Decoded text
179 """
180 tokens = []
181
182 for token_id in token_ids:
183 token = self.id_to_token.get(token_id, self.UNK_TOKEN)
184
185 if skip_special_tokens and token in self.SPECIAL_TOKENS:
186 continue
187
188 tokens.append(token)
189
190 # Join and clean up
191 text = ''.join(tokens)
192 text = text.replace('</w>', ' ')
193 text = text.strip()
194
195 return text
196
197 def save(self, path: str):
198 """Save tokenizer to file."""
199 data = {
200 'vocab_size': self.vocab_size,
201 'min_frequency': self.min_frequency,
202 'token_to_id': self.token_to_id,
203 'merges': self.merges,
204 }
205
206 with open(path, 'w', encoding='utf-8') as f:
207 json.dump(data, f, ensure_ascii=False, indent=2)
208
209 print(f"Saved tokenizer to {path}")
210
211 @classmethod
212 def load(cls, path: str) -> 'JointBPETokenizer':
213 """Load tokenizer from file."""
214 with open(path, 'r', encoding='utf-8') as f:
215 data = json.load(f)
216
217 tokenizer = cls(
218 vocab_size=data['vocab_size'],
219 min_frequency=data['min_frequency']
220 )
221
222 tokenizer.token_to_id = data['token_to_id']
223 tokenizer.merges = [tuple(m) for m in data['merges']]
224
225 # Rebuild id_to_token properly
226 tokenizer.id_to_token = {v: k for k, v in tokenizer.token_to_id.items()}
227
228 print(f"Loaded tokenizer from {path}")
229 return tokenizer3.3 Training Tokenizer on Multi30k
Full Training Script
πpython
1# train_tokenizer.py
2
3from pathlib import Path
4
5def main():
6 # Configuration
7 VOCAB_SIZE = 8000
8 MIN_FREQUENCY = 2
9 DATA_DIR = Path("data/multi30k")
10 OUTPUT_DIR = Path("data/tokenizer")
11
12 OUTPUT_DIR.mkdir(parents=True, exist_ok=True)
13
14 # Load training data
15 print("Loading training data...")
16
17 with open(DATA_DIR / "train.de", 'r') as f:
18 german = [line.strip() for line in f]
19
20 with open(DATA_DIR / "train.en", 'r') as f:
21 english = [line.strip() for line in f]
22
23 # Combine for joint vocabulary
24 combined = german + english
25 print(f" German sentences: {len(german):,}")
26 print(f" English sentences: {len(english):,}")
27 print(f" Combined: {len(combined):,}")
28
29 # Train tokenizer
30 tokenizer = JointBPETokenizer(
31 vocab_size=VOCAB_SIZE,
32 min_frequency=MIN_FREQUENCY
33 )
34
35 tokenizer.train(combined, verbose=True)
36
37 # Save tokenizer
38 tokenizer.save(OUTPUT_DIR / "tokenizer.json")
39
40 # Print statistics
41 print("\nTokenizer Statistics:")
42 print(f" Vocabulary size: {len(tokenizer.token_to_id)}")
43 print(f" Number of merges: {len(tokenizer.merges)}")
44
45 # Test on samples
46 print("\nSample tokenizations:")
47
48 samples = [
49 ("DE", "Ein Hund lΓ€uft im Park."),
50 ("EN", "A dog runs in the park."),
51 ("DE", "Das Wetter ist schΓΆn heute."),
52 ("EN", "The weather is nice today."),
53 ]
54
55 for lang, text in samples:
56 tokens = tokenizer.encode(text, add_special_tokens=False)
57 decoded = tokenizer.decode(tokens)
58 print(f" [{lang}] {text}")
59 print(f" Tokens: {tokens[:10]}...")
60 print(f" Length: {len(tokens)}")
61
62 print("\nTokenizer training complete!")
63
64
65if __name__ == "__main__":
66 main()3.4 Vocabulary Analysis
Analyzing the Tokenizer
πpython
1def analyze_tokenizer(tokenizer: JointBPETokenizer):
2 """
3 Analyze tokenizer properties.
4 """
5 print("Tokenizer Analysis")
6 print("=" * 60)
7
8 # Basic stats
9 print(f"\nBasic Statistics:")
10 print(f" Total vocabulary: {len(tokenizer.token_to_id)}")
11 print(f" Number of merges: {len(tokenizer.merges)}")
12 print(f" Special tokens: {len(tokenizer.SPECIAL_TOKENS)}")
13
14 # Token length distribution
15 token_lengths = [len(t) for t in tokenizer.token_to_id.keys()
16 if t not in tokenizer.SPECIAL_TOKENS and t != '</w>']
17
18 if token_lengths:
19 print(f"\nToken Length Distribution:")
20 print(f" Average: {sum(token_lengths)/len(token_lengths):.2f} chars")
21 print(f" Max: {max(token_lengths)} chars")
22 print(f" Min: {min(token_lengths)} chars")
23
24 # Count single characters vs subwords
25 single_chars = sum(1 for t in tokenizer.token_to_id
26 if len(t) == 1 and t not in tokenizer.SPECIAL_TOKENS)
27 subwords = len(tokenizer.token_to_id) - single_chars - len(tokenizer.SPECIAL_TOKENS)
28
29 print(f"\nToken Types:")
30 print(f" Single characters: {single_chars}")
31 print(f" Subwords: {subwords}")
32 print(f" Special tokens: {len(tokenizer.SPECIAL_TOKENS)}")
33
34
35def analyze_tokenization_quality(
36 tokenizer: JointBPETokenizer,
37 sentences: List[str]
38) -> Dict:
39 """
40 Analyze tokenization quality on a corpus.
41
42 Returns statistics about token counts, unknown rates, etc.
43 """
44 total_tokens = 0
45 total_unk = 0
46 tokens_per_word = []
47
48 for sentence in sentences:
49 words = sentence.split()
50 tokens = tokenizer.encode(sentence, add_special_tokens=False)
51
52 total_tokens += len(tokens)
53 total_unk += sum(1 for t in tokens if t == tokenizer.unk_id)
54 tokens_per_word.append(len(tokens) / max(len(words), 1))
55
56 return {
57 'total_sentences': len(sentences),
58 'total_tokens': total_tokens,
59 'avg_tokens_per_sentence': total_tokens / len(sentences),
60 'avg_tokens_per_word': sum(tokens_per_word) / len(tokens_per_word),
61 'unknown_rate': total_unk / total_tokens if total_tokens > 0 else 0,
62 }3.5 Special Token Handling
Proper Use of Special Tokens
πtext
1SPECIAL TOKENS FOR TRANSLATION:
2βββββββββββββββββββββββββββββββ
3
4<pad> (ID: 0)
5βββββββββββββ
6Purpose: Padding sequences to equal length
7Used in: Both source and target batches
8Example:
9 Batch: [[1,2,3,4], [1,2,0,0]]
10 β padding
11
12<unk> (ID: 1)
13βββββββββββββ
14Purpose: Unknown/out-of-vocabulary tokens
15Used in: Rare words not in vocabulary
16Note: High UNK rate indicates vocabulary issues
17
18<bos> (ID: 2)
19βββββββββββββ
20Purpose: Beginning of sequence marker
21Used in: Start of target sequences
22Example:
23 Target input: [<bos>, word1, word2, ...]
24 Target output: [word1, word2, ..., <eos>]
25
26<eos> (ID: 3)
27βββββββββββββ
28Purpose: End of sequence marker
29Used in: End of target sequences
30Signals: Generation should stop
31
32
33TRAINING DATA FORMAT:
34βββββββββββββββββββββ
35
36Source: [src_word1, src_word2, ..., src_wordN]
37 (No BOS/EOS needed for source)
38
39Target Input: [<bos>, tgt_word1, tgt_word2, ..., tgt_wordN]
40Target Output: [tgt_word1, tgt_word2, ..., tgt_wordN, <eos>]
41
42
43INFERENCE:
44ββββββββββ
45
461. Encode source: [src_tokens]
472. Start with: [<bos>]
483. Generate until <eos> or max_length
494. Remove <bos> and <eos> from final outputCode Example
πpython
1def prepare_batch(source_ids, target_ids, pad_id, bos_id, eos_id):
2 """
3 Prepare source and target for training.
4 """
5 # Source: just pad to max length
6 src_padded = pad_sequence(source_ids, pad_id)
7
8 # Target: add BOS at start, EOS at end
9 tgt_with_special = []
10 for tgt in target_ids:
11 tgt_with_special.append([bos_id] + tgt + [eos_id])
12
13 tgt_padded = pad_sequence(tgt_with_special, pad_id)
14
15 # For training:
16 # - Input to decoder: tgt_padded[:, :-1] (includes BOS, excludes last)
17 # - Labels: tgt_padded[:, 1:] (excludes BOS, includes EOS)
18
19 return {
20 'source': src_padded,
21 'target_input': tgt_padded[:, :-1],
22 'target_output': tgt_padded[:, 1:],
23 }3.6 Complete Tokenizer Setup
Full Setup Workflow
πtext
1FULL SETUP WORKFLOW:
2ββββββββββββββββββββ
3
41. TRAIN TOKENIZER:
5 βββββββββββββββββ
6 # Load all training data
7 german = load_file('train.de')
8 english = load_file('train.en')
9
10 # Create and train
11 tokenizer = JointBPETokenizer(vocab_size=8000)
12 tokenizer.train(german + english)
13
14 # Save
15 tokenizer.save('data/tokenizer/tokenizer.json')
16
17
182. VERIFY TOKENIZER:
19 βββββββββββββββββ
20 # Check vocabulary size
21 assert len(tokenizer.token_to_id) <= 8000
22
23 # Check special tokens
24 assert tokenizer.pad_id == 0
25 assert tokenizer.bos_id == 2
26 assert tokenizer.eos_id == 3
27
28 # Test round-trip
29 text = "The dog runs in the park"
30 ids = tokenizer.encode(text)
31 decoded = tokenizer.decode(ids)
32 # decoded should approximately equal text
33
34
353. COMPUTE STATISTICS:
36 βββββββββββββββββββ
37 # For each split
38 for split in ['train', 'val', 'test']:
39 german = load_file(f'{split}.de')
40 english = load_file(f'{split}.en')
41
42 # Average tokens per sentence
43 de_tokens = [len(tokenizer.encode(s)) for s in german]
44 en_tokens = [len(tokenizer.encode(s)) for s in english]
45
46 print(f"{split}:")
47 print(f" DE avg tokens: {mean(de_tokens):.1f}")
48 print(f" EN avg tokens: {mean(en_tokens):.1f}")Recommended Settings
πpython
1VOCAB_SIZE = 8000
2# Reasoning:
3# - Multi30k has ~30K sentences
4# - Limited vocabulary (image descriptions)
5# - 8K captures most patterns
6# - Faster training than larger vocab
7
8MIN_FREQUENCY = 2
9# Reasoning:
10# - Avoid single-occurrence artifacts
11# - Keep meaningful subwords
12
13
14EXPECTED STATISTICS:
15ββββββββββββββββββββ
16
17With vocab_size=8000 on Multi30k:
18 - Avg tokens/word: ~1.3
19 - Avg tokens/sentence: ~15 (after BOS/EOS)
20 - Unknown rate: < 1%Summary
Tokenizer Setup Checklist
- Train joint BPE tokenizer on combined DE+EN data
- Verify vocabulary size (recommend 8000 for Multi30k)
- Check special token IDs (pad=0, unk=1, bos=2, eos=3)
- Test encode/decode round-trip
- Analyze tokenization statistics
- Save tokenizer for reproducibility
Key Configuration
| Parameter | Recommended Value | Reasoning |
|---|---|---|
| vocab_size | 8000 | Good balance for small dataset |
| min_frequency | 2 | Avoid noise from rare patterns |
| special_tokens | pad,unk,bos,eos | Standard translation setup |
Files to Save
πtext
1data/tokenizer/
2βββ tokenizer.json # Vocabulary and merges
3βββ config.json # Training configurationExercises
Implementation
- Add byte-level BPE fallback for truly unknown characters.
- Implement SentencePiece-style tokenization as alternative.
- Create tokenizer comparison tool (BPE vs WordPiece).
Analysis
- Compare vocabulary overlap between German and English tokens.
- Find words that get heavily split vs words that stay whole.
Next Section Preview
In the next section, we'll cover Data Loading Pipelineβcreating PyTorch datasets and dataloaders for efficient training.