Introduction
While our from-scratch BPE implementation helps understand the algorithm, production systems use optimized libraries. SentencePiece is the industry standard—it's fast, memory-efficient, and handles edge cases that our simple implementation doesn't.
This section covers training and using SentencePiece for our German-English translation project.
4.1 Why SentencePiece?
Advantages Over Custom Implementation
| Feature | Our Implementation | SentencePiece |
|---|---|---|
| Training speed | Slow (Python) | Fast (C++) |
| Memory efficiency | High | Optimized |
| Unicode handling | Basic | Complete |
| Pre-tokenization | Whitespace only | Language-aware |
| Algorithms | BPE only | BPE, Unigram, Word, Char |
| Production-ready | No | Yes |
| Used by | Educational | T5, ALBERT, LLaMA |
Key Features
- Language-agnostic: No pre-tokenization required
- Byte-level: Handles any Unicode text
- Deterministic: Same input → same output
- Memory-mapped: Efficient for large vocabularies
- Multiple algorithms: BPE, Unigram, and more
4.2 Installation and Setup
Installing SentencePiece
1# Install via pip
2pip install sentencepiece
3
4# Verify installation
5python -c "import sentencepiece; print(sentencepiece.__version__)"Basic Usage Pattern
1import sentencepiece as spm
2
3# Training
4spm.SentencePieceTrainer.train(
5 input="corpus.txt",
6 model_prefix="tokenizer",
7 vocab_size=32000
8)
9
10# Loading and using
11sp = spm.SentencePieceProcessor()
12sp.load("tokenizer.model")
13
14# Encoding
15tokens = sp.encode("Hello world", out_type=str)
16ids = sp.encode("Hello world", out_type=int)
17
18# Decoding
19text = sp.decode(ids)4.3 Training a Tokenizer
Preparing Training Data
1import sentencepiece as spm
2from pathlib import Path
3
4
5def prepare_corpus(texts: list, output_path: str) -> None:
6 """
7 Write texts to file for SentencePiece training.
8
9 Args:
10 texts: List of text strings
11 output_path: Path to output file
12 """
13 with open(output_path, "w", encoding="utf-8") as f:
14 for text in texts:
15 # One sentence per line
16 f.write(text.strip() + "\n")
17
18
19# Example: Prepare German and English data
20german_texts = [
21 "Der Hund läuft im Park.",
22 "Die Katze schläft auf dem Sofa.",
23 "Das Wetter ist heute schön.",
24 "Ich gehe morgen einkaufen.",
25 "Die Kinder spielen draußen.",
26]
27
28english_texts = [
29 "The dog runs in the park.",
30 "The cat sleeps on the sofa.",
31 "The weather is nice today.",
32 "I am going shopping tomorrow.",
33 "The children are playing outside.",
34]
35
36# For translation, we typically combine source and target
37combined_texts = german_texts + english_texts
38
39prepare_corpus(combined_texts, "translation_corpus.txt")
40print(f"Wrote {len(combined_texts)} sentences to corpus file")Training Options
1def train_sentencepiece(
2 input_file: str,
3 model_prefix: str,
4 vocab_size: int = 32000,
5 model_type: str = "bpe",
6 character_coverage: float = 0.9995,
7 pad_id: int = 0,
8 unk_id: int = 1,
9 bos_id: int = 2,
10 eos_id: int = 3
11) -> None:
12 """
13 Train a SentencePiece model.
14
15 Args:
16 input_file: Path to training corpus
17 model_prefix: Output model name prefix
18 vocab_size: Target vocabulary size
19 model_type: 'bpe', 'unigram', 'word', or 'char'
20 character_coverage: Percentage of characters to cover
21 pad_id, unk_id, bos_id, eos_id: Special token IDs
22 """
23 spm.SentencePieceTrainer.train(
24 input=input_file,
25 model_prefix=model_prefix,
26 vocab_size=vocab_size,
27 model_type=model_type,
28 character_coverage=character_coverage,
29 pad_id=pad_id,
30 unk_id=unk_id,
31 bos_id=bos_id,
32 eos_id=eos_id,
33 # Additional useful options
34 num_threads=4,
35 train_extremely_large_corpus=False, # Set True for huge corpora
36 max_sentence_length=4192,
37 shuffle_input_sentence=True,
38 # Control special tokens
39 pad_piece="<pad>",
40 unk_piece="<unk>",
41 bos_piece="<bos>",
42 eos_piece="<eos>",
43 )
44
45 print(f"Trained model: {model_prefix}.model")
46 print(f"Vocabulary file: {model_prefix}.vocab")
47
48
49# Train on our corpus
50train_sentencepiece(
51 input_file="translation_corpus.txt",
52 model_prefix="de_en_bpe",
53 vocab_size=8000, # Small for this example
54 model_type="bpe"
55)Understanding Training Output
After training, SentencePiece creates two files:
1de_en_bpe.model - Binary model file (for encoding/decoding)
2de_en_bpe.vocab - Human-readable vocabulary (for inspection)Inspect the vocabulary:
1def inspect_vocabulary(vocab_file: str, num_tokens: int = 20) -> None:
2 """Print sample tokens from vocabulary."""
3 print(f"First {num_tokens} tokens in vocabulary:\n")
4
5 with open(vocab_file, "r", encoding="utf-8") as f:
6 for i, line in enumerate(f):
7 if i >= num_tokens:
8 break
9 token, score = line.strip().split("\t")
10 print(f" {i:4d}: '{token}' (score: {score})")
11
12
13inspect_vocabulary("de_en_bpe.vocab")Output:
1First 20 tokens in vocabulary:
2
3 0: '<pad>' (score: 0)
4 1: '<unk>' (score: 0)
5 2: '<bos>' (score: 0)
6 3: '<eos>' (score: 0)
7 4: '▁' (score: -0)
8 5: '.' (score: -1)
9 6: 'e' (score: -2)
10 7: 'n' (score: -3)
11 8: '▁the' (score: -4)
12 9: 'er' (score: -5)
13 10: '▁The' (score: -6)
14 ...Note: ▁ is SentencePiece's marker for "beginning of word" (similar to our </w> but inverted).
4.4 Using the Trained Model
Basic Encoding and Decoding
1class SentencePieceTokenizer:
2 """
3 Wrapper around SentencePiece for convenient usage.
4
5 Example:
6 >>> tokenizer = SentencePieceTokenizer("model.model")
7 >>> ids = tokenizer.encode("Hello world")
8 >>> text = tokenizer.decode(ids)
9 """
10
11 def __init__(self, model_path: str):
12 """Load a trained SentencePiece model."""
13 self.sp = spm.SentencePieceProcessor()
14 self.sp.load(model_path)
15
16 def encode(
17 self,
18 text: str,
19 add_bos: bool = False,
20 add_eos: bool = False
21 ) -> list:
22 """
23 Encode text to token IDs.
24
25 Args:
26 text: Input text
27 add_bos: Prepend BOS token
28 add_eos: Append EOS token
29
30 Returns:
31 List of token IDs
32 """
33 ids = self.sp.encode(text, out_type=int)
34
35 if add_bos:
36 ids = [self.bos_id] + ids
37 if add_eos:
38 ids = ids + [self.eos_id]
39
40 return ids
41
42 def decode(self, ids: list) -> str:
43 """Decode token IDs to text."""
44 return self.sp.decode(ids)
45
46 def tokenize(self, text: str) -> list:
47 """Get tokens as strings (for inspection)."""
48 return self.sp.encode(text, out_type=str)
49
50 def detokenize(self, tokens: list) -> str:
51 """Convert token strings back to text."""
52 return self.sp.decode(tokens)
53
54 @property
55 def vocab_size(self) -> int:
56 return self.sp.get_piece_size()
57
58 @property
59 def pad_id(self) -> int:
60 return self.sp.pad_id()
61
62 @property
63 def unk_id(self) -> int:
64 return self.sp.unk_id()
65
66 @property
67 def bos_id(self) -> int:
68 return self.sp.bos_id()
69
70 @property
71 def eos_id(self) -> int:
72 return self.sp.eos_id()
73
74
75# Test the tokenizer
76tokenizer = SentencePieceTokenizer("de_en_bpe.model")
77
78print(f"Vocabulary size: {tokenizer.vocab_size}")
79print(f"Special tokens: PAD={tokenizer.pad_id}, UNK={tokenizer.unk_id}, "
80 f"BOS={tokenizer.bos_id}, EOS={tokenizer.eos_id}")
81
82# Test sentences
83test_sentences = [
84 "The dog runs in the park.",
85 "Der Hund läuft im Park.",
86 "This is an unknown word: xyzzy.",
87]
88
89print("\nEncoding examples:")
90for sentence in test_sentences:
91 tokens = tokenizer.tokenize(sentence)
92 ids = tokenizer.encode(sentence)
93 ids_special = tokenizer.encode(sentence, add_bos=True, add_eos=True)
94 decoded = tokenizer.decode(ids)
95
96 print(f"\n Input: '{sentence}'")
97 print(f" Tokens: {tokens}")
98 print(f" IDs: {ids}")
99 print(f" With BOS/EOS: {ids_special}")
100 print(f" Decoded: '{decoded}'")4.5 Shared vs Separate Vocabularies
Shared Vocabulary (Recommended for Translation)
One vocabulary for both source and target languages:
1def train_shared_vocabulary(
2 source_file: str,
3 target_file: str,
4 model_prefix: str,
5 vocab_size: int = 32000
6) -> None:
7 """
8 Train a shared vocabulary on combined source and target data.
9
10 This is the recommended approach for translation because:
11 1. Shared subwords (e.g., "information" in EN/DE)
12 2. Enables weight tying in the model
13 3. Smaller total parameters
14 """
15 # Combine source and target into one file
16 combined_file = f"{model_prefix}_combined.txt"
17
18 with open(combined_file, "w", encoding="utf-8") as out:
19 for input_file in [source_file, target_file]:
20 with open(input_file, "r", encoding="utf-8") as f:
21 for line in f:
22 out.write(line)
23
24 # Train on combined data
25 spm.SentencePieceTrainer.train(
26 input=combined_file,
27 model_prefix=model_prefix,
28 vocab_size=vocab_size,
29 model_type="bpe",
30 character_coverage=0.9995,
31 pad_id=0,
32 unk_id=1,
33 bos_id=2,
34 eos_id=3,
35 )
36
37 print(f"Trained shared vocabulary: {model_prefix}.model")
38
39
40# Example usage
41# train_shared_vocabulary("german.txt", "english.txt", "de_en_shared", 32000)Separate Vocabularies
Independent vocabularies for each language:
1def train_separate_vocabularies(
2 source_file: str,
3 target_file: str,
4 source_prefix: str,
5 target_prefix: str,
6 vocab_size: int = 16000
7) -> None:
8 """
9 Train separate vocabularies for source and target.
10
11 Use this when:
12 1. Languages have very different scripts (EN-ZH, EN-JA)
13 2. Domain-specific requirements
14 3. Memory constraints allow separate embeddings
15 """
16 # Train source tokenizer
17 spm.SentencePieceTrainer.train(
18 input=source_file,
19 model_prefix=source_prefix,
20 vocab_size=vocab_size,
21 model_type="bpe",
22 )
23
24 # Train target tokenizer
25 spm.SentencePieceTrainer.train(
26 input=target_file,
27 model_prefix=target_prefix,
28 vocab_size=vocab_size,
29 model_type="bpe",
30 )
31
32 print(f"Source tokenizer: {source_prefix}.model")
33 print(f"Target tokenizer: {target_prefix}.model")When to Use Each
| Scenario | Recommendation |
|---|---|
| Similar languages (DE-EN, FR-EN) | Shared |
| Different scripts (EN-ZH, EN-AR) | Separate |
| Code-switching data | Shared |
| Domain-specific | Depends |
| Memory-constrained | Separate (smaller each) |
| Research/benchmark | Shared (standard) |
4.6 Integration with PyTorch Dataset
Translation Dataset with SentencePiece
1import torch
2from torch.utils.data import Dataset, DataLoader
3from typing import Optional
4
5
6class TranslationDataset(Dataset):
7 """
8 PyTorch Dataset for machine translation with SentencePiece tokenization.
9
10 Example:
11 >>> tokenizer = SentencePieceTokenizer("model.model")
12 >>> dataset = TranslationDataset(src_texts, tgt_texts, tokenizer)
13 >>> src_ids, tgt_ids = dataset[0]
14 """
15
16 def __init__(
17 self,
18 source_texts: list,
19 target_texts: list,
20 tokenizer: SentencePieceTokenizer,
21 max_length: int = 128
22 ):
23 """
24 Initialize dataset.
25
26 Args:
27 source_texts: List of source language texts
28 target_texts: List of target language texts
29 tokenizer: SentencePiece tokenizer
30 max_length: Maximum sequence length
31 """
32 assert len(source_texts) == len(target_texts)
33
34 self.source_texts = source_texts
35 self.target_texts = target_texts
36 self.tokenizer = tokenizer
37 self.max_length = max_length
38
39 def __len__(self) -> int:
40 return len(self.source_texts)
41
42 def __getitem__(self, idx: int):
43 """
44 Get a single training example.
45
46 Returns:
47 source_ids: Tensor of source token IDs
48 target_ids: Tensor of target token IDs (with BOS/EOS)
49 """
50 source_text = self.source_texts[idx]
51 target_text = self.target_texts[idx]
52
53 # Encode source (no special tokens needed)
54 source_ids = self.tokenizer.encode(source_text)
55
56 # Encode target with BOS and EOS
57 target_ids = self.tokenizer.encode(
58 target_text, add_bos=True, add_eos=True
59 )
60
61 # Truncate if needed
62 source_ids = source_ids[:self.max_length]
63 target_ids = target_ids[:self.max_length]
64
65 return (
66 torch.tensor(source_ids, dtype=torch.long),
67 torch.tensor(target_ids, dtype=torch.long)
68 )
69
70
71def collate_fn(batch, pad_id: int = 0):
72 """
73 Collate function for DataLoader that pads sequences.
74
75 Args:
76 batch: List of (source, target) tensor tuples
77 pad_id: Padding token ID
78
79 Returns:
80 source_batch: Padded source tensor [batch, max_src_len]
81 target_batch: Padded target tensor [batch, max_tgt_len]
82 """
83 sources, targets = zip(*batch)
84
85 # Get max lengths
86 max_src_len = max(s.size(0) for s in sources)
87 max_tgt_len = max(t.size(0) for t in targets)
88
89 # Pad sequences
90 padded_sources = []
91 padded_targets = []
92
93 for src, tgt in batch:
94 # Pad source
95 src_padding = torch.full(
96 (max_src_len - src.size(0),),
97 pad_id,
98 dtype=torch.long
99 )
100 padded_sources.append(torch.cat([src, src_padding]))
101
102 # Pad target
103 tgt_padding = torch.full(
104 (max_tgt_len - tgt.size(0),),
105 pad_id,
106 dtype=torch.long
107 )
108 padded_targets.append(torch.cat([tgt, tgt_padding]))
109
110 source_batch = torch.stack(padded_sources)
111 target_batch = torch.stack(padded_targets)
112
113 return source_batch, target_batch4.7 Complete Tokenizer for Translation Project
Full Implementation
1import sentencepiece as spm
2import torch
3from pathlib import Path
4from typing import List, Optional, Tuple
5import json
6
7
8class TranslationTokenizer:
9 """
10 Complete tokenizer for German-English translation.
11
12 Handles:
13 - Training on parallel corpus
14 - Shared vocabulary for both languages
15 - Proper special token handling
16 - PyTorch integration
17
18 Example:
19 >>> tokenizer = TranslationTokenizer()
20 >>> tokenizer.train(german_texts, english_texts, vocab_size=32000)
21 >>> tokenizer.save("tokenizer_dir")
22 >>>
23 >>> # Later
24 >>> tokenizer = TranslationTokenizer.load("tokenizer_dir")
25 >>> src_ids = tokenizer.encode_source("Der Hund läuft.")
26 >>> tgt_ids = tokenizer.encode_target("The dog runs.")
27 """
28
29 def __init__(self):
30 self.sp: Optional[spm.SentencePieceProcessor] = None
31 self.vocab_size: int = 0
32
33 def train(
34 self,
35 source_texts: List[str],
36 target_texts: List[str],
37 vocab_size: int = 32000,
38 model_dir: str = "tokenizer",
39 model_type: str = "bpe"
40 ) -> None:
41 """
42 Train tokenizer on parallel corpus.
43
44 Args:
45 source_texts: German sentences
46 target_texts: English sentences
47 vocab_size: Target vocabulary size
48 model_dir: Directory for model files
49 model_type: 'bpe' or 'unigram'
50 """
51 model_dir = Path(model_dir)
52 model_dir.mkdir(parents=True, exist_ok=True)
53
54 # Combine source and target for shared vocabulary
55 corpus_file = model_dir / "corpus.txt"
56 with open(corpus_file, "w", encoding="utf-8") as f:
57 for text in source_texts:
58 f.write(text.strip() + "\n")
59 for text in target_texts:
60 f.write(text.strip() + "\n")
61
62 # Train SentencePiece
63 model_prefix = str(model_dir / "sp")
64
65 spm.SentencePieceTrainer.train(
66 input=str(corpus_file),
67 model_prefix=model_prefix,
68 vocab_size=vocab_size,
69 model_type=model_type,
70 character_coverage=0.9995,
71 # Special tokens
72 pad_id=0,
73 unk_id=1,
74 bos_id=2,
75 eos_id=3,
76 pad_piece="<pad>",
77 unk_piece="<unk>",
78 bos_piece="<bos>",
79 eos_piece="<eos>",
80 # Training options
81 num_threads=4,
82 shuffle_input_sentence=True,
83 )
84
85 # Load the trained model
86 self.sp = spm.SentencePieceProcessor()
87 self.sp.load(model_prefix + ".model")
88 self.vocab_size = self.sp.get_piece_size()
89
90 # Save config
91 config = {
92 "vocab_size": vocab_size,
93 "model_type": model_type,
94 }
95 with open(model_dir / "config.json", "w") as f:
96 json.dump(config, f)
97
98 print(f"Trained tokenizer with {self.vocab_size} tokens")
99 print(f"Saved to: {model_dir}")
100
101 def encode_source(
102 self,
103 text: str,
104 max_length: Optional[int] = None
105 ) -> List[int]:
106 """
107 Encode source (German) text.
108
109 Source sequences don't need BOS/EOS in encoder.
110
111 Args:
112 text: Source text
113 max_length: Truncate to this length
114
115 Returns:
116 List of token IDs
117 """
118 ids = self.sp.encode(text, out_type=int)
119 if max_length:
120 ids = ids[:max_length]
121 return ids
122
123 def encode_target(
124 self,
125 text: str,
126 max_length: Optional[int] = None,
127 add_bos: bool = True,
128 add_eos: bool = True
129 ) -> List[int]:
130 """
131 Encode target (English) text.
132
133 Target sequences need BOS for decoder input and EOS for labels.
134
135 Args:
136 text: Target text
137 max_length: Truncate to this length (including special tokens)
138 add_bos: Prepend BOS token
139 add_eos: Append EOS token
140
141 Returns:
142 List of token IDs
143 """
144 ids = self.sp.encode(text, out_type=int)
145
146 if add_bos:
147 ids = [self.bos_id] + ids
148 if add_eos:
149 ids = ids + [self.eos_id]
150
151 if max_length:
152 ids = ids[:max_length]
153
154 return ids
155
156 def decode(self, ids: List[int], skip_special: bool = True) -> str:
157 """
158 Decode token IDs to text.
159
160 Args:
161 ids: Token IDs
162 skip_special: Remove special tokens from output
163
164 Returns:
165 Decoded text
166 """
167 if skip_special:
168 special_ids = {self.pad_id, self.bos_id, self.eos_id}
169 ids = [id_ for id_ in ids if id_ not in special_ids]
170
171 return self.sp.decode(ids)
172
173 def tokenize(self, text: str) -> List[str]:
174 """Get tokens as strings (for debugging)."""
175 return self.sp.encode(text, out_type=str)
176
177 @classmethod
178 def load(cls, model_dir: str) -> "TranslationTokenizer":
179 """
180 Load tokenizer from directory.
181
182 Args:
183 model_dir: Directory containing sp.model
184
185 Returns:
186 Loaded tokenizer
187 """
188 tokenizer = cls()
189 model_path = Path(model_dir) / "sp.model"
190 tokenizer.sp = spm.SentencePieceProcessor()
191 tokenizer.sp.load(str(model_path))
192 tokenizer.vocab_size = tokenizer.sp.get_piece_size()
193 return tokenizer
194
195 @property
196 def pad_id(self) -> int:
197 return self.sp.pad_id()
198
199 @property
200 def unk_id(self) -> int:
201 return self.sp.unk_id()
202
203 @property
204 def bos_id(self) -> int:
205 return self.sp.bos_id()
206
207 @property
208 def eos_id(self) -> int:
209 return self.sp.eos_id()
210
211 def __len__(self) -> int:
212 return self.vocab_size
213
214 def __repr__(self) -> str:
215 return f"TranslationTokenizer(vocab_size={self.vocab_size})"Summary
SentencePiece Workflow
11. Prepare training data → corpus.txt
22. Train model → sp.model, sp.vocab
33. Load model → spm.SentencePieceProcessor()
44. Encode/Decode → sp.encode(), sp.decode()Key Parameters
| Parameter | Description | Typical Value |
|---|---|---|
| vocab_size | Target vocabulary size | 32,000 |
| model_type | Algorithm (bpe/unigram) | bpe |
| character_coverage | % chars to cover | 0.9995 |
| pad_id | Padding token ID | 0 |
| unk_id | Unknown token ID | 1 |
| bos_id | Begin-of-sequence ID | 2 |
| eos_id | End-of-sequence ID | 3 |
Translation Project Settings
For our German-English project:
- Vocabulary size: 32,000 (shared)
- Model type: BPE
- Character coverage: 0.9995
- Special tokens: PAD(0), UNK(1), BOS(2), EOS(3)
Exercises
Implementation Exercises
- Train separate tokenizers for German and English. Compare vocabulary contents.
- Experiment with different vocab_sizes (8K, 16K, 32K, 64K). Measure average tokens per sentence.
- Implement a method to get the token frequency distribution from a trained model.
Analysis Exercises
- Compare BPE vs Unigram models on the same data. How do the tokenizations differ?
- Analyze how German compound words are tokenized (e.g., "Krankenhaus", "Handschuh").
- Test the tokenizer on out-of-domain text (e.g., scientific terms). How well does it generalize?
Next Section Preview
In the next section, we'll cover special tokens for sequence-to-sequence tasks. We'll learn exactly when BOS, EOS, and PAD tokens are used during training and inference, and how they affect the loss computation.