Chapter 5
22 min read
Section 28 of 75

Using SentencePiece for Production

Subword Tokenization for Translation

Introduction

While our from-scratch BPE implementation helps understand the algorithm, production systems use optimized libraries. SentencePiece is the industry standard—it's fast, memory-efficient, and handles edge cases that our simple implementation doesn't.

This section covers training and using SentencePiece for our German-English translation project.


4.1 Why SentencePiece?

Advantages Over Custom Implementation

FeatureOur ImplementationSentencePiece
Training speedSlow (Python)Fast (C++)
Memory efficiencyHighOptimized
Unicode handlingBasicComplete
Pre-tokenizationWhitespace onlyLanguage-aware
AlgorithmsBPE onlyBPE, Unigram, Word, Char
Production-readyNoYes
Used byEducationalT5, ALBERT, LLaMA

Key Features

  1. Language-agnostic: No pre-tokenization required
  2. Byte-level: Handles any Unicode text
  3. Deterministic: Same input → same output
  4. Memory-mapped: Efficient for large vocabularies
  5. Multiple algorithms: BPE, Unigram, and more

4.2 Installation and Setup

Installing SentencePiece

bash
1# Install via pip
2pip install sentencepiece
3
4# Verify installation
5python -c "import sentencepiece; print(sentencepiece.__version__)"

Basic Usage Pattern

🐍python
1import sentencepiece as spm
2
3# Training
4spm.SentencePieceTrainer.train(
5    input="corpus.txt",
6    model_prefix="tokenizer",
7    vocab_size=32000
8)
9
10# Loading and using
11sp = spm.SentencePieceProcessor()
12sp.load("tokenizer.model")
13
14# Encoding
15tokens = sp.encode("Hello world", out_type=str)
16ids = sp.encode("Hello world", out_type=int)
17
18# Decoding
19text = sp.decode(ids)

4.3 Training a Tokenizer

Preparing Training Data

🐍python
1import sentencepiece as spm
2from pathlib import Path
3
4
5def prepare_corpus(texts: list, output_path: str) -> None:
6    """
7    Write texts to file for SentencePiece training.
8
9    Args:
10        texts: List of text strings
11        output_path: Path to output file
12    """
13    with open(output_path, "w", encoding="utf-8") as f:
14        for text in texts:
15            # One sentence per line
16            f.write(text.strip() + "\n")
17
18
19# Example: Prepare German and English data
20german_texts = [
21    "Der Hund läuft im Park.",
22    "Die Katze schläft auf dem Sofa.",
23    "Das Wetter ist heute schön.",
24    "Ich gehe morgen einkaufen.",
25    "Die Kinder spielen draußen.",
26]
27
28english_texts = [
29    "The dog runs in the park.",
30    "The cat sleeps on the sofa.",
31    "The weather is nice today.",
32    "I am going shopping tomorrow.",
33    "The children are playing outside.",
34]
35
36# For translation, we typically combine source and target
37combined_texts = german_texts + english_texts
38
39prepare_corpus(combined_texts, "translation_corpus.txt")
40print(f"Wrote {len(combined_texts)} sentences to corpus file")

Training Options

🐍python
1def train_sentencepiece(
2    input_file: str,
3    model_prefix: str,
4    vocab_size: int = 32000,
5    model_type: str = "bpe",
6    character_coverage: float = 0.9995,
7    pad_id: int = 0,
8    unk_id: int = 1,
9    bos_id: int = 2,
10    eos_id: int = 3
11) -> None:
12    """
13    Train a SentencePiece model.
14
15    Args:
16        input_file: Path to training corpus
17        model_prefix: Output model name prefix
18        vocab_size: Target vocabulary size
19        model_type: 'bpe', 'unigram', 'word', or 'char'
20        character_coverage: Percentage of characters to cover
21        pad_id, unk_id, bos_id, eos_id: Special token IDs
22    """
23    spm.SentencePieceTrainer.train(
24        input=input_file,
25        model_prefix=model_prefix,
26        vocab_size=vocab_size,
27        model_type=model_type,
28        character_coverage=character_coverage,
29        pad_id=pad_id,
30        unk_id=unk_id,
31        bos_id=bos_id,
32        eos_id=eos_id,
33        # Additional useful options
34        num_threads=4,
35        train_extremely_large_corpus=False,  # Set True for huge corpora
36        max_sentence_length=4192,
37        shuffle_input_sentence=True,
38        # Control special tokens
39        pad_piece="<pad>",
40        unk_piece="<unk>",
41        bos_piece="<bos>",
42        eos_piece="<eos>",
43    )
44
45    print(f"Trained model: {model_prefix}.model")
46    print(f"Vocabulary file: {model_prefix}.vocab")
47
48
49# Train on our corpus
50train_sentencepiece(
51    input_file="translation_corpus.txt",
52    model_prefix="de_en_bpe",
53    vocab_size=8000,  # Small for this example
54    model_type="bpe"
55)

Understanding Training Output

After training, SentencePiece creates two files:

📝text
1de_en_bpe.model  - Binary model file (for encoding/decoding)
2de_en_bpe.vocab  - Human-readable vocabulary (for inspection)

Inspect the vocabulary:

🐍python
1def inspect_vocabulary(vocab_file: str, num_tokens: int = 20) -> None:
2    """Print sample tokens from vocabulary."""
3    print(f"First {num_tokens} tokens in vocabulary:\n")
4
5    with open(vocab_file, "r", encoding="utf-8") as f:
6        for i, line in enumerate(f):
7            if i >= num_tokens:
8                break
9            token, score = line.strip().split("\t")
10            print(f"  {i:4d}: '{token}' (score: {score})")
11
12
13inspect_vocabulary("de_en_bpe.vocab")

Output:

📝text
1First 20 tokens in vocabulary:
2
3     0: '<pad>' (score: 0)
4     1: '<unk>' (score: 0)
5     2: '<bos>' (score: 0)
6     3: '<eos>' (score: 0)
7     4: '▁' (score: -0)
8     5: '.' (score: -1)
9     6: 'e' (score: -2)
10     7: 'n' (score: -3)
11     8: '▁the' (score: -4)
12     9: 'er' (score: -5)
13    10: '▁The' (score: -6)
14    ...

Note: is SentencePiece's marker for "beginning of word" (similar to our </w> but inverted).


4.4 Using the Trained Model

Basic Encoding and Decoding

🐍python
1class SentencePieceTokenizer:
2    """
3    Wrapper around SentencePiece for convenient usage.
4
5    Example:
6        >>> tokenizer = SentencePieceTokenizer("model.model")
7        >>> ids = tokenizer.encode("Hello world")
8        >>> text = tokenizer.decode(ids)
9    """
10
11    def __init__(self, model_path: str):
12        """Load a trained SentencePiece model."""
13        self.sp = spm.SentencePieceProcessor()
14        self.sp.load(model_path)
15
16    def encode(
17        self,
18        text: str,
19        add_bos: bool = False,
20        add_eos: bool = False
21    ) -> list:
22        """
23        Encode text to token IDs.
24
25        Args:
26            text: Input text
27            add_bos: Prepend BOS token
28            add_eos: Append EOS token
29
30        Returns:
31            List of token IDs
32        """
33        ids = self.sp.encode(text, out_type=int)
34
35        if add_bos:
36            ids = [self.bos_id] + ids
37        if add_eos:
38            ids = ids + [self.eos_id]
39
40        return ids
41
42    def decode(self, ids: list) -> str:
43        """Decode token IDs to text."""
44        return self.sp.decode(ids)
45
46    def tokenize(self, text: str) -> list:
47        """Get tokens as strings (for inspection)."""
48        return self.sp.encode(text, out_type=str)
49
50    def detokenize(self, tokens: list) -> str:
51        """Convert token strings back to text."""
52        return self.sp.decode(tokens)
53
54    @property
55    def vocab_size(self) -> int:
56        return self.sp.get_piece_size()
57
58    @property
59    def pad_id(self) -> int:
60        return self.sp.pad_id()
61
62    @property
63    def unk_id(self) -> int:
64        return self.sp.unk_id()
65
66    @property
67    def bos_id(self) -> int:
68        return self.sp.bos_id()
69
70    @property
71    def eos_id(self) -> int:
72        return self.sp.eos_id()
73
74
75# Test the tokenizer
76tokenizer = SentencePieceTokenizer("de_en_bpe.model")
77
78print(f"Vocabulary size: {tokenizer.vocab_size}")
79print(f"Special tokens: PAD={tokenizer.pad_id}, UNK={tokenizer.unk_id}, "
80      f"BOS={tokenizer.bos_id}, EOS={tokenizer.eos_id}")
81
82# Test sentences
83test_sentences = [
84    "The dog runs in the park.",
85    "Der Hund läuft im Park.",
86    "This is an unknown word: xyzzy.",
87]
88
89print("\nEncoding examples:")
90for sentence in test_sentences:
91    tokens = tokenizer.tokenize(sentence)
92    ids = tokenizer.encode(sentence)
93    ids_special = tokenizer.encode(sentence, add_bos=True, add_eos=True)
94    decoded = tokenizer.decode(ids)
95
96    print(f"\n  Input:    '{sentence}'")
97    print(f"  Tokens:   {tokens}")
98    print(f"  IDs:      {ids}")
99    print(f"  With BOS/EOS: {ids_special}")
100    print(f"  Decoded:  '{decoded}'")

4.5 Shared vs Separate Vocabularies

One vocabulary for both source and target languages:

🐍python
1def train_shared_vocabulary(
2    source_file: str,
3    target_file: str,
4    model_prefix: str,
5    vocab_size: int = 32000
6) -> None:
7    """
8    Train a shared vocabulary on combined source and target data.
9
10    This is the recommended approach for translation because:
11    1. Shared subwords (e.g., "information" in EN/DE)
12    2. Enables weight tying in the model
13    3. Smaller total parameters
14    """
15    # Combine source and target into one file
16    combined_file = f"{model_prefix}_combined.txt"
17
18    with open(combined_file, "w", encoding="utf-8") as out:
19        for input_file in [source_file, target_file]:
20            with open(input_file, "r", encoding="utf-8") as f:
21                for line in f:
22                    out.write(line)
23
24    # Train on combined data
25    spm.SentencePieceTrainer.train(
26        input=combined_file,
27        model_prefix=model_prefix,
28        vocab_size=vocab_size,
29        model_type="bpe",
30        character_coverage=0.9995,
31        pad_id=0,
32        unk_id=1,
33        bos_id=2,
34        eos_id=3,
35    )
36
37    print(f"Trained shared vocabulary: {model_prefix}.model")
38
39
40# Example usage
41# train_shared_vocabulary("german.txt", "english.txt", "de_en_shared", 32000)

Separate Vocabularies

Independent vocabularies for each language:

🐍python
1def train_separate_vocabularies(
2    source_file: str,
3    target_file: str,
4    source_prefix: str,
5    target_prefix: str,
6    vocab_size: int = 16000
7) -> None:
8    """
9    Train separate vocabularies for source and target.
10
11    Use this when:
12    1. Languages have very different scripts (EN-ZH, EN-JA)
13    2. Domain-specific requirements
14    3. Memory constraints allow separate embeddings
15    """
16    # Train source tokenizer
17    spm.SentencePieceTrainer.train(
18        input=source_file,
19        model_prefix=source_prefix,
20        vocab_size=vocab_size,
21        model_type="bpe",
22    )
23
24    # Train target tokenizer
25    spm.SentencePieceTrainer.train(
26        input=target_file,
27        model_prefix=target_prefix,
28        vocab_size=vocab_size,
29        model_type="bpe",
30    )
31
32    print(f"Source tokenizer: {source_prefix}.model")
33    print(f"Target tokenizer: {target_prefix}.model")

When to Use Each

ScenarioRecommendation
Similar languages (DE-EN, FR-EN)Shared
Different scripts (EN-ZH, EN-AR)Separate
Code-switching dataShared
Domain-specificDepends
Memory-constrainedSeparate (smaller each)
Research/benchmarkShared (standard)

4.6 Integration with PyTorch Dataset

Translation Dataset with SentencePiece

🐍python
1import torch
2from torch.utils.data import Dataset, DataLoader
3from typing import Optional
4
5
6class TranslationDataset(Dataset):
7    """
8    PyTorch Dataset for machine translation with SentencePiece tokenization.
9
10    Example:
11        >>> tokenizer = SentencePieceTokenizer("model.model")
12        >>> dataset = TranslationDataset(src_texts, tgt_texts, tokenizer)
13        >>> src_ids, tgt_ids = dataset[0]
14    """
15
16    def __init__(
17        self,
18        source_texts: list,
19        target_texts: list,
20        tokenizer: SentencePieceTokenizer,
21        max_length: int = 128
22    ):
23        """
24        Initialize dataset.
25
26        Args:
27            source_texts: List of source language texts
28            target_texts: List of target language texts
29            tokenizer: SentencePiece tokenizer
30            max_length: Maximum sequence length
31        """
32        assert len(source_texts) == len(target_texts)
33
34        self.source_texts = source_texts
35        self.target_texts = target_texts
36        self.tokenizer = tokenizer
37        self.max_length = max_length
38
39    def __len__(self) -> int:
40        return len(self.source_texts)
41
42    def __getitem__(self, idx: int):
43        """
44        Get a single training example.
45
46        Returns:
47            source_ids: Tensor of source token IDs
48            target_ids: Tensor of target token IDs (with BOS/EOS)
49        """
50        source_text = self.source_texts[idx]
51        target_text = self.target_texts[idx]
52
53        # Encode source (no special tokens needed)
54        source_ids = self.tokenizer.encode(source_text)
55
56        # Encode target with BOS and EOS
57        target_ids = self.tokenizer.encode(
58            target_text, add_bos=True, add_eos=True
59        )
60
61        # Truncate if needed
62        source_ids = source_ids[:self.max_length]
63        target_ids = target_ids[:self.max_length]
64
65        return (
66            torch.tensor(source_ids, dtype=torch.long),
67            torch.tensor(target_ids, dtype=torch.long)
68        )
69
70
71def collate_fn(batch, pad_id: int = 0):
72    """
73    Collate function for DataLoader that pads sequences.
74
75    Args:
76        batch: List of (source, target) tensor tuples
77        pad_id: Padding token ID
78
79    Returns:
80        source_batch: Padded source tensor [batch, max_src_len]
81        target_batch: Padded target tensor [batch, max_tgt_len]
82    """
83    sources, targets = zip(*batch)
84
85    # Get max lengths
86    max_src_len = max(s.size(0) for s in sources)
87    max_tgt_len = max(t.size(0) for t in targets)
88
89    # Pad sequences
90    padded_sources = []
91    padded_targets = []
92
93    for src, tgt in batch:
94        # Pad source
95        src_padding = torch.full(
96            (max_src_len - src.size(0),),
97            pad_id,
98            dtype=torch.long
99        )
100        padded_sources.append(torch.cat([src, src_padding]))
101
102        # Pad target
103        tgt_padding = torch.full(
104            (max_tgt_len - tgt.size(0),),
105            pad_id,
106            dtype=torch.long
107        )
108        padded_targets.append(torch.cat([tgt, tgt_padding]))
109
110    source_batch = torch.stack(padded_sources)
111    target_batch = torch.stack(padded_targets)
112
113    return source_batch, target_batch

4.7 Complete Tokenizer for Translation Project

Full Implementation

🐍python
1import sentencepiece as spm
2import torch
3from pathlib import Path
4from typing import List, Optional, Tuple
5import json
6
7
8class TranslationTokenizer:
9    """
10    Complete tokenizer for German-English translation.
11
12    Handles:
13    - Training on parallel corpus
14    - Shared vocabulary for both languages
15    - Proper special token handling
16    - PyTorch integration
17
18    Example:
19        >>> tokenizer = TranslationTokenizer()
20        >>> tokenizer.train(german_texts, english_texts, vocab_size=32000)
21        >>> tokenizer.save("tokenizer_dir")
22        >>>
23        >>> # Later
24        >>> tokenizer = TranslationTokenizer.load("tokenizer_dir")
25        >>> src_ids = tokenizer.encode_source("Der Hund läuft.")
26        >>> tgt_ids = tokenizer.encode_target("The dog runs.")
27    """
28
29    def __init__(self):
30        self.sp: Optional[spm.SentencePieceProcessor] = None
31        self.vocab_size: int = 0
32
33    def train(
34        self,
35        source_texts: List[str],
36        target_texts: List[str],
37        vocab_size: int = 32000,
38        model_dir: str = "tokenizer",
39        model_type: str = "bpe"
40    ) -> None:
41        """
42        Train tokenizer on parallel corpus.
43
44        Args:
45            source_texts: German sentences
46            target_texts: English sentences
47            vocab_size: Target vocabulary size
48            model_dir: Directory for model files
49            model_type: 'bpe' or 'unigram'
50        """
51        model_dir = Path(model_dir)
52        model_dir.mkdir(parents=True, exist_ok=True)
53
54        # Combine source and target for shared vocabulary
55        corpus_file = model_dir / "corpus.txt"
56        with open(corpus_file, "w", encoding="utf-8") as f:
57            for text in source_texts:
58                f.write(text.strip() + "\n")
59            for text in target_texts:
60                f.write(text.strip() + "\n")
61
62        # Train SentencePiece
63        model_prefix = str(model_dir / "sp")
64
65        spm.SentencePieceTrainer.train(
66            input=str(corpus_file),
67            model_prefix=model_prefix,
68            vocab_size=vocab_size,
69            model_type=model_type,
70            character_coverage=0.9995,
71            # Special tokens
72            pad_id=0,
73            unk_id=1,
74            bos_id=2,
75            eos_id=3,
76            pad_piece="<pad>",
77            unk_piece="<unk>",
78            bos_piece="<bos>",
79            eos_piece="<eos>",
80            # Training options
81            num_threads=4,
82            shuffle_input_sentence=True,
83        )
84
85        # Load the trained model
86        self.sp = spm.SentencePieceProcessor()
87        self.sp.load(model_prefix + ".model")
88        self.vocab_size = self.sp.get_piece_size()
89
90        # Save config
91        config = {
92            "vocab_size": vocab_size,
93            "model_type": model_type,
94        }
95        with open(model_dir / "config.json", "w") as f:
96            json.dump(config, f)
97
98        print(f"Trained tokenizer with {self.vocab_size} tokens")
99        print(f"Saved to: {model_dir}")
100
101    def encode_source(
102        self,
103        text: str,
104        max_length: Optional[int] = None
105    ) -> List[int]:
106        """
107        Encode source (German) text.
108
109        Source sequences don't need BOS/EOS in encoder.
110
111        Args:
112            text: Source text
113            max_length: Truncate to this length
114
115        Returns:
116            List of token IDs
117        """
118        ids = self.sp.encode(text, out_type=int)
119        if max_length:
120            ids = ids[:max_length]
121        return ids
122
123    def encode_target(
124        self,
125        text: str,
126        max_length: Optional[int] = None,
127        add_bos: bool = True,
128        add_eos: bool = True
129    ) -> List[int]:
130        """
131        Encode target (English) text.
132
133        Target sequences need BOS for decoder input and EOS for labels.
134
135        Args:
136            text: Target text
137            max_length: Truncate to this length (including special tokens)
138            add_bos: Prepend BOS token
139            add_eos: Append EOS token
140
141        Returns:
142            List of token IDs
143        """
144        ids = self.sp.encode(text, out_type=int)
145
146        if add_bos:
147            ids = [self.bos_id] + ids
148        if add_eos:
149            ids = ids + [self.eos_id]
150
151        if max_length:
152            ids = ids[:max_length]
153
154        return ids
155
156    def decode(self, ids: List[int], skip_special: bool = True) -> str:
157        """
158        Decode token IDs to text.
159
160        Args:
161            ids: Token IDs
162            skip_special: Remove special tokens from output
163
164        Returns:
165            Decoded text
166        """
167        if skip_special:
168            special_ids = {self.pad_id, self.bos_id, self.eos_id}
169            ids = [id_ for id_ in ids if id_ not in special_ids]
170
171        return self.sp.decode(ids)
172
173    def tokenize(self, text: str) -> List[str]:
174        """Get tokens as strings (for debugging)."""
175        return self.sp.encode(text, out_type=str)
176
177    @classmethod
178    def load(cls, model_dir: str) -> "TranslationTokenizer":
179        """
180        Load tokenizer from directory.
181
182        Args:
183            model_dir: Directory containing sp.model
184
185        Returns:
186            Loaded tokenizer
187        """
188        tokenizer = cls()
189        model_path = Path(model_dir) / "sp.model"
190        tokenizer.sp = spm.SentencePieceProcessor()
191        tokenizer.sp.load(str(model_path))
192        tokenizer.vocab_size = tokenizer.sp.get_piece_size()
193        return tokenizer
194
195    @property
196    def pad_id(self) -> int:
197        return self.sp.pad_id()
198
199    @property
200    def unk_id(self) -> int:
201        return self.sp.unk_id()
202
203    @property
204    def bos_id(self) -> int:
205        return self.sp.bos_id()
206
207    @property
208    def eos_id(self) -> int:
209        return self.sp.eos_id()
210
211    def __len__(self) -> int:
212        return self.vocab_size
213
214    def __repr__(self) -> str:
215        return f"TranslationTokenizer(vocab_size={self.vocab_size})"

Summary

SentencePiece Workflow

📝text
11. Prepare training data → corpus.txt
22. Train model         → sp.model, sp.vocab
33. Load model          → spm.SentencePieceProcessor()
44. Encode/Decode       → sp.encode(), sp.decode()

Key Parameters

ParameterDescriptionTypical Value
vocab_sizeTarget vocabulary size32,000
model_typeAlgorithm (bpe/unigram)bpe
character_coverage% chars to cover0.9995
pad_idPadding token ID0
unk_idUnknown token ID1
bos_idBegin-of-sequence ID2
eos_idEnd-of-sequence ID3

Translation Project Settings

For our German-English project:

  • Vocabulary size: 32,000 (shared)
  • Model type: BPE
  • Character coverage: 0.9995
  • Special tokens: PAD(0), UNK(1), BOS(2), EOS(3)

Exercises

Implementation Exercises

  1. Train separate tokenizers for German and English. Compare vocabulary contents.
  2. Experiment with different vocab_sizes (8K, 16K, 32K, 64K). Measure average tokens per sentence.
  3. Implement a method to get the token frequency distribution from a trained model.

Analysis Exercises

  1. Compare BPE vs Unigram models on the same data. How do the tokenizations differ?
  2. Analyze how German compound words are tokenized (e.g., "Krankenhaus", "Handschuh").
  3. Test the tokenizer on out-of-domain text (e.g., scientific terms). How well does it generalize?

Next Section Preview

In the next section, we'll cover special tokens for sequence-to-sequence tasks. We'll learn exactly when BOS, EOS, and PAD tokens are used during training and inference, and how they affect the loss computation.