Boo-AI — Master Artificial Intelligence by Building from Scratch

Introduction

While our from-scratch BPE implementation helps understand the algorithm, production systems use optimized libraries. SentencePiece is the industry standard—it's fast, memory-efficient, and handles edge cases that our simple implementation doesn't.

This section covers training and using SentencePiece for our German-English translation project.

4.1 Why SentencePiece?

Advantages Over Custom Implementation

Feature	Our Implementation	SentencePiece
Training speed	Slow (Python)	Fast (C++)
Memory efficiency	High	Optimized
Unicode handling	Basic	Complete
Pre-tokenization	Whitespace only	Language-aware
Algorithms	BPE only	BPE, Unigram, Word, Char
Production-ready	No	Yes
Used by	Educational	T5, ALBERT, LLaMA

Key Features

Language-agnostic: No pre-tokenization required
Byte-level: Handles any Unicode text
Deterministic: Same input → same output
Memory-mapped: Efficient for large vocabularies
Multiple algorithms: BPE, Unigram, and more

4.2 Installation and Setup

Installing SentencePiece

⚡bash

1# Install via pip
2pip install sentencepiece
3
4# Verify installation
5python -c "import sentencepiece; print(sentencepiece.__version__)"

Basic Usage Pattern

🐍python

1import sentencepiece as spm
2
3# Training
4spm.SentencePieceTrainer.train(
5    input="corpus.txt",
6    model_prefix="tokenizer",
7    vocab_size=32000
8)
9
10# Loading and using
11sp = spm.SentencePieceProcessor()
12sp.load("tokenizer.model")
13
14# Encoding
15tokens = sp.encode("Hello world", out_type=str)
16ids = sp.encode("Hello world", out_type=int)
17
18# Decoding
19text = sp.decode(ids)

4.3 Training a Tokenizer

Preparing Training Data

🐍python

1import sentencepiece as spm
2from pathlib import Path
3
4
5def prepare_corpus(texts: list, output_path: str) -> None:
6    """
7    Write texts to file for SentencePiece training.
8
9    Args:
10        texts: List of text strings
11        output_path: Path to output file
12    """
13    with open(output_path, "w", encoding="utf-8") as f:
14        for text in texts:
15            # One sentence per line
16            f.write(text.strip() + "\n")
17
18
19# Example: Prepare German and English data
20german_texts = [
21    "Der Hund läuft im Park.",
22    "Die Katze schläft auf dem Sofa.",
23    "Das Wetter ist heute schön.",
24    "Ich gehe morgen einkaufen.",
25    "Die Kinder spielen draußen.",
26]
27
28english_texts = [
29    "The dog runs in the park.",
30    "The cat sleeps on the sofa.",
31    "The weather is nice today.",
32    "I am going shopping tomorrow.",
33    "The children are playing outside.",
34]
35
36# For translation, we typically combine source and target
37combined_texts = german_texts + english_texts
38
39prepare_corpus(combined_texts, "translation_corpus.txt")
40print(f"Wrote {len(combined_texts)} sentences to corpus file")

Training Options

🐍python

1def train_sentencepiece(
2    input_file: str,
3    model_prefix: str,
4    vocab_size: int = 32000,
5    model_type: str = "bpe",
6    character_coverage: float = 0.9995,
7    pad_id: int = 0,
8    unk_id: int = 1,
9    bos_id: int = 2,
10    eos_id: int = 3
11) -> None:
12    """
13    Train a SentencePiece model.
14
15    Args:
16        input_file: Path to training corpus
17        model_prefix: Output model name prefix
18        vocab_size: Target vocabulary size
19        model_type: 'bpe', 'unigram', 'word', or 'char'
20        character_coverage: Percentage of characters to cover
21        pad_id, unk_id, bos_id, eos_id: Special token IDs
22    """
23    spm.SentencePieceTrainer.train(
24        input=input_file,
25        model_prefix=model_prefix,
26        vocab_size=vocab_size,
27        model_type=model_type,
28        character_coverage=character_coverage,
29        pad_id=pad_id,
30        unk_id=unk_id,
31        bos_id=bos_id,
32        eos_id=eos_id,
33        # Additional useful options
34        num_threads=4,
35        train_extremely_large_corpus=False,  # Set True for huge corpora
36        max_sentence_length=4192,
37        shuffle_input_sentence=True,
38        # Control special tokens
39        pad_piece="<pad>",
40        unk_piece="<unk>",
41        bos_piece="<bos>",
42        eos_piece="<eos>",
43    )
44
45    print(f"Trained model: {model_prefix}.model")
46    print(f"Vocabulary file: {model_prefix}.vocab")
47
48
49# Train on our corpus
50train_sentencepiece(
51    input_file="translation_corpus.txt",
52    model_prefix="de_en_bpe",
53    vocab_size=8000,  # Small for this example
54    model_type="bpe"
55)

Understanding Training Output

After training, SentencePiece creates two files:

📝text

1de_en_bpe.model  - Binary model file (for encoding/decoding)
2de_en_bpe.vocab  - Human-readable vocabulary (for inspection)

Inspect the vocabulary:

🐍python

1def inspect_vocabulary(vocab_file: str, num_tokens: int = 20) -> None:
2    """Print sample tokens from vocabulary."""
3    print(f"First {num_tokens} tokens in vocabulary:\n")
4
5    with open(vocab_file, "r", encoding="utf-8") as f:
6        for i, line in enumerate(f):
7            if i >= num_tokens:
8                break
9            token, score = line.strip().split("\t")
10            print(f"  {i:4d}: '{token}' (score: {score})")
11
12
13inspect_vocabulary("de_en_bpe.vocab")

Output:

📝text

1First 20 tokens in vocabulary:
2
3     0: '<pad>' (score: 0)
4     1: '<unk>' (score: 0)
5     2: '<bos>' (score: 0)
6     3: '<eos>' (score: 0)
7     4: '▁' (score: -0)
8     5: '.' (score: -1)
9     6: 'e' (score: -2)
10     7: 'n' (score: -3)
11     8: '▁the' (score: -4)
12     9: 'er' (score: -5)
13    10: '▁The' (score: -6)
14    ...

Note: ▁ is SentencePiece's marker for "beginning of word" (similar to our </w> but inverted).

4.4 Using the Trained Model

Basic Encoding and Decoding

🐍python

1class SentencePieceTokenizer:
2    """
3    Wrapper around SentencePiece for convenient usage.
4
5    Example:
6        >>> tokenizer = SentencePieceTokenizer("model.model")
7        >>> ids = tokenizer.encode("Hello world")
8        >>> text = tokenizer.decode(ids)
9    """
10
11    def __init__(self, model_path: str):
12        """Load a trained SentencePiece model."""
13        self.sp = spm.SentencePieceProcessor()
14        self.sp.load(model_path)
15
16    def encode(
17        self,
18        text: str,
19        add_bos: bool = False,
20        add_eos: bool = False
21    ) -> list:
22        """
23        Encode text to token IDs.
24
25        Args:
26            text: Input text
27            add_bos: Prepend BOS token
28            add_eos: Append EOS token
29
30        Returns:
31            List of token IDs
32        """
33        ids = self.sp.encode(text, out_type=int)
34
35        if add_bos:
36            ids = [self.bos_id] + ids
37        if add_eos:
38            ids = ids + [self.eos_id]
39
40        return ids
41
42    def decode(self, ids: list) -> str:
43        """Decode token IDs to text."""
44        return self.sp.decode(ids)
45
46    def tokenize(self, text: str) -> list:
47        """Get tokens as strings (for inspection)."""
48        return self.sp.encode(text, out_type=str)
49
50    def detokenize(self, tokens: list) -> str:
51        """Convert token strings back to text."""
52        return self.sp.decode(tokens)
53
54    @property
55    def vocab_size(self) -> int:
56        return self.sp.get_piece_size()
57
58    @property
59    def pad_id(self) -> int:
60        return self.sp.pad_id()
61
62    @property
63    def unk_id(self) -> int:
64        return self.sp.unk_id()
65
66    @property
67    def bos_id(self) -> int:
68        return self.sp.bos_id()
69
70    @property
71    def eos_id(self) -> int:
72        return self.sp.eos_id()
73
74
75# Test the tokenizer
76tokenizer = SentencePieceTokenizer("de_en_bpe.model")
77
78print(f"Vocabulary size: {tokenizer.vocab_size}")
79print(f"Special tokens: PAD={tokenizer.pad_id}, UNK={tokenizer.unk_id}, "
80      f"BOS={tokenizer.bos_id}, EOS={tokenizer.eos_id}")
81
82# Test sentences
83test_sentences = [
84    "The dog runs in the park.",
85    "Der Hund läuft im Park.",
86    "This is an unknown word: xyzzy.",
87]
88
89print("\nEncoding examples:")
90for sentence in test_sentences:
91    tokens = tokenizer.tokenize(sentence)
92    ids = tokenizer.encode(sentence)
93    ids_special = tokenizer.encode(sentence, add_bos=True, add_eos=True)
94    decoded = tokenizer.decode(ids)
95
96    print(f"\n  Input:    '{sentence}'")
97    print(f"  Tokens:   {tokens}")
98    print(f"  IDs:      {ids}")
99    print(f"  With BOS/EOS: {ids_special}")
100    print(f"  Decoded:  '{decoded}'")

4.5 Shared vs Separate Vocabularies

Shared Vocabulary (Recommended for Translation)

One vocabulary for both source and target languages:

🐍python

1def train_shared_vocabulary(
2    source_file: str,
3    target_file: str,
4    model_prefix: str,
5    vocab_size: int = 32000
6) -> None:
7    """
8    Train a shared vocabulary on combined source and target data.
9
10    This is the recommended approach for translation because:
11    1. Shared subwords (e.g., "information" in EN/DE)
12    2. Enables weight tying in the model
13    3. Smaller total parameters
14    """
15    # Combine source and target into one file
16    combined_file = f"{model_prefix}_combined.txt"
17
18    with open(combined_file, "w", encoding="utf-8") as out:
19        for input_file in [source_file, target_file]:
20            with open(input_file, "r", encoding="utf-8") as f:
21                for line in f:
22                    out.write(line)
23
24    # Train on combined data
25    spm.SentencePieceTrainer.train(
26        input=combined_file,
27        model_prefix=model_prefix,
28        vocab_size=vocab_size,
29        model_type="bpe",
30        character_coverage=0.9995,
31        pad_id=0,
32        unk_id=1,
33        bos_id=2,
34        eos_id=3,
35    )
36
37    print(f"Trained shared vocabulary: {model_prefix}.model")
38
39
40# Example usage
41# train_shared_vocabulary("german.txt", "english.txt", "de_en_shared", 32000)

Separate Vocabularies

Independent vocabularies for each language:

🐍python

1def train_separate_vocabularies(
2    source_file: str,
3    target_file: str,
4    source_prefix: str,
5    target_prefix: str,
6    vocab_size: int = 16000
7) -> None:
8    """
9    Train separate vocabularies for source and target.
10
11    Use this when:
12    1. Languages have very different scripts (EN-ZH, EN-JA)
13    2. Domain-specific requirements
14    3. Memory constraints allow separate embeddings
15    """
16    # Train source tokenizer
17    spm.SentencePieceTrainer.train(
18        input=source_file,
19        model_prefix=source_prefix,
20        vocab_size=vocab_size,
21        model_type="bpe",
22    )
23
24    # Train target tokenizer
25    spm.SentencePieceTrainer.train(
26        input=target_file,
27        model_prefix=target_prefix,
28        vocab_size=vocab_size,
29        model_type="bpe",
30    )
31
32    print(f"Source tokenizer: {source_prefix}.model")
33    print(f"Target tokenizer: {target_prefix}.model")

When to Use Each

Scenario	Recommendation
Similar languages (DE-EN, FR-EN)	Shared
Different scripts (EN-ZH, EN-AR)	Separate
Code-switching data	Shared
Domain-specific	Depends
Memory-constrained	Separate (smaller each)
Research/benchmark	Shared (standard)

4.6 Integration with PyTorch Dataset

Translation Dataset with SentencePiece

🐍python

1import torch
2from torch.utils.data import Dataset, DataLoader
3from typing import Optional
4
5
6class TranslationDataset(Dataset):
7    """
8    PyTorch Dataset for machine translation with SentencePiece tokenization.
9
10    Example:
11        >>> tokenizer = SentencePieceTokenizer("model.model")
12        >>> dataset = TranslationDataset(src_texts, tgt_texts, tokenizer)
13        >>> src_ids, tgt_ids = dataset[0]
14    """
15
16    def __init__(
17        self,
18        source_texts: list,
19        target_texts: list,
20        tokenizer: SentencePieceTokenizer,
21        max_length: int = 128
22    ):
23        """
24        Initialize dataset.
25
26        Args:
27            source_texts: List of source language texts
28            target_texts: List of target language texts
29            tokenizer: SentencePiece tokenizer
30            max_length: Maximum sequence length
31        """
32        assert len(source_texts) == len(target_texts)
33
34        self.source_texts = source_texts
35        self.target_texts = target_texts
36        self.tokenizer = tokenizer
37        self.max_length = max_length
38
39    def __len__(self) -> int:
40        return len(self.source_texts)
41
42    def __getitem__(self, idx: int):
43        """
44        Get a single training example.
45
46        Returns:
47            source_ids: Tensor of source token IDs
48            target_ids: Tensor of target token IDs (with BOS/EOS)
49        """
50        source_text = self.source_texts[idx]
51        target_text = self.target_texts[idx]
52
53        # Encode source (no special tokens needed)
54        source_ids = self.tokenizer.encode(source_text)
55
56        # Encode target with BOS and EOS
57        target_ids = self.tokenizer.encode(
58            target_text, add_bos=True, add_eos=True
59        )
60
61        # Truncate if needed
62        source_ids = source_ids[:self.max_length]
63        target_ids = target_ids[:self.max_length]
64
65        return (
66            torch.tensor(source_ids, dtype=torch.long),
67            torch.tensor(target_ids, dtype=torch.long)
68        )
69
70
71def collate_fn(batch, pad_id: int = 0):
72    """
73    Collate function for DataLoader that pads sequences.
74
75    Args:
76        batch: List of (source, target) tensor tuples
77        pad_id: Padding token ID
78
79    Returns:
80        source_batch: Padded source tensor [batch, max_src_len]
81        target_batch: Padded target tensor [batch, max_tgt_len]
82    """
83    sources, targets = zip(*batch)
84
85    # Get max lengths
86    max_src_len = max(s.size(0) for s in sources)
87    max_tgt_len = max(t.size(0) for t in targets)
88
89    # Pad sequences
90    padded_sources = []
91    padded_targets = []
92
93    for src, tgt in batch:
94        # Pad source
95        src_padding = torch.full(
96            (max_src_len - src.size(0),),
97            pad_id,
98            dtype=torch.long
99        )
100        padded_sources.append(torch.cat([src, src_padding]))
101
102        # Pad target
103        tgt_padding = torch.full(
104            (max_tgt_len - tgt.size(0),),
105            pad_id,
106            dtype=torch.long
107        )
108        padded_targets.append(torch.cat([tgt, tgt_padding]))
109
110    source_batch = torch.stack(padded_sources)
111    target_batch = torch.stack(padded_targets)
112
113    return source_batch, target_batch

4.7 Complete Tokenizer for Translation Project

Full Implementation

🐍python

1import sentencepiece as spm
2import torch
3from pathlib import Path
4from typing import List, Optional, Tuple
5import json
6
7
8class TranslationTokenizer:
9    """
10    Complete tokenizer for German-English translation.
11
12    Handles:
13    - Training on parallel corpus
14    - Shared vocabulary for both languages
15    - Proper special token handling
16    - PyTorch integration
17
18    Example:
19        >>> tokenizer = TranslationTokenizer()
20        >>> tokenizer.train(german_texts, english_texts, vocab_size=32000)
21        >>> tokenizer.save("tokenizer_dir")
22        >>>
23        >>> # Later
24        >>> tokenizer = TranslationTokenizer.load("tokenizer_dir")
25        >>> src_ids = tokenizer.encode_source("Der Hund läuft.")
26        >>> tgt_ids = tokenizer.encode_target("The dog runs.")
27    """
28
29    def __init__(self):
30        self.sp: Optional[spm.SentencePieceProcessor] = None
31        self.vocab_size: int = 0
32
33    def train(
34        self,
35        source_texts: List[str],
36        target_texts: List[str],
37        vocab_size: int = 32000,
38        model_dir: str = "tokenizer",
39        model_type: str = "bpe"
40    ) -> None:
41        """
42        Train tokenizer on parallel corpus.
43
44        Args:
45            source_texts: German sentences
46            target_texts: English sentences
47            vocab_size: Target vocabulary size
48            model_dir: Directory for model files
49            model_type: 'bpe' or 'unigram'
50        """
51        model_dir = Path(model_dir)
52        model_dir.mkdir(parents=True, exist_ok=True)
53
54        # Combine source and target for shared vocabulary
55        corpus_file = model_dir / "corpus.txt"
56        with open(corpus_file, "w", encoding="utf-8") as f:
57            for text in source_texts:
58                f.write(text.strip() + "\n")
59            for text in target_texts:
60                f.write(text.strip() + "\n")
61
62        # Train SentencePiece
63        model_prefix = str(model_dir / "sp")
64
65        spm.SentencePieceTrainer.train(
66            input=str(corpus_file),
67            model_prefix=model_prefix,
68            vocab_size=vocab_size,
69            model_type=model_type,
70            character_coverage=0.9995,
71            # Special tokens
72            pad_id=0,
73            unk_id=1,
74            bos_id=2,
75            eos_id=3,
76            pad_piece="<pad>",
77            unk_piece="<unk>",
78            bos_piece="<bos>",
79            eos_piece="<eos>",
80            # Training options
81            num_threads=4,
82            shuffle_input_sentence=True,
83        )
84
85        # Load the trained model
86        self.sp = spm.SentencePieceProcessor()
87        self.sp.load(model_prefix + ".model")
88        self.vocab_size = self.sp.get_piece_size()
89
90        # Save config
91        config = {
92            "vocab_size": vocab_size,
93            "model_type": model_type,
94        }
95        with open(model_dir / "config.json", "w") as f:
96            json.dump(config, f)
97
98        print(f"Trained tokenizer with {self.vocab_size} tokens")
99        print(f"Saved to: {model_dir}")
100
101    def encode_source(
102        self,
103        text: str,
104        max_length: Optional[int] = None
105    ) -> List[int]:
106        """
107        Encode source (German) text.
108
109        Source sequences don't need BOS/EOS in encoder.
110
111        Args:
112            text: Source text
113            max_length: Truncate to this length
114
115        Returns:
116            List of token IDs
117        """
118        ids = self.sp.encode(text, out_type=int)
119        if max_length:
120            ids = ids[:max_length]
121        return ids
122
123    def encode_target(
124        self,
125        text: str,
126        max_length: Optional[int] = None,
127        add_bos: bool = True,
128        add_eos: bool = True
129    ) -> List[int]:
130        """
131        Encode target (English) text.
132
133        Target sequences need BOS for decoder input and EOS for labels.
134
135        Args:
136            text: Target text
137            max_length: Truncate to this length (including special tokens)
138            add_bos: Prepend BOS token
139            add_eos: Append EOS token
140
141        Returns:
142            List of token IDs
143        """
144        ids = self.sp.encode(text, out_type=int)
145
146        if add_bos:
147            ids = [self.bos_id] + ids
148        if add_eos:
149            ids = ids + [self.eos_id]
150
151        if max_length:
152            ids = ids[:max_length]
153
154        return ids
155
156    def decode(self, ids: List[int], skip_special: bool = True) -> str:
157        """
158        Decode token IDs to text.
159
160        Args:
161            ids: Token IDs
162            skip_special: Remove special tokens from output
163
164        Returns:
165            Decoded text
166        """
167        if skip_special:
168            special_ids = {self.pad_id, self.bos_id, self.eos_id}
169            ids = [id_ for id_ in ids if id_ not in special_ids]
170
171        return self.sp.decode(ids)
172
173    def tokenize(self, text: str) -> List[str]:
174        """Get tokens as strings (for debugging)."""
175        return self.sp.encode(text, out_type=str)
176
177    @classmethod
178    def load(cls, model_dir: str) -> "TranslationTokenizer":
179        """
180        Load tokenizer from directory.
181
182        Args:
183            model_dir: Directory containing sp.model
184
185        Returns:
186            Loaded tokenizer
187        """
188        tokenizer = cls()
189        model_path = Path(model_dir) / "sp.model"
190        tokenizer.sp = spm.SentencePieceProcessor()
191        tokenizer.sp.load(str(model_path))
192        tokenizer.vocab_size = tokenizer.sp.get_piece_size()
193        return tokenizer
194
195    @property
196    def pad_id(self) -> int:
197        return self.sp.pad_id()
198
199    @property
200    def unk_id(self) -> int:
201        return self.sp.unk_id()
202
203    @property
204    def bos_id(self) -> int:
205        return self.sp.bos_id()
206
207    @property
208    def eos_id(self) -> int:
209        return self.sp.eos_id()
210
211    def __len__(self) -> int:
212        return self.vocab_size
213
214    def __repr__(self) -> str:
215        return f"TranslationTokenizer(vocab_size={self.vocab_size})"

Summary

SentencePiece Workflow

📝text

11. Prepare training data → corpus.txt
22. Train model         → sp.model, sp.vocab
33. Load model          → spm.SentencePieceProcessor()
44. Encode/Decode       → sp.encode(), sp.decode()

Key Parameters

Parameter	Description	Typical Value
vocab_size	Target vocabulary size	32,000
model_type	Algorithm (bpe/unigram)	bpe
character_coverage	% chars to cover	0.9995
pad_id	Padding token ID	0
unk_id	Unknown token ID	1
bos_id	Begin-of-sequence ID	2
eos_id	End-of-sequence ID	3

Translation Project Settings

For our German-English project:

Vocabulary size: 32,000 (shared)
Model type: BPE
Character coverage: 0.9995
Special tokens: PAD(0), UNK(1), BOS(2), EOS(3)

Exercises

Implementation Exercises

Train separate tokenizers for German and English. Compare vocabulary contents.
Experiment with different vocab_sizes (8K, 16K, 32K, 64K). Measure average tokens per sentence.
Implement a method to get the token frequency distribution from a trained model.

Analysis Exercises

Compare BPE vs Unigram models on the same data. How do the tokenizations differ?
Analyze how German compound words are tokenized (e.g., "Krankenhaus", "Handschuh").
Test the tokenizer on out-of-domain text (e.g., scientific terms). How well does it generalize?

Next Section Preview

In the next section, we'll cover special tokens for sequence-to-sequence tasks. We'll learn exactly when BOS, EOS, and PAD tokens are used during training and inference, and how they affect the loss computation.