Introduction
When preparing text for neural networks, we face a fundamental question: how do we break text into pieces? The choice of tokenization strategy dramatically affects model performance, vocabulary size, and ability to handle new words.
This section explores the limitations of word-level and character-level approaches, and why subword tokenization became the standard for modern NLP.
1.1 The Tokenization Spectrum
Three Main Approaches
1Character-level: "playing" → ['p', 'l', 'a', 'y', 'i', 'n', 'g']
2Subword-level: "playing" → ['play', 'ing']
3Word-level: "playing" → ['playing']Each approach trades off between:
- Vocabulary size: How many unique tokens?
- Sequence length: How many tokens per sentence?
- Coverage: Can we handle any input?
Visual Comparison
1Sentence: "The transformer architecture revolutionized NLP"
2
3Word-level (5 tokens):
4 ["The", "transformer", "architecture", "revolutionized", "NLP"]
5 Vocabulary: millions of unique words
6 Problem: What about "transformerized"? → <UNK>
7
8Character-level (42 tokens):
9 ['T', 'h', 'e', ' ', 't', 'r', 'a', 'n', 's', ...]
10 Vocabulary: ~100 characters
11 Problem: Very long sequences, hard to learn patterns
12
13Subword-level (~8 tokens):
14 ["The", "transform", "er", "architec", "ture", "revolution", "ized", "NLP"]
15 Vocabulary: 30K-50K subwords
16 Best of both worlds!1.2 The Out-of-Vocabulary (OOV) Problem
Word-Level Limitations
With word-level tokenization, we must decide on a fixed vocabulary:
1# Example: Vocabulary of 50,000 most common words
2vocab = {"the", "a", "transformer", "model", ...} # 50K words
3
4# At inference time
5text = "The COVID-19 pandemic affected transformerization"
6
7# Problem: Words not in vocabulary
8# "COVID-19" → <UNK>
9# "transformerization" → <UNK>OOV Statistics
For a typical English vocabulary:
- 50K words: ~5% OOV rate on news text
- 100K words: ~3% OOV rate
- 500K words: ~1% OOV rate (but massive memory!)
For morphologically rich languages (German, Finnish, Turkish):
- 50K words: ~15-25% OOV rate!
The Long Tail Problem
Word frequency follows Zipf's Law:
1Frequency
2 |
31000|*
4 |**
5 100|****
6 |*******
7 10|**************
8 |**************************
9 1|___________________________
10 Rank: 1 100 1000 10000Most words appear rarely, but they're often important (names, technical terms, new words).
1.3 Why Character-Level Fails
Advantages of Characters
- Zero OOV: Any text can be represented
- Tiny vocabulary: ~100-300 characters (including special)
- Language agnostic: Works for any script
Critical Disadvantages
1. Extremely Long Sequences
1Word-level: "The quick brown fox" → 4 tokens
2Character: "The quick brown fox" → 19 tokens
3
4For attention: O(n²) complexity
5- 4 tokens: 16 attention computations
6- 19 tokens: 361 attention computations2. Difficult to Learn Word-Level Patterns
1# Model must learn:
2# ['c', 'a', 't'] is an animal
3# ['d', 'o', 'g'] is also an animal
4# These share no characters, so no shared learning!
5
6# With subwords:
7# "cat" and "dog" can both connect to "animal" patterns3. Increased Error Rate
Each character adds noise:
1"hello" → ['h', 'e', 'l', 'l', 'o']
2"helo" → ['h', 'e', 'l', 'o'] # Only 1 char difference
3"jello" → ['j', 'e', 'l', 'l', 'o'] # Also 1 char difference
4
5The model struggles to distinguish typos from different words4. Training Inefficiency
1# Characters require many more steps to learn patterns
2# Example: Learning that "play" relates to "playing", "played", "plays"
3
4# Character-level model sees:
5# p-l-a-y-i-n-g
6# p-l-a-y-e-d
7# p-l-a-y-s
8# Must learn the "p-l-a-y" pattern from scratch each time
9
10# Subword model sees:
11# play + ing
12# play + ed
13# play + s
14# Immediately recognizes "play" as the common element1.4 Morphological Awareness
What is Morphology?
Morphology studies how words are formed from smaller meaningful units (morphemes):
1unhappiness = un- (negation) + happy (root) + -ness (noun-forming)
2playing = play (root) + -ing (present participle)
3transformer = transform (root) + -er (agent noun)Why It Matters for NLP
Words sharing morphemes have related meanings:
1play, plays, played, playing, player, playful, replay
2│ same root: "play" │
3
4Subword tokenization captures this:
5 playing → [play, ##ing]
6 player → [play, ##er]
7 playful → [play, ##ful]
8
9The model learns "play" once and applies it everywhere!German: A Morphological Challenge
German has compound words that combine multiple words:
1Rindfleischetikettierungsüberwachungsaufgabenübertragungsgesetz
2= Beef labeling supervision task transfer law
3
4Word-level: This is ONE word (OOV!)
5Character-level: 63 characters (too long!)
6Subword: [Rind, fleisch, etikett, ierung, s, überwach, ung, s, aufgaben, ...]
7 (Captures meaningful components)More German examples:
1Krankenhaus (hospital) = Kranken (sick) + Haus (house)
2Handschuh (glove) = Hand (hand) + Schuh (shoe)
3Flugzeug (airplane) = Flug (flight) + Zeug (stuff/device)
4
5Subword tokenization naturally handles these!1.5 The Translation Challenge
Source and Target Mismatch
In German-to-English translation:
1German: "Ich spiele gern Fußball"
2English: "I like playing soccer"
3
4Word alignment is not 1:1:
5 "spiele" → "playing" (different conjugation)
6 "gern" → "like" (different word entirely)
7 "Fußball" → "soccer" (or "football" - regional!)Subwords Help Alignment
1German subwords: [Ich, spiel, e, gern, Fuß, ball]
2English subwords: [I, like, play, ing, socc, er]
3
4Now the model can learn:
5- "spiel" relates to "play"
6- "Fuß" + "ball" relates to football/soccerShared Vocabulary Benefits
Many words are similar across languages:
1Computer (German) → Computer (English)
2Telefon (German) → Telephone (English)
3Information (German) → Information (English)
4
5With shared subword vocabulary:
6Both languages use the same tokens!
7→ Better cross-lingual transfer1.6 Trade-offs: Vocabulary Size vs Sequence Length
The Fundamental Trade-off
1Vocabulary Size ←――――――――→ Sequence Length
2 Large Short
3 ↑ ↑
4 Word-level Character-level
5
6 Subword tokenization sits in betweenPractical Numbers
| Approach | Vocab Size | Avg Tokens/Sentence | Memory | Training Speed |
|---|---|---|---|---|
| Character | 100 | 80 | Low | Very Slow |
| Subword 8K | 8,000 | 25 | Medium | Fast |
| Subword 32K | 32,000 | 20 | Medium | Fast |
| Subword 50K | 50,000 | 18 | Higher | Fast |
| Word 100K | 100,000 | 15 | High | Fast |
Choosing Vocabulary Size
Too Small (< 8K):
- Very long sequences
- Common words split unnecessarily
- "the" might become ["th", "e"]
Too Large (> 100K):
- Many rare tokens with poor embeddings
- Increased memory for embedding table
- Diminishing returns
Sweet Spot (16K-50K):
- Common words stay intact
- Rare words split meaningfully
- Manageable sequence lengths
1.7 Why Subword Wins
Summary of Advantages
| Property | Character | Word | Subword |
|---|---|---|---|
| OOV Handling | Perfect | Poor | Good |
| Sequence Length | Long | Short | Medium |
| Vocabulary Size | Tiny | Huge | Medium |
| Morphology | None | None | Good |
| Cross-lingual | OK | Poor | Good |
| Training Speed | Slow | Fast | Fast |
Real-World Evidence
Major language models use subword tokenization:
| Model | Tokenizer | Vocab Size |
|---|---|---|
| BERT | WordPiece | 30,522 |
| GPT-2/3 | BPE | 50,257 |
| T5 | SentencePiece | 32,000 |
| LLaMA | BPE | 32,000 |
| BART | BPE | 50,265 |
No major model uses word-level or character-level tokenization!
1.8 Subword Algorithms Overview
Three Main Algorithms
1. Byte-Pair Encoding (BPE)
- Bottom-up: Start with characters, merge frequent pairs
- Used by: GPT, RoBERTa, BART
2. WordPiece
- Similar to BPE, but uses likelihood instead of frequency
- Used by: BERT, DistilBERT
3. Unigram
- Top-down: Start with large vocab, prune unlikely pieces
- Used by: T5, ALBERT, XLNet
For This Course
We'll focus on BPE because:
- Most intuitive algorithm
- Widely used in production
- Excellent performance for translation
- SentencePiece supports it
Summary
The Tokenization Problem
| Approach | Pros | Cons |
|---|---|---|
| Word-level | Short sequences, natural units | OOV problem, huge vocabulary |
| Character-level | Zero OOV, tiny vocabulary | Very long sequences, slow learning |
| Subword | Balanced vocabulary, handles rare words | Slightly more complex |
Why Subword for Translation?
- Handles OOV: New words split into known subwords
- Cross-lingual: Shared subwords across languages
- Morphology: Captures meaningful word parts
- German compounds: Naturally decomposes long words
- Efficiency: Reasonable sequence lengths
Key Concepts
- OOV Problem: Unknown words in fixed vocabulary
- Morphemes: Smallest meaningful units in language
- Vocabulary-Sequence Trade-off: Larger vocab = shorter sequences
- BPE: Most common subword algorithm
Exercises
Conceptual Questions
- Why does German have a higher OOV rate than English with word-level tokenization?
- Explain why character-level models are slower to train for the same text.
- For machine translation, what are the advantages of a shared subword vocabulary across source and target languages?
- If you had unlimited compute, would character-level tokenization be better? Why or why not?
Analysis Exercises
- Take the sentence "The transformerized architecture outperformed everything" and show how it would be tokenized at word, character, and approximate subword levels.
- German word "Donaudampfschifffahrtsgesellschaftskapitän" (Danube steamship company captain) - why would word-level tokenization fail? How might subword handle it?
- Compare the sequence lengths for translating "I love you" to German "Ich liebe dich" at different tokenization levels.
Next Section Preview
In the next section, we'll dive deep into the Byte-Pair Encoding (BPE) algorithm. We'll trace through the algorithm step by step, understanding how it learns merge rules from data and builds a vocabulary of subword units.