Chapter 5
15 min read
Section 25 of 75

Why Subword Tokenization?

Subword Tokenization for Translation

Introduction

When preparing text for neural networks, we face a fundamental question: how do we break text into pieces? The choice of tokenization strategy dramatically affects model performance, vocabulary size, and ability to handle new words.

This section explores the limitations of word-level and character-level approaches, and why subword tokenization became the standard for modern NLP.


1.1 The Tokenization Spectrum

Three Main Approaches

📝text
1Character-level:  "playing" → ['p', 'l', 'a', 'y', 'i', 'n', 'g']
2Subword-level:    "playing" → ['play', 'ing']
3Word-level:       "playing" → ['playing']

Each approach trades off between:

  • Vocabulary size: How many unique tokens?
  • Sequence length: How many tokens per sentence?
  • Coverage: Can we handle any input?

Visual Comparison

📝text
1Sentence: "The transformer architecture revolutionized NLP"
2
3Word-level (5 tokens):
4  ["The", "transformer", "architecture", "revolutionized", "NLP"]
5  Vocabulary: millions of unique words
6  Problem: What about "transformerized"? → <UNK>
7
8Character-level (42 tokens):
9  ['T', 'h', 'e', ' ', 't', 'r', 'a', 'n', 's', ...]
10  Vocabulary: ~100 characters
11  Problem: Very long sequences, hard to learn patterns
12
13Subword-level (~8 tokens):
14  ["The", "transform", "er", "architec", "ture", "revolution", "ized", "NLP"]
15  Vocabulary: 30K-50K subwords
16  Best of both worlds!

1.2 The Out-of-Vocabulary (OOV) Problem

Word-Level Limitations

With word-level tokenization, we must decide on a fixed vocabulary:

🐍python
1# Example: Vocabulary of 50,000 most common words
2vocab = {"the", "a", "transformer", "model", ...}  # 50K words
3
4# At inference time
5text = "The COVID-19 pandemic affected transformerization"
6
7# Problem: Words not in vocabulary
8# "COVID-19" → <UNK>
9# "transformerization" → <UNK>

OOV Statistics

For a typical English vocabulary:

  • 50K words: ~5% OOV rate on news text
  • 100K words: ~3% OOV rate
  • 500K words: ~1% OOV rate (but massive memory!)

For morphologically rich languages (German, Finnish, Turkish):

  • 50K words: ~15-25% OOV rate!

The Long Tail Problem

Word frequency follows Zipf's Law:

📝text
1Frequency
2    |
31000|*
4    |**
5 100|****
6    |*******
7  10|**************
8    |**************************
9   1|___________________________
10    Rank: 1   100   1000   10000

Most words appear rarely, but they're often important (names, technical terms, new words).


1.3 Why Character-Level Fails

Advantages of Characters

  • Zero OOV: Any text can be represented
  • Tiny vocabulary: ~100-300 characters (including special)
  • Language agnostic: Works for any script

Critical Disadvantages

1. Extremely Long Sequences

📝text
1Word-level:   "The quick brown fox" → 4 tokens
2Character:    "The quick brown fox" → 19 tokens
3
4For attention: O(n²) complexity
5- 4 tokens: 16 attention computations
6- 19 tokens: 361 attention computations

2. Difficult to Learn Word-Level Patterns

🐍python
1# Model must learn:
2# ['c', 'a', 't'] is an animal
3# ['d', 'o', 'g'] is also an animal
4# These share no characters, so no shared learning!
5
6# With subwords:
7# "cat" and "dog" can both connect to "animal" patterns

3. Increased Error Rate

Each character adds noise:

📝text
1"hello" → ['h', 'e', 'l', 'l', 'o']
2"helo"  → ['h', 'e', 'l', 'o']      # Only 1 char difference
3"jello" → ['j', 'e', 'l', 'l', 'o']  # Also 1 char difference
4
5The model struggles to distinguish typos from different words

4. Training Inefficiency

🐍python
1# Characters require many more steps to learn patterns
2# Example: Learning that "play" relates to "playing", "played", "plays"
3
4# Character-level model sees:
5# p-l-a-y-i-n-g
6# p-l-a-y-e-d
7# p-l-a-y-s
8# Must learn the "p-l-a-y" pattern from scratch each time
9
10# Subword model sees:
11# play + ing
12# play + ed
13# play + s
14# Immediately recognizes "play" as the common element

1.4 Morphological Awareness

What is Morphology?

Morphology studies how words are formed from smaller meaningful units (morphemes):

📝text
1unhappiness = un- (negation) + happy (root) + -ness (noun-forming)
2playing     = play (root) + -ing (present participle)
3transformer = transform (root) + -er (agent noun)

Why It Matters for NLP

Words sharing morphemes have related meanings:

📝text
1play, plays, played, playing, player, playful, replay
2│          same root: "play"          │
3
4Subword tokenization captures this:
5  playing  → [play, ##ing]
6  player   → [play, ##er]
7  playful  → [play, ##ful]
8
9The model learns "play" once and applies it everywhere!

German: A Morphological Challenge

German has compound words that combine multiple words:

📝text
1Rindfleischetikettierungsüberwachungsaufgabenübertragungsgesetz
2= Beef labeling supervision task transfer law
3
4Word-level: This is ONE word (OOV!)
5Character-level: 63 characters (too long!)
6Subword: [Rind, fleisch, etikett, ierung, s, überwach, ung, s, aufgaben, ...]
7         (Captures meaningful components)

More German examples:

📝text
1Krankenhaus (hospital) = Kranken (sick) + Haus (house)
2Handschuh (glove) = Hand (hand) + Schuh (shoe)
3Flugzeug (airplane) = Flug (flight) + Zeug (stuff/device)
4
5Subword tokenization naturally handles these!

1.5 The Translation Challenge

Source and Target Mismatch

In German-to-English translation:

📝text
1German: "Ich spiele gern Fußball"
2English: "I like playing soccer"
3
4Word alignment is not 1:1:
5  "spiele" → "playing" (different conjugation)
6  "gern" → "like" (different word entirely)
7  "Fußball" → "soccer" (or "football" - regional!)

Subwords Help Alignment

📝text
1German subwords:  [Ich, spiel, e, gern, Fuß, ball]
2English subwords: [I, like, play, ing, socc, er]
3
4Now the model can learn:
5- "spiel" relates to "play"
6- "Fuß" + "ball" relates to football/soccer

Shared Vocabulary Benefits

Many words are similar across languages:

📝text
1Computer (German) → Computer (English)
2Telefon (German) → Telephone (English)
3Information (German) → Information (English)
4
5With shared subword vocabulary:
6Both languages use the same tokens!
7→ Better cross-lingual transfer

1.6 Trade-offs: Vocabulary Size vs Sequence Length

The Fundamental Trade-off

📝text
1Vocabulary Size ←――――――――→ Sequence Length
2     Large                      Short
3      ↑                           ↑
4   Word-level              Character-level
5
6     Subword tokenization sits in between

Practical Numbers

ApproachVocab SizeAvg Tokens/SentenceMemoryTraining Speed
Character10080LowVery Slow
Subword 8K8,00025MediumFast
Subword 32K32,00020MediumFast
Subword 50K50,00018HigherFast
Word 100K100,00015HighFast

Choosing Vocabulary Size

Too Small (< 8K):

  • Very long sequences
  • Common words split unnecessarily
  • "the" might become ["th", "e"]

Too Large (> 100K):

  • Many rare tokens with poor embeddings
  • Increased memory for embedding table
  • Diminishing returns

Sweet Spot (16K-50K):

  • Common words stay intact
  • Rare words split meaningfully
  • Manageable sequence lengths

1.7 Why Subword Wins

Summary of Advantages

PropertyCharacterWordSubword
OOV HandlingPerfectPoorGood
Sequence LengthLongShortMedium
Vocabulary SizeTinyHugeMedium
MorphologyNoneNoneGood
Cross-lingualOKPoorGood
Training SpeedSlowFastFast

Real-World Evidence

Major language models use subword tokenization:

ModelTokenizerVocab Size
BERTWordPiece30,522
GPT-2/3BPE50,257
T5SentencePiece32,000
LLaMABPE32,000
BARTBPE50,265

No major model uses word-level or character-level tokenization!


1.8 Subword Algorithms Overview

Three Main Algorithms

1. Byte-Pair Encoding (BPE)

  • Bottom-up: Start with characters, merge frequent pairs
  • Used by: GPT, RoBERTa, BART

2. WordPiece

  • Similar to BPE, but uses likelihood instead of frequency
  • Used by: BERT, DistilBERT

3. Unigram

  • Top-down: Start with large vocab, prune unlikely pieces
  • Used by: T5, ALBERT, XLNet

For This Course

We'll focus on BPE because:

  1. Most intuitive algorithm
  2. Widely used in production
  3. Excellent performance for translation
  4. SentencePiece supports it

Summary

The Tokenization Problem

ApproachProsCons
Word-levelShort sequences, natural unitsOOV problem, huge vocabulary
Character-levelZero OOV, tiny vocabularyVery long sequences, slow learning
SubwordBalanced vocabulary, handles rare wordsSlightly more complex

Why Subword for Translation?

  1. Handles OOV: New words split into known subwords
  2. Cross-lingual: Shared subwords across languages
  3. Morphology: Captures meaningful word parts
  4. German compounds: Naturally decomposes long words
  5. Efficiency: Reasonable sequence lengths

Key Concepts

  • OOV Problem: Unknown words in fixed vocabulary
  • Morphemes: Smallest meaningful units in language
  • Vocabulary-Sequence Trade-off: Larger vocab = shorter sequences
  • BPE: Most common subword algorithm

Exercises

Conceptual Questions

  1. Why does German have a higher OOV rate than English with word-level tokenization?
  2. Explain why character-level models are slower to train for the same text.
  3. For machine translation, what are the advantages of a shared subword vocabulary across source and target languages?
  4. If you had unlimited compute, would character-level tokenization be better? Why or why not?

Analysis Exercises

  1. Take the sentence "The transformerized architecture outperformed everything" and show how it would be tokenized at word, character, and approximate subword levels.
  2. German word "Donaudampfschifffahrtsgesellschaftskapitän" (Danube steamship company captain) - why would word-level tokenization fail? How might subword handle it?
  3. Compare the sequence lengths for translating "I love you" to German "Ich liebe dich" at different tokenization levels.

Next Section Preview

In the next section, we'll dive deep into the Byte-Pair Encoding (BPE) algorithm. We'll trace through the algorithm step by step, understanding how it learns merge rules from data and builds a vocabulary of subword units.