Boo-AI — Master Artificial Intelligence by Building from Scratch

Introduction

When preparing text for neural networks, we face a fundamental question: how do we break text into pieces? The choice of tokenization strategy dramatically affects model performance, vocabulary size, and ability to handle new words.

This section explores the limitations of word-level and character-level approaches, and why subword tokenization became the standard for modern NLP.

1.1 The Tokenization Spectrum

Three Main Approaches

📝text

1Character-level:  "playing" → ['p', 'l', 'a', 'y', 'i', 'n', 'g']
2Subword-level:    "playing" → ['play', 'ing']
3Word-level:       "playing" → ['playing']

Each approach trades off between:

Vocabulary size: How many unique tokens?
Sequence length: How many tokens per sentence?
Coverage: Can we handle any input?

Visual Comparison

📝text

1Sentence: "The transformer architecture revolutionized NLP"
2
3Word-level (5 tokens):
4  ["The", "transformer", "architecture", "revolutionized", "NLP"]
5  Vocabulary: millions of unique words
6  Problem: What about "transformerized"? → <UNK>
7
8Character-level (42 tokens):
9  ['T', 'h', 'e', ' ', 't', 'r', 'a', 'n', 's', ...]
10  Vocabulary: ~100 characters
11  Problem: Very long sequences, hard to learn patterns
12
13Subword-level (~8 tokens):
14  ["The", "transform", "er", "architec", "ture", "revolution", "ized", "NLP"]
15  Vocabulary: 30K-50K subwords
16  Best of both worlds!

1.2 The Out-of-Vocabulary (OOV) Problem

Word-Level Limitations

With word-level tokenization, we must decide on a fixed vocabulary:

🐍python

1# Example: Vocabulary of 50,000 most common words
2vocab = {"the", "a", "transformer", "model", ...}  # 50K words
3
4# At inference time
5text = "The COVID-19 pandemic affected transformerization"
6
7# Problem: Words not in vocabulary
8# "COVID-19" → <UNK>
9# "transformerization" → <UNK>

OOV Statistics

For a typical English vocabulary:

50K words: ~5% OOV rate on news text
100K words: ~3% OOV rate
500K words: ~1% OOV rate (but massive memory!)

For morphologically rich languages (German, Finnish, Turkish):

50K words: ~15-25% OOV rate!

The Long Tail Problem

Word frequency follows Zipf's Law:

📝text

1Frequency
2    |
31000|*
4    |**
5 100|****
6    |*******
7  10|**************
8    |**************************
9   1|___________________________
10    Rank: 1   100   1000   10000

Most words appear rarely, but they're often important (names, technical terms, new words).

1.3 Why Character-Level Fails

Advantages of Characters

Zero OOV: Any text can be represented
Tiny vocabulary: ~100-300 characters (including special)
Language agnostic: Works for any script

Critical Disadvantages

1. Extremely Long Sequences

📝text

1Word-level:   "The quick brown fox" → 4 tokens
2Character:    "The quick brown fox" → 19 tokens
3
4For attention: O(n²) complexity
5- 4 tokens: 16 attention computations
6- 19 tokens: 361 attention computations

2. Difficult to Learn Word-Level Patterns

🐍python

1# Model must learn:
2# ['c', 'a', 't'] is an animal
3# ['d', 'o', 'g'] is also an animal
4# These share no characters, so no shared learning!
5
6# With subwords:
7# "cat" and "dog" can both connect to "animal" patterns

3. Increased Error Rate

Each character adds noise:

📝text

1"hello" → ['h', 'e', 'l', 'l', 'o']
2"helo"  → ['h', 'e', 'l', 'o']      # Only 1 char difference
3"jello" → ['j', 'e', 'l', 'l', 'o']  # Also 1 char difference
4
5The model struggles to distinguish typos from different words

4. Training Inefficiency

🐍python

1# Characters require many more steps to learn patterns
2# Example: Learning that "play" relates to "playing", "played", "plays"
3
4# Character-level model sees:
5# p-l-a-y-i-n-g
6# p-l-a-y-e-d
7# p-l-a-y-s
8# Must learn the "p-l-a-y" pattern from scratch each time
9
10# Subword model sees:
11# play + ing
12# play + ed
13# play + s
14# Immediately recognizes "play" as the common element

1.4 Morphological Awareness

What is Morphology?

Morphology studies how words are formed from smaller meaningful units (morphemes):

📝text

1unhappiness = un- (negation) + happy (root) + -ness (noun-forming)
2playing     = play (root) + -ing (present participle)
3transformer = transform (root) + -er (agent noun)

Why It Matters for NLP

Words sharing morphemes have related meanings:

📝text

1play, plays, played, playing, player, playful, replay
2│          same root: "play"          │
3
4Subword tokenization captures this:
5  playing  → [play, ##ing]
6  player   → [play, ##er]
7  playful  → [play, ##ful]
8
9The model learns "play" once and applies it everywhere!

German: A Morphological Challenge

German has compound words that combine multiple words:

📝text

1Rindfleischetikettierungsüberwachungsaufgabenübertragungsgesetz
2= Beef labeling supervision task transfer law
3
4Word-level: This is ONE word (OOV!)
5Character-level: 63 characters (too long!)
6Subword: [Rind, fleisch, etikett, ierung, s, überwach, ung, s, aufgaben, ...]
7         (Captures meaningful components)

More German examples:

📝text

1Krankenhaus (hospital) = Kranken (sick) + Haus (house)
2Handschuh (glove) = Hand (hand) + Schuh (shoe)
3Flugzeug (airplane) = Flug (flight) + Zeug (stuff/device)
4
5Subword tokenization naturally handles these!

1.5 The Translation Challenge

Source and Target Mismatch

In German-to-English translation:

📝text

1German: "Ich spiele gern Fußball"
2English: "I like playing soccer"
3
4Word alignment is not 1:1:
5  "spiele" → "playing" (different conjugation)
6  "gern" → "like" (different word entirely)
7  "Fußball" → "soccer" (or "football" - regional!)

Subwords Help Alignment

📝text

1German subwords:  [Ich, spiel, e, gern, Fuß, ball]
2English subwords: [I, like, play, ing, socc, er]
3
4Now the model can learn:
5- "spiel" relates to "play"
6- "Fuß" + "ball" relates to football/soccer

Shared Vocabulary Benefits

Many words are similar across languages:

📝text

1Computer (German) → Computer (English)
2Telefon (German) → Telephone (English)
3Information (German) → Information (English)
4
5With shared subword vocabulary:
6Both languages use the same tokens!
7→ Better cross-lingual transfer

1.6 Trade-offs: Vocabulary Size vs Sequence Length

The Fundamental Trade-off

📝text

1Vocabulary Size ←――――――――→ Sequence Length
2     Large                      Short
3      ↑                           ↑
4   Word-level              Character-level
5
6     Subword tokenization sits in between

Practical Numbers

Approach	Vocab Size	Avg Tokens/Sentence	Memory	Training Speed
Character	100	80	Low	Very Slow
Subword 8K	8,000	25	Medium	Fast
Subword 32K	32,000	20	Medium	Fast
Subword 50K	50,000	18	Higher	Fast
Word 100K	100,000	15	High	Fast

Choosing Vocabulary Size

Too Small (< 8K):

Very long sequences
Common words split unnecessarily
"the" might become ["th", "e"]

Too Large (> 100K):

Many rare tokens with poor embeddings
Increased memory for embedding table
Diminishing returns

Sweet Spot (16K-50K):

Common words stay intact
Rare words split meaningfully
Manageable sequence lengths

1.7 Why Subword Wins

Summary of Advantages

Property	Character	Word	Subword
OOV Handling	Perfect	Poor	Good
Sequence Length	Long	Short	Medium
Vocabulary Size	Tiny	Huge	Medium
Morphology	None	None	Good
Cross-lingual	OK	Poor	Good
Training Speed	Slow	Fast	Fast

Real-World Evidence

Major language models use subword tokenization:

Model	Tokenizer	Vocab Size
BERT	WordPiece	30,522
GPT-2/3	BPE	50,257
T5	SentencePiece	32,000
LLaMA	BPE	32,000
BART	BPE	50,265

No major model uses word-level or character-level tokenization!

1.8 Subword Algorithms Overview

Three Main Algorithms

1. Byte-Pair Encoding (BPE)

Bottom-up: Start with characters, merge frequent pairs
Used by: GPT, RoBERTa, BART

2. WordPiece

Similar to BPE, but uses likelihood instead of frequency
Used by: BERT, DistilBERT

3. Unigram

Top-down: Start with large vocab, prune unlikely pieces
Used by: T5, ALBERT, XLNet

For This Course

We'll focus on BPE because:

Most intuitive algorithm
Widely used in production
Excellent performance for translation
SentencePiece supports it

Summary

The Tokenization Problem

Approach	Pros	Cons
Word-level	Short sequences, natural units	OOV problem, huge vocabulary
Character-level	Zero OOV, tiny vocabulary	Very long sequences, slow learning
Subword	Balanced vocabulary, handles rare words	Slightly more complex

Why Subword for Translation?

Handles OOV: New words split into known subwords
Cross-lingual: Shared subwords across languages
Morphology: Captures meaningful word parts
German compounds: Naturally decomposes long words
Efficiency: Reasonable sequence lengths

Key Concepts

OOV Problem: Unknown words in fixed vocabulary
Morphemes: Smallest meaningful units in language
Vocabulary-Sequence Trade-off: Larger vocab = shorter sequences
BPE: Most common subword algorithm

Exercises

Conceptual Questions

Why does German have a higher OOV rate than English with word-level tokenization?
Explain why character-level models are slower to train for the same text.
For machine translation, what are the advantages of a shared subword vocabulary across source and target languages?
If you had unlimited compute, would character-level tokenization be better? Why or why not?

Analysis Exercises

Take the sentence "The transformerized architecture outperformed everything" and show how it would be tokenized at word, character, and approximate subword levels.
German word "Donaudampfschifffahrtsgesellschaftskapitän" (Danube steamship company captain) - why would word-level tokenization fail? How might subword handle it?
Compare the sequence lengths for translating "I love you" to German "Ich liebe dich" at different tokenization levels.

Next Section Preview

In the next section, we'll dive deep into the Byte-Pair Encoding (BPE) algorithm. We'll trace through the algorithm step by step, understanding how it learns merge rules from data and builds a vocabulary of subword units.