Introduction
Byte-Pair Encoding (BPE) was originally a data compression algorithm, repurposed for NLP tokenization by Sennrich et al. (2016). The core idea is brilliantly simple: start with characters, then iteratively merge the most frequent pair.
This section explains the algorithm step-by-step with detailed examples.
2.1 The Core Idea
From Compression to Tokenization
Original BPE (compression):
Replace frequent byte pairs with new symbols to reduce file size.
BPE for NLP:
Replace frequent character pairs to build meaningful subword units.
The Process
1Start: ["l", "o", "w", "</w>"]
2Step 1: Merge most frequent pair
3 ("l", "o") β "lo"
4 ["lo", "w", "</w>"]
5Step 2: Merge next most frequent
6 ("lo", "w") β "low"
7 ["low", "</w>"]
8Step 3: Continue until vocab_size reachedKey Insight
Frequent character combinations often correspond to meaningful units:
- "ing", "tion", "ment" (suffixes)
- "un", "re", "pre" (prefixes)
- "the", "and", "for" (common words)
BPE discovers these automatically from data!
2.2 Algorithm Steps
Training Phase
Input: Corpus of text, desired vocabulary size
Output: List of merge rules, vocabulary
1Algorithm: Train BPE
2
31. Initialize vocabulary with all characters in corpus
42. Add end-of-word marker (</w>) to each word
53. Count frequency of each word in corpus
64. Repeat until vocabulary reaches desired size:
7 a. Count frequency of all adjacent symbol pairs
8 b. Find most frequent pair
9 c. Merge that pair into a new symbol
10 d. Add merge rule to list
11 e. Update word representations
125. Return merge rules and vocabularyEncoding Phase
Input: Text to encode, learned merge rules
Output: List of subword tokens
1Algorithm: Encode with BPE
2
31. Split text into words
42. For each word:
5 a. Split into characters + </w>
6 b. Apply merge rules in learned order
7 c. Return resulting symbols
83. Return all tokens2.3 Detailed Walkthrough
Training Example
Corpus:
1"low" appears 5 times
2"lower" appears 2 times
3"newest" appears 6 times
4"widest" appears 3 timesStep 0: Initialize
Split each word into characters with end-of-word marker:
1Vocabulary: {l, o, w, e, r, n, s, t, i, d, </w>}
2
3Word representations:
4"low" (5x): l o w </w>
5"lower" (2x): l o w e r </w>
6"newest" (6x): n e w e s t </w>
7"widest" (3x): w i d e s t </w>Step 1: Count pairs, merge most frequent
1Pair frequencies:
2(e, s): 6 (newest) + 3 (widest) = 9 β MOST FREQUENT
3(s, t): 6 + 3 = 9 (tied, pick first alphabetically)
4(l, o): 5 + 2 = 7
5(o, w): 5 + 2 = 7
6(t, </w>): 6 + 3 = 9 (tied)
7...
8
9Merge (e, s) β "es"
10
11New vocabulary: {l, o, w, e, r, n, s, t, i, d, </w>, es}
12
13Updated representations:
14"low" (5x): l o w </w>
15"lower" (2x): l o w e r </w>
16"newest" (6x): n e w es t </w>
17"widest" (3x): w i d es t </w>Merge rule #1: (e, s) β es
Step 2: Count pairs again
1Pair frequencies:
2(es, t): 6 + 3 = 9 β MOST FREQUENT
3(l, o): 7
4(o, w): 7
5(t, </w>): 9 (tied)
6...
7
8Merge (es, t) β "est"
9
10New vocabulary: {..., es, est}
11
12Updated representations:
13"low" (5x): l o w </w>
14"lower" (2x): l o w e r </w>
15"newest" (6x): n e w est </w>
16"widest" (3x): w i d est </w>Merge rule #2: (es, t) β est
Step 3: Continue
1Pair frequencies:
2(est, </w>): 9
3(l, o): 7
4(o, w): 7
5...
6
7Merge (est, </w>) β "est</w>"
8
9Updated representations:
10"low" (5x): l o w </w>
11"lower" (2x): l o w e r </w>
12"newest" (6x): n e w est</w>
13"widest" (3x): w i d est</w>Merge rule #3: (est, </w>) β est</w>
Final merge rules (first 6):
11. (e, s) β es
22. (es, t) β est
33. (est, </w>) β est</w>
44. (l, o) β lo
55. (lo, w) β low
66. (low, </w>) β low</w>2.4 Encoding with Learned Rules
Encoding Process
To encode a new word, apply merge rules in order:
Example: Encode "lowest"
1Initial: l o w e s t </w>
2
3Apply rule 1 (e, s) β es:
4 l o w es t </w>
5
6Apply rule 2 (es, t) β est:
7 l o w est </w>
8
9Apply rule 3 (est, </w>) β est</w>:
10 l o w est</w>
11
12Apply rule 4 (l, o) β lo:
13 lo w est</w>
14
15Apply rule 5 (lo, w) β low:
16 low est</w>
17
18Rule 6 doesn't match (low + </w>), skip
19
20Final tokens: ["low", "est</w>"]Encoding Unknown Words
Example: Encode "newest" (seen in training)
1Initial: n e w e s t </w>
2After rules: n e w est</w>
3Final: ["n", "e", "w", "est</w>"]Example: Encode "newer" (NOT seen, but subwords are)
1Initial: n e w e r </w>
2Rules don't create "newer", but we have learned subwords
3Final: ["n", "e", "w", "e", "r", "</w>"]The word "newer" wasn't in training, but we can still encode it using known characters!
2.5 The End-of-Word Marker
Why </w>?
The end-of-word marker distinguishes:
- "est" inside a word: "testing" β test + ing
- "est" at end of word: "newest" β new + est</w>
Without it:
1"newest" β ["new", "est"]
2"testing" β ["test", "ing"]
3
4Problem: How does decoder know where spaces go?
5"newest" vs "new est" both produce ["new", "est"]With </w>:
1"newest" β ["new", "est</w>"] β ends with word-final token
2"new est" β ["new</w>", "est</w>"] β both are word-final
3
4Now we know exactly where words end!Alternative Markers
Different implementations use different markers:
| System | Marker | Example |
|---|---|---|
| Original BPE | </w> (end) | low</w> |
| GPT-2 | Δ (space prefix) | Δ low |
| SentencePiece | β (underscore prefix) | βlow |
| BERT WordPiece | ## (continuation) | low, ##est |
GPT-2/SentencePiece approach (space as prefix):
1"the lowest price"
2β ["the", "Δ lowest", "Δ price"] or ["βthe", "βlow", "est", "βprice"]2.6 Vocabulary Size Selection
Trade-offs
Smaller Vocabulary (8K-16K):
- Longer sequences
- More aggressive splitting
- Better handling of rare words
- Smaller embedding table
- May split common words
Larger Vocabulary (32K-64K):
- Shorter sequences
- Common words stay intact
- Larger embedding table
- Some rare tokens poorly trained
- Better for languages with large character sets
Practical Guidelines
| Use Case | Recommended Size |
|---|---|
| Resource-limited | 8K-16K |
| General English NLP | 30K-50K |
| Machine Translation | 32K-37K |
| Multilingual models | 50K-100K |
| Code generation | 50K+ |
For German-English Translation
We'll use ~32,000 subwords because:
- Standard for translation tasks
- Handles German compounds well
- Reasonable sequence lengths
- Matches benchmark configurations
2.7 BPE vs WordPiece vs Unigram
BPE (Byte-Pair Encoding)
Approach: Bottom-up, merge most frequent pairs
Selection criterion: Raw frequency count
Used by: GPT-2, GPT-3, RoBERTa, BART
1Pros:
2- Simple and intuitive
3- Fast training
4- Deterministic merges
5
6Cons:
7- Greedy, may not find optimal segmentation
8- Order of merges mattersWordPiece
Approach: Similar to BPE, but different selection
Selection criterion: Likelihood improvement
Used by: BERT, DistilBERT, Electra
1Score for merging (a, b) β ab:
2 score = freq(ab) / (freq(a) Γ freq(b))
3
4This prefers merges that:
5- Are frequent together (high freq(ab))
6- BUT components are rare alone (low freq(a), freq(b))Example:
1"un" + "like" vs "the" + "n"
2
3freq("unlike") = 100, freq("un") = 500, freq("like") = 800
4score = 100 / (500 Γ 800) = 0.00025
5
6freq("then") = 50, freq("the") = 10000, freq("n") = 5000
7score = 50 / (10000 Γ 5000) = 0.000001
8
9"unlike" wins! (higher score)Unigram Language Model
Approach: Top-down, start large and prune
Selection criterion: Keep tokens that minimize loss
Used by: T5, ALBERT, XLNet
1Process:
21. Start with large vocab (all substrings up to length k)
32. Compute probability of corpus under unigram model
43. For each token, compute loss if removed
54. Remove tokens with lowest loss (least impact)
65. Repeat until desired vocab sizeComparison Table
| Aspect | BPE | WordPiece | Unigram |
|---|---|---|---|
| Direction | Bottom-up | Bottom-up | Top-down |
| Criterion | Frequency | Likelihood | Loss |
| Deterministic | Yes | Yes | No |
| Multiple segmentations | No | No | Yes |
| Training speed | Fast | Medium | Slower |
| Popular in | GPT, BART | BERT | T5, XLNet |
2.8 Handling Special Cases
Numbers
1"2023" can be tokenized as:
2- ["2", "0", "2", "3"] (character-level)
3- ["20", "23"] (subword)
4- ["2023"] (if frequent enough)Punctuation
1"Hello, world!" β
2- With space: ["Hello", ",", " ", "world", "!"]
3- GPT-2 style: ["Hello", ",", "Δ world", "!"]Unicode and Multilingual
BPE works on bytes, not characters:
1"CafΓ©" in bytes:
2- UTF-8: [67, 97, 102, 195, 169]
3- "Γ©" = bytes [195, 169]
4
5BPE might learn:
6- [195, 169] β "Γ©" (if frequent)
7- Or keep as separate bytes (rare characters)Summary
BPE Algorithm
- Initialize: Characters + end-of-word marker
- Iterate: Count pairs β Merge most frequent β Update
- Stop: When vocabulary reaches desired size
- Encode: Apply merges in learned order
Key Properties
| Property | Value |
|---|---|
| Direction | Bottom-up |
| Selection | Frequency-based |
| Deterministic | Yes (given corpus) |
| Complexity | O(N Γ V) for N words, V vocab |
Implementation Preview
1# What we'll implement in the next section
2def train_bpe(corpus, vocab_size):
3 """Learn BPE merge rules from corpus."""
4 ...
5 return merge_rules, vocabulary
6
7def encode_bpe(text, merge_rules):
8 """Encode text using learned merge rules."""
9 ...
10 return tokens
11
12def decode_bpe(tokens):
13 """Decode tokens back to text."""
14 ...
15 return textExercises
Algorithm Tracing
- Given corpus
{"cat": 3, "car": 5, "card": 2}, trace the first 3 BPE merges. - With merge rules:
[(a, t) β at, (c, at) β cat], encode the word "catch". - Why does merge order matter? Give an example where different orders produce different results.
Conceptual Questions
- Why does BPE use an end-of-word marker? What would happen without it?
- A vocabulary of 30K has both "play" and "playing" as tokens. How might this happen in BPE?
- Compare BPE with WordPiece: when would WordPiece merge differently than BPE?
Analysis
- For a corpus with 1M words and target vocab of 32K, approximately how many merge operations are needed?
- How does vocabulary size affect translation quality? What's the tradeoff?
Next Section Preview
In the next section, we'll implement BPE from scratch in Python. We'll build:
get_pair_frequencies(): Count all adjacent pairsmerge_pair(): Merge a pair throughout the vocabularytrain_bpe(): Full training loopencode_bpe(): Apply learned rules to new text