Boo-AI — Master Artificial Intelligence by Building from Scratch

Introduction

Byte-Pair Encoding (BPE) was originally a data compression algorithm, repurposed for NLP tokenization by Sennrich et al. (2016). The core idea is brilliantly simple: start with characters, then iteratively merge the most frequent pair.

This section explains the algorithm step-by-step with detailed examples.

2.1 The Core Idea

From Compression to Tokenization

Original BPE (compression):

Replace frequent byte pairs with new symbols to reduce file size.

BPE for NLP:

Replace frequent character pairs to build meaningful subword units.

The Process

📝text

1Start:   ["l", "o", "w", "</w>"]
2Step 1:  Merge most frequent pair
3         ("l", "o") → "lo"
4         ["lo", "w", "</w>"]
5Step 2:  Merge next most frequent
6         ("lo", "w") → "low"
7         ["low", "</w>"]
8Step 3:  Continue until vocab_size reached

Key Insight

Frequent character combinations often correspond to meaningful units:

"ing", "tion", "ment" (suffixes)
"un", "re", "pre" (prefixes)
"the", "and", "for" (common words)

BPE discovers these automatically from data!

2.2 Algorithm Steps

Training Phase

Input: Corpus of text, desired vocabulary size

Output: List of merge rules, vocabulary

📝text

1Algorithm: Train BPE
2
31. Initialize vocabulary with all characters in corpus
42. Add end-of-word marker (</w>) to each word
53. Count frequency of each word in corpus
64. Repeat until vocabulary reaches desired size:
7   a. Count frequency of all adjacent symbol pairs
8   b. Find most frequent pair
9   c. Merge that pair into a new symbol
10   d. Add merge rule to list
11   e. Update word representations
125. Return merge rules and vocabulary

Encoding Phase

Input: Text to encode, learned merge rules

Output: List of subword tokens

📝text

1Algorithm: Encode with BPE
2
31. Split text into words
42. For each word:
5   a. Split into characters + </w>
6   b. Apply merge rules in learned order
7   c. Return resulting symbols
83. Return all tokens

2.3 Detailed Walkthrough

Training Example

Corpus:

📝text

1"low" appears 5 times
2"lower" appears 2 times
3"newest" appears 6 times
4"widest" appears 3 times

Step 0: Initialize

Split each word into characters with end-of-word marker:

📝text

1Vocabulary: {l, o, w, e, r, n, s, t, i, d, </w>}
2
3Word representations:
4"low"     (5x):  l o w </w>
5"lower"   (2x):  l o w e r </w>
6"newest"  (6x):  n e w e s t </w>
7"widest"  (3x):  w i d e s t </w>

Step 1: Count pairs, merge most frequent

📝text

1Pair frequencies:
2(e, s): 6 (newest) + 3 (widest) = 9  ← MOST FREQUENT
3(s, t): 6 + 3 = 9  (tied, pick first alphabetically)
4(l, o): 5 + 2 = 7
5(o, w): 5 + 2 = 7
6(t, </w>): 6 + 3 = 9 (tied)
7...
8
9Merge (e, s) → "es"
10
11New vocabulary: {l, o, w, e, r, n, s, t, i, d, </w>, es}
12
13Updated representations:
14"low"     (5x):  l o w </w>
15"lower"   (2x):  l o w e r </w>
16"newest"  (6x):  n e w es t </w>
17"widest"  (3x):  w i d es t </w>

Merge rule #1: (e, s) → es

Step 2: Count pairs again

📝text

1Pair frequencies:
2(es, t): 6 + 3 = 9  ← MOST FREQUENT
3(l, o): 7
4(o, w): 7
5(t, </w>): 9 (tied)
6...
7
8Merge (es, t) → "est"
9
10New vocabulary: {..., es, est}
11
12Updated representations:
13"low"     (5x):  l o w </w>
14"lower"   (2x):  l o w e r </w>
15"newest"  (6x):  n e w est </w>
16"widest"  (3x):  w i d est </w>

Merge rule #2: (es, t) → est

Step 3: Continue

📝text

1Pair frequencies:
2(est, </w>): 9
3(l, o): 7
4(o, w): 7
5...
6
7Merge (est, </w>) → "est</w>"
8
9Updated representations:
10"low"     (5x):  l o w </w>
11"lower"   (2x):  l o w e r </w>
12"newest"  (6x):  n e w est</w>
13"widest"  (3x):  w i d est</w>

Merge rule #3: (est, </w>) → est</w>

Final merge rules (first 6):

📝text

11. (e, s) → es
22. (es, t) → est
33. (est, </w>) → est</w>
44. (l, o) → lo
55. (lo, w) → low
66. (low, </w>) → low</w>

2.4 Encoding with Learned Rules

Encoding Process

To encode a new word, apply merge rules in order:

Example: Encode "lowest"

📝text

1Initial: l o w e s t </w>
2
3Apply rule 1 (e, s) → es:
4         l o w es t </w>
5
6Apply rule 2 (es, t) → est:
7         l o w est </w>
8
9Apply rule 3 (est, </w>) → est</w>:
10         l o w est</w>
11
12Apply rule 4 (l, o) → lo:
13         lo w est</w>
14
15Apply rule 5 (lo, w) → low:
16         low est</w>
17
18Rule 6 doesn't match (low + </w>), skip
19
20Final tokens: ["low", "est</w>"]

Encoding Unknown Words

Example: Encode "newest" (seen in training)

📝text

1Initial: n e w e s t </w>
2After rules: n e w est</w>
3Final: ["n", "e", "w", "est</w>"]

Example: Encode "newer" (NOT seen, but subwords are)

📝text

1Initial: n e w e r </w>
2Rules don't create "newer", but we have learned subwords
3Final: ["n", "e", "w", "e", "r", "</w>"]

The word "newer" wasn't in training, but we can still encode it using known characters!

2.5 The End-of-Word Marker

Why </w>?

The end-of-word marker distinguishes:

"est" inside a word: "testing" → test + ing
"est" at end of word: "newest" → new + est</w>

Without it:

📝text

1"newest" → ["new", "est"]
2"testing" → ["test", "ing"]
3
4Problem: How does decoder know where spaces go?
5"newest" vs "new est" both produce ["new", "est"]

With </w>:

📝text

1"newest" → ["new", "est</w>"]   ← ends with word-final token
2"new est" → ["new</w>", "est</w>"]   ← both are word-final
3
4Now we know exactly where words end!

Alternative Markers

Different implementations use different markers:

System	Marker	Example
Original BPE	</w> (end)	low</w>
GPT-2	Ġ (space prefix)	Ġlow
SentencePiece	▁ (underscore prefix)	▁low
BERT WordPiece	## (continuation)	low, ##est

GPT-2/SentencePiece approach (space as prefix):

📝text

1"the lowest price"
2→ ["the", "Ġlowest", "Ġprice"]  or  ["▁the", "▁low", "est", "▁price"]

2.6 Vocabulary Size Selection

Trade-offs

Smaller Vocabulary (8K-16K):

Longer sequences
More aggressive splitting
Better handling of rare words
Smaller embedding table
May split common words

Larger Vocabulary (32K-64K):

Shorter sequences
Common words stay intact
Larger embedding table
Some rare tokens poorly trained
Better for languages with large character sets

Practical Guidelines

Use Case	Recommended Size
Resource-limited	8K-16K
General English NLP	30K-50K
Machine Translation	32K-37K
Multilingual models	50K-100K
Code generation	50K+

For German-English Translation

We'll use ~32,000 subwords because:

Standard for translation tasks
Handles German compounds well
Reasonable sequence lengths
Matches benchmark configurations

2.7 BPE vs WordPiece vs Unigram

BPE (Byte-Pair Encoding)

Approach: Bottom-up, merge most frequent pairs

Selection criterion: Raw frequency count

Used by: GPT-2, GPT-3, RoBERTa, BART

📝text

1Pros:
2- Simple and intuitive
3- Fast training
4- Deterministic merges
5
6Cons:
7- Greedy, may not find optimal segmentation
8- Order of merges matters

WordPiece

Approach: Similar to BPE, but different selection

Selection criterion: Likelihood improvement

Used by: BERT, DistilBERT, Electra

📝text

1Score for merging (a, b) → ab:
2  score = freq(ab) / (freq(a) × freq(b))
3
4This prefers merges that:
5- Are frequent together (high freq(ab))
6- BUT components are rare alone (low freq(a), freq(b))

Example:

📝text

1"un" + "like" vs "the" + "n"
2
3freq("unlike") = 100, freq("un") = 500, freq("like") = 800
4score = 100 / (500 × 800) = 0.00025
5
6freq("then") = 50, freq("the") = 10000, freq("n") = 5000
7score = 50 / (10000 × 5000) = 0.000001
8
9"unlike" wins! (higher score)

Unigram Language Model

Approach: Top-down, start large and prune

Selection criterion: Keep tokens that minimize loss

Used by: T5, ALBERT, XLNet

📝text

1Process:
21. Start with large vocab (all substrings up to length k)
32. Compute probability of corpus under unigram model
43. For each token, compute loss if removed
54. Remove tokens with lowest loss (least impact)
65. Repeat until desired vocab size

Comparison Table

Aspect	BPE	WordPiece	Unigram
Direction	Bottom-up	Bottom-up	Top-down
Criterion	Frequency	Likelihood	Loss
Deterministic	Yes	Yes	No
Multiple segmentations	No	No	Yes
Training speed	Fast	Medium	Slower
Popular in	GPT, BART	BERT	T5, XLNet

2.8 Handling Special Cases

Numbers

📝text

1"2023" can be tokenized as:
2- ["2", "0", "2", "3"]  (character-level)
3- ["20", "23"]          (subword)
4- ["2023"]              (if frequent enough)

Punctuation

📝text

1"Hello, world!" →
2- With space: ["Hello", ",", " ", "world", "!"]
3- GPT-2 style: ["Hello", ",", "Ġworld", "!"]

Unicode and Multilingual

BPE works on bytes, not characters:

📝text

1"Café" in bytes:
2- UTF-8: [67, 97, 102, 195, 169]
3- "é" = bytes [195, 169]
4
5BPE might learn:
6- [195, 169] → "é" (if frequent)
7- Or keep as separate bytes (rare characters)

Summary

BPE Algorithm

Initialize: Characters + end-of-word marker
Iterate: Count pairs → Merge most frequent → Update
Stop: When vocabulary reaches desired size
Encode: Apply merges in learned order

Key Properties

Property	Value
Direction	Bottom-up
Selection	Frequency-based
Deterministic	Yes (given corpus)
Complexity	O(N × V) for N words, V vocab

Implementation Preview

🐍python

1# What we'll implement in the next section
2def train_bpe(corpus, vocab_size):
3    """Learn BPE merge rules from corpus."""
4    ...
5    return merge_rules, vocabulary
6
7def encode_bpe(text, merge_rules):
8    """Encode text using learned merge rules."""
9    ...
10    return tokens
11
12def decode_bpe(tokens):
13    """Decode tokens back to text."""
14    ...
15    return text

Exercises

Algorithm Tracing

Given corpus {"cat": 3, "car": 5, "card": 2}, trace the first 3 BPE merges.
With merge rules: [(a, t) → at, (c, at) → cat], encode the word "catch".
Why does merge order matter? Give an example where different orders produce different results.

Conceptual Questions

Why does BPE use an end-of-word marker? What would happen without it?
A vocabulary of 30K has both "play" and "playing" as tokens. How might this happen in BPE?
Compare BPE with WordPiece: when would WordPiece merge differently than BPE?

Analysis

For a corpus with 1M words and target vocab of 32K, approximately how many merge operations are needed?
How does vocabulary size affect translation quality? What's the tradeoff?

Next Section Preview

In the next section, we'll implement BPE from scratch in Python. We'll build:

get_pair_frequencies(): Count all adjacent pairs
merge_pair(): Merge a pair throughout the vocabulary
train_bpe(): Full training loop
encode_bpe(): Apply learned rules to new text