Chapter 5
18 min read
Section 26 of 75

Byte-Pair Encoding (BPE) Algorithm

Subword Tokenization for Translation

Introduction

Byte-Pair Encoding (BPE) was originally a data compression algorithm, repurposed for NLP tokenization by Sennrich et al. (2016). The core idea is brilliantly simple: start with characters, then iteratively merge the most frequent pair.

This section explains the algorithm step-by-step with detailed examples.


2.1 The Core Idea

From Compression to Tokenization

Original BPE (compression):

Replace frequent byte pairs with new symbols to reduce file size.

BPE for NLP:

Replace frequent character pairs to build meaningful subword units.

The Process

πŸ“text
1Start:   ["l", "o", "w", "</w>"]
2Step 1:  Merge most frequent pair
3         ("l", "o") β†’ "lo"
4         ["lo", "w", "</w>"]
5Step 2:  Merge next most frequent
6         ("lo", "w") β†’ "low"
7         ["low", "</w>"]
8Step 3:  Continue until vocab_size reached

Key Insight

Frequent character combinations often correspond to meaningful units:

  • "ing", "tion", "ment" (suffixes)
  • "un", "re", "pre" (prefixes)
  • "the", "and", "for" (common words)

BPE discovers these automatically from data!


2.2 Algorithm Steps

Training Phase

Input: Corpus of text, desired vocabulary size

Output: List of merge rules, vocabulary

πŸ“text
1Algorithm: Train BPE
2
31. Initialize vocabulary with all characters in corpus
42. Add end-of-word marker (</w>) to each word
53. Count frequency of each word in corpus
64. Repeat until vocabulary reaches desired size:
7   a. Count frequency of all adjacent symbol pairs
8   b. Find most frequent pair
9   c. Merge that pair into a new symbol
10   d. Add merge rule to list
11   e. Update word representations
125. Return merge rules and vocabulary

Encoding Phase

Input: Text to encode, learned merge rules

Output: List of subword tokens

πŸ“text
1Algorithm: Encode with BPE
2
31. Split text into words
42. For each word:
5   a. Split into characters + </w>
6   b. Apply merge rules in learned order
7   c. Return resulting symbols
83. Return all tokens

2.3 Detailed Walkthrough

Training Example

Corpus:

πŸ“text
1"low" appears 5 times
2"lower" appears 2 times
3"newest" appears 6 times
4"widest" appears 3 times

Step 0: Initialize

Split each word into characters with end-of-word marker:

πŸ“text
1Vocabulary: {l, o, w, e, r, n, s, t, i, d, </w>}
2
3Word representations:
4"low"     (5x):  l o w </w>
5"lower"   (2x):  l o w e r </w>
6"newest"  (6x):  n e w e s t </w>
7"widest"  (3x):  w i d e s t </w>

Step 1: Count pairs, merge most frequent

πŸ“text
1Pair frequencies:
2(e, s): 6 (newest) + 3 (widest) = 9  ← MOST FREQUENT
3(s, t): 6 + 3 = 9  (tied, pick first alphabetically)
4(l, o): 5 + 2 = 7
5(o, w): 5 + 2 = 7
6(t, </w>): 6 + 3 = 9 (tied)
7...
8
9Merge (e, s) β†’ "es"
10
11New vocabulary: {l, o, w, e, r, n, s, t, i, d, </w>, es}
12
13Updated representations:
14"low"     (5x):  l o w </w>
15"lower"   (2x):  l o w e r </w>
16"newest"  (6x):  n e w es t </w>
17"widest"  (3x):  w i d es t </w>

Merge rule #1: (e, s) β†’ es

Step 2: Count pairs again

πŸ“text
1Pair frequencies:
2(es, t): 6 + 3 = 9  ← MOST FREQUENT
3(l, o): 7
4(o, w): 7
5(t, </w>): 9 (tied)
6...
7
8Merge (es, t) β†’ "est"
9
10New vocabulary: {..., es, est}
11
12Updated representations:
13"low"     (5x):  l o w </w>
14"lower"   (2x):  l o w e r </w>
15"newest"  (6x):  n e w est </w>
16"widest"  (3x):  w i d est </w>

Merge rule #2: (es, t) β†’ est

Step 3: Continue

πŸ“text
1Pair frequencies:
2(est, </w>): 9
3(l, o): 7
4(o, w): 7
5...
6
7Merge (est, </w>) β†’ "est</w>"
8
9Updated representations:
10"low"     (5x):  l o w </w>
11"lower"   (2x):  l o w e r </w>
12"newest"  (6x):  n e w est</w>
13"widest"  (3x):  w i d est</w>

Merge rule #3: (est, </w>) β†’ est</w>

Final merge rules (first 6):

πŸ“text
11. (e, s) β†’ es
22. (es, t) β†’ est
33. (est, </w>) β†’ est</w>
44. (l, o) β†’ lo
55. (lo, w) β†’ low
66. (low, </w>) β†’ low</w>

2.4 Encoding with Learned Rules

Encoding Process

To encode a new word, apply merge rules in order:

Example: Encode "lowest"

πŸ“text
1Initial: l o w e s t </w>
2
3Apply rule 1 (e, s) β†’ es:
4         l o w es t </w>
5
6Apply rule 2 (es, t) β†’ est:
7         l o w est </w>
8
9Apply rule 3 (est, </w>) β†’ est</w>:
10         l o w est</w>
11
12Apply rule 4 (l, o) β†’ lo:
13         lo w est</w>
14
15Apply rule 5 (lo, w) β†’ low:
16         low est</w>
17
18Rule 6 doesn't match (low + </w>), skip
19
20Final tokens: ["low", "est</w>"]

Encoding Unknown Words

Example: Encode "newest" (seen in training)

πŸ“text
1Initial: n e w e s t </w>
2After rules: n e w est</w>
3Final: ["n", "e", "w", "est</w>"]

Example: Encode "newer" (NOT seen, but subwords are)

πŸ“text
1Initial: n e w e r </w>
2Rules don't create "newer", but we have learned subwords
3Final: ["n", "e", "w", "e", "r", "</w>"]

The word "newer" wasn't in training, but we can still encode it using known characters!


2.5 The End-of-Word Marker

Why </w>?

The end-of-word marker distinguishes:

  • "est" inside a word: "testing" β†’ test + ing
  • "est" at end of word: "newest" β†’ new + est</w>

Without it:

πŸ“text
1"newest" β†’ ["new", "est"]
2"testing" β†’ ["test", "ing"]
3
4Problem: How does decoder know where spaces go?
5"newest" vs "new est" both produce ["new", "est"]

With </w>:

πŸ“text
1"newest" β†’ ["new", "est</w>"]   ← ends with word-final token
2"new est" β†’ ["new</w>", "est</w>"]   ← both are word-final
3
4Now we know exactly where words end!

Alternative Markers

Different implementations use different markers:

SystemMarkerExample
Original BPE</w> (end)low</w>
GPT-2Δ  (space prefix)Δ low
SentencePiece▁ (underscore prefix)▁low
BERT WordPiece## (continuation)low, ##est

GPT-2/SentencePiece approach (space as prefix):

πŸ“text
1"the lowest price"
2β†’ ["the", "Δ lowest", "Δ price"]  or  ["▁the", "▁low", "est", "▁price"]

2.6 Vocabulary Size Selection

Trade-offs

Smaller Vocabulary (8K-16K):

  • Longer sequences
  • More aggressive splitting
  • Better handling of rare words
  • Smaller embedding table
  • May split common words

Larger Vocabulary (32K-64K):

  • Shorter sequences
  • Common words stay intact
  • Larger embedding table
  • Some rare tokens poorly trained
  • Better for languages with large character sets

Practical Guidelines

Use CaseRecommended Size
Resource-limited8K-16K
General English NLP30K-50K
Machine Translation32K-37K
Multilingual models50K-100K
Code generation50K+

For German-English Translation

We'll use ~32,000 subwords because:

  1. Standard for translation tasks
  2. Handles German compounds well
  3. Reasonable sequence lengths
  4. Matches benchmark configurations

2.7 BPE vs WordPiece vs Unigram

BPE (Byte-Pair Encoding)

Approach: Bottom-up, merge most frequent pairs

Selection criterion: Raw frequency count

Used by: GPT-2, GPT-3, RoBERTa, BART

πŸ“text
1Pros:
2- Simple and intuitive
3- Fast training
4- Deterministic merges
5
6Cons:
7- Greedy, may not find optimal segmentation
8- Order of merges matters

WordPiece

Approach: Similar to BPE, but different selection

Selection criterion: Likelihood improvement

Used by: BERT, DistilBERT, Electra

πŸ“text
1Score for merging (a, b) β†’ ab:
2  score = freq(ab) / (freq(a) Γ— freq(b))
3
4This prefers merges that:
5- Are frequent together (high freq(ab))
6- BUT components are rare alone (low freq(a), freq(b))

Example:

πŸ“text
1"un" + "like" vs "the" + "n"
2
3freq("unlike") = 100, freq("un") = 500, freq("like") = 800
4score = 100 / (500 Γ— 800) = 0.00025
5
6freq("then") = 50, freq("the") = 10000, freq("n") = 5000
7score = 50 / (10000 Γ— 5000) = 0.000001
8
9"unlike" wins! (higher score)

Unigram Language Model

Approach: Top-down, start large and prune

Selection criterion: Keep tokens that minimize loss

Used by: T5, ALBERT, XLNet

πŸ“text
1Process:
21. Start with large vocab (all substrings up to length k)
32. Compute probability of corpus under unigram model
43. For each token, compute loss if removed
54. Remove tokens with lowest loss (least impact)
65. Repeat until desired vocab size

Comparison Table

AspectBPEWordPieceUnigram
DirectionBottom-upBottom-upTop-down
CriterionFrequencyLikelihoodLoss
DeterministicYesYesNo
Multiple segmentationsNoNoYes
Training speedFastMediumSlower
Popular inGPT, BARTBERTT5, XLNet

2.8 Handling Special Cases

Numbers

πŸ“text
1"2023" can be tokenized as:
2- ["2", "0", "2", "3"]  (character-level)
3- ["20", "23"]          (subword)
4- ["2023"]              (if frequent enough)

Punctuation

πŸ“text
1"Hello, world!" β†’
2- With space: ["Hello", ",", " ", "world", "!"]
3- GPT-2 style: ["Hello", ",", "Δ world", "!"]

Unicode and Multilingual

BPE works on bytes, not characters:

πŸ“text
1"CafΓ©" in bytes:
2- UTF-8: [67, 97, 102, 195, 169]
3- "Γ©" = bytes [195, 169]
4
5BPE might learn:
6- [195, 169] β†’ "Γ©" (if frequent)
7- Or keep as separate bytes (rare characters)

Summary

BPE Algorithm

  1. Initialize: Characters + end-of-word marker
  2. Iterate: Count pairs β†’ Merge most frequent β†’ Update
  3. Stop: When vocabulary reaches desired size
  4. Encode: Apply merges in learned order

Key Properties

PropertyValue
DirectionBottom-up
SelectionFrequency-based
DeterministicYes (given corpus)
ComplexityO(N Γ— V) for N words, V vocab

Implementation Preview

🐍python
1# What we'll implement in the next section
2def train_bpe(corpus, vocab_size):
3    """Learn BPE merge rules from corpus."""
4    ...
5    return merge_rules, vocabulary
6
7def encode_bpe(text, merge_rules):
8    """Encode text using learned merge rules."""
9    ...
10    return tokens
11
12def decode_bpe(tokens):
13    """Decode tokens back to text."""
14    ...
15    return text

Exercises

Algorithm Tracing

  1. Given corpus {"cat": 3, "car": 5, "card": 2}, trace the first 3 BPE merges.
  2. With merge rules: [(a, t) β†’ at, (c, at) β†’ cat], encode the word "catch".
  3. Why does merge order matter? Give an example where different orders produce different results.

Conceptual Questions

  1. Why does BPE use an end-of-word marker? What would happen without it?
  2. A vocabulary of 30K has both "play" and "playing" as tokens. How might this happen in BPE?
  3. Compare BPE with WordPiece: when would WordPiece merge differently than BPE?

Analysis

  1. For a corpus with 1M words and target vocab of 32K, approximately how many merge operations are needed?
  2. How does vocabulary size affect translation quality? What's the tradeoff?

Next Section Preview

In the next section, we'll implement BPE from scratch in Python. We'll build:

  • get_pair_frequencies(): Count all adjacent pairs
  • merge_pair(): Merge a pair throughout the vocabulary
  • train_bpe(): Full training loop
  • encode_bpe(): Apply learned rules to new text