Introduction
This section provides a complete implementation of BLEU from scratch. Understanding the internals helps you debug evaluation issues and interpret scores correctly.
N-gram Extraction
Building the Foundation
špython
1from collections import Counter
2from typing import List, Tuple, Dict, Optional, Union
3import math
4
5
6def extract_ngrams(tokens: List[str], n: int) -> Counter:
7 """
8 Extract n-grams from a list of tokens.
9
10 An n-gram is a contiguous sequence of n items.
11
12 Args:
13 tokens: List of tokens
14 n: N-gram order (1=unigrams, 2=bigrams, etc.)
15
16 Returns:
17 Counter of n-gram tuples to their counts
18
19 Example:
20 >>> tokens = ['the', 'cat', 'sat']
21 >>> extract_ngrams(tokens, 1)
22 Counter({('the',): 1, ('cat',): 1, ('sat',): 1})
23 >>> extract_ngrams(tokens, 2)
24 Counter({('the', 'cat'): 1, ('cat', 'sat'): 1})
25 """
26 ngrams = []
27
28 for i in range(len(tokens) - n + 1):
29 ngram = tuple(tokens[i:i + n])
30 ngrams.append(ngram)
31
32 return Counter(ngrams)
33
34
35def demonstrate_ngrams():
36 """
37 Demonstrate n-gram extraction.
38 """
39 print("N-gram Extraction")
40 print("=" * 60)
41
42 sentence = "the cat sat on the mat"
43 tokens = sentence.split()
44
45 print(f"Sentence: {sentence}")
46 print(f"Tokens: {tokens}")
47 print()
48
49 for n in range(1, 5):
50 ngrams = extract_ngrams(tokens, n)
51 print(f"{n}-grams ({len(ngrams)} unique):")
52 for ngram, count in ngrams.most_common():
53 print(f" {' '.join(ngram)}: {count}")
54 print()
55
56
57demonstrate_ngrams()Modified Precision
The Core of BLEU
štext
1Standard precision:
2 precision = matches / total_hypothesis_ngrams
3
4Problem: Gaming!
5 Reference: "The cat sat on the mat"
6 Hypothesis: "the the the the the the the"
7
8 Unigram precision = 7/7 = 100% !!!
9
10Modified precision (clipping):
11 - Count matches, but clip by reference count
12 - "the" appears 2x in reference, so max match = 2
13
14 Modified precision = 2/7 = 28.5%špython
1def count_ngram_matches(
2 hypothesis_ngrams: Counter,
3 reference_ngrams: Counter
4) -> int:
5 """
6 Count matching n-grams with clipping.
7
8 For each n-gram in hypothesis:
9 - Count how many times it appears in hypothesis
10 - Count how many times it appears in reference
11 - Match count = min(hypothesis_count, reference_count)
12
13 Args:
14 hypothesis_ngrams: Counter of hypothesis n-grams
15 reference_ngrams: Counter of reference n-grams
16
17 Returns:
18 Total clipped match count
19
20 Example:
21 >>> hyp = Counter({('the',): 7})
22 >>> ref = Counter({('the',): 2, ('cat',): 1})
23 >>> count_ngram_matches(hyp, ref)
24 2 # Clipped from 7 to 2
25 """
26 matches = 0
27
28 for ngram, hyp_count in hypothesis_ngrams.items():
29 ref_count = reference_ngrams.get(ngram, 0)
30 # Clip: can't match more than reference has
31 matches += min(hyp_count, ref_count)
32
33 return matches
34
35
36def modified_precision(
37 hypothesis: List[str],
38 references: List[List[str]],
39 n: int
40) -> Tuple[int, int]:
41 """
42 Compute modified n-gram precision.
43
44 Args:
45 hypothesis: Hypothesis tokens
46 references: List of reference token lists
47 n: N-gram order
48
49 Returns:
50 (matched_count, total_count)
51 """
52 # Get hypothesis n-grams
53 hyp_ngrams = extract_ngrams(hypothesis, n)
54 total = sum(hyp_ngrams.values())
55
56 if total == 0:
57 return (0, 0)
58
59 # Get reference n-grams (combine all references)
60 # For each n-gram, take maximum count across references
61 ref_ngram_max = Counter()
62
63 for ref in references:
64 ref_ngrams = extract_ngrams(ref, n)
65 for ngram, count in ref_ngrams.items():
66 ref_ngram_max[ngram] = max(ref_ngram_max[ngram], count)
67
68 # Count clipped matches
69 matches = count_ngram_matches(hyp_ngrams, ref_ngram_max)
70
71 return (matches, total)
72
73
74def demonstrate_modified_precision():
75 """
76 Demonstrate modified precision with clipping.
77 """
78 print("Modified Precision")
79 print("=" * 60)
80
81 # Example 1: Gaming attempt
82 print("Example 1: Gaming attempt")
83 print("-" * 40)
84
85 reference = "the cat sat on the mat".split()
86 hypothesis = "the the the the the the the".split()
87
88 print(f"Reference: {' '.join(reference)}")
89 print(f"Hypothesis: {' '.join(hypothesis)}")
90 print()
91
92 # Without clipping
93 print(f"Without clipping: 7 'the' match 7 times ā 7/7 = 100%")
94
95 # With clipping
96 matches, total = modified_precision(hypothesis, [reference], 1)
97 print(f"With clipping: 7 'the' clipped to 2 ā {matches}/{total} = {matches/total:.1%}")
98
99 # Example 2: Normal case
100 print("\nExample 2: Normal translation")
101 print("-" * 40)
102
103 reference = "the cat sat on the mat".split()
104 hypothesis = "the cat on the mat".split()
105
106 print(f"Reference: {' '.join(reference)}")
107 print(f"Hypothesis: {' '.join(hypothesis)}")
108 print()
109
110 for n in range(1, 4):
111 matches, total = modified_precision(hypothesis, [reference], n)
112 if total > 0:
113 print(f"{n}-gram precision: {matches}/{total} = {matches/total:.3f}")
114 else:
115 print(f"{n}-gram precision: 0/0 = undefined")
116
117
118demonstrate_modified_precision()Brevity Penalty
Preventing Short Translations
štext
1Problem without brevity penalty:
2 Reference: "The quick brown fox jumps over the lazy dog"
3 Hypothesis: "The"
4
5 1-gram precision = 1/1 = 100% !
6
7Brevity penalty (BP):
8 - If hypothesis >= reference length: BP = 1 (no penalty)
9 - If hypothesis < reference length: BP = exp(1 - r/c)
10
11 Where:
12 c = hypothesis length (candidate)
13 r = reference length (closest to c)špython
1def brevity_penalty(
2 hypothesis_length: int,
3 reference_length: int
4) -> float:
5 """
6 Compute brevity penalty.
7
8 Args:
9 hypothesis_length: Length of hypothesis
10 reference_length: Effective reference length
11
12 Returns:
13 Brevity penalty (0 to 1)
14 """
15 if hypothesis_length >= reference_length:
16 return 1.0
17
18 return math.exp(1 - reference_length / hypothesis_length)
19
20
21def closest_reference_length(
22 hypothesis_length: int,
23 reference_lengths: List[int]
24) -> int:
25 """
26 Find reference length closest to hypothesis length.
27
28 If multiple references have same distance, prefer shorter.
29
30 Args:
31 hypothesis_length: Length of hypothesis
32 reference_lengths: Lengths of all references
33
34 Returns:
35 Closest reference length
36 """
37 closest = None
38 closest_diff = float('inf')
39
40 for ref_len in reference_lengths:
41 diff = abs(ref_len - hypothesis_length)
42 if diff < closest_diff:
43 closest_diff = diff
44 closest = ref_len
45 elif diff == closest_diff and ref_len < closest:
46 closest = ref_len
47
48 return closest
49
50
51def demonstrate_brevity_penalty():
52 """
53 Demonstrate brevity penalty behavior.
54 """
55 print("Brevity Penalty")
56 print("=" * 60)
57
58 reference_length = 10
59
60 print(f"Reference length: {reference_length}")
61 print()
62 print(f"{'Hypothesis Length':<20} {'Ratio':<10} {'BP':<10}")
63 print("-" * 40)
64
65 for hyp_len in [2, 5, 8, 10, 12, 15, 20]:
66 bp = brevity_penalty(hyp_len, reference_length)
67 ratio = hyp_len / reference_length
68 print(f"{hyp_len:<20} {ratio:<10.2f} {bp:<10.4f}")
69
70 # Visualization
71 print("\nBP vs Hypothesis/Reference ratio:")
72 print("-" * 50)
73
74 for ratio in [0.2, 0.4, 0.6, 0.8, 1.0, 1.2, 1.5]:
75 hyp_len = int(ratio * reference_length)
76 bp = brevity_penalty(hyp_len, reference_length)
77 bar = "ā" * int(bp * 40)
78 print(f"{ratio:.1f}: {bp:.3f} |{bar}")
79
80
81demonstrate_brevity_penalty()Complete BLEU Implementation
Sentence-Level BLEU
špython
1def sentence_bleu(
2 hypothesis: List[str],
3 references: List[List[str]],
4 weights: Tuple[float, ...] = (0.25, 0.25, 0.25, 0.25),
5 smoothing: bool = True
6) -> float:
7 """
8 Compute BLEU score for a single sentence.
9
10 Note: Sentence-level BLEU can be unreliable for short sentences.
11 Corpus-level BLEU is preferred for evaluation.
12
13 Args:
14 hypothesis: Hypothesis tokens
15 references: List of reference token lists
16 weights: Weights for each n-gram order
17 smoothing: Apply smoothing for zero counts
18
19 Returns:
20 BLEU score (0 to 1)
21 """
22 # Compute precision for each n-gram order
23 precisions = []
24
25 for n in range(1, len(weights) + 1):
26 matches, total = modified_precision(hypothesis, references, n)
27
28 if total == 0:
29 precision = 0.0
30 elif matches == 0:
31 # Smoothing: add small value to avoid log(0)
32 if smoothing:
33 precision = 1.0 / (total + 1) # Add-one smoothing variant
34 else:
35 precision = 0.0
36 else:
37 precision = matches / total
38
39 precisions.append(precision)
40
41 # Check for zero precisions
42 if not smoothing and any(p == 0 for p in precisions):
43 return 0.0
44
45 # Compute weighted geometric mean of precisions
46 log_precision_sum = sum(
47 w * math.log(p) if p > 0 else float('-inf')
48 for w, p in zip(weights, precisions)
49 )
50
51 if log_precision_sum == float('-inf'):
52 return 0.0
53
54 # Brevity penalty
55 ref_lengths = [len(ref) for ref in references]
56 closest_ref_len = closest_reference_length(len(hypothesis), ref_lengths)
57 bp = brevity_penalty(len(hypothesis), closest_ref_len)
58
59 # Final BLEU
60 bleu = bp * math.exp(log_precision_sum)
61
62 return bleu
63
64
65def test_sentence_bleu():
66 """
67 Test sentence-level BLEU.
68 """
69 print("Sentence BLEU Test")
70 print("=" * 60)
71
72 test_cases = [
73 # Perfect match
74 {
75 'hypothesis': "the cat sat on the mat".split(),
76 'references': [["the", "cat", "sat", "on", "the", "mat"]],
77 'expected': 'high (perfect match)'
78 },
79 # Partial match
80 {
81 'hypothesis': "the cat on the mat".split(),
82 'references': [["the", "cat", "sat", "on", "the", "mat"]],
83 'expected': 'medium (missing word)'
84 },
85 # Different phrasing
86 {
87 'hypothesis': "a cat is sitting on a mat".split(),
88 'references': [["the", "cat", "sat", "on", "the", "mat"]],
89 'expected': 'low (different words)'
90 },
91 # Multiple references
92 {
93 'hypothesis': "a cat is sitting on a mat".split(),
94 'references': [
95 ["the", "cat", "sat", "on", "the", "mat"],
96 ["a", "cat", "is", "sitting", "on", "a", "mat"],
97 ],
98 'expected': 'high (matches ref 2)'
99 },
100 ]
101
102 for i, case in enumerate(test_cases, 1):
103 hyp = case['hypothesis']
104 refs = case['references']
105
106 score = sentence_bleu(hyp, refs)
107
108 print(f"Test {i}:")
109 print(f" Hypothesis: {' '.join(hyp)}")
110 for j, ref in enumerate(refs):
111 print(f" Reference {j+1}: {' '.join(ref)}")
112 print(f" Expected: {case['expected']}")
113 print(f" BLEU: {score:.4f} ({score*100:.2f}%)")
114 print()
115
116
117test_sentence_bleu()Corpus-Level BLEU
The Standard Evaluation Method
špython
1class BLEUScore:
2 """
3 Corpus-level BLEU score calculator.
4
5 Computes BLEU by aggregating statistics across all sentences,
6 then computing precision and brevity penalty on totals.
7
8 This is the standard way to compute BLEU for evaluation.
9
10 Args:
11 max_n: Maximum n-gram order (default: 4)
12 weights: Weights for each n-gram (default: uniform)
13
14 Example:
15 >>> bleu = BLEUScore()
16 >>> bleu.add("the cat sat", ["the cat sat on the mat"])
17 >>> bleu.add("a dog runs", ["the dog ran"])
18 >>> score = bleu.compute()
19 """
20
21 def __init__(
22 self,
23 max_n: int = 4,
24 weights: Optional[Tuple[float, ...]] = None
25 ):
26 self.max_n = max_n
27 self.weights = weights or tuple(1/max_n for _ in range(max_n))
28
29 # Statistics accumulators
30 self.reset()
31
32 def reset(self):
33 """Reset all accumulated statistics."""
34 # Per-n-gram statistics
35 self.matches = [0] * self.max_n
36 self.totals = [0] * self.max_n
37
38 # Length statistics
39 self.hypothesis_length = 0
40 self.reference_length = 0
41
42 # Sentence count
43 self.sentence_count = 0
44
45 def add(
46 self,
47 hypothesis: Union[str, List[str]],
48 references: Union[str, List[str], List[List[str]]]
49 ):
50 """
51 Add a sentence pair to the accumulator.
52
53 Args:
54 hypothesis: Hypothesis (string or tokens)
55 references: Reference(s) (string(s) or token list(s))
56 """
57 # Tokenize if needed
58 if isinstance(hypothesis, str):
59 hypothesis = hypothesis.split()
60
61 # Handle single reference
62 if isinstance(references, str):
63 references = [references.split()]
64 elif isinstance(references[0], str):
65 references = [ref.split() if isinstance(ref, str) else ref
66 for ref in references]
67
68 # Update n-gram statistics
69 for n in range(1, self.max_n + 1):
70 matches, total = modified_precision(hypothesis, references, n)
71 self.matches[n-1] += matches
72 self.totals[n-1] += total
73
74 # Update length statistics
75 self.hypothesis_length += len(hypothesis)
76
77 ref_lengths = [len(ref) for ref in references]
78 closest = closest_reference_length(len(hypothesis), ref_lengths)
79 self.reference_length += closest
80
81 self.sentence_count += 1
82
83 def compute(self) -> Dict[str, float]:
84 """
85 Compute corpus-level BLEU score.
86
87 Returns:
88 Dictionary with:
89 - bleu: Final BLEU score
90 - precisions: Per-n-gram precisions
91 - bp: Brevity penalty
92 - ratio: Hypothesis/reference length ratio
93 """
94 # Compute precisions
95 precisions = []
96 for i in range(self.max_n):
97 if self.totals[i] == 0:
98 precisions.append(0.0)
99 else:
100 precisions.append(self.matches[i] / self.totals[i])
101
102 # Check for zero precisions
103 if any(p == 0 for p in precisions):
104 return {
105 'bleu': 0.0,
106 'precisions': precisions,
107 'bp': 0.0,
108 'ratio': self.hypothesis_length / max(self.reference_length, 1),
109 'hypothesis_length': self.hypothesis_length,
110 'reference_length': self.reference_length,
111 }
112
113 # Compute geometric mean
114 log_precision = sum(
115 w * math.log(p)
116 for w, p in zip(self.weights, precisions)
117 )
118
119 # Brevity penalty
120 bp = brevity_penalty(self.hypothesis_length, self.reference_length)
121
122 # Final BLEU
123 bleu = bp * math.exp(log_precision)
124
125 return {
126 'bleu': bleu,
127 'precisions': precisions,
128 'bp': bp,
129 'ratio': self.hypothesis_length / max(self.reference_length, 1),
130 'hypothesis_length': self.hypothesis_length,
131 'reference_length': self.reference_length,
132 }
133
134 def score(self) -> float:
135 """Get just the BLEU score."""
136 return self.compute()['bleu']
137
138
139def test_corpus_bleu():
140 """
141 Test corpus-level BLEU.
142 """
143 print("Corpus BLEU Test")
144 print("=" * 60)
145
146 # Create scorer
147 bleu = BLEUScore()
148
149 # Add sentence pairs
150 pairs = [
151 ("the cat sat on the mat", "the cat sat on the mat"),
152 ("the dog runs quickly", "the dog ran fast"),
153 ("she is happy", "she seems happy"),
154 ("it is cold today", "today is cold"),
155 ]
156
157 print("Adding sentences:")
158 for hyp, ref in pairs:
159 print(f" H: {hyp}")
160 print(f" R: {ref}")
161 print()
162 bleu.add(hyp, [ref])
163
164 # Compute score
165 result = bleu.compute()
166
167 print("Results:")
168 print("-" * 40)
169 print(f"BLEU Score: {result['bleu']:.4f} ({result['bleu']*100:.2f}%)")
170 print(f"Brevity Penalty: {result['bp']:.4f}")
171 print(f"Length Ratio: {result['ratio']:.4f}")
172 print()
173 print("Precisions:")
174 for i, p in enumerate(result['precisions'], 1):
175 print(f" {i}-gram: {p:.4f}")
176
177 print("\n Corpus BLEU test passed!")
178
179
180test_corpus_bleu()Smoothing Methods
Handling Zero Counts
For sentence-level BLEU, we need smoothing to handle cases where n-grams don't match. Several smoothing methods exist:
- method0: No smoothing (corpus BLEU default)
- method1: Add epsilon to zero counts (NIST)
- method2: Add 1 to all counts (add-one)
- method3: BLEU+1 (add 1 to matched counts)
- method4: Exponential decay smoothing
SacreBLEU-Style Implementation
Reproducible BLEU
SacreBLEU is a standardized BLEU implementation that ensures reproducibility. Key features include:
- Standardized tokenization
- Lowercase option
- Proper handling of punctuation
- Version tracking
Based on: https://github.com/mjpost/sacrebleu
BLEU Signature
Ensuring Reproducibility
When reporting BLEU scores, always include a signature that specifies:
štext
1BLEU SIGNATURE FORMAT:
2āāāāāāāāāāāāāāāāāāāāā
3
4BLEU+case.{lc|mixed}+numrefs.N+smooth.{none|exp|floor}+tok.{13a|intl|none}+version.X.Y.Z
5
6Examples:
7 BLEU+case.lc+numrefs.4+smooth.exp+tok.13a+v.1.4.14
8 BLEU+case.mixed+numrefs.1+smooth.none+tok.intl+v.1.0.0Proper BLEU Reporting
štext
1ALWAYS report:
21. BLEU score (with decimal precision)
32. Tokenization method
43. Number of references
54. Whether lowercased
65. Test set name
76. Implementation/library
8
9Good example:
10 "We achieve 35.2 BLEU on WMT14 EN-DE test set
11 (sacreBLEU, tokenized, 1 reference, cased)"
12
13 Signature: BLEU+case.mixed+numrefs.1+tok.13a+v.1.4.14
14
15Bad example:
16 "Our model achieves 35.2 BLEU"
17 (Not reproducible! Missing details)Summary
BLEU Components
| Component | Purpose | Formula |
|---|---|---|
| Modified Precision | Match n-grams with clipping | min(count_hyp, count_ref) |
| Brevity Penalty | Penalize short translations | exp(1 - r/c) if c < r |
| Geometric Mean | Combine precisions | exp(Ī£ wā log pā) |
Key Implementations
| Class | Use Case |
|---|---|
| BLEUScore | Corpus-level evaluation (standard) |
| SmoothedBLEU | Sentence-level with smoothing |
| SacreBLEU | Reproducible evaluation |
Best Practices
- Use corpus-level BLEU for evaluation (more stable)
- Report tokenization method for reproducibility
- Include BLEU signature in papers
- Use multiple references when available
Next Steps
In the next section, we'll cover Other Evaluation Metrics including ChrF, METEOR concepts, and when to use each metric.