Chapter 11
15 min read
Section 56 of 75

BLEU Score Implementation

Evaluation Metrics

Introduction

This section provides a complete implementation of BLEU from scratch. Understanding the internals helps you debug evaluation issues and interpret scores correctly.


N-gram Extraction

Building the Foundation

šŸpython
1from collections import Counter
2from typing import List, Tuple, Dict, Optional, Union
3import math
4
5
6def extract_ngrams(tokens: List[str], n: int) -> Counter:
7    """
8    Extract n-grams from a list of tokens.
9
10    An n-gram is a contiguous sequence of n items.
11
12    Args:
13        tokens: List of tokens
14        n: N-gram order (1=unigrams, 2=bigrams, etc.)
15
16    Returns:
17        Counter of n-gram tuples to their counts
18
19    Example:
20        >>> tokens = ['the', 'cat', 'sat']
21        >>> extract_ngrams(tokens, 1)
22        Counter({('the',): 1, ('cat',): 1, ('sat',): 1})
23        >>> extract_ngrams(tokens, 2)
24        Counter({('the', 'cat'): 1, ('cat', 'sat'): 1})
25    """
26    ngrams = []
27
28    for i in range(len(tokens) - n + 1):
29        ngram = tuple(tokens[i:i + n])
30        ngrams.append(ngram)
31
32    return Counter(ngrams)
33
34
35def demonstrate_ngrams():
36    """
37    Demonstrate n-gram extraction.
38    """
39    print("N-gram Extraction")
40    print("=" * 60)
41
42    sentence = "the cat sat on the mat"
43    tokens = sentence.split()
44
45    print(f"Sentence: {sentence}")
46    print(f"Tokens: {tokens}")
47    print()
48
49    for n in range(1, 5):
50        ngrams = extract_ngrams(tokens, n)
51        print(f"{n}-grams ({len(ngrams)} unique):")
52        for ngram, count in ngrams.most_common():
53            print(f"  {' '.join(ngram)}: {count}")
54        print()
55
56
57demonstrate_ngrams()

Modified Precision

The Core of BLEU

šŸ“text
1Standard precision:
2  precision = matches / total_hypothesis_ngrams
3
4Problem: Gaming!
5  Reference: "The cat sat on the mat"
6  Hypothesis: "the the the the the the the"
7
8  Unigram precision = 7/7 = 100% !!!
9
10Modified precision (clipping):
11  - Count matches, but clip by reference count
12  - "the" appears 2x in reference, so max match = 2
13
14  Modified precision = 2/7 = 28.5%
šŸpython
1def count_ngram_matches(
2    hypothesis_ngrams: Counter,
3    reference_ngrams: Counter
4) -> int:
5    """
6    Count matching n-grams with clipping.
7
8    For each n-gram in hypothesis:
9      - Count how many times it appears in hypothesis
10      - Count how many times it appears in reference
11      - Match count = min(hypothesis_count, reference_count)
12
13    Args:
14        hypothesis_ngrams: Counter of hypothesis n-grams
15        reference_ngrams: Counter of reference n-grams
16
17    Returns:
18        Total clipped match count
19
20    Example:
21        >>> hyp = Counter({('the',): 7})
22        >>> ref = Counter({('the',): 2, ('cat',): 1})
23        >>> count_ngram_matches(hyp, ref)
24        2  # Clipped from 7 to 2
25    """
26    matches = 0
27
28    for ngram, hyp_count in hypothesis_ngrams.items():
29        ref_count = reference_ngrams.get(ngram, 0)
30        # Clip: can't match more than reference has
31        matches += min(hyp_count, ref_count)
32
33    return matches
34
35
36def modified_precision(
37    hypothesis: List[str],
38    references: List[List[str]],
39    n: int
40) -> Tuple[int, int]:
41    """
42    Compute modified n-gram precision.
43
44    Args:
45        hypothesis: Hypothesis tokens
46        references: List of reference token lists
47        n: N-gram order
48
49    Returns:
50        (matched_count, total_count)
51    """
52    # Get hypothesis n-grams
53    hyp_ngrams = extract_ngrams(hypothesis, n)
54    total = sum(hyp_ngrams.values())
55
56    if total == 0:
57        return (0, 0)
58
59    # Get reference n-grams (combine all references)
60    # For each n-gram, take maximum count across references
61    ref_ngram_max = Counter()
62
63    for ref in references:
64        ref_ngrams = extract_ngrams(ref, n)
65        for ngram, count in ref_ngrams.items():
66            ref_ngram_max[ngram] = max(ref_ngram_max[ngram], count)
67
68    # Count clipped matches
69    matches = count_ngram_matches(hyp_ngrams, ref_ngram_max)
70
71    return (matches, total)
72
73
74def demonstrate_modified_precision():
75    """
76    Demonstrate modified precision with clipping.
77    """
78    print("Modified Precision")
79    print("=" * 60)
80
81    # Example 1: Gaming attempt
82    print("Example 1: Gaming attempt")
83    print("-" * 40)
84
85    reference = "the cat sat on the mat".split()
86    hypothesis = "the the the the the the the".split()
87
88    print(f"Reference:  {' '.join(reference)}")
89    print(f"Hypothesis: {' '.join(hypothesis)}")
90    print()
91
92    # Without clipping
93    print(f"Without clipping: 7 'the' match 7 times → 7/7 = 100%")
94
95    # With clipping
96    matches, total = modified_precision(hypothesis, [reference], 1)
97    print(f"With clipping: 7 'the' clipped to 2 → {matches}/{total} = {matches/total:.1%}")
98
99    # Example 2: Normal case
100    print("\nExample 2: Normal translation")
101    print("-" * 40)
102
103    reference = "the cat sat on the mat".split()
104    hypothesis = "the cat on the mat".split()
105
106    print(f"Reference:  {' '.join(reference)}")
107    print(f"Hypothesis: {' '.join(hypothesis)}")
108    print()
109
110    for n in range(1, 4):
111        matches, total = modified_precision(hypothesis, [reference], n)
112        if total > 0:
113            print(f"{n}-gram precision: {matches}/{total} = {matches/total:.3f}")
114        else:
115            print(f"{n}-gram precision: 0/0 = undefined")
116
117
118demonstrate_modified_precision()

Brevity Penalty

Preventing Short Translations

šŸ“text
1Problem without brevity penalty:
2  Reference: "The quick brown fox jumps over the lazy dog"
3  Hypothesis: "The"
4
5  1-gram precision = 1/1 = 100% !
6
7Brevity penalty (BP):
8  - If hypothesis >= reference length: BP = 1 (no penalty)
9  - If hypothesis < reference length: BP = exp(1 - r/c)
10
11  Where:
12    c = hypothesis length (candidate)
13    r = reference length (closest to c)
šŸpython
1def brevity_penalty(
2    hypothesis_length: int,
3    reference_length: int
4) -> float:
5    """
6    Compute brevity penalty.
7
8    Args:
9        hypothesis_length: Length of hypothesis
10        reference_length: Effective reference length
11
12    Returns:
13        Brevity penalty (0 to 1)
14    """
15    if hypothesis_length >= reference_length:
16        return 1.0
17
18    return math.exp(1 - reference_length / hypothesis_length)
19
20
21def closest_reference_length(
22    hypothesis_length: int,
23    reference_lengths: List[int]
24) -> int:
25    """
26    Find reference length closest to hypothesis length.
27
28    If multiple references have same distance, prefer shorter.
29
30    Args:
31        hypothesis_length: Length of hypothesis
32        reference_lengths: Lengths of all references
33
34    Returns:
35        Closest reference length
36    """
37    closest = None
38    closest_diff = float('inf')
39
40    for ref_len in reference_lengths:
41        diff = abs(ref_len - hypothesis_length)
42        if diff < closest_diff:
43            closest_diff = diff
44            closest = ref_len
45        elif diff == closest_diff and ref_len < closest:
46            closest = ref_len
47
48    return closest
49
50
51def demonstrate_brevity_penalty():
52    """
53    Demonstrate brevity penalty behavior.
54    """
55    print("Brevity Penalty")
56    print("=" * 60)
57
58    reference_length = 10
59
60    print(f"Reference length: {reference_length}")
61    print()
62    print(f"{'Hypothesis Length':<20} {'Ratio':<10} {'BP':<10}")
63    print("-" * 40)
64
65    for hyp_len in [2, 5, 8, 10, 12, 15, 20]:
66        bp = brevity_penalty(hyp_len, reference_length)
67        ratio = hyp_len / reference_length
68        print(f"{hyp_len:<20} {ratio:<10.2f} {bp:<10.4f}")
69
70    # Visualization
71    print("\nBP vs Hypothesis/Reference ratio:")
72    print("-" * 50)
73
74    for ratio in [0.2, 0.4, 0.6, 0.8, 1.0, 1.2, 1.5]:
75        hyp_len = int(ratio * reference_length)
76        bp = brevity_penalty(hyp_len, reference_length)
77        bar = "ā–ˆ" * int(bp * 40)
78        print(f"{ratio:.1f}: {bp:.3f} |{bar}")
79
80
81demonstrate_brevity_penalty()

Complete BLEU Implementation

Sentence-Level BLEU

šŸpython
1def sentence_bleu(
2    hypothesis: List[str],
3    references: List[List[str]],
4    weights: Tuple[float, ...] = (0.25, 0.25, 0.25, 0.25),
5    smoothing: bool = True
6) -> float:
7    """
8    Compute BLEU score for a single sentence.
9
10    Note: Sentence-level BLEU can be unreliable for short sentences.
11    Corpus-level BLEU is preferred for evaluation.
12
13    Args:
14        hypothesis: Hypothesis tokens
15        references: List of reference token lists
16        weights: Weights for each n-gram order
17        smoothing: Apply smoothing for zero counts
18
19    Returns:
20        BLEU score (0 to 1)
21    """
22    # Compute precision for each n-gram order
23    precisions = []
24
25    for n in range(1, len(weights) + 1):
26        matches, total = modified_precision(hypothesis, references, n)
27
28        if total == 0:
29            precision = 0.0
30        elif matches == 0:
31            # Smoothing: add small value to avoid log(0)
32            if smoothing:
33                precision = 1.0 / (total + 1)  # Add-one smoothing variant
34            else:
35                precision = 0.0
36        else:
37            precision = matches / total
38
39        precisions.append(precision)
40
41    # Check for zero precisions
42    if not smoothing and any(p == 0 for p in precisions):
43        return 0.0
44
45    # Compute weighted geometric mean of precisions
46    log_precision_sum = sum(
47        w * math.log(p) if p > 0 else float('-inf')
48        for w, p in zip(weights, precisions)
49    )
50
51    if log_precision_sum == float('-inf'):
52        return 0.0
53
54    # Brevity penalty
55    ref_lengths = [len(ref) for ref in references]
56    closest_ref_len = closest_reference_length(len(hypothesis), ref_lengths)
57    bp = brevity_penalty(len(hypothesis), closest_ref_len)
58
59    # Final BLEU
60    bleu = bp * math.exp(log_precision_sum)
61
62    return bleu
63
64
65def test_sentence_bleu():
66    """
67    Test sentence-level BLEU.
68    """
69    print("Sentence BLEU Test")
70    print("=" * 60)
71
72    test_cases = [
73        # Perfect match
74        {
75            'hypothesis': "the cat sat on the mat".split(),
76            'references': [["the", "cat", "sat", "on", "the", "mat"]],
77            'expected': 'high (perfect match)'
78        },
79        # Partial match
80        {
81            'hypothesis': "the cat on the mat".split(),
82            'references': [["the", "cat", "sat", "on", "the", "mat"]],
83            'expected': 'medium (missing word)'
84        },
85        # Different phrasing
86        {
87            'hypothesis': "a cat is sitting on a mat".split(),
88            'references': [["the", "cat", "sat", "on", "the", "mat"]],
89            'expected': 'low (different words)'
90        },
91        # Multiple references
92        {
93            'hypothesis': "a cat is sitting on a mat".split(),
94            'references': [
95                ["the", "cat", "sat", "on", "the", "mat"],
96                ["a", "cat", "is", "sitting", "on", "a", "mat"],
97            ],
98            'expected': 'high (matches ref 2)'
99        },
100    ]
101
102    for i, case in enumerate(test_cases, 1):
103        hyp = case['hypothesis']
104        refs = case['references']
105
106        score = sentence_bleu(hyp, refs)
107
108        print(f"Test {i}:")
109        print(f"  Hypothesis: {' '.join(hyp)}")
110        for j, ref in enumerate(refs):
111            print(f"  Reference {j+1}: {' '.join(ref)}")
112        print(f"  Expected: {case['expected']}")
113        print(f"  BLEU: {score:.4f} ({score*100:.2f}%)")
114        print()
115
116
117test_sentence_bleu()

Corpus-Level BLEU

The Standard Evaluation Method

šŸpython
1class BLEUScore:
2    """
3    Corpus-level BLEU score calculator.
4
5    Computes BLEU by aggregating statistics across all sentences,
6    then computing precision and brevity penalty on totals.
7
8    This is the standard way to compute BLEU for evaluation.
9
10    Args:
11        max_n: Maximum n-gram order (default: 4)
12        weights: Weights for each n-gram (default: uniform)
13
14    Example:
15        >>> bleu = BLEUScore()
16        >>> bleu.add("the cat sat", ["the cat sat on the mat"])
17        >>> bleu.add("a dog runs", ["the dog ran"])
18        >>> score = bleu.compute()
19    """
20
21    def __init__(
22        self,
23        max_n: int = 4,
24        weights: Optional[Tuple[float, ...]] = None
25    ):
26        self.max_n = max_n
27        self.weights = weights or tuple(1/max_n for _ in range(max_n))
28
29        # Statistics accumulators
30        self.reset()
31
32    def reset(self):
33        """Reset all accumulated statistics."""
34        # Per-n-gram statistics
35        self.matches = [0] * self.max_n
36        self.totals = [0] * self.max_n
37
38        # Length statistics
39        self.hypothesis_length = 0
40        self.reference_length = 0
41
42        # Sentence count
43        self.sentence_count = 0
44
45    def add(
46        self,
47        hypothesis: Union[str, List[str]],
48        references: Union[str, List[str], List[List[str]]]
49    ):
50        """
51        Add a sentence pair to the accumulator.
52
53        Args:
54            hypothesis: Hypothesis (string or tokens)
55            references: Reference(s) (string(s) or token list(s))
56        """
57        # Tokenize if needed
58        if isinstance(hypothesis, str):
59            hypothesis = hypothesis.split()
60
61        # Handle single reference
62        if isinstance(references, str):
63            references = [references.split()]
64        elif isinstance(references[0], str):
65            references = [ref.split() if isinstance(ref, str) else ref
66                         for ref in references]
67
68        # Update n-gram statistics
69        for n in range(1, self.max_n + 1):
70            matches, total = modified_precision(hypothesis, references, n)
71            self.matches[n-1] += matches
72            self.totals[n-1] += total
73
74        # Update length statistics
75        self.hypothesis_length += len(hypothesis)
76
77        ref_lengths = [len(ref) for ref in references]
78        closest = closest_reference_length(len(hypothesis), ref_lengths)
79        self.reference_length += closest
80
81        self.sentence_count += 1
82
83    def compute(self) -> Dict[str, float]:
84        """
85        Compute corpus-level BLEU score.
86
87        Returns:
88            Dictionary with:
89                - bleu: Final BLEU score
90                - precisions: Per-n-gram precisions
91                - bp: Brevity penalty
92                - ratio: Hypothesis/reference length ratio
93        """
94        # Compute precisions
95        precisions = []
96        for i in range(self.max_n):
97            if self.totals[i] == 0:
98                precisions.append(0.0)
99            else:
100                precisions.append(self.matches[i] / self.totals[i])
101
102        # Check for zero precisions
103        if any(p == 0 for p in precisions):
104            return {
105                'bleu': 0.0,
106                'precisions': precisions,
107                'bp': 0.0,
108                'ratio': self.hypothesis_length / max(self.reference_length, 1),
109                'hypothesis_length': self.hypothesis_length,
110                'reference_length': self.reference_length,
111            }
112
113        # Compute geometric mean
114        log_precision = sum(
115            w * math.log(p)
116            for w, p in zip(self.weights, precisions)
117        )
118
119        # Brevity penalty
120        bp = brevity_penalty(self.hypothesis_length, self.reference_length)
121
122        # Final BLEU
123        bleu = bp * math.exp(log_precision)
124
125        return {
126            'bleu': bleu,
127            'precisions': precisions,
128            'bp': bp,
129            'ratio': self.hypothesis_length / max(self.reference_length, 1),
130            'hypothesis_length': self.hypothesis_length,
131            'reference_length': self.reference_length,
132        }
133
134    def score(self) -> float:
135        """Get just the BLEU score."""
136        return self.compute()['bleu']
137
138
139def test_corpus_bleu():
140    """
141    Test corpus-level BLEU.
142    """
143    print("Corpus BLEU Test")
144    print("=" * 60)
145
146    # Create scorer
147    bleu = BLEUScore()
148
149    # Add sentence pairs
150    pairs = [
151        ("the cat sat on the mat", "the cat sat on the mat"),
152        ("the dog runs quickly", "the dog ran fast"),
153        ("she is happy", "she seems happy"),
154        ("it is cold today", "today is cold"),
155    ]
156
157    print("Adding sentences:")
158    for hyp, ref in pairs:
159        print(f"  H: {hyp}")
160        print(f"  R: {ref}")
161        print()
162        bleu.add(hyp, [ref])
163
164    # Compute score
165    result = bleu.compute()
166
167    print("Results:")
168    print("-" * 40)
169    print(f"BLEU Score: {result['bleu']:.4f} ({result['bleu']*100:.2f}%)")
170    print(f"Brevity Penalty: {result['bp']:.4f}")
171    print(f"Length Ratio: {result['ratio']:.4f}")
172    print()
173    print("Precisions:")
174    for i, p in enumerate(result['precisions'], 1):
175        print(f"  {i}-gram: {p:.4f}")
176
177    print("\n Corpus BLEU test passed!")
178
179
180test_corpus_bleu()

Smoothing Methods

Handling Zero Counts

For sentence-level BLEU, we need smoothing to handle cases where n-grams don't match. Several smoothing methods exist:

  • method0: No smoothing (corpus BLEU default)
  • method1: Add epsilon to zero counts (NIST)
  • method2: Add 1 to all counts (add-one)
  • method3: BLEU+1 (add 1 to matched counts)
  • method4: Exponential decay smoothing

SacreBLEU-Style Implementation

Reproducible BLEU

SacreBLEU is a standardized BLEU implementation that ensures reproducibility. Key features include:

  • Standardized tokenization
  • Lowercase option
  • Proper handling of punctuation
  • Version tracking

BLEU Signature

Ensuring Reproducibility

When reporting BLEU scores, always include a signature that specifies:

šŸ“text
1BLEU SIGNATURE FORMAT:
2─────────────────────
3
4BLEU+case.{lc|mixed}+numrefs.N+smooth.{none|exp|floor}+tok.{13a|intl|none}+version.X.Y.Z
5
6Examples:
7  BLEU+case.lc+numrefs.4+smooth.exp+tok.13a+v.1.4.14
8  BLEU+case.mixed+numrefs.1+smooth.none+tok.intl+v.1.0.0

Proper BLEU Reporting

šŸ“text
1ALWAYS report:
21. BLEU score (with decimal precision)
32. Tokenization method
43. Number of references
54. Whether lowercased
65. Test set name
76. Implementation/library
8
9Good example:
10  "We achieve 35.2 BLEU on WMT14 EN-DE test set
11   (sacreBLEU, tokenized, 1 reference, cased)"
12
13  Signature: BLEU+case.mixed+numrefs.1+tok.13a+v.1.4.14
14
15Bad example:
16  "Our model achieves 35.2 BLEU"
17  (Not reproducible! Missing details)

Summary

BLEU Components

ComponentPurposeFormula
Modified PrecisionMatch n-grams with clippingmin(count_hyp, count_ref)
Brevity PenaltyPenalize short translationsexp(1 - r/c) if c < r
Geometric MeanCombine precisionsexp(Ī£ wā‚™ log pā‚™)

Key Implementations

ClassUse Case
BLEUScoreCorpus-level evaluation (standard)
SmoothedBLEUSentence-level with smoothing
SacreBLEUReproducible evaluation

Best Practices

  1. Use corpus-level BLEU for evaluation (more stable)
  2. Report tokenization method for reproducibility
  3. Include BLEU signature in papers
  4. Use multiple references when available

Next Steps

In the next section, we'll cover Other Evaluation Metrics including ChrF, METEOR concepts, and when to use each metric.

Loading comments...