Introduction
Evaluating machine translation quality is challenging because language is inherently ambiguous—there are many valid ways to translate a sentence. This chapter covers metrics for measuring translation quality, with a focus on BLEU, the most widely used metric.
The Challenge of MT Evaluation
Why Evaluation is Hard
📝text
1Original (German): "Der Hund läuft schnell."
2
3Valid translations:
4 "The dog runs quickly." ← Reference 1
5 "The dog runs fast." ← Reference 2
6 "The dog is running quickly." ← Reference 3
7 "A dog runs rapidly." ← Also valid!
8
9Invalid translations:
10 "The cat runs quickly." ← Wrong meaning
11 "Dog the runs quickly." ← Grammatically wrong
12 "The dog." ← Incomplete
13
14The challenge: How do we automatically score translations
15when there are multiple valid answers?Human vs Automatic Evaluation
📝text
1HUMAN EVALUATION:
2─────────────────
3Pros:
4 - Gold standard for quality
5 - Can judge fluency and adequacy
6 - Understands context and nuance
7
8Cons:
9 - Expensive ($$$)
10 - Slow (hours/days)
11 - Not reproducible (annotator variance)
12 - Can't use for development/training
13
14
15AUTOMATIC EVALUATION:
16─────────────────────
17Pros:
18 - Free and instant
19 - Reproducible
20 - Can evaluate millions of sentences
21 - Enables rapid development
22
23Cons:
24 - Imperfect correlation with human judgment
25 - Can be gamed
26 - Misses some quality aspectsCategories of Metrics
Overview
🐍python
1def metrics_overview():
2 """
3 Overview of translation evaluation metrics.
4 """
5 print("Translation Evaluation Metrics")
6 print("=" * 60)
7
8 print("""
9 REFERENCE-BASED METRICS:
10 ────────────────────────
11 Compare hypothesis to reference translation(s).
12
13 N-gram Overlap:
14 • BLEU - Precision of n-grams (1-4)
15 • NIST - BLEU variant with information weighting
16 • ROUGE - Recall-oriented (for summarization)
17 • ChrF - Character n-gram F-score
18
19 Word Alignment:
20 • METEOR - Alignment + paraphrase matching
21 • TER - Translation Edit Rate (min edits)
22
23 Embedding-Based:
24 • BERTScore - Contextual embedding similarity
25 • COMET - Neural learned metric
26
27
28 REFERENCE-FREE METRICS:
29 ───────────────────────
30 Quality estimation without references.
31
32 • COMET-QE - Quality estimation model
33 • XMover - Cross-lingual similarity
34
35 Useful when references are unavailable.
36
37
38 HUMAN EVALUATION:
39 ─────────────────
40 Gold standard, multiple dimensions:
41
42 • Adequacy - Is the meaning preserved?
43 • Fluency - Is it grammatically correct?
44 • MQM - Multidimensional Quality Metrics
45 • DA - Direct Assessment (rating scale)
46 """)
47
48
49metrics_overview()What is BLEU?
Bilingual Evaluation Understudy
📝text
1BLEU = Bilingual Evaluation Understudy (Papineni et al., 2002)
2
3Core idea: Good translations share many n-grams with references.
4
5Simple example:
6
7Reference: "The cat sat on the mat"
8Hypothesis: "The cat sat on the mat"
9→ Perfect match! BLEU = 1.0 (100%)
10
11Reference: "The cat sat on the mat"
12Hypothesis: "A cat is sitting on a mat"
13→ Some n-grams match, some don't. BLEU ≈ 0.3
14
15Reference: "The cat sat on the mat"
16Hypothesis: "Dog runs in park"
17→ No n-grams match. BLEU ≈ 0.0BLEU Components
📝text
1BLEU consists of:
2
31. MODIFIED PRECISION (for n-grams 1,2,3,4):
4 - Count matching n-grams
5 - Clip by reference count (prevents gaming)
6
72. BREVITY PENALTY:
8 - Penalize translations that are too short
9 - Without this, "the" could score 100% on 1-grams
10
113. GEOMETRIC MEAN:
12 - Combine precisions with geometric mean
13 - One zero precision = zero BLEU
14
15
16Formula:
17 BLEU = BP × exp(Σ wₙ × log(pₙ))
18
19Where:
20 BP = brevity penalty
21 wₙ = weight for n-gram (usually 1/4 each)
22 pₙ = modified precision for n-grams of length nQuick BLEU Example
Step-by-Step Calculation
🐍python
1from collections import Counter
2from typing import List, Dict
3import math
4
5
6def simple_bleu_example():
7 """
8 Walk through BLEU calculation step by step.
9 """
10 print("Simple BLEU Example")
11 print("=" * 60)
12
13 reference = "the cat sat on the mat".split()
14 hypothesis = "the cat on the mat".split()
15
16 print(f"Reference: {' '.join(reference)}")
17 print(f"Hypothesis: {' '.join(hypothesis)}")
18 print()
19
20 # Step 1: Count n-grams
21 print("STEP 1: Count n-grams")
22 print("-" * 40)
23
24 for n in range(1, 5):
25 # Get n-grams
26 ref_ngrams = get_ngrams(reference, n)
27 hyp_ngrams = get_ngrams(hypothesis, n)
28
29 print(f"\n{n}-grams:")
30 print(f" Reference: {dict(ref_ngrams)}")
31 print(f" Hypothesis: {dict(hyp_ngrams)}")
32
33 # Count matches (clipped)
34 matches = 0
35 total = sum(hyp_ngrams.values())
36
37 for ngram, count in hyp_ngrams.items():
38 matches += min(count, ref_ngrams.get(ngram, 0))
39
40 precision = matches / total if total > 0 else 0
41 print(f" Matches: {matches}/{total} = {precision:.3f}")
42
43 # Step 2: Brevity penalty
44 print("\n" + "-" * 40)
45 print("STEP 2: Brevity Penalty")
46 print("-" * 40)
47
48 ref_len = len(reference)
49 hyp_len = len(hypothesis)
50
51 print(f"Reference length: {ref_len}")
52 print(f"Hypothesis length: {hyp_len}")
53
54 if hyp_len >= ref_len:
55 bp = 1.0
56 print(f"Hypothesis >= reference, BP = 1.0")
57 else:
58 bp = math.exp(1 - ref_len / hyp_len)
59 print(f"BP = exp(1 - {ref_len}/{hyp_len}) = {bp:.4f}")
60
61 # Step 3: Final BLEU
62 print("\n" + "-" * 40)
63 print("STEP 3: Final BLEU Score")
64 print("-" * 40)
65
66 # Using actual precisions
67 precisions = [4/5, 2/4, 1/3, 0/2] # From calculation above
68 precisions = [p if p > 0 else 1e-10 for p in precisions] # Avoid log(0)
69
70 log_precision = sum(0.25 * math.log(p) for p in precisions)
71 bleu = bp * math.exp(log_precision)
72
73 print(f"Precisions: {[f'{p:.3f}' for p in precisions]}")
74 print(f"BLEU = {bp:.4f} × exp({log_precision:.4f})")
75 print(f"BLEU = {bleu:.4f} ({bleu*100:.2f}%)")
76
77
78def get_ngrams(tokens: List[str], n: int) -> Counter:
79 """Extract n-grams from token list."""
80 ngrams = []
81 for i in range(len(tokens) - n + 1):
82 ngram = tuple(tokens[i:i+n])
83 ngrams.append(ngram)
84 return Counter(ngrams)
85
86
87simple_bleu_example()Interpreting BLEU Scores
Score Guidelines
| Score Range | Quality Level |
|---|---|
| < 10 | Almost useless |
| 10-19 | Hard to understand, some words correct |
| 20-29 | Basic meaning, significant errors |
| 30-39 | Understandable, some fluency issues |
| 40-49 | Good quality, minor errors |
| 50-59 | Very good, near human quality |
| 60+ | Excellent (rare without data leakage!) |
Note: These are rough guidelines for news translation. Different domains and language pairs vary significantly!
Context Matters
📝text
1WMT (News translation):
2 - State-of-the-art DE→EN: ~40-45 BLEU
3 - State-of-the-art EN→DE: ~35-40 BLEU
4
5Multi30k (Simple descriptions):
6 - Good model: ~35-40 BLEU
7 - Target for our course: 30-35 BLEU
8
9Conversational:
10 - Much harder, lower BLEU expected
11 - 20-25 can be reasonableImportant Caveats
- BLEU doesn't measure: Factual accuracy (can hallucinate facts), semantic preservation (paraphrases score low), or fluency directly
- Higher isn't always better: Can overfit to reference style, may sacrifice diversity, human preference can differ
- Compare carefully: Same tokenization (!), same number of references, same test set, same BLEU implementation
Metrics Comparison
When to Use Each Metric
| Metric | Correlation | Speed | Multi-ref | Notes |
|---|---|---|---|---|
| BLEU | Medium | Fast | Yes | Industry standard |
| METEOR | High | Medium | Yes | Better for en, needs resources |
| ChrF | Medium-High | Fast | Yes | Good for morphological langs |
| TER | Medium | Medium | No | Measures edit distance |
| BERTScore | High | Slow | Yes | Context-aware, GPU needed |
| COMET | Very High | Slow | Yes | Best correlation, GPU needed |
Choosing a Metric
📝text
1For development/research:
2 → Use BLEU (fast, comparable to papers)
3 → Add COMET/BERTScore for final evaluation
4
5For morphologically rich languages (German, Finnish, Turkish):
6 → Use ChrF (character-level handles compounds)
7 → BLEU underestimates quality
8
9For summarization:
10 → Use ROUGE (recall-focused)
11 → BLEU penalizes paraphrasing
12
13For quality estimation (no reference):
14 → Use COMET-QE or similar
15 → Useful for filtering training data
16
17
18WHY BLEU IS STILL POPULAR:
19──────────────────────────
20
211. Fast (no GPU needed)
222. Reproducible (deterministic)
233. Comparable (everyone uses it)
244. Good enough for development
255. Well-understood limitationsSetting Up Evaluation
Evaluation Pipeline Overview
📝text
1EVALUATION WORKFLOW:
2────────────────────
3
4┌─────────────┐
5│ Test Data │ Source sentences + reference translations
6└──────┬──────┘
7 │
8 ▼
9┌─────────────┐
10│ Model │ Generate translations
11└──────┬──────┘
12 │
13 ▼
14┌─────────────┐
15│ Tokenize │ Apply consistent tokenization
16└──────┬──────┘
17 │
18 ▼
19┌─────────────┐
20│ Score │ BLEU, ChrF, etc.
21└──────┬──────┘
22 │
23 ▼
24┌─────────────┐
25│ Report │ Scores + analysis
26└─────────────┘Critical: Tokenization
BLEU scores depend heavily on tokenization!
📝text
1Same translation, different tokenization:
2 "don't" vs "do n't" vs "don ' t"
3 → Different n-gram counts!
4
5Standard approaches:
6 - Moses tokenizer (traditional MT)
7 - SacreBLEU (standardized, recommended)
8 - spaCy/NLTK tokenizer
9
10We'll use SacreBLEU-style for reproducibility.Summary
Key Concepts
| Concept | Description |
|---|---|
| BLEU | N-gram precision with brevity penalty |
| Reference-based | Requires human translations to compare |
| Modified precision | Clips counts to prevent gaming |
| Corpus-level | Aggregates across all sentences |
Score Guidelines
| BLEU Range | Quality |
|---|---|
| < 20 | Low quality |
| 20-30 | Basic understanding |
| 30-40 | Good quality |
| 40+ | Very good (approaching human) |
Next Steps
In the next section, we'll implement BLEU from scratch with all its components.