Chapter 11
10 min read
Section 55 of 75

Introduction to Translation Metrics

Evaluation Metrics

Introduction

Evaluating machine translation quality is challenging because language is inherently ambiguous—there are many valid ways to translate a sentence. This chapter covers metrics for measuring translation quality, with a focus on BLEU, the most widely used metric.


The Challenge of MT Evaluation

Why Evaluation is Hard

📝text
1Original (German):  "Der Hund läuft schnell."
2
3Valid translations:
4  "The dog runs quickly."        ← Reference 1
5  "The dog runs fast."           ← Reference 2
6  "The dog is running quickly."  ← Reference 3
7  "A dog runs rapidly."          ← Also valid!
8
9Invalid translations:
10  "The cat runs quickly."        ← Wrong meaning
11  "Dog the runs quickly."        ← Grammatically wrong
12  "The dog."                     ← Incomplete
13
14The challenge: How do we automatically score translations
15when there are multiple valid answers?

Human vs Automatic Evaluation

📝text
1HUMAN EVALUATION:
2─────────────────
3Pros:
4  - Gold standard for quality
5  - Can judge fluency and adequacy
6  - Understands context and nuance
7
8Cons:
9  - Expensive ($$$)
10  - Slow (hours/days)
11  - Not reproducible (annotator variance)
12  - Can't use for development/training
13
14
15AUTOMATIC EVALUATION:
16─────────────────────
17Pros:
18  - Free and instant
19  - Reproducible
20  - Can evaluate millions of sentences
21  - Enables rapid development
22
23Cons:
24  - Imperfect correlation with human judgment
25  - Can be gamed
26  - Misses some quality aspects

Categories of Metrics

Overview

🐍python
1def metrics_overview():
2    """
3    Overview of translation evaluation metrics.
4    """
5    print("Translation Evaluation Metrics")
6    print("=" * 60)
7
8    print("""
9    REFERENCE-BASED METRICS:
10    ────────────────────────
11    Compare hypothesis to reference translation(s).
12
13    N-gram Overlap:
14      • BLEU    - Precision of n-grams (1-4)
15      • NIST    - BLEU variant with information weighting
16      • ROUGE   - Recall-oriented (for summarization)
17      • ChrF    - Character n-gram F-score
18
19    Word Alignment:
20      • METEOR  - Alignment + paraphrase matching
21      • TER     - Translation Edit Rate (min edits)
22
23    Embedding-Based:
24      • BERTScore  - Contextual embedding similarity
25      • COMET      - Neural learned metric
26
27
28    REFERENCE-FREE METRICS:
29    ───────────────────────
30    Quality estimation without references.
31
32      • COMET-QE   - Quality estimation model
33      • XMover     - Cross-lingual similarity
34
35    Useful when references are unavailable.
36
37
38    HUMAN EVALUATION:
39    ─────────────────
40    Gold standard, multiple dimensions:
41
42      • Adequacy   - Is the meaning preserved?
43      • Fluency    - Is it grammatically correct?
44      • MQM        - Multidimensional Quality Metrics
45      • DA         - Direct Assessment (rating scale)
46    """)
47
48
49metrics_overview()

What is BLEU?

Bilingual Evaluation Understudy

📝text
1BLEU = Bilingual Evaluation Understudy (Papineni et al., 2002)
2
3Core idea: Good translations share many n-grams with references.
4
5Simple example:
6
7Reference: "The cat sat on the mat"
8Hypothesis: "The cat sat on the mat"
9→ Perfect match! BLEU = 1.0 (100%)
10
11Reference: "The cat sat on the mat"
12Hypothesis: "A cat is sitting on a mat"
13→ Some n-grams match, some don't. BLEU ≈ 0.3
14
15Reference: "The cat sat on the mat"
16Hypothesis: "Dog runs in park"
17→ No n-grams match. BLEU ≈ 0.0

BLEU Components

📝text
1BLEU consists of:
2
31. MODIFIED PRECISION (for n-grams 1,2,3,4):
4   - Count matching n-grams
5   - Clip by reference count (prevents gaming)
6
72. BREVITY PENALTY:
8   - Penalize translations that are too short
9   - Without this, "the" could score 100% on 1-grams
10
113. GEOMETRIC MEAN:
12   - Combine precisions with geometric mean
13   - One zero precision = zero BLEU
14
15
16Formula:
17  BLEU = BP × exp(Σ wₙ × log(pₙ))
18
19Where:
20  BP = brevity penalty
21  wₙ = weight for n-gram (usually 1/4 each)
22  pₙ = modified precision for n-grams of length n

Quick BLEU Example

Step-by-Step Calculation

🐍python
1from collections import Counter
2from typing import List, Dict
3import math
4
5
6def simple_bleu_example():
7    """
8    Walk through BLEU calculation step by step.
9    """
10    print("Simple BLEU Example")
11    print("=" * 60)
12
13    reference = "the cat sat on the mat".split()
14    hypothesis = "the cat on the mat".split()
15
16    print(f"Reference:  {' '.join(reference)}")
17    print(f"Hypothesis: {' '.join(hypothesis)}")
18    print()
19
20    # Step 1: Count n-grams
21    print("STEP 1: Count n-grams")
22    print("-" * 40)
23
24    for n in range(1, 5):
25        # Get n-grams
26        ref_ngrams = get_ngrams(reference, n)
27        hyp_ngrams = get_ngrams(hypothesis, n)
28
29        print(f"\n{n}-grams:")
30        print(f"  Reference:  {dict(ref_ngrams)}")
31        print(f"  Hypothesis: {dict(hyp_ngrams)}")
32
33        # Count matches (clipped)
34        matches = 0
35        total = sum(hyp_ngrams.values())
36
37        for ngram, count in hyp_ngrams.items():
38            matches += min(count, ref_ngrams.get(ngram, 0))
39
40        precision = matches / total if total > 0 else 0
41        print(f"  Matches: {matches}/{total} = {precision:.3f}")
42
43    # Step 2: Brevity penalty
44    print("\n" + "-" * 40)
45    print("STEP 2: Brevity Penalty")
46    print("-" * 40)
47
48    ref_len = len(reference)
49    hyp_len = len(hypothesis)
50
51    print(f"Reference length: {ref_len}")
52    print(f"Hypothesis length: {hyp_len}")
53
54    if hyp_len >= ref_len:
55        bp = 1.0
56        print(f"Hypothesis >= reference, BP = 1.0")
57    else:
58        bp = math.exp(1 - ref_len / hyp_len)
59        print(f"BP = exp(1 - {ref_len}/{hyp_len}) = {bp:.4f}")
60
61    # Step 3: Final BLEU
62    print("\n" + "-" * 40)
63    print("STEP 3: Final BLEU Score")
64    print("-" * 40)
65
66    # Using actual precisions
67    precisions = [4/5, 2/4, 1/3, 0/2]  # From calculation above
68    precisions = [p if p > 0 else 1e-10 for p in precisions]  # Avoid log(0)
69
70    log_precision = sum(0.25 * math.log(p) for p in precisions)
71    bleu = bp * math.exp(log_precision)
72
73    print(f"Precisions: {[f'{p:.3f}' for p in precisions]}")
74    print(f"BLEU = {bp:.4f} × exp({log_precision:.4f})")
75    print(f"BLEU = {bleu:.4f} ({bleu*100:.2f}%)")
76
77
78def get_ngrams(tokens: List[str], n: int) -> Counter:
79    """Extract n-grams from token list."""
80    ngrams = []
81    for i in range(len(tokens) - n + 1):
82        ngram = tuple(tokens[i:i+n])
83        ngrams.append(ngram)
84    return Counter(ngrams)
85
86
87simple_bleu_example()

Interpreting BLEU Scores

Score Guidelines

Score RangeQuality Level
< 10Almost useless
10-19Hard to understand, some words correct
20-29Basic meaning, significant errors
30-39Understandable, some fluency issues
40-49Good quality, minor errors
50-59Very good, near human quality
60+Excellent (rare without data leakage!)

Note: These are rough guidelines for news translation. Different domains and language pairs vary significantly!

Context Matters

📝text
1WMT (News translation):
2  - State-of-the-art DE→EN: ~40-45 BLEU
3  - State-of-the-art EN→DE: ~35-40 BLEU
4
5Multi30k (Simple descriptions):
6  - Good model: ~35-40 BLEU
7  - Target for our course: 30-35 BLEU
8
9Conversational:
10  - Much harder, lower BLEU expected
11  - 20-25 can be reasonable

Important Caveats

  • BLEU doesn't measure: Factual accuracy (can hallucinate facts), semantic preservation (paraphrases score low), or fluency directly
  • Higher isn't always better: Can overfit to reference style, may sacrifice diversity, human preference can differ
  • Compare carefully: Same tokenization (!), same number of references, same test set, same BLEU implementation

Metrics Comparison

When to Use Each Metric

MetricCorrelationSpeedMulti-refNotes
BLEUMediumFastYesIndustry standard
METEORHighMediumYesBetter for en, needs resources
ChrFMedium-HighFastYesGood for morphological langs
TERMediumMediumNoMeasures edit distance
BERTScoreHighSlowYesContext-aware, GPU needed
COMETVery HighSlowYesBest correlation, GPU needed

Choosing a Metric

📝text
1For development/research:
2  → Use BLEU (fast, comparable to papers)
3  → Add COMET/BERTScore for final evaluation
4
5For morphologically rich languages (German, Finnish, Turkish):
6  → Use ChrF (character-level handles compounds)
7  → BLEU underestimates quality
8
9For summarization:
10  → Use ROUGE (recall-focused)
11  → BLEU penalizes paraphrasing
12
13For quality estimation (no reference):
14  → Use COMET-QE or similar
15  → Useful for filtering training data
16
17
18WHY BLEU IS STILL POPULAR:
19──────────────────────────
20
211. Fast (no GPU needed)
222. Reproducible (deterministic)
233. Comparable (everyone uses it)
244. Good enough for development
255. Well-understood limitations

Setting Up Evaluation

Evaluation Pipeline Overview

📝text
1EVALUATION WORKFLOW:
2────────────────────
3
4┌─────────────┐
5│ Test Data   │ Source sentences + reference translations
6└──────┬──────┘
789┌─────────────┐
10│   Model     │ Generate translations
11└──────┬──────┘
121314┌─────────────┐
15│  Tokenize   │ Apply consistent tokenization
16└──────┬──────┘
171819┌─────────────┐
20│   Score     │ BLEU, ChrF, etc.
21└──────┬──────┘
222324┌─────────────┐
25│   Report    │ Scores + analysis
26└─────────────┘

Critical: Tokenization

BLEU scores depend heavily on tokenization!
📝text
1Same translation, different tokenization:
2  "don't" vs "do n't" vs "don ' t"
3  → Different n-gram counts!
4
5Standard approaches:
6  - Moses tokenizer (traditional MT)
7  - SacreBLEU (standardized, recommended)
8  - spaCy/NLTK tokenizer
9
10We'll use SacreBLEU-style for reproducibility.

Summary

Key Concepts

ConceptDescription
BLEUN-gram precision with brevity penalty
Reference-basedRequires human translations to compare
Modified precisionClips counts to prevent gaming
Corpus-levelAggregates across all sentences

Score Guidelines

BLEU RangeQuality
< 20Low quality
20-30Basic understanding
30-40Good quality
40+Very good (approaching human)

Next Steps

In the next section, we'll implement BLEU from scratch with all its components.

Loading comments...