Boo-AI — Master Artificial Intelligence by Building from Scratch

Introduction

Evaluating machine translation quality is challenging because language is inherently ambiguous—there are many valid ways to translate a sentence. This chapter covers metrics for measuring translation quality, with a focus on BLEU, the most widely used metric.

The Challenge of MT Evaluation

Why Evaluation is Hard

📝text

1Original (German):  "Der Hund läuft schnell."
2
3Valid translations:
4  "The dog runs quickly."        ← Reference 1
5  "The dog runs fast."           ← Reference 2
6  "The dog is running quickly."  ← Reference 3
7  "A dog runs rapidly."          ← Also valid!
8
9Invalid translations:
10  "The cat runs quickly."        ← Wrong meaning
11  "Dog the runs quickly."        ← Grammatically wrong
12  "The dog."                     ← Incomplete
13
14The challenge: How do we automatically score translations
15when there are multiple valid answers?

Human vs Automatic Evaluation

📝text

1HUMAN EVALUATION:
2─────────────────
3Pros:
4  - Gold standard for quality
5  - Can judge fluency and adequacy
6  - Understands context and nuance
7
8Cons:
9  - Expensive ($$$)
10  - Slow (hours/days)
11  - Not reproducible (annotator variance)
12  - Can't use for development/training
13
14
15AUTOMATIC EVALUATION:
16─────────────────────
17Pros:
18  - Free and instant
19  - Reproducible
20  - Can evaluate millions of sentences
21  - Enables rapid development
22
23Cons:
24  - Imperfect correlation with human judgment
25  - Can be gamed
26  - Misses some quality aspects

Categories of Metrics

Overview

🐍python

1def metrics_overview():
2    """
3    Overview of translation evaluation metrics.
4    """
5    print("Translation Evaluation Metrics")
6    print("=" * 60)
7
8    print("""
9    REFERENCE-BASED METRICS:
10    ────────────────────────
11    Compare hypothesis to reference translation(s).
12
13    N-gram Overlap:
14      • BLEU    - Precision of n-grams (1-4)
15      • NIST    - BLEU variant with information weighting
16      • ROUGE   - Recall-oriented (for summarization)
17      • ChrF    - Character n-gram F-score
18
19    Word Alignment:
20      • METEOR  - Alignment + paraphrase matching
21      • TER     - Translation Edit Rate (min edits)
22
23    Embedding-Based:
24      • BERTScore  - Contextual embedding similarity
25      • COMET      - Neural learned metric
26
27
28    REFERENCE-FREE METRICS:
29    ───────────────────────
30    Quality estimation without references.
31
32      • COMET-QE   - Quality estimation model
33      • XMover     - Cross-lingual similarity
34
35    Useful when references are unavailable.
36
37
38    HUMAN EVALUATION:
39    ─────────────────
40    Gold standard, multiple dimensions:
41
42      • Adequacy   - Is the meaning preserved?
43      • Fluency    - Is it grammatically correct?
44      • MQM        - Multidimensional Quality Metrics
45      • DA         - Direct Assessment (rating scale)
46    """)
47
48
49metrics_overview()

What is BLEU?

Bilingual Evaluation Understudy

📝text

1BLEU = Bilingual Evaluation Understudy (Papineni et al., 2002)
2
3Core idea: Good translations share many n-grams with references.
4
5Simple example:
6
7Reference: "The cat sat on the mat"
8Hypothesis: "The cat sat on the mat"
9→ Perfect match! BLEU = 1.0 (100%)
10
11Reference: "The cat sat on the mat"
12Hypothesis: "A cat is sitting on a mat"
13→ Some n-grams match, some don't. BLEU ≈ 0.3
14
15Reference: "The cat sat on the mat"
16Hypothesis: "Dog runs in park"
17→ No n-grams match. BLEU ≈ 0.0

BLEU Components

📝text

1BLEU consists of:
2
31. MODIFIED PRECISION (for n-grams 1,2,3,4):
4   - Count matching n-grams
5   - Clip by reference count (prevents gaming)
6
72. BREVITY PENALTY:
8   - Penalize translations that are too short
9   - Without this, "the" could score 100% on 1-grams
10
113. GEOMETRIC MEAN:
12   - Combine precisions with geometric mean
13   - One zero precision = zero BLEU
14
15
16Formula:
17  BLEU = BP × exp(Σ wₙ × log(pₙ))
18
19Where:
20  BP = brevity penalty
21  wₙ = weight for n-gram (usually 1/4 each)
22  pₙ = modified precision for n-grams of length n

Quick BLEU Example

Step-by-Step Calculation

🐍python

1from collections import Counter
2from typing import List, Dict
3import math
4
5
6def simple_bleu_example():
7    """
8    Walk through BLEU calculation step by step.
9    """
10    print("Simple BLEU Example")
11    print("=" * 60)
12
13    reference = "the cat sat on the mat".split()
14    hypothesis = "the cat on the mat".split()
15
16    print(f"Reference:  {' '.join(reference)}")
17    print(f"Hypothesis: {' '.join(hypothesis)}")
18    print()
19
20    # Step 1: Count n-grams
21    print("STEP 1: Count n-grams")
22    print("-" * 40)
23
24    for n in range(1, 5):
25        # Get n-grams
26        ref_ngrams = get_ngrams(reference, n)
27        hyp_ngrams = get_ngrams(hypothesis, n)
28
29        print(f"\n{n}-grams:")
30        print(f"  Reference:  {dict(ref_ngrams)}")
31        print(f"  Hypothesis: {dict(hyp_ngrams)}")
32
33        # Count matches (clipped)
34        matches = 0
35        total = sum(hyp_ngrams.values())
36
37        for ngram, count in hyp_ngrams.items():
38            matches += min(count, ref_ngrams.get(ngram, 0))
39
40        precision = matches / total if total > 0 else 0
41        print(f"  Matches: {matches}/{total} = {precision:.3f}")
42
43    # Step 2: Brevity penalty
44    print("\n" + "-" * 40)
45    print("STEP 2: Brevity Penalty")
46    print("-" * 40)
47
48    ref_len = len(reference)
49    hyp_len = len(hypothesis)
50
51    print(f"Reference length: {ref_len}")
52    print(f"Hypothesis length: {hyp_len}")
53
54    if hyp_len >= ref_len:
55        bp = 1.0
56        print(f"Hypothesis >= reference, BP = 1.0")
57    else:
58        bp = math.exp(1 - ref_len / hyp_len)
59        print(f"BP = exp(1 - {ref_len}/{hyp_len}) = {bp:.4f}")
60
61    # Step 3: Final BLEU
62    print("\n" + "-" * 40)
63    print("STEP 3: Final BLEU Score")
64    print("-" * 40)
65
66    # Using actual precisions
67    precisions = [4/5, 2/4, 1/3, 0/2]  # From calculation above
68    precisions = [p if p > 0 else 1e-10 for p in precisions]  # Avoid log(0)
69
70    log_precision = sum(0.25 * math.log(p) for p in precisions)
71    bleu = bp * math.exp(log_precision)
72
73    print(f"Precisions: {[f'{p:.3f}' for p in precisions]}")
74    print(f"BLEU = {bp:.4f} × exp({log_precision:.4f})")
75    print(f"BLEU = {bleu:.4f} ({bleu*100:.2f}%)")
76
77
78def get_ngrams(tokens: List[str], n: int) -> Counter:
79    """Extract n-grams from token list."""
80    ngrams = []
81    for i in range(len(tokens) - n + 1):
82        ngram = tuple(tokens[i:i+n])
83        ngrams.append(ngram)
84    return Counter(ngrams)
85
86
87simple_bleu_example()

Interpreting BLEU Scores

Score Guidelines

Score Range	Quality Level
< 10	Almost useless
10-19	Hard to understand, some words correct
20-29	Basic meaning, significant errors
30-39	Understandable, some fluency issues
40-49	Good quality, minor errors
50-59	Very good, near human quality
60+	Excellent (rare without data leakage!)

Note: These are rough guidelines for news translation. Different domains and language pairs vary significantly!

Context Matters

📝text

1WMT (News translation):
2  - State-of-the-art DE→EN: ~40-45 BLEU
3  - State-of-the-art EN→DE: ~35-40 BLEU
4
5Multi30k (Simple descriptions):
6  - Good model: ~35-40 BLEU
7  - Target for our course: 30-35 BLEU
8
9Conversational:
10  - Much harder, lower BLEU expected
11  - 20-25 can be reasonable

Important Caveats

BLEU doesn't measure: Factual accuracy (can hallucinate facts), semantic preservation (paraphrases score low), or fluency directly
Higher isn't always better: Can overfit to reference style, may sacrifice diversity, human preference can differ
Compare carefully: Same tokenization (!), same number of references, same test set, same BLEU implementation

Metrics Comparison

When to Use Each Metric

Metric	Correlation	Speed	Multi-ref	Notes
BLEU	Medium	Fast	Yes	Industry standard
METEOR	High	Medium	Yes	Better for en, needs resources
ChrF	Medium-High	Fast	Yes	Good for morphological langs
TER	Medium	Medium	No	Measures edit distance
BERTScore	High	Slow	Yes	Context-aware, GPU needed
COMET	Very High	Slow	Yes	Best correlation, GPU needed

Choosing a Metric

📝text

1For development/research:
2  → Use BLEU (fast, comparable to papers)
3  → Add COMET/BERTScore for final evaluation
4
5For morphologically rich languages (German, Finnish, Turkish):
6  → Use ChrF (character-level handles compounds)
7  → BLEU underestimates quality
8
9For summarization:
10  → Use ROUGE (recall-focused)
11  → BLEU penalizes paraphrasing
12
13For quality estimation (no reference):
14  → Use COMET-QE or similar
15  → Useful for filtering training data
16
17
18WHY BLEU IS STILL POPULAR:
19──────────────────────────
20
211. Fast (no GPU needed)
222. Reproducible (deterministic)
233. Comparable (everyone uses it)
244. Good enough for development
255. Well-understood limitations

Setting Up Evaluation

Evaluation Pipeline Overview

📝text

1EVALUATION WORKFLOW:
2────────────────────
3
4┌─────────────┐
5│ Test Data   │ Source sentences + reference translations
6└──────┬──────┘
7       │
8       ▼
9┌─────────────┐
10│   Model     │ Generate translations
11└──────┬──────┘
12       │
13       ▼
14┌─────────────┐
15│  Tokenize   │ Apply consistent tokenization
16└──────┬──────┘
17       │
18       ▼
19┌─────────────┐
20│   Score     │ BLEU, ChrF, etc.
21└──────┬──────┘
22       │
23       ▼
24┌─────────────┐
25│   Report    │ Scores + analysis
26└─────────────┘

Critical: Tokenization

BLEU scores depend heavily on tokenization!

📝text

1Same translation, different tokenization:
2  "don't" vs "do n't" vs "don ' t"
3  → Different n-gram counts!
4
5Standard approaches:
6  - Moses tokenizer (traditional MT)
7  - SacreBLEU (standardized, recommended)
8  - spaCy/NLTK tokenizer
9
10We'll use SacreBLEU-style for reproducibility.

Summary

Key Concepts

Concept	Description
BLEU	N-gram precision with brevity penalty
Reference-based	Requires human translations to compare
Modified precision	Clips counts to prevent gaming
Corpus-level	Aggregates across all sentences

Score Guidelines

BLEU Range	Quality
< 20	Low quality
20-30	Basic understanding
30-40	Good quality
40+	Very good (approaching human)

Next Steps

In the next section, we'll implement BLEU from scratch with all its components.