Boo-AI — Master Artificial Intelligence by Building from Scratch

Introduction

This section brings together all evaluation components into a practical pipeline for evaluating translation models. We'll cover batch translation, proper tokenization handling, statistical significance testing, and reporting.

Complete Evaluation Pipeline

Pipeline Overview

📝text

1EVALUATION PIPELINE:
2────────────────────
3
4┌─────────────────┐
5│   Test Data     │ source.txt, reference.txt
6└────────┬────────┘
7         │
8         ▼
9┌─────────────────┐
10│   Load Model    │ From checkpoint
11└────────┬────────┘
12         │
13         ▼
14┌─────────────────┐
15│   Translate     │ Batch inference with beam search
16└────────┬────────┘
17         │
18         ▼
19┌─────────────────┐
20│   Detokenize    │ Convert subwords back to text
21└────────┬────────┘
22         │
23         ▼
24┌─────────────────┐
25│   Score         │ BLEU, ChrF, TER
26└────────┬────────┘
27         │
28         ▼
29┌─────────────────┐
30│   Report        │ Scores + analysis
31└─────────────────┘

Implementation

🐍python

1import torch
2import torch.nn as nn
3from typing import List, Dict, Optional, Tuple, Any
4from dataclasses import dataclass
5from pathlib import Path
6import json
7import time
8from collections import Counter
9import math
10
11
12@dataclass
13class EvaluationConfig:
14    """Configuration for evaluation pipeline."""
15    # Model
16    checkpoint_path: str
17    device: str = "cuda"
18
19    # Generation
20    beam_size: int = 5
21    max_length: int = 128
22    length_penalty: float = 1.0
23
24    # Tokenization
25    tokenizer_path: Optional[str] = None
26
27    # Output
28    output_dir: str = "evaluation_results"
29    save_translations: bool = True
30    save_detailed_scores: bool = True
31
32    # Analysis
33    num_examples_to_show: int = 10
34    compute_bootstrap_ci: bool = False
35    bootstrap_samples: int = 1000
36
37
38class TranslationEvaluationPipeline:
39    """
40    Complete pipeline for evaluating translation models.
41
42    Handles:
43    - Loading model and tokenizer
44    - Batch translation with beam search
45    - Multi-metric evaluation
46    - Result reporting and analysis
47
48    Args:
49        config: EvaluationConfig instance
50        model: Transformer model (optional, will load from checkpoint)
51        tokenizer: Tokenizer (optional, will load from config path)
52
53    Example:
54        >>> pipeline = TranslationEvaluationPipeline(config)
55        >>> results = pipeline.evaluate(test_sources, test_references)
56        >>> pipeline.save_results(results)
57    """
58
59    def __init__(
60        self,
61        config: EvaluationConfig,
62        model: Optional[nn.Module] = None,
63        tokenizer: Optional[Any] = None
64    ):
65        self.config = config
66        self.device = torch.device(config.device)
67
68        # Load model if not provided
69        if model is None:
70            self.model = self._load_model()
71        else:
72            self.model = model.to(self.device)
73
74        self.model.eval()
75
76        # Load tokenizer if not provided
77        self.tokenizer = tokenizer
78
79        # Initialize evaluators
80        self._init_evaluators()
81
82        # Create output directory
83        Path(config.output_dir).mkdir(parents=True, exist_ok=True)
84
85    def _load_model(self) -> nn.Module:
86        """Load model from checkpoint."""
87        checkpoint = torch.load(
88            self.config.checkpoint_path,
89            map_location=self.device
90        )
91
92        # This would use actual model class in real code
93        # model = Transformer(**checkpoint['model_config'])
94        # model.load_state_dict(checkpoint['model_state_dict'])
95
96        print(f"Loaded model from {self.config.checkpoint_path}")
97        # Placeholder
98        return None
99
100    def _init_evaluators(self):
101        """Initialize metric evaluators."""
102        # These use our implementations from previous sections
103        self.bleu_scorer = BLEUScoreAccumulator()
104        self.chrf_scorer = ChrFScore()
105        self.ter_scorer = TERScore()
106
107    def evaluate(
108        self,
109        sources: List[str],
110        references: List[List[str]],
111        batch_size: int = 32
112    ) -> Dict[str, Any]:
113        """
114        Run complete evaluation pipeline.
115
116        Args:
117            sources: Source sentences
118            references: Reference translations (list of lists for multi-ref)
119            batch_size: Batch size for translation
120
121        Returns:
122            Dictionary with all results
123        """
124        print(f"Evaluating {len(sources)} sentences...")
125        start_time = time.time()
126
127        # Step 1: Translate
128        print("Translating...")
129        hypotheses = self._translate_all(sources, batch_size)
130
131        translation_time = time.time() - start_time
132        print(f"Translation completed in {translation_time:.2f}s")
133
134        # Step 2: Compute metrics
135        print("Computing metrics...")
136        metrics = self._compute_metrics(hypotheses, references)
137
138        # Step 3: Compute sentence-level scores
139        sentence_scores = self._compute_sentence_scores(hypotheses, references)
140
141        # Step 4: Bootstrap confidence intervals (optional)
142        confidence_intervals = {}
143        if self.config.compute_bootstrap_ci:
144            print("Computing confidence intervals...")
145            confidence_intervals = self._bootstrap_ci(
146                hypotheses, references
147            )
148
149        # Step 5: Prepare results
150        results = {
151            'metrics': metrics,
152            'sentence_scores': sentence_scores,
153            'confidence_intervals': confidence_intervals,
154            'hypotheses': hypotheses,
155            'num_sentences': len(sources),
156            'translation_time': translation_time,
157            'sentences_per_second': len(sources) / translation_time,
158            'config': {
159                'beam_size': self.config.beam_size,
160                'max_length': self.config.max_length,
161                'length_penalty': self.config.length_penalty,
162            }
163        }
164
165        return results
166
167    def _translate_all(
168        self,
169        sources: List[str],
170        batch_size: int
171    ) -> List[str]:
172        """Translate all source sentences."""
173        hypotheses = []
174
175        # For demonstration, return placeholder translations
176        # In real code, this would use the model
177        for source in sources:
178            # Placeholder: echo source (would use actual model)
179            hypotheses.append(source.lower())
180
181        return hypotheses
182
183    def _compute_metrics(
184        self,
185        hypotheses: List[str],
186        references: List[List[str]]
187    ) -> Dict[str, float]:
188        """Compute all corpus-level metrics."""
189        # Reset scorers
190        self.bleu_scorer.reset()
191        self.chrf_scorer.reset()
192        self.ter_scorer.reset()
193
194        # Accumulate
195        for hyp, refs in zip(hypotheses, references):
196            ref = refs[0] if isinstance(refs, list) else refs
197            self.bleu_scorer.add(hyp, ref)
198            self.chrf_scorer.add(hyp, ref)
199            self.ter_scorer.add(hyp, ref)
200
201        return {
202            'bleu': self.bleu_scorer.compute() * 100,
203            'chrf': self.chrf_scorer.corpus_score()['chrf'] * 100,
204            'ter': self.ter_scorer.corpus_score() * 100,
205        }
206
207    def _compute_sentence_scores(
208        self,
209        hypotheses: List[str],
210        references: List[List[str]]
211    ) -> List[Dict[str, float]]:
212        """Compute sentence-level scores for analysis."""
213        scores = []
214
215        for hyp, refs in zip(hypotheses, references):
216            ref = refs[0] if isinstance(refs, list) else refs
217            scores.append({
218                'chrf': self.chrf_scorer.sentence_score(hyp, ref),
219                'ter': self.ter_scorer.sentence_score(hyp, ref),
220                'length_ratio': len(hyp.split()) / max(len(ref.split()), 1),
221            })
222
223        return scores
224
225    def _bootstrap_ci(
226        self,
227        hypotheses: List[str],
228        references: List[List[str]],
229        confidence: float = 0.95
230    ) -> Dict[str, Tuple[float, float]]:
231        """
232        Compute bootstrap confidence intervals.
233
234        Args:
235            hypotheses: Hypothesis translations
236            references: Reference translations
237            confidence: Confidence level (default: 95%)
238
239        Returns:
240            Dictionary mapping metric name to (lower, upper) bounds
241        """
242        import random
243
244        n = len(hypotheses)
245        bleu_scores = []
246        chrf_scores = []
247
248        for _ in range(self.config.bootstrap_samples):
249            # Sample with replacement
250            indices = [random.randint(0, n-1) for _ in range(n)]
251            sampled_hyp = [hypotheses[i] for i in indices]
252            sampled_ref = [references[i] for i in indices]
253
254            # Compute metrics on sample
255            sample_metrics = self._compute_metrics(sampled_hyp, sampled_ref)
256            bleu_scores.append(sample_metrics['bleu'])
257            chrf_scores.append(sample_metrics['chrf'])
258
259        # Compute percentiles
260        alpha = 1 - confidence
261        lower_idx = int(alpha / 2 * len(bleu_scores))
262        upper_idx = int((1 - alpha / 2) * len(bleu_scores))
263
264        bleu_sorted = sorted(bleu_scores)
265        chrf_sorted = sorted(chrf_scores)
266
267        return {
268            'bleu': (bleu_sorted[lower_idx], bleu_sorted[upper_idx]),
269            'chrf': (chrf_sorted[lower_idx], chrf_sorted[upper_idx]),
270        }
271
272    def save_results(self, results: Dict[str, Any], prefix: str = "eval"):
273        """Save evaluation results to files."""
274        output_dir = Path(self.config.output_dir)
275
276        # Save metrics summary
277        metrics_path = output_dir / f"{prefix}_metrics.json"
278        with open(metrics_path, 'w') as f:
279            json.dump({
280                'metrics': results['metrics'],
281                'confidence_intervals': results.get('confidence_intervals', {}),
282                'config': results['config'],
283                'num_sentences': results['num_sentences'],
284                'translation_time': results['translation_time'],
285            }, f, indent=2)
286
287        print(f"Saved metrics to {metrics_path}")
288
289        # Save translations
290        if self.config.save_translations:
291            trans_path = output_dir / f"{prefix}_translations.txt"
292            with open(trans_path, 'w') as f:
293                for hyp in results['hypotheses']:
294                    f.write(hyp + '\n')
295            print(f"Saved translations to {trans_path}")
296
297        # Save detailed scores
298        if self.config.save_detailed_scores:
299            scores_path = output_dir / f"{prefix}_sentence_scores.json"
300            with open(scores_path, 'w') as f:
301                json.dump(results['sentence_scores'], f, indent=2)
302            print(f"Saved sentence scores to {scores_path}")
303
304    def print_report(
305        self,
306        results: Dict[str, Any],
307        sources: List[str],
308        references: List[List[str]]
309    ):
310        """Print formatted evaluation report."""
311        print("\n" + "=" * 70)
312        print("TRANSLATION EVALUATION REPORT")
313        print("=" * 70)
314
315        # Metrics
316        print("\nCORPUS-LEVEL METRICS:")
317        print("-" * 40)
318        metrics = results['metrics']
319        print(f"  BLEU:  {metrics['bleu']:.2f}")
320        print(f"  ChrF:  {metrics['chrf']:.2f}")
321        print(f"  TER:   {metrics['ter']:.2f} (lower is better)")
322
323        # Confidence intervals
324        if results.get('confidence_intervals'):
325            print("\n95% CONFIDENCE INTERVALS:")
326            print("-" * 40)
327            for metric, (lower, upper) in results['confidence_intervals'].items():
328                print(f"  {metric.upper()}: [{lower:.2f}, {upper:.2f}]")
329
330        # Statistics
331        print("\nSTATISTICS:")
332        print("-" * 40)
333        print(f"  Sentences evaluated: {results['num_sentences']}")
334        print(f"  Translation time: {results['translation_time']:.2f}s")
335        print(f"  Speed: {results['sentences_per_second']:.1f} sentences/s")
336
337        # Example translations
338        print(f"\nEXAMPLE TRANSLATIONS ({self.config.num_examples_to_show}):")
339        print("-" * 40)
340
341        hypotheses = results['hypotheses']
342        sentence_scores = results['sentence_scores']
343
344        for i in range(min(self.config.num_examples_to_show, len(sources))):
345            print(f"\n[{i+1}]")
346            print(f"  SRC: {sources[i]}")
347            print(f"  REF: {references[i][0]}")
348            print(f"  HYP: {hypotheses[i]}")
349            print(f"  ChrF: {sentence_scores[i]['chrf']:.4f}")
350
351        print("\n" + "=" * 70)

Statistical Significance Testing

Paired Bootstrap Resampling

When comparing two systems, it's important to test whether the difference is statistically significant or just due to chance.

🐍python

1import random
2from typing import Callable
3
4
5def paired_bootstrap_test(
6    system_a_scores: List[float],
7    system_b_scores: List[float],
8    num_samples: int = 10000
9) -> float:
10    """
11    Paired bootstrap test for comparing two systems.
12
13    Tests whether system A is significantly better than system B.
14
15    Args:
16        system_a_scores: Sentence-level scores for system A
17        system_b_scores: Sentence-level scores for system B
18        num_samples: Number of bootstrap samples
19
20    Returns:
21        p-value (probability that difference is due to chance)
22    """
23    n = len(system_a_scores)
24    assert len(system_b_scores) == n, "Must have same number of scores"
25
26    # Observed difference
27    observed_diff = sum(system_a_scores) - sum(system_b_scores)
28
29    # Count how often random sampling gives >= observed difference
30    count_greater = 0
31
32    for _ in range(num_samples):
33        # Random sign flip
34        sample_diff = 0
35        for i in range(n):
36            diff = system_a_scores[i] - system_b_scores[i]
37            if random.random() < 0.5:
38                sample_diff += diff
39            else:
40                sample_diff -= diff
41
42        if sample_diff >= observed_diff:
43            count_greater += 1
44
45    p_value = count_greater / num_samples
46    return p_value
47
48
49def demonstrate_significance_testing():
50    """
51    Demonstrate statistical significance testing.
52    """
53    print("Statistical Significance Testing")
54    print("=" * 60)
55
56    # Simulated sentence-level BLEU scores
57    random.seed(42)
58
59    # System A: slightly better
60    system_a = [random.gauss(0.35, 0.1) for _ in range(100)]
61
62    # System B: baseline
63    system_b = [random.gauss(0.32, 0.1) for _ in range(100)]
64
65    # Compute means
66    mean_a = sum(system_a) / len(system_a)
67    mean_b = sum(system_b) / len(system_b)
68
69    print(f"System A mean: {mean_a:.4f}")
70    print(f"System B mean: {mean_b:.4f}")
71    print(f"Difference: {mean_a - mean_b:.4f}")
72    print()
73
74    # Run bootstrap test
75    p_value = paired_bootstrap_test(system_a, system_b, num_samples=1000)
76
77    print(f"Bootstrap p-value: {p_value:.4f}")
78    print()
79
80    if p_value < 0.05:
81        print("Result: Statistically significant (p < 0.05)")
82    else:
83        print("Result: NOT statistically significant (p >= 0.05)")

Interpreting Significance

📝text

1INTERPRETING SIGNIFICANCE:
2──────────────────────────
3
4p < 0.05:  Significant at 95% confidence
5           "System A is likely better"
6
7p < 0.01:  Highly significant
8           "System A is almost certainly better"
9
10p >= 0.05: Not significant
11           "Cannot conclude A is better than B"
12
13IMPORTANT:
14- Statistical significance ≠ practical significance
15- A 0.1 BLEU improvement may be significant but not meaningful
16- Always consider effect size alongside p-value
17- Multiple comparisons require correction (Bonferroni, etc.)

Error Analysis

Finding and Categorizing Errors

🐍python

1def analyze_translation_errors(
2    sources: List[str],
3    hypotheses: List[str],
4    references: List[str],
5    sentence_scores: List[Dict[str, float]]
6) -> Dict[str, Any]:
7    """
8    Analyze translation errors for debugging.
9
10    Categorizes sentences by:
11    - Score ranges
12    - Length ratio
13    - Common error patterns
14
15    Args:
16        sources: Source sentences
17        hypotheses: Model translations
18        references: Reference translations
19        sentence_scores: Per-sentence metrics
20
21    Returns:
22        Analysis dictionary
23    """
24    n = len(sources)
25
26    # Categorize by score
27    score_bins = {
28        'excellent': [],  # ChrF > 0.8
29        'good': [],       # ChrF 0.6-0.8
30        'medium': [],     # ChrF 0.4-0.6
31        'poor': [],       # ChrF 0.2-0.4
32        'very_poor': [],  # ChrF < 0.2
33    }
34
35    for i in range(n):
36        chrf = sentence_scores[i]['chrf']
37        entry = {
38            'idx': i,
39            'source': sources[i],
40            'hypothesis': hypotheses[i],
41            'reference': references[i],
42            'chrf': chrf,
43            'ter': sentence_scores[i]['ter'],
44            'length_ratio': sentence_scores[i]['length_ratio'],
45        }
46
47        if chrf > 0.8:
48            score_bins['excellent'].append(entry)
49        elif chrf > 0.6:
50            score_bins['good'].append(entry)
51        elif chrf > 0.4:
52            score_bins['medium'].append(entry)
53        elif chrf > 0.2:
54            score_bins['poor'].append(entry)
55        else:
56            score_bins['very_poor'].append(entry)
57
58    # Length analysis
59    length_issues = {
60        'too_short': [e for s in score_bins.values() for e in s
61                     if e['length_ratio'] < 0.7],
62        'too_long': [e for s in score_bins.values() for e in s
63                    if e['length_ratio'] > 1.3],
64    }
65
66    return {
67        'score_distribution': {k: len(v) for k, v in score_bins.items()},
68        'score_bins': score_bins,
69        'length_issues': length_issues,
70        'worst_examples': sorted(
71            [e for s in score_bins.values() for e in s],
72            key=lambda x: x['chrf']
73        )[:10],
74        'best_examples': sorted(
75            [e for s in score_bins.values() for e in s],
76            key=lambda x: -x['chrf']
77        )[:10],
78    }
79
80
81def print_error_analysis(analysis: Dict[str, Any]):
82    """Print formatted error analysis."""
83    print("\n" + "=" * 70)
84    print("ERROR ANALYSIS")
85    print("=" * 70)
86
87    # Score distribution
88    print("\nSCORE DISTRIBUTION:")
89    print("-" * 40)
90    dist = analysis['score_distribution']
91    total = sum(dist.values())
92    for category, count in dist.items():
93        pct = count / total * 100 if total > 0 else 0
94        bar = "█" * int(pct / 2)
95        print(f"  {category:<12} {count:>5} ({pct:>5.1f}%) {bar}")
96
97    # Length issues
98    print("\nLENGTH ISSUES:")
99    print("-" * 40)
100    print(f"  Too short (ratio < 0.7): {len(analysis['length_issues']['too_short'])}")
101    print(f"  Too long (ratio > 1.3):  {len(analysis['length_issues']['too_long'])}")
102
103    # Worst examples
104    print("\nWORST TRANSLATIONS:")
105    print("-" * 40)
106    for i, ex in enumerate(analysis['worst_examples'][:5], 1):
107        print(f"\n[{i}] ChrF: {ex['chrf']:.4f}")
108        print(f"  SRC: {ex['source']}")
109        print(f"  REF: {ex['reference']}")
110        print(f"  HYP: {ex['hypothesis']}")
111
112    # Best examples
113    print("\nBEST TRANSLATIONS:")
114    print("-" * 40)
115    for i, ex in enumerate(analysis['best_examples'][:5], 1):
116        print(f"\n[{i}] ChrF: {ex['chrf']:.4f}")
117        print(f"  SRC: {ex['source']}")
118        print(f"  REF: {ex['reference']}")
119        print(f"  HYP: {ex['hypothesis']}")

Evaluation Reporting Template

Standardized Report Format

🐍python

1def create_evaluation_report(
2    results: Dict[str, Any],
3    model_name: str,
4    dataset_name: str,
5    additional_info: Optional[Dict] = None
6) -> str:
7    """
8    Create standardized evaluation report.
9
10    Args:
11        results: Evaluation results dictionary
12        model_name: Name of the model
13        dataset_name: Name of the test set
14        additional_info: Any additional information
15
16    Returns:
17        Formatted report string
18    """
19    report = []
20
21    report.append("=" * 70)
22    report.append("MACHINE TRANSLATION EVALUATION REPORT")
23    report.append("=" * 70)
24    report.append("")
25
26    # Metadata
27    report.append("EVALUATION DETAILS")
28    report.append("-" * 40)
29    report.append(f"  Model:        {model_name}")
30    report.append(f"  Test Set:     {dataset_name}")
31    report.append(f"  Sentences:    {results['num_sentences']}")
32    report.append(f"  Date:         {time.strftime('%Y-%m-%d %H:%M:%S')}")
33    report.append("")
34
35    # Generation settings
36    config = results.get('config', {})
37    report.append("GENERATION SETTINGS")
38    report.append("-" * 40)
39    report.append(f"  Beam Size:      {config.get('beam_size', 'N/A')}")
40    report.append(f"  Max Length:     {config.get('max_length', 'N/A')}")
41    report.append(f"  Length Penalty: {config.get('length_penalty', 'N/A')}")
42    report.append("")
43
44    # Main metrics
45    metrics = results['metrics']
46    report.append("CORPUS-LEVEL METRICS")
47    report.append("-" * 40)
48    report.append(f"  BLEU:   {metrics['bleu']:.2f}")
49    report.append(f"  ChrF:   {metrics['chrf']:.2f}")
50    report.append(f"  TER:    {metrics['ter']:.2f}")
51    report.append("")
52
53    # Confidence intervals (if available)
54    if results.get('confidence_intervals'):
55        report.append("95% CONFIDENCE INTERVALS")
56        report.append("-" * 40)
57        for metric, (lower, upper) in results['confidence_intervals'].items():
58            report.append(f"  {metric.upper()}: [{lower:.2f}, {upper:.2f}]")
59        report.append("")
60
61    # BLEU signature
62    report.append("REPRODUCIBILITY")
63    report.append("-" * 40)
64    report.append("  Tokenization: default (lowercased)")
65    report.append("  BLEU Signature: BLEU+case.lc+smooth.none+tok.default")
66    report.append("")
67
68    # Performance
69    report.append("PERFORMANCE")
70    report.append("-" * 40)
71    report.append(f"  Translation Time: {results['translation_time']:.2f}s")
72    report.append(f"  Speed: {results['sentences_per_second']:.1f} sentences/sec")
73    report.append("")
74
75    report.append("=" * 70)
76
77    return "\n".join(report)

Complete Evaluation Example

End-to-End Usage

🐍python

1def complete_evaluation_example():
2    """
3    Complete example of evaluation workflow.
4    """
5    print("Complete Evaluation Workflow")
6    print("=" * 70)
7
8    print("""
9    STEP-BY-STEP WORKFLOW:
10    ─────────────────────
11
12    1. PREPARE DATA:
13       # Load test set
14       sources = load_file('test.de')
15       references = load_file('test.en')
16
17    2. LOAD MODEL:
18       checkpoint = torch.load('best_model.pt')
19       model = Transformer(**checkpoint['config'])
20       model.load_state_dict(checkpoint['model_state_dict'])
21       model.eval()
22
23    3. TRANSLATE:
24       hypotheses = []
25       for batch in batched(sources, batch_size=32):
26           with torch.no_grad():
27               translations = beam_search(model, batch)
28           hypotheses.extend(translations)
29
30    4. DETOKENIZE:
31       # Convert subwords back to text
32       hypotheses = [detokenize(h) for h in hypotheses]
33
34    5. EVALUATE:
35       evaluator = TranslationEvaluator()
36       results = evaluator.evaluate(hypotheses, references)
37
38    6. ANALYZE:
39       analysis = analyze_translation_errors(
40           sources, hypotheses, references, results['sentence_scores']
41       )
42
43    7. REPORT:
44       report = create_evaluation_report(results, 'My Model', 'Multi30k')
45       print(report)
46
47       # Save results
48       with open('evaluation_results.json', 'w') as f:
49           json.dump(results, f)
50
51
52    COMMON PITFALLS:
53    ────────────────
54
55    1. Tokenization mismatch:
56       - Always use same tokenization for hyp and ref
57       - Prefer SacreBLEU for standardization
58
59    2. Comparing different settings:
60       - Document beam size, length penalty, etc.
61       - These significantly affect scores
62
63    3. Test set contamination:
64       - Never tune hyperparameters on test set
65       - Use separate validation set
66
67    4. Cherry-picking:
68       - Report corpus-level scores
69       - Show confidence intervals
70       - Include all experiments (even failed ones)
71    """)

Summary

Evaluation Pipeline Components

Component	Purpose
TranslationEvaluationPipeline	End-to-end evaluation
BLEUScoreAccumulator	Corpus-level BLEU
ChrFScore	Character-level metrics
TERScore	Edit distance metric

Best Practices

Use standardized tokenization (SacreBLEU style)
Report multiple metrics (BLEU + ChrF minimum)
Include confidence intervals for statistical rigor
Document all settings for reproducibility
Perform error analysis to understand failures

Key Outputs

Output	Purpose
metrics.json	Numerical scores
translations.txt	Model outputs
sentence_scores.json	Per-sentence analysis
report.txt	Human-readable summary

Chapter Summary

In this chapter, we covered:

BLEU Score: N-gram precision with brevity penalty
ChrF: Character-level F-score for morphological languages
TER: Edit distance metric
METEOR concepts: Alignment-based evaluation
Evaluation Pipeline: Complete workflow for model evaluation

Our target BLEU for the German-English translation project is 30-35 BLEU on Multi30k.

Next Chapter Preview

In the next chapter, we'll begin the Multi30k Translation Project, where we'll apply everything we've learned to build a complete German-to-English translation system.