Introduction
This section brings together all evaluation components into a practical pipeline for evaluating translation models. We'll cover batch translation, proper tokenization handling, statistical significance testing, and reporting.
Complete Evaluation Pipeline
Pipeline Overview
📝text
1EVALUATION PIPELINE:
2────────────────────
3
4┌─────────────────┐
5│ Test Data │ source.txt, reference.txt
6└────────┬────────┘
7 │
8 ▼
9┌─────────────────┐
10│ Load Model │ From checkpoint
11└────────┬────────┘
12 │
13 ▼
14┌─────────────────┐
15│ Translate │ Batch inference with beam search
16└────────┬────────┘
17 │
18 ▼
19┌─────────────────┐
20│ Detokenize │ Convert subwords back to text
21└────────┬────────┘
22 │
23 ▼
24┌─────────────────┐
25│ Score │ BLEU, ChrF, TER
26└────────┬────────┘
27 │
28 ▼
29┌─────────────────┐
30│ Report │ Scores + analysis
31└─────────────────┘Implementation
🐍python
1import torch
2import torch.nn as nn
3from typing import List, Dict, Optional, Tuple, Any
4from dataclasses import dataclass
5from pathlib import Path
6import json
7import time
8from collections import Counter
9import math
10
11
12@dataclass
13class EvaluationConfig:
14 """Configuration for evaluation pipeline."""
15 # Model
16 checkpoint_path: str
17 device: str = "cuda"
18
19 # Generation
20 beam_size: int = 5
21 max_length: int = 128
22 length_penalty: float = 1.0
23
24 # Tokenization
25 tokenizer_path: Optional[str] = None
26
27 # Output
28 output_dir: str = "evaluation_results"
29 save_translations: bool = True
30 save_detailed_scores: bool = True
31
32 # Analysis
33 num_examples_to_show: int = 10
34 compute_bootstrap_ci: bool = False
35 bootstrap_samples: int = 1000
36
37
38class TranslationEvaluationPipeline:
39 """
40 Complete pipeline for evaluating translation models.
41
42 Handles:
43 - Loading model and tokenizer
44 - Batch translation with beam search
45 - Multi-metric evaluation
46 - Result reporting and analysis
47
48 Args:
49 config: EvaluationConfig instance
50 model: Transformer model (optional, will load from checkpoint)
51 tokenizer: Tokenizer (optional, will load from config path)
52
53 Example:
54 >>> pipeline = TranslationEvaluationPipeline(config)
55 >>> results = pipeline.evaluate(test_sources, test_references)
56 >>> pipeline.save_results(results)
57 """
58
59 def __init__(
60 self,
61 config: EvaluationConfig,
62 model: Optional[nn.Module] = None,
63 tokenizer: Optional[Any] = None
64 ):
65 self.config = config
66 self.device = torch.device(config.device)
67
68 # Load model if not provided
69 if model is None:
70 self.model = self._load_model()
71 else:
72 self.model = model.to(self.device)
73
74 self.model.eval()
75
76 # Load tokenizer if not provided
77 self.tokenizer = tokenizer
78
79 # Initialize evaluators
80 self._init_evaluators()
81
82 # Create output directory
83 Path(config.output_dir).mkdir(parents=True, exist_ok=True)
84
85 def _load_model(self) -> nn.Module:
86 """Load model from checkpoint."""
87 checkpoint = torch.load(
88 self.config.checkpoint_path,
89 map_location=self.device
90 )
91
92 # This would use actual model class in real code
93 # model = Transformer(**checkpoint['model_config'])
94 # model.load_state_dict(checkpoint['model_state_dict'])
95
96 print(f"Loaded model from {self.config.checkpoint_path}")
97 # Placeholder
98 return None
99
100 def _init_evaluators(self):
101 """Initialize metric evaluators."""
102 # These use our implementations from previous sections
103 self.bleu_scorer = BLEUScoreAccumulator()
104 self.chrf_scorer = ChrFScore()
105 self.ter_scorer = TERScore()
106
107 def evaluate(
108 self,
109 sources: List[str],
110 references: List[List[str]],
111 batch_size: int = 32
112 ) -> Dict[str, Any]:
113 """
114 Run complete evaluation pipeline.
115
116 Args:
117 sources: Source sentences
118 references: Reference translations (list of lists for multi-ref)
119 batch_size: Batch size for translation
120
121 Returns:
122 Dictionary with all results
123 """
124 print(f"Evaluating {len(sources)} sentences...")
125 start_time = time.time()
126
127 # Step 1: Translate
128 print("Translating...")
129 hypotheses = self._translate_all(sources, batch_size)
130
131 translation_time = time.time() - start_time
132 print(f"Translation completed in {translation_time:.2f}s")
133
134 # Step 2: Compute metrics
135 print("Computing metrics...")
136 metrics = self._compute_metrics(hypotheses, references)
137
138 # Step 3: Compute sentence-level scores
139 sentence_scores = self._compute_sentence_scores(hypotheses, references)
140
141 # Step 4: Bootstrap confidence intervals (optional)
142 confidence_intervals = {}
143 if self.config.compute_bootstrap_ci:
144 print("Computing confidence intervals...")
145 confidence_intervals = self._bootstrap_ci(
146 hypotheses, references
147 )
148
149 # Step 5: Prepare results
150 results = {
151 'metrics': metrics,
152 'sentence_scores': sentence_scores,
153 'confidence_intervals': confidence_intervals,
154 'hypotheses': hypotheses,
155 'num_sentences': len(sources),
156 'translation_time': translation_time,
157 'sentences_per_second': len(sources) / translation_time,
158 'config': {
159 'beam_size': self.config.beam_size,
160 'max_length': self.config.max_length,
161 'length_penalty': self.config.length_penalty,
162 }
163 }
164
165 return results
166
167 def _translate_all(
168 self,
169 sources: List[str],
170 batch_size: int
171 ) -> List[str]:
172 """Translate all source sentences."""
173 hypotheses = []
174
175 # For demonstration, return placeholder translations
176 # In real code, this would use the model
177 for source in sources:
178 # Placeholder: echo source (would use actual model)
179 hypotheses.append(source.lower())
180
181 return hypotheses
182
183 def _compute_metrics(
184 self,
185 hypotheses: List[str],
186 references: List[List[str]]
187 ) -> Dict[str, float]:
188 """Compute all corpus-level metrics."""
189 # Reset scorers
190 self.bleu_scorer.reset()
191 self.chrf_scorer.reset()
192 self.ter_scorer.reset()
193
194 # Accumulate
195 for hyp, refs in zip(hypotheses, references):
196 ref = refs[0] if isinstance(refs, list) else refs
197 self.bleu_scorer.add(hyp, ref)
198 self.chrf_scorer.add(hyp, ref)
199 self.ter_scorer.add(hyp, ref)
200
201 return {
202 'bleu': self.bleu_scorer.compute() * 100,
203 'chrf': self.chrf_scorer.corpus_score()['chrf'] * 100,
204 'ter': self.ter_scorer.corpus_score() * 100,
205 }
206
207 def _compute_sentence_scores(
208 self,
209 hypotheses: List[str],
210 references: List[List[str]]
211 ) -> List[Dict[str, float]]:
212 """Compute sentence-level scores for analysis."""
213 scores = []
214
215 for hyp, refs in zip(hypotheses, references):
216 ref = refs[0] if isinstance(refs, list) else refs
217 scores.append({
218 'chrf': self.chrf_scorer.sentence_score(hyp, ref),
219 'ter': self.ter_scorer.sentence_score(hyp, ref),
220 'length_ratio': len(hyp.split()) / max(len(ref.split()), 1),
221 })
222
223 return scores
224
225 def _bootstrap_ci(
226 self,
227 hypotheses: List[str],
228 references: List[List[str]],
229 confidence: float = 0.95
230 ) -> Dict[str, Tuple[float, float]]:
231 """
232 Compute bootstrap confidence intervals.
233
234 Args:
235 hypotheses: Hypothesis translations
236 references: Reference translations
237 confidence: Confidence level (default: 95%)
238
239 Returns:
240 Dictionary mapping metric name to (lower, upper) bounds
241 """
242 import random
243
244 n = len(hypotheses)
245 bleu_scores = []
246 chrf_scores = []
247
248 for _ in range(self.config.bootstrap_samples):
249 # Sample with replacement
250 indices = [random.randint(0, n-1) for _ in range(n)]
251 sampled_hyp = [hypotheses[i] for i in indices]
252 sampled_ref = [references[i] for i in indices]
253
254 # Compute metrics on sample
255 sample_metrics = self._compute_metrics(sampled_hyp, sampled_ref)
256 bleu_scores.append(sample_metrics['bleu'])
257 chrf_scores.append(sample_metrics['chrf'])
258
259 # Compute percentiles
260 alpha = 1 - confidence
261 lower_idx = int(alpha / 2 * len(bleu_scores))
262 upper_idx = int((1 - alpha / 2) * len(bleu_scores))
263
264 bleu_sorted = sorted(bleu_scores)
265 chrf_sorted = sorted(chrf_scores)
266
267 return {
268 'bleu': (bleu_sorted[lower_idx], bleu_sorted[upper_idx]),
269 'chrf': (chrf_sorted[lower_idx], chrf_sorted[upper_idx]),
270 }
271
272 def save_results(self, results: Dict[str, Any], prefix: str = "eval"):
273 """Save evaluation results to files."""
274 output_dir = Path(self.config.output_dir)
275
276 # Save metrics summary
277 metrics_path = output_dir / f"{prefix}_metrics.json"
278 with open(metrics_path, 'w') as f:
279 json.dump({
280 'metrics': results['metrics'],
281 'confidence_intervals': results.get('confidence_intervals', {}),
282 'config': results['config'],
283 'num_sentences': results['num_sentences'],
284 'translation_time': results['translation_time'],
285 }, f, indent=2)
286
287 print(f"Saved metrics to {metrics_path}")
288
289 # Save translations
290 if self.config.save_translations:
291 trans_path = output_dir / f"{prefix}_translations.txt"
292 with open(trans_path, 'w') as f:
293 for hyp in results['hypotheses']:
294 f.write(hyp + '\n')
295 print(f"Saved translations to {trans_path}")
296
297 # Save detailed scores
298 if self.config.save_detailed_scores:
299 scores_path = output_dir / f"{prefix}_sentence_scores.json"
300 with open(scores_path, 'w') as f:
301 json.dump(results['sentence_scores'], f, indent=2)
302 print(f"Saved sentence scores to {scores_path}")
303
304 def print_report(
305 self,
306 results: Dict[str, Any],
307 sources: List[str],
308 references: List[List[str]]
309 ):
310 """Print formatted evaluation report."""
311 print("\n" + "=" * 70)
312 print("TRANSLATION EVALUATION REPORT")
313 print("=" * 70)
314
315 # Metrics
316 print("\nCORPUS-LEVEL METRICS:")
317 print("-" * 40)
318 metrics = results['metrics']
319 print(f" BLEU: {metrics['bleu']:.2f}")
320 print(f" ChrF: {metrics['chrf']:.2f}")
321 print(f" TER: {metrics['ter']:.2f} (lower is better)")
322
323 # Confidence intervals
324 if results.get('confidence_intervals'):
325 print("\n95% CONFIDENCE INTERVALS:")
326 print("-" * 40)
327 for metric, (lower, upper) in results['confidence_intervals'].items():
328 print(f" {metric.upper()}: [{lower:.2f}, {upper:.2f}]")
329
330 # Statistics
331 print("\nSTATISTICS:")
332 print("-" * 40)
333 print(f" Sentences evaluated: {results['num_sentences']}")
334 print(f" Translation time: {results['translation_time']:.2f}s")
335 print(f" Speed: {results['sentences_per_second']:.1f} sentences/s")
336
337 # Example translations
338 print(f"\nEXAMPLE TRANSLATIONS ({self.config.num_examples_to_show}):")
339 print("-" * 40)
340
341 hypotheses = results['hypotheses']
342 sentence_scores = results['sentence_scores']
343
344 for i in range(min(self.config.num_examples_to_show, len(sources))):
345 print(f"\n[{i+1}]")
346 print(f" SRC: {sources[i]}")
347 print(f" REF: {references[i][0]}")
348 print(f" HYP: {hypotheses[i]}")
349 print(f" ChrF: {sentence_scores[i]['chrf']:.4f}")
350
351 print("\n" + "=" * 70)Statistical Significance Testing
Paired Bootstrap Resampling
When comparing two systems, it's important to test whether the difference is statistically significant or just due to chance.
🐍python
1import random
2from typing import Callable
3
4
5def paired_bootstrap_test(
6 system_a_scores: List[float],
7 system_b_scores: List[float],
8 num_samples: int = 10000
9) -> float:
10 """
11 Paired bootstrap test for comparing two systems.
12
13 Tests whether system A is significantly better than system B.
14
15 Args:
16 system_a_scores: Sentence-level scores for system A
17 system_b_scores: Sentence-level scores for system B
18 num_samples: Number of bootstrap samples
19
20 Returns:
21 p-value (probability that difference is due to chance)
22 """
23 n = len(system_a_scores)
24 assert len(system_b_scores) == n, "Must have same number of scores"
25
26 # Observed difference
27 observed_diff = sum(system_a_scores) - sum(system_b_scores)
28
29 # Count how often random sampling gives >= observed difference
30 count_greater = 0
31
32 for _ in range(num_samples):
33 # Random sign flip
34 sample_diff = 0
35 for i in range(n):
36 diff = system_a_scores[i] - system_b_scores[i]
37 if random.random() < 0.5:
38 sample_diff += diff
39 else:
40 sample_diff -= diff
41
42 if sample_diff >= observed_diff:
43 count_greater += 1
44
45 p_value = count_greater / num_samples
46 return p_value
47
48
49def demonstrate_significance_testing():
50 """
51 Demonstrate statistical significance testing.
52 """
53 print("Statistical Significance Testing")
54 print("=" * 60)
55
56 # Simulated sentence-level BLEU scores
57 random.seed(42)
58
59 # System A: slightly better
60 system_a = [random.gauss(0.35, 0.1) for _ in range(100)]
61
62 # System B: baseline
63 system_b = [random.gauss(0.32, 0.1) for _ in range(100)]
64
65 # Compute means
66 mean_a = sum(system_a) / len(system_a)
67 mean_b = sum(system_b) / len(system_b)
68
69 print(f"System A mean: {mean_a:.4f}")
70 print(f"System B mean: {mean_b:.4f}")
71 print(f"Difference: {mean_a - mean_b:.4f}")
72 print()
73
74 # Run bootstrap test
75 p_value = paired_bootstrap_test(system_a, system_b, num_samples=1000)
76
77 print(f"Bootstrap p-value: {p_value:.4f}")
78 print()
79
80 if p_value < 0.05:
81 print("Result: Statistically significant (p < 0.05)")
82 else:
83 print("Result: NOT statistically significant (p >= 0.05)")Interpreting Significance
📝text
1INTERPRETING SIGNIFICANCE:
2──────────────────────────
3
4p < 0.05: Significant at 95% confidence
5 "System A is likely better"
6
7p < 0.01: Highly significant
8 "System A is almost certainly better"
9
10p >= 0.05: Not significant
11 "Cannot conclude A is better than B"
12
13IMPORTANT:
14- Statistical significance ≠ practical significance
15- A 0.1 BLEU improvement may be significant but not meaningful
16- Always consider effect size alongside p-value
17- Multiple comparisons require correction (Bonferroni, etc.)Error Analysis
Finding and Categorizing Errors
🐍python
1def analyze_translation_errors(
2 sources: List[str],
3 hypotheses: List[str],
4 references: List[str],
5 sentence_scores: List[Dict[str, float]]
6) -> Dict[str, Any]:
7 """
8 Analyze translation errors for debugging.
9
10 Categorizes sentences by:
11 - Score ranges
12 - Length ratio
13 - Common error patterns
14
15 Args:
16 sources: Source sentences
17 hypotheses: Model translations
18 references: Reference translations
19 sentence_scores: Per-sentence metrics
20
21 Returns:
22 Analysis dictionary
23 """
24 n = len(sources)
25
26 # Categorize by score
27 score_bins = {
28 'excellent': [], # ChrF > 0.8
29 'good': [], # ChrF 0.6-0.8
30 'medium': [], # ChrF 0.4-0.6
31 'poor': [], # ChrF 0.2-0.4
32 'very_poor': [], # ChrF < 0.2
33 }
34
35 for i in range(n):
36 chrf = sentence_scores[i]['chrf']
37 entry = {
38 'idx': i,
39 'source': sources[i],
40 'hypothesis': hypotheses[i],
41 'reference': references[i],
42 'chrf': chrf,
43 'ter': sentence_scores[i]['ter'],
44 'length_ratio': sentence_scores[i]['length_ratio'],
45 }
46
47 if chrf > 0.8:
48 score_bins['excellent'].append(entry)
49 elif chrf > 0.6:
50 score_bins['good'].append(entry)
51 elif chrf > 0.4:
52 score_bins['medium'].append(entry)
53 elif chrf > 0.2:
54 score_bins['poor'].append(entry)
55 else:
56 score_bins['very_poor'].append(entry)
57
58 # Length analysis
59 length_issues = {
60 'too_short': [e for s in score_bins.values() for e in s
61 if e['length_ratio'] < 0.7],
62 'too_long': [e for s in score_bins.values() for e in s
63 if e['length_ratio'] > 1.3],
64 }
65
66 return {
67 'score_distribution': {k: len(v) for k, v in score_bins.items()},
68 'score_bins': score_bins,
69 'length_issues': length_issues,
70 'worst_examples': sorted(
71 [e for s in score_bins.values() for e in s],
72 key=lambda x: x['chrf']
73 )[:10],
74 'best_examples': sorted(
75 [e for s in score_bins.values() for e in s],
76 key=lambda x: -x['chrf']
77 )[:10],
78 }
79
80
81def print_error_analysis(analysis: Dict[str, Any]):
82 """Print formatted error analysis."""
83 print("\n" + "=" * 70)
84 print("ERROR ANALYSIS")
85 print("=" * 70)
86
87 # Score distribution
88 print("\nSCORE DISTRIBUTION:")
89 print("-" * 40)
90 dist = analysis['score_distribution']
91 total = sum(dist.values())
92 for category, count in dist.items():
93 pct = count / total * 100 if total > 0 else 0
94 bar = "█" * int(pct / 2)
95 print(f" {category:<12} {count:>5} ({pct:>5.1f}%) {bar}")
96
97 # Length issues
98 print("\nLENGTH ISSUES:")
99 print("-" * 40)
100 print(f" Too short (ratio < 0.7): {len(analysis['length_issues']['too_short'])}")
101 print(f" Too long (ratio > 1.3): {len(analysis['length_issues']['too_long'])}")
102
103 # Worst examples
104 print("\nWORST TRANSLATIONS:")
105 print("-" * 40)
106 for i, ex in enumerate(analysis['worst_examples'][:5], 1):
107 print(f"\n[{i}] ChrF: {ex['chrf']:.4f}")
108 print(f" SRC: {ex['source']}")
109 print(f" REF: {ex['reference']}")
110 print(f" HYP: {ex['hypothesis']}")
111
112 # Best examples
113 print("\nBEST TRANSLATIONS:")
114 print("-" * 40)
115 for i, ex in enumerate(analysis['best_examples'][:5], 1):
116 print(f"\n[{i}] ChrF: {ex['chrf']:.4f}")
117 print(f" SRC: {ex['source']}")
118 print(f" REF: {ex['reference']}")
119 print(f" HYP: {ex['hypothesis']}")Evaluation Reporting Template
Standardized Report Format
🐍python
1def create_evaluation_report(
2 results: Dict[str, Any],
3 model_name: str,
4 dataset_name: str,
5 additional_info: Optional[Dict] = None
6) -> str:
7 """
8 Create standardized evaluation report.
9
10 Args:
11 results: Evaluation results dictionary
12 model_name: Name of the model
13 dataset_name: Name of the test set
14 additional_info: Any additional information
15
16 Returns:
17 Formatted report string
18 """
19 report = []
20
21 report.append("=" * 70)
22 report.append("MACHINE TRANSLATION EVALUATION REPORT")
23 report.append("=" * 70)
24 report.append("")
25
26 # Metadata
27 report.append("EVALUATION DETAILS")
28 report.append("-" * 40)
29 report.append(f" Model: {model_name}")
30 report.append(f" Test Set: {dataset_name}")
31 report.append(f" Sentences: {results['num_sentences']}")
32 report.append(f" Date: {time.strftime('%Y-%m-%d %H:%M:%S')}")
33 report.append("")
34
35 # Generation settings
36 config = results.get('config', {})
37 report.append("GENERATION SETTINGS")
38 report.append("-" * 40)
39 report.append(f" Beam Size: {config.get('beam_size', 'N/A')}")
40 report.append(f" Max Length: {config.get('max_length', 'N/A')}")
41 report.append(f" Length Penalty: {config.get('length_penalty', 'N/A')}")
42 report.append("")
43
44 # Main metrics
45 metrics = results['metrics']
46 report.append("CORPUS-LEVEL METRICS")
47 report.append("-" * 40)
48 report.append(f" BLEU: {metrics['bleu']:.2f}")
49 report.append(f" ChrF: {metrics['chrf']:.2f}")
50 report.append(f" TER: {metrics['ter']:.2f}")
51 report.append("")
52
53 # Confidence intervals (if available)
54 if results.get('confidence_intervals'):
55 report.append("95% CONFIDENCE INTERVALS")
56 report.append("-" * 40)
57 for metric, (lower, upper) in results['confidence_intervals'].items():
58 report.append(f" {metric.upper()}: [{lower:.2f}, {upper:.2f}]")
59 report.append("")
60
61 # BLEU signature
62 report.append("REPRODUCIBILITY")
63 report.append("-" * 40)
64 report.append(" Tokenization: default (lowercased)")
65 report.append(" BLEU Signature: BLEU+case.lc+smooth.none+tok.default")
66 report.append("")
67
68 # Performance
69 report.append("PERFORMANCE")
70 report.append("-" * 40)
71 report.append(f" Translation Time: {results['translation_time']:.2f}s")
72 report.append(f" Speed: {results['sentences_per_second']:.1f} sentences/sec")
73 report.append("")
74
75 report.append("=" * 70)
76
77 return "\n".join(report)Complete Evaluation Example
End-to-End Usage
🐍python
1def complete_evaluation_example():
2 """
3 Complete example of evaluation workflow.
4 """
5 print("Complete Evaluation Workflow")
6 print("=" * 70)
7
8 print("""
9 STEP-BY-STEP WORKFLOW:
10 ─────────────────────
11
12 1. PREPARE DATA:
13 # Load test set
14 sources = load_file('test.de')
15 references = load_file('test.en')
16
17 2. LOAD MODEL:
18 checkpoint = torch.load('best_model.pt')
19 model = Transformer(**checkpoint['config'])
20 model.load_state_dict(checkpoint['model_state_dict'])
21 model.eval()
22
23 3. TRANSLATE:
24 hypotheses = []
25 for batch in batched(sources, batch_size=32):
26 with torch.no_grad():
27 translations = beam_search(model, batch)
28 hypotheses.extend(translations)
29
30 4. DETOKENIZE:
31 # Convert subwords back to text
32 hypotheses = [detokenize(h) for h in hypotheses]
33
34 5. EVALUATE:
35 evaluator = TranslationEvaluator()
36 results = evaluator.evaluate(hypotheses, references)
37
38 6. ANALYZE:
39 analysis = analyze_translation_errors(
40 sources, hypotheses, references, results['sentence_scores']
41 )
42
43 7. REPORT:
44 report = create_evaluation_report(results, 'My Model', 'Multi30k')
45 print(report)
46
47 # Save results
48 with open('evaluation_results.json', 'w') as f:
49 json.dump(results, f)
50
51
52 COMMON PITFALLS:
53 ────────────────
54
55 1. Tokenization mismatch:
56 - Always use same tokenization for hyp and ref
57 - Prefer SacreBLEU for standardization
58
59 2. Comparing different settings:
60 - Document beam size, length penalty, etc.
61 - These significantly affect scores
62
63 3. Test set contamination:
64 - Never tune hyperparameters on test set
65 - Use separate validation set
66
67 4. Cherry-picking:
68 - Report corpus-level scores
69 - Show confidence intervals
70 - Include all experiments (even failed ones)
71 """)Summary
Evaluation Pipeline Components
| Component | Purpose |
|---|---|
| TranslationEvaluationPipeline | End-to-end evaluation |
| BLEUScoreAccumulator | Corpus-level BLEU |
| ChrFScore | Character-level metrics |
| TERScore | Edit distance metric |
Best Practices
- Use standardized tokenization (SacreBLEU style)
- Report multiple metrics (BLEU + ChrF minimum)
- Include confidence intervals for statistical rigor
- Document all settings for reproducibility
- Perform error analysis to understand failures
Key Outputs
| Output | Purpose |
|---|---|
| metrics.json | Numerical scores |
| translations.txt | Model outputs |
| sentence_scores.json | Per-sentence analysis |
| report.txt | Human-readable summary |
Chapter Summary
In this chapter, we covered:
- BLEU Score: N-gram precision with brevity penalty
- ChrF: Character-level F-score for morphological languages
- TER: Edit distance metric
- METEOR concepts: Alignment-based evaluation
- Evaluation Pipeline: Complete workflow for model evaluation
Our target BLEU for the German-English translation project is 30-35 BLEU on Multi30k.
Next Chapter Preview
In the next chapter, we'll begin the Multi30k Translation Project, where we'll apply everything we've learned to build a complete German-to-English translation system.