In this chapter, we begin our capstone project: building a German-to-English translation system. We'll use the Multi30k dataset, a widely-used benchmark for multimodal machine translation research. By the end of this project, we aim to achieve 30-35 BLEU on the test set.
1.1 About Multi30k
Dataset Description
πtext
1MULTI30K DATASET:
2βββββββββββββββββ
3
4Original Name: Multi30k - Multilingual Image Description Dataset
5Purpose: Multimodal Machine Translation
6Source: Extension of Flickr30k image descriptions
7
8Key Statistics:
9 Training: 29,000 sentence pairs
10 Validation: 1,014 sentence pairs
11 Test 2016: 1,000 sentence pairs
12 Test 2017: 1,000 sentence pairs
13
14Languages: German (DE) β English (EN)
15 Also available: French, Czech
16
17Domain: Image descriptions
18 Short, simple sentences describing everyday scenes
19
20Example Pairs:
21 DE: "Ein Mann in einem blauen Hemd steht vor einer Garage."
22 EN: "A man in a blue shirt is standing in front of a garage."
23
24 DE: "Ein Kind spielt mit einem Ball im Park."
25 EN: "A child is playing with a ball in the park."Why Multi30k?
πpython
1def why_multi30k():
2 """
3 Reasons for choosing Multi30k for this course.
4 """
5 print("Why Multi30k for Learning?")
6 print("=" * 60)
7
8 print("""
9 1. MANAGEABLE SIZE:
10 βββββββββββββββββ
11 - ~30K training examples
12 - Can train on single GPU in hours
13 - Can even train on CPU (slowly)
14
15 Compare to WMT datasets:
16 WMT14 EN-DE: 4.5M sentence pairs
17 β Days of GPU time
18
19 2. SIMPLE DOMAIN:
20 βββββββββββββββ
21 - Short sentences (avg ~12 words)
22 - Everyday vocabulary
23 - Clear, descriptive language
24 - Less ambiguity than news/literature
25
26 3. WELL-ESTABLISHED:
27 βββββββββββββββββ
28 - Published baselines to compare
29 - Standardized test sets
30 - Active research community
31 - Easy to find reference implementations
32
33 4. GOOD FOR VALIDATION:
34 ββββββββββββββββββββ
35 - Small test sets = fast evaluation
36 - Multiple test years (2016, 2017)
37 - Clear quality metrics
38
39
40 EXPECTED PERFORMANCE:
41 βββββββββββββββββββββ
42
43 Model Type BLEU (DEβEN)
44 βββββββββββββββββββββ ββββββββββββ
45 Our target 30-35
46 Transformer-base 35-40
47 State-of-the-art 45+
48 With image features 50+
49
50 Note: Using text only (no images) in this course.
51 """)
52
53
54why_multi30k()1.2 Dataset Structure
File Organization
πtext
1MULTI30K DATASET STRUCTURE:
2βββββββββββββββββββββββββββ
3
4multi30k/
5βββ data/
6β βββ task1/
7β βββ raw/
8β βββ train.de # German training sentences
9β βββ train.en # English training sentences
10β βββ val.de # German validation
11β βββ val.en # English validation
12β βββ test_2016_flickr.de # Test 2016
13β βββ test_2016_flickr.en
14β βββ test_2017_flickr.de # Test 2017
15β βββ test_2017_flickr.en
16β
17βββ images/ # (Not used in this course)
18 βββ train/
19 βββ val/
20 βββ test/Data Format
πpython
1def show_data_format():
2 """
3 Show the format of Multi30k data files.
4 """
5 print("Multi30k Data Format")
6 print("=" * 60)
7
8 print("""
9 FILE FORMAT:
10 ββββββββββββ
11 - Plain text files
12 - One sentence per line
13 - UTF-8 encoding
14 - Parallel: line N in train.de corresponds to line N in train.en
15
16 Example train.de (first 5 lines):
17 βββββββββββββββββββββββββββββββββ
18 Zwei junge weiΓe MΓ€nner sind im Freien in der NΓ€he vieler BΓΌsche.
19 Mehrere MΓ€nner mit Schutzhelmen bedienen ein Antriebsaggregat.
20 Ein kleines MΓ€dchen klettert in ein Spielhaus aus Holz.
21 Ein Mann in einem blauen Hemd steht auf einer Leiter...
22 Zwei MΓ€nner stehen am Herd und bereiten Essen zu.
23
24 Corresponding train.en:
25 βββββββββββββββββββββββ
26 Two young, White males are outside near many bushes.
27 Several men in hard hats are operating a giant pulley system.
28 A little girl climbing into a wooden playhouse.
29 A man in a blue shirt is standing on a ladder...
30 Two men are at the stove preparing food.
31
32
33 STATISTICS BY SPLIT:
34 ββββββββββββββββββββ
35
36 Split Sentences Avg DE Len Avg EN Len
37 ββββββββββ βββββββββ ββββββββββ ββββββββββ
38 Train 29,000 12.1 words 11.4 words
39 Validation 1,014 11.5 words 10.8 words
40 Test 2016 1,000 12.3 words 11.6 words
41 Test 2017 1,000 12.2 words 11.5 words
42 """)
43
44
45show_data_format()1.3 Downloading the Dataset
Download Methods
πpython
1from pathlib import Path
2import urllib.request
3import tarfile
4import os
5
6
7def download_multi30k(data_dir: str = "data/multi30k") -> Path:
8 """
9 Download Multi30k dataset.
10
11 Args:
12 data_dir: Directory to store the data
13
14 Returns:
15 Path to data directory
16 """
17 data_path = Path(data_dir)
18 data_path.mkdir(parents=True, exist_ok=True)
19
20 # URLs for the dataset
21 # Note: In practice, you may need to find current mirrors
22 base_url = "https://raw.githubusercontent.com/multi30k/dataset/master/data/task1/raw/"
23
24 files = [
25 "train.de.gz", "train.en.gz",
26 "val.de.gz", "val.en.gz",
27 "test_2016_flickr.de.gz", "test_2016_flickr.en.gz",
28 "test_2017_flickr.de.gz", "test_2017_flickr.en.gz",
29 ]
30
31 print("Downloading Multi30k dataset...")
32
33 for filename in files:
34 url = base_url + filename
35 output_path = data_path / filename
36
37 if not output_path.exists():
38 print(f" Downloading {filename}...")
39 try:
40 urllib.request.urlretrieve(url, output_path)
41
42 # Decompress
43 import gzip
44 with gzip.open(output_path, 'rb') as f_in:
45 unzipped_path = output_path.with_suffix('')
46 with open(unzipped_path, 'wb') as f_out:
47 f_out.write(f_in.read())
48
49 # Remove .gz file
50 output_path.unlink()
51
52 except Exception as e:
53 print(f" Error downloading {filename}: {e}")
54 else:
55 print(f" {filename} already exists")
56
57 print("Download complete!")
58 return data_pathFor manual download instructions and alternative methods:
β‘bash
1# OPTION 1: GitHub Repository
2git clone https://github.com/multi30k/dataset.git
3cd dataset/data/task1/raw
4gunzip *.gz
5
6# OPTION 2: Hugging Face Datasets
7pip install datasets
8
9# Then in Python:
10from datasets import load_dataset
11dataset = load_dataset("multi30k")1.4 Exploring the Data
Data Exploration
πpython
1from typing import List, Tuple, Dict
2from collections import Counter
3import re
4
5
6def load_parallel_data(
7 src_path: str,
8 tgt_path: str,
9 max_samples: int = None
10) -> Tuple[List[str], List[str]]:
11 """
12 Load parallel text files.
13
14 Args:
15 src_path: Path to source language file
16 tgt_path: Path to target language file
17 max_samples: Maximum number of samples to load
18
19 Returns:
20 Tuple of (source_sentences, target_sentences)
21 """
22 with open(src_path, 'r', encoding='utf-8') as f:
23 source = [line.strip() for line in f]
24
25 with open(tgt_path, 'r', encoding='utf-8') as f:
26 target = [line.strip() for line in f]
27
28 assert len(source) == len(target), "Mismatched parallel files!"
29
30 if max_samples:
31 source = source[:max_samples]
32 target = target[:max_samples]
33
34 return source, target
35
36
37def analyze_dataset(
38 source: List[str],
39 target: List[str],
40 src_lang: str = "DE",
41 tgt_lang: str = "EN"
42) -> Dict:
43 """
44 Analyze parallel corpus statistics.
45 """
46 stats = {
47 'num_pairs': len(source),
48 'src_lang': src_lang,
49 'tgt_lang': tgt_lang,
50 }
51
52 # Length statistics
53 src_lengths = [len(s.split()) for s in source]
54 tgt_lengths = [len(t.split()) for t in target]
55
56 stats['src_avg_len'] = sum(src_lengths) / len(src_lengths)
57 stats['tgt_avg_len'] = sum(tgt_lengths) / len(tgt_lengths)
58 stats['src_max_len'] = max(src_lengths)
59 stats['tgt_max_len'] = max(tgt_lengths)
60 stats['src_min_len'] = min(src_lengths)
61 stats['tgt_min_len'] = min(tgt_lengths)
62
63 # Vocabulary (simple word-level)
64 src_vocab = Counter()
65 tgt_vocab = Counter()
66
67 for s in source:
68 src_vocab.update(s.lower().split())
69 for t in target:
70 tgt_vocab.update(t.lower().split())
71
72 stats['src_vocab_size'] = len(src_vocab)
73 stats['tgt_vocab_size'] = len(tgt_vocab)
74 stats['src_total_tokens'] = sum(src_vocab.values())
75 stats['tgt_total_tokens'] = sum(tgt_vocab.values())
76
77 # Most common words
78 stats['src_common'] = src_vocab.most_common(10)
79 stats['tgt_common'] = tgt_vocab.most_common(10)
80
81 return stats1.5 Data Quality Checks
Verifying the Data
πpython
1def quality_checks(
2 source: List[str],
3 target: List[str]
4) -> Dict[str, List[int]]:
5 """
6 Run quality checks on parallel data.
7
8 Returns:
9 Dictionary of issue type to list of problematic indices
10 """
11 issues = {
12 'empty_source': [],
13 'empty_target': [],
14 'too_long': [],
15 'too_short': [],
16 'length_mismatch': [],
17 }
18
19 MAX_LENGTH = 100 # Maximum reasonable length
20 MIN_LENGTH = 2 # Minimum reasonable length
21 LENGTH_RATIO_MAX = 3.0 # Maximum source/target length ratio
22
23 for i, (src, tgt) in enumerate(zip(source, target)):
24 # Empty checks
25 if not src.strip():
26 issues['empty_source'].append(i)
27 if not tgt.strip():
28 issues['empty_target'].append(i)
29
30 # Length checks
31 src_len = len(src.split())
32 tgt_len = len(tgt.split())
33
34 if src_len > MAX_LENGTH or tgt_len > MAX_LENGTH:
35 issues['too_long'].append(i)
36
37 if src_len < MIN_LENGTH or tgt_len < MIN_LENGTH:
38 issues['too_short'].append(i)
39
40 # Length ratio check (catches misalignments)
41 if src_len > 0 and tgt_len > 0:
42 ratio = max(src_len, tgt_len) / min(src_len, tgt_len)
43 if ratio > LENGTH_RATIO_MAX:
44 issues['length_mismatch'].append(i)
45
46 return issues
47
48
49def print_quality_report(issues: Dict[str, List[int]], total: int):
50 """Print quality check report."""
51 print("\nData Quality Report")
52 print("=" * 60)
53
54 print(f"\nTotal pairs: {total:,}")
55 print("\nIssues found:")
56
57 for issue_type, indices in issues.items():
58 count = len(indices)
59 pct = count / total * 100
60 status = "β" if count == 0 else "!"
61 print(f" {status} {issue_type}: {count} ({pct:.2f}%)")
62
63 # Overall assessment
64 total_issues = sum(len(v) for v in issues.values())
65 if total_issues == 0:
66 print("\nβ Data appears clean!")
67 elif total_issues < total * 0.01:
68 print(f"\nβ Minor issues found ({total_issues} total)")
69 else:
70 print(f"\nβ Significant issues found ({total_issues} total)")
71 print(" Consider filtering problematic pairs")1.6 Project Setup
Directory Structure
πtext
1RECOMMENDED PROJECT STRUCTURE:
2ββββββββββββββββββββββββββββββ
3
4translation_project/
5β
6βββ data/
7β βββ multi30k/
8β βββ train.de
9β βββ train.en
10β βββ val.de
11β βββ val.en
12β βββ test_2016_flickr.de
13β βββ test_2016_flickr.en
14β βββ tokenizer/
15β βββ vocab.json
16β βββ merges.txt
17β
18βββ src/
19β βββ model/
20β β βββ __init__.py
21β β βββ transformer.py
22β β βββ encoder.py
23β β βββ decoder.py
24β β βββ attention.py
25β β
26β βββ data/
27β β βββ __init__.py
28β β βββ dataset.py
29β β βββ tokenizer.py
30β β
31β βββ training/
32β β βββ __init__.py
33β β βββ trainer.py
34β β βββ scheduler.py
35β β
36β βββ evaluation/
37β βββ __init__.py
38β βββ bleu.py
39β βββ inference.py
40β
41βββ configs/
42β βββ model_small.yaml
43β βββ model_base.yaml
44β βββ training.yaml
45β
46βββ checkpoints/
47β βββ (saved models)
48β
49βββ logs/
50β βββ (training logs)
51β
52βββ train.py
53βββ evaluate.py
54βββ translate.py
55βββ requirements.txtConfiguration Files
πyaml
1# MODEL CONFIGURATION (model_base.yaml):
2model:
3 d_model: 512
4 num_heads: 8
5 num_encoder_layers: 6
6 num_decoder_layers: 6
7 d_ff: 2048
8 dropout: 0.1
9 max_seq_len: 128
10
11vocab:
12 vocab_size: 8000
13 pad_token: "<pad>"
14 unk_token: "<unk>"
15 bos_token: "<bos>"
16 eos_token: "<eos>"
17
18# TRAINING CONFIGURATION (training.yaml):
19training:
20 batch_size: 64
21 max_tokens: 4096
22 num_epochs: 30
23 gradient_clip: 1.0
24 label_smoothing: 0.1
25
26optimizer:
27 type: adam
28 lr: 0.0001
29 betas: [0.9, 0.98]
30 eps: 1.0e-9
31
32scheduler:
33 warmup_steps: 4000
34 type: transformer
35
36checkpoint:
37 save_dir: checkpoints
38 save_every: 1000
39 keep_best: 5
40
41evaluation:
42 eval_every: 500
43 beam_size: 5Summary
| Aspect | Details |
|---|---|
| Size | ~30K training pairs |
| Domain | Image descriptions |
| Languages | German β English |
| Avg Length | ~12 words |
| Target BLEU | 30-35 |
Project Checklist
- Download Multi30k dataset
- Verify data integrity
- Set up project structure
- Create configuration files
- Run initial data exploration
Next Section Preview
In the next section, we'll cover Data Preprocessingβcleaning, normalizing, and preparing the data for training.