Boo-AI — Master Artificial Intelligence by Building from Scratch

In this chapter, we begin our capstone project: building a German-to-English translation system. We'll use the Multi30k dataset, a widely-used benchmark for multimodal machine translation research. By the end of this project, we aim to achieve 30-35 BLEU on the test set.

1.1 About Multi30k

Dataset Description

📝text

1MULTI30K DATASET:
2─────────────────
3
4Original Name: Multi30k - Multilingual Image Description Dataset
5Purpose: Multimodal Machine Translation
6Source: Extension of Flickr30k image descriptions
7
8Key Statistics:
9  Training:   29,000 sentence pairs
10  Validation:  1,014 sentence pairs
11  Test 2016:   1,000 sentence pairs
12  Test 2017:   1,000 sentence pairs
13
14Languages: German (DE) → English (EN)
15           Also available: French, Czech
16
17Domain: Image descriptions
18        Short, simple sentences describing everyday scenes
19
20Example Pairs:
21  DE: "Ein Mann in einem blauen Hemd steht vor einer Garage."
22  EN: "A man in a blue shirt is standing in front of a garage."
23
24  DE: "Ein Kind spielt mit einem Ball im Park."
25  EN: "A child is playing with a ball in the park."

Why Multi30k?

🐍python

1def why_multi30k():
2    """
3    Reasons for choosing Multi30k for this course.
4    """
5    print("Why Multi30k for Learning?")
6    print("=" * 60)
7
8    print("""
9    1. MANAGEABLE SIZE:
10       ─────────────────
11       - ~30K training examples
12       - Can train on single GPU in hours
13       - Can even train on CPU (slowly)
14
15       Compare to WMT datasets:
16         WMT14 EN-DE: 4.5M sentence pairs
17         → Days of GPU time
18
19    2. SIMPLE DOMAIN:
20       ───────────────
21       - Short sentences (avg ~12 words)
22       - Everyday vocabulary
23       - Clear, descriptive language
24       - Less ambiguity than news/literature
25
26    3. WELL-ESTABLISHED:
27       ─────────────────
28       - Published baselines to compare
29       - Standardized test sets
30       - Active research community
31       - Easy to find reference implementations
32
33    4. GOOD FOR VALIDATION:
34       ────────────────────
35       - Small test sets = fast evaluation
36       - Multiple test years (2016, 2017)
37       - Clear quality metrics
38
39
40    EXPECTED PERFORMANCE:
41    ─────────────────────
42
43    Model Type              BLEU (DE→EN)
44    ─────────────────────   ────────────
45    Our target              30-35
46    Transformer-base        35-40
47    State-of-the-art        45+
48    With image features     50+
49
50    Note: Using text only (no images) in this course.
51    """)
52
53
54why_multi30k()

1.2 Dataset Structure

File Organization

📝text

1MULTI30K DATASET STRUCTURE:
2───────────────────────────
3
4multi30k/
5├── data/
6│   └── task1/
7│       └── raw/
8│           ├── train.de      # German training sentences
9│           ├── train.en      # English training sentences
10│           ├── val.de        # German validation
11│           ├── val.en        # English validation
12│           ├── test_2016_flickr.de  # Test 2016
13│           ├── test_2016_flickr.en
14│           ├── test_2017_flickr.de  # Test 2017
15│           └── test_2017_flickr.en
16│
17└── images/                   # (Not used in this course)
18    ├── train/
19    ├── val/
20    └── test/

Data Format

🐍python

1def show_data_format():
2    """
3    Show the format of Multi30k data files.
4    """
5    print("Multi30k Data Format")
6    print("=" * 60)
7
8    print("""
9    FILE FORMAT:
10    ────────────
11    - Plain text files
12    - One sentence per line
13    - UTF-8 encoding
14    - Parallel: line N in train.de corresponds to line N in train.en
15
16    Example train.de (first 5 lines):
17    ─────────────────────────────────
18    Zwei junge weiße Männer sind im Freien in der Nähe vieler Büsche.
19    Mehrere Männer mit Schutzhelmen bedienen ein Antriebsaggregat.
20    Ein kleines Mädchen klettert in ein Spielhaus aus Holz.
21    Ein Mann in einem blauen Hemd steht auf einer Leiter...
22    Zwei Männer stehen am Herd und bereiten Essen zu.
23
24    Corresponding train.en:
25    ───────────────────────
26    Two young, White males are outside near many bushes.
27    Several men in hard hats are operating a giant pulley system.
28    A little girl climbing into a wooden playhouse.
29    A man in a blue shirt is standing on a ladder...
30    Two men are at the stove preparing food.
31
32
33    STATISTICS BY SPLIT:
34    ────────────────────
35
36    Split        Sentences  Avg DE Len  Avg EN Len
37    ──────────   ─────────  ──────────  ──────────
38    Train        29,000     12.1 words  11.4 words
39    Validation    1,014     11.5 words  10.8 words
40    Test 2016     1,000     12.3 words  11.6 words
41    Test 2017     1,000     12.2 words  11.5 words
42    """)
43
44
45show_data_format()

1.3 Downloading the Dataset

Download Methods

🐍python

1from pathlib import Path
2import urllib.request
3import tarfile
4import os
5
6
7def download_multi30k(data_dir: str = "data/multi30k") -> Path:
8    """
9    Download Multi30k dataset.
10
11    Args:
12        data_dir: Directory to store the data
13
14    Returns:
15        Path to data directory
16    """
17    data_path = Path(data_dir)
18    data_path.mkdir(parents=True, exist_ok=True)
19
20    # URLs for the dataset
21    # Note: In practice, you may need to find current mirrors
22    base_url = "https://raw.githubusercontent.com/multi30k/dataset/master/data/task1/raw/"
23
24    files = [
25        "train.de.gz", "train.en.gz",
26        "val.de.gz", "val.en.gz",
27        "test_2016_flickr.de.gz", "test_2016_flickr.en.gz",
28        "test_2017_flickr.de.gz", "test_2017_flickr.en.gz",
29    ]
30
31    print("Downloading Multi30k dataset...")
32
33    for filename in files:
34        url = base_url + filename
35        output_path = data_path / filename
36
37        if not output_path.exists():
38            print(f"  Downloading {filename}...")
39            try:
40                urllib.request.urlretrieve(url, output_path)
41
42                # Decompress
43                import gzip
44                with gzip.open(output_path, 'rb') as f_in:
45                    unzipped_path = output_path.with_suffix('')
46                    with open(unzipped_path, 'wb') as f_out:
47                        f_out.write(f_in.read())
48
49                # Remove .gz file
50                output_path.unlink()
51
52            except Exception as e:
53                print(f"  Error downloading {filename}: {e}")
54        else:
55            print(f"  {filename} already exists")
56
57    print("Download complete!")
58    return data_path

For manual download instructions and alternative methods:

⚡bash

1# OPTION 1: GitHub Repository
2git clone https://github.com/multi30k/dataset.git
3cd dataset/data/task1/raw
4gunzip *.gz
5
6# OPTION 2: Hugging Face Datasets
7pip install datasets
8
9# Then in Python:
10from datasets import load_dataset
11dataset = load_dataset("multi30k")

1.4 Exploring the Data

Data Exploration

🐍python

1from typing import List, Tuple, Dict
2from collections import Counter
3import re
4
5
6def load_parallel_data(
7    src_path: str,
8    tgt_path: str,
9    max_samples: int = None
10) -> Tuple[List[str], List[str]]:
11    """
12    Load parallel text files.
13
14    Args:
15        src_path: Path to source language file
16        tgt_path: Path to target language file
17        max_samples: Maximum number of samples to load
18
19    Returns:
20        Tuple of (source_sentences, target_sentences)
21    """
22    with open(src_path, 'r', encoding='utf-8') as f:
23        source = [line.strip() for line in f]
24
25    with open(tgt_path, 'r', encoding='utf-8') as f:
26        target = [line.strip() for line in f]
27
28    assert len(source) == len(target), "Mismatched parallel files!"
29
30    if max_samples:
31        source = source[:max_samples]
32        target = target[:max_samples]
33
34    return source, target
35
36
37def analyze_dataset(
38    source: List[str],
39    target: List[str],
40    src_lang: str = "DE",
41    tgt_lang: str = "EN"
42) -> Dict:
43    """
44    Analyze parallel corpus statistics.
45    """
46    stats = {
47        'num_pairs': len(source),
48        'src_lang': src_lang,
49        'tgt_lang': tgt_lang,
50    }
51
52    # Length statistics
53    src_lengths = [len(s.split()) for s in source]
54    tgt_lengths = [len(t.split()) for t in target]
55
56    stats['src_avg_len'] = sum(src_lengths) / len(src_lengths)
57    stats['tgt_avg_len'] = sum(tgt_lengths) / len(tgt_lengths)
58    stats['src_max_len'] = max(src_lengths)
59    stats['tgt_max_len'] = max(tgt_lengths)
60    stats['src_min_len'] = min(src_lengths)
61    stats['tgt_min_len'] = min(tgt_lengths)
62
63    # Vocabulary (simple word-level)
64    src_vocab = Counter()
65    tgt_vocab = Counter()
66
67    for s in source:
68        src_vocab.update(s.lower().split())
69    for t in target:
70        tgt_vocab.update(t.lower().split())
71
72    stats['src_vocab_size'] = len(src_vocab)
73    stats['tgt_vocab_size'] = len(tgt_vocab)
74    stats['src_total_tokens'] = sum(src_vocab.values())
75    stats['tgt_total_tokens'] = sum(tgt_vocab.values())
76
77    # Most common words
78    stats['src_common'] = src_vocab.most_common(10)
79    stats['tgt_common'] = tgt_vocab.most_common(10)
80
81    return stats

1.5 Data Quality Checks

Verifying the Data

🐍python

1def quality_checks(
2    source: List[str],
3    target: List[str]
4) -> Dict[str, List[int]]:
5    """
6    Run quality checks on parallel data.
7
8    Returns:
9        Dictionary of issue type to list of problematic indices
10    """
11    issues = {
12        'empty_source': [],
13        'empty_target': [],
14        'too_long': [],
15        'too_short': [],
16        'length_mismatch': [],
17    }
18
19    MAX_LENGTH = 100  # Maximum reasonable length
20    MIN_LENGTH = 2    # Minimum reasonable length
21    LENGTH_RATIO_MAX = 3.0  # Maximum source/target length ratio
22
23    for i, (src, tgt) in enumerate(zip(source, target)):
24        # Empty checks
25        if not src.strip():
26            issues['empty_source'].append(i)
27        if not tgt.strip():
28            issues['empty_target'].append(i)
29
30        # Length checks
31        src_len = len(src.split())
32        tgt_len = len(tgt.split())
33
34        if src_len > MAX_LENGTH or tgt_len > MAX_LENGTH:
35            issues['too_long'].append(i)
36
37        if src_len < MIN_LENGTH or tgt_len < MIN_LENGTH:
38            issues['too_short'].append(i)
39
40        # Length ratio check (catches misalignments)
41        if src_len > 0 and tgt_len > 0:
42            ratio = max(src_len, tgt_len) / min(src_len, tgt_len)
43            if ratio > LENGTH_RATIO_MAX:
44                issues['length_mismatch'].append(i)
45
46    return issues
47
48
49def print_quality_report(issues: Dict[str, List[int]], total: int):
50    """Print quality check report."""
51    print("\nData Quality Report")
52    print("=" * 60)
53
54    print(f"\nTotal pairs: {total:,}")
55    print("\nIssues found:")
56
57    for issue_type, indices in issues.items():
58        count = len(indices)
59        pct = count / total * 100
60        status = "✓" if count == 0 else "!"
61        print(f"  {status} {issue_type}: {count} ({pct:.2f}%)")
62
63    # Overall assessment
64    total_issues = sum(len(v) for v in issues.values())
65    if total_issues == 0:
66        print("\n✓ Data appears clean!")
67    elif total_issues < total * 0.01:
68        print(f"\n⚠ Minor issues found ({total_issues} total)")
69    else:
70        print(f"\n⚠ Significant issues found ({total_issues} total)")
71        print("   Consider filtering problematic pairs")

1.6 Project Setup

Directory Structure

📝text

1RECOMMENDED PROJECT STRUCTURE:
2──────────────────────────────
3
4translation_project/
5│
6├── data/
7│   └── multi30k/
8│       ├── train.de
9│       ├── train.en
10│       ├── val.de
11│       ├── val.en
12│       ├── test_2016_flickr.de
13│       ├── test_2016_flickr.en
14│       └── tokenizer/
15│           ├── vocab.json
16│           └── merges.txt
17│
18├── src/
19│   ├── model/
20│   │   ├── __init__.py
21│   │   ├── transformer.py
22│   │   ├── encoder.py
23│   │   ├── decoder.py
24│   │   └── attention.py
25│   │
26│   ├── data/
27│   │   ├── __init__.py
28│   │   ├── dataset.py
29│   │   └── tokenizer.py
30│   │
31│   ├── training/
32│   │   ├── __init__.py
33│   │   ├── trainer.py
34│   │   └── scheduler.py
35│   │
36│   └── evaluation/
37│       ├── __init__.py
38│       ├── bleu.py
39│       └── inference.py
40│
41├── configs/
42│   ├── model_small.yaml
43│   ├── model_base.yaml
44│   └── training.yaml
45│
46├── checkpoints/
47│   └── (saved models)
48│
49├── logs/
50│   └── (training logs)
51│
52├── train.py
53├── evaluate.py
54├── translate.py
55└── requirements.txt

Configuration Files

📄yaml

1# MODEL CONFIGURATION (model_base.yaml):
2model:
3  d_model: 512
4  num_heads: 8
5  num_encoder_layers: 6
6  num_decoder_layers: 6
7  d_ff: 2048
8  dropout: 0.1
9  max_seq_len: 128
10
11vocab:
12  vocab_size: 8000
13  pad_token: "<pad>"
14  unk_token: "<unk>"
15  bos_token: "<bos>"
16  eos_token: "<eos>"
17
18# TRAINING CONFIGURATION (training.yaml):
19training:
20  batch_size: 64
21  max_tokens: 4096
22  num_epochs: 30
23  gradient_clip: 1.0
24  label_smoothing: 0.1
25
26optimizer:
27  type: adam
28  lr: 0.0001
29  betas: [0.9, 0.98]
30  eps: 1.0e-9
31
32scheduler:
33  warmup_steps: 4000
34  type: transformer
35
36checkpoint:
37  save_dir: checkpoints
38  save_every: 1000
39  keep_best: 5
40
41evaluation:
42  eval_every: 500
43  beam_size: 5

Summary

Aspect	Details
Size	~30K training pairs
Domain	Image descriptions
Languages	German → English
Avg Length	~12 words
Target BLEU	30-35

Project Checklist

Download Multi30k dataset
Verify data integrity
Set up project structure
Create configuration files
Run initial data exploration

Next Section Preview

In the next section, we'll cover Data Preprocessing—cleaning, normalizing, and preparing the data for training.