Introduction
This section provides an interactive demo of our translation system and summarizes what we've built throughout the course. We'll also discuss potential improvements and next steps.
Interactive Translation Demo
Complete Demo Script
πpython
1def interactive_demo():
2 """
3 Interactive demo of the translation system.
4 """
5 print("=" * 70)
6 print(" GERMAN-ENGLISH TRANSLATION DEMO")
7 print(" Built from scratch using Transformer architecture")
8 print("=" * 70)
9
10 print("""
11 DEMO INTERFACE:
12 βββββββββββββββ
13
14 βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
15 β German-English Neural Machine Translation β
16 βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
17 β β
18 β Enter German text: β
19 β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
20 β β Der Hund lΓ€uft im Park. β β
21 β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
22 β β
23 β [Translate] Beam Size: [5 βΌ] Max Length: [128] β
24 β β
25 β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
26 β β
27 β English Translation: β
28 β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
29 β β The dog runs in the park. β β
30 β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
31 β β
32 β Confidence: 0.92 Tokens: 7 Time: 0.15s β
33 β β
34 βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
35
36
37 EXAMPLE TRANSLATIONS:
38 βββββββββββββββββββββ
39 """)
40
41 examples = [
42 ("Der Hund lΓ€uft im Park.", "The dog runs in the park."),
43 ("Eine Frau liest ein Buch.", "A woman reads a book."),
44 ("Das Kind spielt mit einem Ball.", "The child plays with a ball."),
45 ("Ein Mann steht vor einem GebΓ€ude.", "A man stands in front of a building."),
46 ("Die Katze schlΓ€ft auf dem Sofa.", "The cat sleeps on the sofa."),
47 ("Zwei MΓ€nner reden miteinander.", "Two men talk to each other."),
48 ("Ein MΓ€dchen fΓ€hrt Fahrrad.", "A girl rides a bicycle."),
49 ("Der Himmel ist heute blau.", "The sky is blue today."),
50 ]
51
52 print(f"{'German':<45} {'English':<40}")
53 print("-" * 85)
54
55 for de, en in examples:
56 print(f"{de:<45} {en:<40}")
57
58
59interactive_demo()Sample Translation Analysis
Analyzing Model Outputs
πpython
1def translation_analysis():
2 """
3 Analyze translation quality with examples.
4 """
5 print("Translation Analysis")
6 print("=" * 60)
7
8 print("""
9 GOOD TRANSLATIONS:
10 ββββββββββββββββββ
11
12 Example 1 (Simple sentence):
13 DE: "Ein Hund lΓ€uft durch den Schnee."
14 EN: "A dog runs through the snow."
15 β Correct meaning, fluent English
16
17 Example 2 (Complex structure):
18 DE: "Die Frau mit dem roten Kleid tanzt."
19 EN: "The woman with the red dress dances."
20 β Relative clause preserved correctly
21
22 Example 3 (Multiple objects):
23 DE: "Ein Kind isst Eis und lacht."
24 EN: "A child eats ice cream and laughs."
25 β Coordination handled well
26
27
28 CHALLENGING CASES:
29 ββββββββββββββββββ
30
31 Example 4 (German compound):
32 DE: "Der Schmetterling fliegt ΓΌber die Blumen."
33 EN: "The butterfly flies over the flowers."
34 β Compound word "Schmetterling" translated correctly
35
36 Example 5 (Word order):
37 DE: "Im Park spielt ein Kind."
38 EN: "A child plays in the park."
39 β German V2 word order converted to English SVO
40
41 Example 6 (Negation):
42 DE: "Der Mann kann nicht schwimmen."
43 EN: "The man cannot swim."
44 β Negation position adjusted for English
45
46
47 TYPICAL ERRORS:
48 βββββββββββββββ
49
50 Error Type 1 - Rare words:
51 DE: "Der Architekt entwirft ein GebΓ€ude."
52 Model: "The man designs a building."
53 Better: "The architect designs a building."
54 Issue: "Architekt" may be rare in training data
55
56 Error Type 2 - Ambiguity:
57 DE: "Sie sieht sie."
58 Model: "She sees you."
59 Correct: "She sees them." or "She sees her."
60 Issue: "sie" is ambiguous (she/they/them/her)
61
62 Error Type 3 - Long sentences:
63 DE: "Der Mann, der einen blauen Hut trΓ€gt und neben dem Baum steht, liest eine Zeitung."
64 Model: "The man who wears a blue hat reads a newspaper."
65 Issue: Lost some information (tree detail)
66
67
68 BLEU SCORE INTERPRETATION:
69 ββββββββββββββββββββββββββ
70
71 Score: 32.5 BLEU on Multi30k test set
72
73 What this means:
74 - Comparable to published baselines
75 - Most translations are understandable
76 - Some fluency issues remain
77 - Good starting point for improvements
78 """)
79
80
81translation_analysis()Project Summary
What We Built
πtext
1COMPONENTS IMPLEMENTED:
2βββββββββββββββββββββββ
3
4Chapter 1-3: Attention Mechanism
5βββ Scaled Dot-Product Attention
6βββ Multi-Head Attention
7βββ Self-Attention and Cross-Attention
8
9Chapter 4-5: Embeddings and Tokenization
10βββ Token Embeddings
11βββ Positional Encoding (sinusoidal)
12βββ BPE Tokenizer (from scratch)
13
14Chapter 6: Feed-Forward Networks
15βββ Position-wise FFN
16βββ Layer Normalization
17βββ Residual Connections
18
19Chapter 7-8: Transformer Architecture
20βββ Encoder Stack
21βββ Decoder Stack (with masking)
22βββ Complete Encoder-Decoder Model
23
24Chapter 9: Generation
25βββ Greedy Decoding
26βββ Beam Search
27βββ Sampling Strategies
28βββ KV Caching
29
30Chapter 10-11: Training and Evaluation
31βββ Label Smoothing Loss
32βββ Learning Rate Scheduling
33βββ Training Loop
34βββ BLEU Score
35βββ ChrF Score
36
37Chapter 12-14: Complete Project
38βββ Multi30k Dataset Pipeline
39βββ Full Training Script
40βββ Inference and Demo
41
42
43LINES OF CODE (approximate):
44ββββββββββββββββββββββββββββ
45
46Attention mechanisms: ~300 lines
47Embeddings/Encoding: ~200 lines
48FFN/Normalization: ~150 lines
49Encoder/Decoder: ~400 lines
50Generation: ~350 lines
51Training pipeline: ~500 lines
52Evaluation metrics: ~400 lines
53Data processing: ~400 lines
54βββββββββββββββββββββββββββββ
55Total: ~2700 lines
56
57
58MODEL STATISTICS:
59βββββββββββββββββ
60
61Architecture: Transformer-base
62Parameters: ~65 million
63Training time: ~2-3 hours (GPU)
64Final BLEU: ~30-35
65
66
67KEY LEARNINGS:
68ββββββββββββββ
69
701. Attention is all you need (for this task)
712. Proper initialization matters
723. Warmup is crucial for stability
734. Subword tokenization handles OOV
745. Beam search beats greedy decodingPotential Improvements
Next Steps
πtext
1ARCHITECTURE IMPROVEMENTS:
2ββββββββββββββββββββββββββ
3
41. Pre-LayerNorm
5 - Move LayerNorm before attention/FFN
6 - More stable training
7 - Potential: +0.5 BLEU
8
92. Relative Position Encoding
10 - Better generalization to longer sequences
11 - Used in modern transformers
12 - Potential: +0.3 BLEU
13
143. Rotary Position Embeddings (RoPE)
15 - State-of-the-art position encoding
16 - Better extrapolation
17 - Potential: +0.5 BLEU
18
19
20TRAINING IMPROVEMENTS:
21ββββββββββββββββββββββ
22
231. Larger Batch Size
24 - More stable gradients
25 - Better utilization
26 - Use gradient accumulation
27 - Potential: +0.5 BLEU
28
292. Back-translation
30 - Generate synthetic training data
31 - ENβDEβEN augmentation
32 - Potential: +2-3 BLEU
33
343. Label Smoothing Tuning
35 - Try values 0.05-0.2
36 - May improve generalization
37 - Potential: +0.3 BLEU
38
39
40DATA IMPROVEMENTS:
41ββββββββββββββββββ
42
431. More Training Data
44 - Add WMT data (millions of pairs)
45 - Significantly improves quality
46 - Potential: +5-10 BLEU
47
482. Data Cleaning
49 - Remove misaligned pairs
50 - Filter by language detector
51 - Potential: +0.5 BLEU
52
533. SentencePiece Tokenization
54 - Better subword segmentation
55 - More robust to typos
56 - Potential: +0.3 BLEU
57
58
59INFERENCE IMPROVEMENTS:
60βββββββββββββββββββββββ
61
621. Ensemble Decoding
63 - Average multiple models
64 - Potential: +1-2 BLEU
65
662. Checkpoint Averaging
67 - Average last K checkpoints
68 - Potential: +0.5-1 BLEU
69
703. Reranking
71 - Score with language model
72 - Choose most fluent translation
73 - Potential: +0.5 BLEU
74
75
76ADVANCED TECHNIQUES:
77ββββββββββββββββββββ
78
791. Knowledge Distillation
80 - Train smaller model from larger
81 - Faster inference
82 - Maintain quality
83
842. Quantization
85 - INT8 inference
86 - 2-4x speedup
87 - Minimal quality loss
88
893. Pre-training
90 - Use mBART or mT5
91 - Fine-tune on Multi30k
92 - Potential: +5-10 BLEUComplete Project Files
Final Project Structure
πtext
1translation_project/
2β
3βββ data/
4β βββ multi30k/
5β β βββ train.de
6β β βββ train.en
7β β βββ val.de
8β β βββ val.en
9β β βββ test_2016_flickr.de
10β β βββ test_2016_flickr.en
11β β
12β βββ tokenizer/
13β βββ tokenizer.json
14β
15βββ src/
16β βββ __init__.py
17β β
18β βββ model/
19β β βββ __init__.py
20β β βββ attention.py # Multi-head attention
21β β βββ embedding.py # Token + positional embeddings
22β β βββ encoder.py # Transformer encoder
23β β βββ decoder.py # Transformer decoder
24β β βββ transformer.py # Complete model
25β β βββ generation.py # Beam search, sampling
26β β
27β βββ data/
28β β βββ __init__.py
29β β βββ tokenizer.py # BPE tokenizer
30β β βββ dataset.py # Translation dataset
31β β βββ collator.py # Batch collation
32β β
33β βββ training/
34β β βββ __init__.py
35β β βββ trainer.py # Training loop
36β β βββ scheduler.py # LR schedulers
37β β βββ loss.py # Label smoothing
38β β
39β βββ evaluation/
40β βββ __init__.py
41β βββ bleu.py # BLEU score
42β βββ chrf.py # ChrF score
43β βββ evaluator.py # Evaluation pipeline
44β
45βββ configs/
46β βββ model_tiny.yaml
47β βββ model_small.yaml
48β βββ model_base.yaml
49β βββ training.yaml
50β
51βββ checkpoints/
52β βββ best_model.pt
53β βββ checkpoint_epoch30.pt
54β
55βββ logs/
56β βββ training_metrics.json
57β
58βββ scripts/
59β βββ train_tokenizer.py
60β βββ preprocess.py
61β βββ download_data.py
62β
63βββ train.py # Main training script
64βββ evaluate.py # Evaluation script
65βββ translate.py # Translation script
66βββ requirements.txt
67βββ README.md
68
69
70KEY FILES:
71ββββββββββ
72
73requirements.txt:
74βββββββββββββββββ
75torch>=2.0.0
76numpy>=1.21.0
77tqdm>=4.62.0
78pyyaml>=6.0
79
80
81README.md highlights:
82βββββββββββββββββββββ
83- Setup instructions
84- Training commands
85- Evaluation commands
86- Model architecture details
87- Results and benchmarksCourse Conclusion
Final Thoughts
πtext
1WHAT WE ACCOMPLISHED:
2βββββββββββββββββββββ
3
4β Built a complete Transformer from scratch
5β Implemented attention mechanisms (self, cross, multi-head)
6β Created BPE tokenization
7β Designed encoder-decoder architecture
8β Implemented multiple decoding strategies
9β Built a complete training pipeline
10β Implemented evaluation metrics
11β Trained a working translation system
12β Achieved competitive BLEU scores
13
14
15KEY TAKEAWAYS:
16ββββββββββββββ
17
181. Understanding > Using
19 By building from scratch, you now understand
20 WHY transformers work, not just HOW to use them.
21
222. Attention is Powerful
23 The self-attention mechanism enables modeling
24 of long-range dependencies efficiently.
25
263. Training Matters
27 Good architecture + bad training = bad results.
28 Warmup, learning rate, and regularization are crucial.
29
304. Engineering Details Count
31 Numerical stability, efficient batching, and
32 proper initialization significantly impact results.
33
34
35WHERE TO GO FROM HERE:
36ββββββββββββββββββββββ
37
381. Read "Attention Is All You Need" paper again
39 - Now it will make much more sense!
40
412. Explore pre-trained models
42 - BERT, GPT, T5, mBART
43 - Understand how they extend this foundation
44
453. Apply to other tasks
46 - Summarization
47 - Question answering
48 - Code generation
49
504. Study modern improvements
51 - Flash Attention
52 - Mixture of Experts
53 - Retrieval-augmented generation
54
55
56RESOURCES:
57ββββββββββ
58
59Papers:
60- "Attention Is All You Need" (Vaswani et al., 2017)
61- "BERT" (Devlin et al., 2019)
62- "GPT-2" (Radford et al., 2019)
63
64Libraries:
65- Hugging Face Transformers
66- fairseq
67- OpenNMT
68
69Courses:
70- Stanford CS224N
71- CMU Neural Nets for NLP
72
73
74ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
75
76Congratulations on completing this course!
77
78You now have the foundational knowledge to understand,
79implement, and improve transformer-based models.
80
81ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββSummary
Project Deliverables
| Deliverable | Status |
|---|---|
| Complete Transformer implementation | β |
| BPE tokenizer | β |
| Training pipeline | β |
| Evaluation metrics | β |
| Working translation system | β |
| Interactive demo | β |
Performance Achieved
| Metric | Value |
|---|---|
| BLEU | ~30-35 |
| ChrF | ~55-60 |
| Training time | ~2-3 hours |
| Model parameters | ~65M |
Exercises
Final Project Exercises
- Train the model and report your BLEU score.
- Implement one improvement from the suggestions list.
- Compare your results with Hugging Face models.
- Create a simple web demo using Gradio or Streamlit.
- Write a brief report analyzing translation errors.
Next Chapters Preview: The remaining chapters cover Advanced Topics:
- Chapter 15: Pre-trained Models and Fine-tuning
- Chapter 16: Advanced Architectures
- Chapter 17: Production Deployment