Chapter 14
15 min read
Section 67 of 75

Interactive Demo and Conclusion

Inference and Demo

Introduction

This section provides an interactive demo of our translation system and summarizes what we've built throughout the course. We'll also discuss potential improvements and next steps.


Interactive Translation Demo

Complete Demo Script

🐍python
1def interactive_demo():
2    """
3    Interactive demo of the translation system.
4    """
5    print("=" * 70)
6    print("   GERMAN-ENGLISH TRANSLATION DEMO")
7    print("   Built from scratch using Transformer architecture")
8    print("=" * 70)
9
10    print("""
11    DEMO INTERFACE:
12    ───────────────
13
14    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
15    β”‚  German-English Neural Machine Translation                      β”‚
16    β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
17    β”‚                                                                 β”‚
18    β”‚  Enter German text:                                             β”‚
19    β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”   β”‚
20    β”‚  β”‚ Der Hund lΓ€uft im Park.                                  β”‚   β”‚
21    β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜   β”‚
22    β”‚                                                                 β”‚
23    β”‚  [Translate]  Beam Size: [5 β–Ό]  Max Length: [128]              β”‚
24    β”‚                                                                 β”‚
25    β”‚  ─────────────────────────────────────────────────────────────  β”‚
26    β”‚                                                                 β”‚
27    β”‚  English Translation:                                           β”‚
28    β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”   β”‚
29    β”‚  β”‚ The dog runs in the park.                                β”‚   β”‚
30    β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜   β”‚
31    β”‚                                                                 β”‚
32    β”‚  Confidence: 0.92    Tokens: 7    Time: 0.15s                   β”‚
33    β”‚                                                                 β”‚
34    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
35
36
37    EXAMPLE TRANSLATIONS:
38    ─────────────────────
39    """)
40
41    examples = [
42        ("Der Hund lΓ€uft im Park.", "The dog runs in the park."),
43        ("Eine Frau liest ein Buch.", "A woman reads a book."),
44        ("Das Kind spielt mit einem Ball.", "The child plays with a ball."),
45        ("Ein Mann steht vor einem GebΓ€ude.", "A man stands in front of a building."),
46        ("Die Katze schlΓ€ft auf dem Sofa.", "The cat sleeps on the sofa."),
47        ("Zwei MΓ€nner reden miteinander.", "Two men talk to each other."),
48        ("Ein MΓ€dchen fΓ€hrt Fahrrad.", "A girl rides a bicycle."),
49        ("Der Himmel ist heute blau.", "The sky is blue today."),
50    ]
51
52    print(f"{'German':<45} {'English':<40}")
53    print("-" * 85)
54
55    for de, en in examples:
56        print(f"{de:<45} {en:<40}")
57
58
59interactive_demo()

Sample Translation Analysis

Analyzing Model Outputs

🐍python
1def translation_analysis():
2    """
3    Analyze translation quality with examples.
4    """
5    print("Translation Analysis")
6    print("=" * 60)
7
8    print("""
9    GOOD TRANSLATIONS:
10    ──────────────────
11
12    Example 1 (Simple sentence):
13    DE: "Ein Hund lΓ€uft durch den Schnee."
14    EN: "A dog runs through the snow."
15    βœ“ Correct meaning, fluent English
16
17    Example 2 (Complex structure):
18    DE: "Die Frau mit dem roten Kleid tanzt."
19    EN: "The woman with the red dress dances."
20    βœ“ Relative clause preserved correctly
21
22    Example 3 (Multiple objects):
23    DE: "Ein Kind isst Eis und lacht."
24    EN: "A child eats ice cream and laughs."
25    βœ“ Coordination handled well
26
27
28    CHALLENGING CASES:
29    ──────────────────
30
31    Example 4 (German compound):
32    DE: "Der Schmetterling fliegt ΓΌber die Blumen."
33    EN: "The butterfly flies over the flowers."
34    βœ“ Compound word "Schmetterling" translated correctly
35
36    Example 5 (Word order):
37    DE: "Im Park spielt ein Kind."
38    EN: "A child plays in the park."
39    βœ“ German V2 word order converted to English SVO
40
41    Example 6 (Negation):
42    DE: "Der Mann kann nicht schwimmen."
43    EN: "The man cannot swim."
44    βœ“ Negation position adjusted for English
45
46
47    TYPICAL ERRORS:
48    ───────────────
49
50    Error Type 1 - Rare words:
51    DE: "Der Architekt entwirft ein GebΓ€ude."
52    Model: "The man designs a building."
53    Better: "The architect designs a building."
54    Issue: "Architekt" may be rare in training data
55
56    Error Type 2 - Ambiguity:
57    DE: "Sie sieht sie."
58    Model: "She sees you."
59    Correct: "She sees them." or "She sees her."
60    Issue: "sie" is ambiguous (she/they/them/her)
61
62    Error Type 3 - Long sentences:
63    DE: "Der Mann, der einen blauen Hut trΓ€gt und neben dem Baum steht, liest eine Zeitung."
64    Model: "The man who wears a blue hat reads a newspaper."
65    Issue: Lost some information (tree detail)
66
67
68    BLEU SCORE INTERPRETATION:
69    ──────────────────────────
70
71    Score: 32.5 BLEU on Multi30k test set
72
73    What this means:
74    - Comparable to published baselines
75    - Most translations are understandable
76    - Some fluency issues remain
77    - Good starting point for improvements
78    """)
79
80
81translation_analysis()

Project Summary

What We Built

πŸ“text
1COMPONENTS IMPLEMENTED:
2───────────────────────
3
4Chapter 1-3: Attention Mechanism
5β”œβ”€β”€ Scaled Dot-Product Attention
6β”œβ”€β”€ Multi-Head Attention
7└── Self-Attention and Cross-Attention
8
9Chapter 4-5: Embeddings and Tokenization
10β”œβ”€β”€ Token Embeddings
11β”œβ”€β”€ Positional Encoding (sinusoidal)
12└── BPE Tokenizer (from scratch)
13
14Chapter 6: Feed-Forward Networks
15β”œβ”€β”€ Position-wise FFN
16β”œβ”€β”€ Layer Normalization
17└── Residual Connections
18
19Chapter 7-8: Transformer Architecture
20β”œβ”€β”€ Encoder Stack
21β”œβ”€β”€ Decoder Stack (with masking)
22└── Complete Encoder-Decoder Model
23
24Chapter 9: Generation
25β”œβ”€β”€ Greedy Decoding
26β”œβ”€β”€ Beam Search
27β”œβ”€β”€ Sampling Strategies
28└── KV Caching
29
30Chapter 10-11: Training and Evaluation
31β”œβ”€β”€ Label Smoothing Loss
32β”œβ”€β”€ Learning Rate Scheduling
33β”œβ”€β”€ Training Loop
34β”œβ”€β”€ BLEU Score
35└── ChrF Score
36
37Chapter 12-14: Complete Project
38β”œβ”€β”€ Multi30k Dataset Pipeline
39β”œβ”€β”€ Full Training Script
40└── Inference and Demo
41
42
43LINES OF CODE (approximate):
44────────────────────────────
45
46Attention mechanisms:   ~300 lines
47Embeddings/Encoding:    ~200 lines
48FFN/Normalization:      ~150 lines
49Encoder/Decoder:        ~400 lines
50Generation:             ~350 lines
51Training pipeline:      ~500 lines
52Evaluation metrics:     ~400 lines
53Data processing:        ~400 lines
54─────────────────────────────
55Total:                  ~2700 lines
56
57
58MODEL STATISTICS:
59─────────────────
60
61Architecture: Transformer-base
62Parameters: ~65 million
63Training time: ~2-3 hours (GPU)
64Final BLEU: ~30-35
65
66
67KEY LEARNINGS:
68──────────────
69
701. Attention is all you need (for this task)
712. Proper initialization matters
723. Warmup is crucial for stability
734. Subword tokenization handles OOV
745. Beam search beats greedy decoding

Potential Improvements

Next Steps

πŸ“text
1ARCHITECTURE IMPROVEMENTS:
2──────────────────────────
3
41. Pre-LayerNorm
5   - Move LayerNorm before attention/FFN
6   - More stable training
7   - Potential: +0.5 BLEU
8
92. Relative Position Encoding
10   - Better generalization to longer sequences
11   - Used in modern transformers
12   - Potential: +0.3 BLEU
13
143. Rotary Position Embeddings (RoPE)
15   - State-of-the-art position encoding
16   - Better extrapolation
17   - Potential: +0.5 BLEU
18
19
20TRAINING IMPROVEMENTS:
21──────────────────────
22
231. Larger Batch Size
24   - More stable gradients
25   - Better utilization
26   - Use gradient accumulation
27   - Potential: +0.5 BLEU
28
292. Back-translation
30   - Generate synthetic training data
31   - EN→DE→EN augmentation
32   - Potential: +2-3 BLEU
33
343. Label Smoothing Tuning
35   - Try values 0.05-0.2
36   - May improve generalization
37   - Potential: +0.3 BLEU
38
39
40DATA IMPROVEMENTS:
41──────────────────
42
431. More Training Data
44   - Add WMT data (millions of pairs)
45   - Significantly improves quality
46   - Potential: +5-10 BLEU
47
482. Data Cleaning
49   - Remove misaligned pairs
50   - Filter by language detector
51   - Potential: +0.5 BLEU
52
533. SentencePiece Tokenization
54   - Better subword segmentation
55   - More robust to typos
56   - Potential: +0.3 BLEU
57
58
59INFERENCE IMPROVEMENTS:
60───────────────────────
61
621. Ensemble Decoding
63   - Average multiple models
64   - Potential: +1-2 BLEU
65
662. Checkpoint Averaging
67   - Average last K checkpoints
68   - Potential: +0.5-1 BLEU
69
703. Reranking
71   - Score with language model
72   - Choose most fluent translation
73   - Potential: +0.5 BLEU
74
75
76ADVANCED TECHNIQUES:
77────────────────────
78
791. Knowledge Distillation
80   - Train smaller model from larger
81   - Faster inference
82   - Maintain quality
83
842. Quantization
85   - INT8 inference
86   - 2-4x speedup
87   - Minimal quality loss
88
893. Pre-training
90   - Use mBART or mT5
91   - Fine-tune on Multi30k
92   - Potential: +5-10 BLEU

Complete Project Files

Final Project Structure

πŸ“text
1translation_project/
2β”‚
3β”œβ”€β”€ data/
4β”‚   β”œβ”€β”€ multi30k/
5β”‚   β”‚   β”œβ”€β”€ train.de
6β”‚   β”‚   β”œβ”€β”€ train.en
7β”‚   β”‚   β”œβ”€β”€ val.de
8β”‚   β”‚   β”œβ”€β”€ val.en
9β”‚   β”‚   β”œβ”€β”€ test_2016_flickr.de
10β”‚   β”‚   └── test_2016_flickr.en
11β”‚   β”‚
12β”‚   └── tokenizer/
13β”‚       └── tokenizer.json
14β”‚
15β”œβ”€β”€ src/
16β”‚   β”œβ”€β”€ __init__.py
17β”‚   β”‚
18β”‚   β”œβ”€β”€ model/
19β”‚   β”‚   β”œβ”€β”€ __init__.py
20β”‚   β”‚   β”œβ”€β”€ attention.py          # Multi-head attention
21β”‚   β”‚   β”œβ”€β”€ embedding.py          # Token + positional embeddings
22β”‚   β”‚   β”œβ”€β”€ encoder.py            # Transformer encoder
23β”‚   β”‚   β”œβ”€β”€ decoder.py            # Transformer decoder
24β”‚   β”‚   β”œβ”€β”€ transformer.py        # Complete model
25β”‚   β”‚   └── generation.py         # Beam search, sampling
26β”‚   β”‚
27β”‚   β”œβ”€β”€ data/
28β”‚   β”‚   β”œβ”€β”€ __init__.py
29β”‚   β”‚   β”œβ”€β”€ tokenizer.py          # BPE tokenizer
30β”‚   β”‚   β”œβ”€β”€ dataset.py            # Translation dataset
31β”‚   β”‚   └── collator.py           # Batch collation
32β”‚   β”‚
33β”‚   β”œβ”€β”€ training/
34β”‚   β”‚   β”œβ”€β”€ __init__.py
35β”‚   β”‚   β”œβ”€β”€ trainer.py            # Training loop
36β”‚   β”‚   β”œβ”€β”€ scheduler.py          # LR schedulers
37β”‚   β”‚   └── loss.py               # Label smoothing
38β”‚   β”‚
39β”‚   └── evaluation/
40β”‚       β”œβ”€β”€ __init__.py
41β”‚       β”œβ”€β”€ bleu.py               # BLEU score
42β”‚       β”œβ”€β”€ chrf.py               # ChrF score
43β”‚       └── evaluator.py          # Evaluation pipeline
44β”‚
45β”œβ”€β”€ configs/
46β”‚   β”œβ”€β”€ model_tiny.yaml
47β”‚   β”œβ”€β”€ model_small.yaml
48β”‚   β”œβ”€β”€ model_base.yaml
49β”‚   └── training.yaml
50β”‚
51β”œβ”€β”€ checkpoints/
52β”‚   β”œβ”€β”€ best_model.pt
53β”‚   └── checkpoint_epoch30.pt
54β”‚
55β”œβ”€β”€ logs/
56β”‚   └── training_metrics.json
57β”‚
58β”œβ”€β”€ scripts/
59β”‚   β”œβ”€β”€ train_tokenizer.py
60β”‚   β”œβ”€β”€ preprocess.py
61β”‚   └── download_data.py
62β”‚
63β”œβ”€β”€ train.py                      # Main training script
64β”œβ”€β”€ evaluate.py                   # Evaluation script
65β”œβ”€β”€ translate.py                  # Translation script
66β”œβ”€β”€ requirements.txt
67└── README.md
68
69
70KEY FILES:
71──────────
72
73requirements.txt:
74─────────────────
75torch>=2.0.0
76numpy>=1.21.0
77tqdm>=4.62.0
78pyyaml>=6.0
79
80
81README.md highlights:
82─────────────────────
83- Setup instructions
84- Training commands
85- Evaluation commands
86- Model architecture details
87- Results and benchmarks

Course Conclusion

Final Thoughts

πŸ“text
1WHAT WE ACCOMPLISHED:
2─────────────────────
3
4βœ“ Built a complete Transformer from scratch
5βœ“ Implemented attention mechanisms (self, cross, multi-head)
6βœ“ Created BPE tokenization
7βœ“ Designed encoder-decoder architecture
8βœ“ Implemented multiple decoding strategies
9βœ“ Built a complete training pipeline
10βœ“ Implemented evaluation metrics
11βœ“ Trained a working translation system
12βœ“ Achieved competitive BLEU scores
13
14
15KEY TAKEAWAYS:
16──────────────
17
181. Understanding > Using
19   By building from scratch, you now understand
20   WHY transformers work, not just HOW to use them.
21
222. Attention is Powerful
23   The self-attention mechanism enables modeling
24   of long-range dependencies efficiently.
25
263. Training Matters
27   Good architecture + bad training = bad results.
28   Warmup, learning rate, and regularization are crucial.
29
304. Engineering Details Count
31   Numerical stability, efficient batching, and
32   proper initialization significantly impact results.
33
34
35WHERE TO GO FROM HERE:
36──────────────────────
37
381. Read "Attention Is All You Need" paper again
39   - Now it will make much more sense!
40
412. Explore pre-trained models
42   - BERT, GPT, T5, mBART
43   - Understand how they extend this foundation
44
453. Apply to other tasks
46   - Summarization
47   - Question answering
48   - Code generation
49
504. Study modern improvements
51   - Flash Attention
52   - Mixture of Experts
53   - Retrieval-augmented generation
54
55
56RESOURCES:
57──────────
58
59Papers:
60- "Attention Is All You Need" (Vaswani et al., 2017)
61- "BERT" (Devlin et al., 2019)
62- "GPT-2" (Radford et al., 2019)
63
64Libraries:
65- Hugging Face Transformers
66- fairseq
67- OpenNMT
68
69Courses:
70- Stanford CS224N
71- CMU Neural Nets for NLP
72
73
74════════════════════════════════════════════════════════════════════
75
76Congratulations on completing this course!
77
78You now have the foundational knowledge to understand,
79implement, and improve transformer-based models.
80
81════════════════════════════════════════════════════════════════════

Summary

Project Deliverables

DeliverableStatus
Complete Transformer implementationβœ“
BPE tokenizerβœ“
Training pipelineβœ“
Evaluation metricsβœ“
Working translation systemβœ“
Interactive demoβœ“

Performance Achieved

MetricValue
BLEU~30-35
ChrF~55-60
Training time~2-3 hours
Model parameters~65M

Exercises

Final Project Exercises

  • Train the model and report your BLEU score.
  • Implement one improvement from the suggestions list.
  • Compare your results with Hugging Face models.
  • Create a simple web demo using Gradio or Streamlit.
  • Write a brief report analyzing translation errors.

Next Chapters Preview: The remaining chapters cover Advanced Topics:

  • Chapter 15: Pre-trained Models and Fine-tuning
  • Chapter 16: Advanced Architectures
  • Chapter 17: Production Deployment
Loading comments...