Introduction
Now that you understand why transformers matter and how they've revolutionized AI, it's time to chart our learning path. This section provides a detailed overview of the course structure and introduces the end-to-end machine translation project that will serve as our hands-on practice.
By the end of this course, you'll have implemented a complete German-to-English translation system from scratchβno high-level wrappers, no magic black boxes. Just you, PyTorch, and the fundamental principles of transformers.
How to use this roadmap
Your Learning Journey
Here's your transformation from "transformers seem magical" to "I built one myself":
1YOUR TRANSFORMER JOURNEY
2ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
3
4 START FINISH
5 β β
6 βΌ βΌ
7βββββββββββ βββββββββββ βββββββββββ βββββββββββ βββββββββββββββ
8β "What β β "I see β β"I built β β"My modelβ β"I understandβ
9β is βββββΆβ how βββββΆβ the βββββΆβ learns! βββββΆβ deeply and β
10βattentionβ β Q,K,V β βencoder- β βBLEU: 30+β β can modify/ β
11β ?" β β work" β βdecoder" β β β β extend" β
12βββββββββββ βββββββββββ βββββββββββ βββββββββββ βββββββββββββββ
13 β β β β β
14 Ch 1-2 Ch 3-4 Ch 7-8 Ch 10-14 Ch 15-17
15 ~1 week ~1 week ~1 week ~2 weeks Optional
16
17ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
18 Total: 4-8 weeks depending on your backgroundBefore vs. After This Course
| Aspect | Before | After |
|---|---|---|
| Understanding | "Transformers use attention somehow" | "I can derive the attention formula and explain why scaling matters" |
| Reading Papers | Skip math, hope for intuition | Understand equations, implement from description |
| Using Libraries | Copy-paste, pray it works | Know what's inside, debug confidently |
| Job Interviews | "I've used HuggingFace..." | "I implemented transformers from scratch, here's my BLEU score" |
| Building Projects | Limited to tutorials | Can architect custom transformer variants |
| Debugging | Random guessing | "Check mask shapes, verify attention patterns" |
The Goal: You won't just be able to use transformersβyou'll be able to explain them, modify them, debug them, and build them for your specific needs.
4.1 Course Philosophy
Learning by Building
This course follows a constructionist approach:
The best way to understand something is to build it.
We won't just read about attention mechanismsβwe'll implement them tensor by tensor. We won't just discuss positional encodingβwe'll visualize why sinusoidal functions work.
What you should do at each step
Progressive Complexity
1Chapter 1-4: Foundations (Attention, Multi-Head, Position, Embeddings)
2 β
3Chapter 5-6: Building Blocks (Tokenization, FFN, Normalization)
4 β
5Chapter 7-9: Core Architecture (Encoder, Decoder, Generation)
6 β
7Chapter 10-11: Training & Evaluation (Pipeline, BLEU)
8 β
9Chapter 12-14: Complete Project (Data, Training, Analysis)
10 β
11Chapter 15-17: Advanced Topics (Variants, Modern Methods, Production)Time Investment by Part
1TIME COMMITMENT OVERVIEW
2ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
3Part 1: Foundations βββββββββββββββββββββββββββββββ ~2 weeks
4Part 2: Building Blocks βββββββββββββββββββββββββββββββ ~1 week
5Part 3: Core Arch βββββββββββββββββββββββββββββββ ~1 week
6Part 4: Training/Eval βββββββββββββββββββββββββββββββ ~1 week
7Part 5: Full Project βββββββββββββββββββββββββββββββ ~2 weeks
8Part 6: Advanced βββββββββββββββββββββββββββββββ Optional
9ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
10 β β β β
11 5hrs 10hrs 15hrs 20hrs/weekTime planning
Each chapter builds on the previous, and by Chapter 14, all components come together in a working system.
From Scratch, Then Compare
We implement everything from basic PyTorch operations:
1# What we DO:
2def scaled_dot_product_attention(Q, K, V, mask=None):
3 d_k = Q.size(-1)
4 scores = torch.matmul(Q, K.transpose(-2, -1)) / math.sqrt(d_k)
5 if mask is not None:
6 scores = scores.masked_fill(mask == 0, -1e9)
7 attention_weights = F.softmax(scores, dim=-1)
8 return torch.matmul(attention_weights, V), attention_weights
9
10# What we DON'T do (until comparison at the end):
11output = nn.MultiheadAttention(embed_dim, num_heads)(query, key, value)Only after understanding how components work do we compare with PyTorch's built-in implementations.
Minimum done criteria
4.2 Chapter-by-Chapter Roadmap
Part 1: Foundations (Chapters 1-4)
Chapter 1: Introduction to Transformers β
You are here!
- Evolution of sequence modeling
- The Transformer revolution
- Applications and variants
- Course roadmap
Checkpoint: confident if you can
Chapter 2: Attention Mechanism From Scratch
Core Learning: Build attention from first principles.
1# What you'll implement:
2- Dot-product similarity
3- Softmax attention weights
4- Weighted value aggregation
5- Scaling factor derivation
6- Masking mechanismsKey Outputs:
attention.pymodule with reusable attention functions- Visualizations showing attention patterns
Quick Check: Can you explain why we divide by βd_k in attention? What would happen if we didn't?
Chapter 3: Multi-Head Attention
Core Learning: Parallel attention heads for richer representations.
1# What you'll implement:
2class MultiHeadAttention(nn.Module):
3 def __init__(self, d_model, num_heads):
4 # Projection matrices
5 self.W_Q = nn.Linear(d_model, d_model)
6 self.W_K = nn.Linear(d_model, d_model)
7 self.W_V = nn.Linear(d_model, d_model)
8 self.W_O = nn.Linear(d_model, d_model)Key Insights:
- Why multiple heads help (different relationship types)
- Efficient implementation via reshaping
- Head-by-head attention visualization
Chapter 4: Positional Encoding and Embeddings
Core Learning: How transformers understand sequence order.
1# What you'll implement:
2- Sinusoidal positional encoding
3- Learned positional embeddings
4- Token embedding layer
5- Combined input representationsVisualizations:
- Sinusoidal patterns across dimensions
- Position similarity matrices
Part 2: Building Blocks (Chapters 5-6)
Chapter 5: Subword Tokenization for Translation
Core Learning: BPE and vocabulary creation for translation.
1# What you'll implement:
2class BPETokenizer:
3 def train(self, corpus, vocab_size):
4 # Learn merge rules from data
5
6 def encode(self, text):
7 # Convert text to token IDs
8
9 def decode(self, token_ids):
10 # Convert token IDs back to textWhy It Matters:
- Handles unknown words gracefully
- Balances vocabulary size with representation
- Essential for multilingual models
Chapter 6: Feed-Forward Networks and Normalization
Core Learning: The "processing" layers between attention.
1# What you'll implement:
2class PositionwiseFeedForward(nn.Module):
3 def __init__(self, d_model, d_ff, dropout=0.1):
4 self.linear1 = nn.Linear(d_model, d_ff)
5 self.linear2 = nn.Linear(d_ff, d_model)
6 self.dropout = nn.Dropout(dropout)
7
8class LayerNorm(nn.Module):
9 # From scratch, understanding why it helps trainingTopics Covered:
- ReLU/GELU activation functions
- Why 4Γ expansion in FFN
- Pre-norm vs. post-norm
- Residual connections
Part 3: Core Architecture (Chapters 7-9)
Chapter 7: Transformer Encoder
Core Learning: Complete encoder implementation.
1# What you'll implement:
2class EncoderLayer(nn.Module):
3 # Self-attention + FFN + residuals + normalization
4
5class TransformerEncoder(nn.Module):
6 # Stack of N encoder layersArchitecture:
1Input embeddings + positions
2 β
3ββββββββββββββββββββββββββ
4β Self-Attention ββββ
5ββββββββββββββββββββββββββ€ β Γ N layers
6β Add & Norm ββββ
7ββββββββββββββββββββββββββ€
8β Feed-Forward ββββ
9ββββββββββββββββββββββββββ€ β
10β Add & Norm ββββ
11ββββββββββββββββββββββββββ
12 β
13 Encoder outputChapter 8: Transformer Decoder
Core Learning: Decoder with masked self-attention and cross-attention.
1# What you'll implement:
2class DecoderLayer(nn.Module):
3 def forward(self, x, encoder_output, src_mask, tgt_mask):
4 # Masked self-attention
5 x = self.self_attention(x, x, x, tgt_mask)
6 # Cross-attention to encoder
7 x = self.cross_attention(x, encoder_output, encoder_output, src_mask)
8 # Feed-forward
9 x = self.feed_forward(x)
10 return xKey Concepts:
- Causal masking (preventing future peeking)
- Cross-attention mechanism
- Teacher forcing during training
Chapter 9: Autoregressive Generation and Inference
Core Learning: How decoders generate text token by token.
1# What you'll implement:
2def greedy_decode(model, encoder_output, max_len):
3 """Generate one token at a time, feeding back predictions."""
4
5def beam_search(model, encoder_output, beam_width, max_len):
6 """Maintain multiple hypotheses for better results."""Topics:
- Greedy vs. beam search
- Temperature sampling
- Top-k and nucleus (top-p) sampling
- KV caching for efficient inference
Part 4: Training & Evaluation (Chapters 10-11)
Chapter 10: Training Pipeline for Translation
Core Learning: Complete training loop with best practices.
1# What you'll implement:
2class TransformerTrainer:
3 def train_epoch(self, dataloader):
4 for batch in dataloader:
5 loss = self.compute_loss(batch)
6 loss.backward()
7 self.optimizer.step()
8 self.scheduler.step()Topics:
- Cross-entropy loss with label smoothing
- Adam optimizer with warmup
- Learning rate scheduling
- Gradient clipping
- Mixed precision training
- Checkpointing
Chapter 11: Evaluation Metrics for Translation
Core Learning: Measuring translation quality.
1# What you'll implement:
2def compute_bleu(references, hypotheses, max_n=4):
3 """Compute BLEU score from scratch."""
4
5def corpus_bleu(references, hypotheses):
6 """Corpus-level BLEU with brevity penalty."""Metrics Covered:
- BLEU (n-gram precision)
- METEOR
- chrF
- Human evaluation considerations
Part 5: Complete Project (Chapters 12-14)
Chapter 12: Project - Data Pipeline
The Dataset: Multi30k German-English parallel corpus
1# ~30,000 sentence pairs
2# Example:
3german: "Ein Mann in einem blauen Hemd steht auf einer Leiter."
4english: "A man in a blue shirt is standing on a ladder."What You'll Build:
- Data loading and preprocessing
- BPE vocabulary training
- Batching with dynamic padding
- Data augmentation techniques
Chapter 13: Project - Model and Training
The Complete Model:
1class Transformer(nn.Module):
2 def __init__(self, config):
3 self.encoder = TransformerEncoder(config)
4 self.decoder = TransformerDecoder(config)
5 self.src_embed = Embeddings(config)
6 self.tgt_embed = Embeddings(config)
7 self.generator = nn.Linear(config.d_model, config.tgt_vocab_size)Training Goals:
- Target BLEU: 30-35 on test set
- Training time: 2-4 hours on single GPU
- Model size: ~30M parameters
Chapter 14: Project - Evaluation and Analysis
Analysis You'll Perform:
- Quantitative evaluation (BLEU, loss curves)
- Attention visualizations
- Error analysis and failure cases
- Comparison with baseline and PyTorch built-ins
Part 6: Advanced Topics (Chapters 15-17)
Chapter 15: Advanced Architectures and Techniques
- Relative position encodings (T5, ALiBi)
- Rotary position embeddings (RoPE)
- Sparse attention patterns
- Mixture of Experts
Expect optional depth
Chapter 16: Modern Transformer Variants
- Vision Transformers
- Decoder-only LLMs (GPT-style)
- BERT-style pre-training
- Flash Attention
Production relevance
Chapter 17: Production and Deployment
- Model quantization
- ONNX export
- KV caching optimization
- Serving architectures
4.3 The Translation Project
Why Machine Translation?
Translation is the ideal learning project for transformers because:
- Full Architecture: Uses complete encoder-decoder (not just encoder or decoder)
- Clear Evaluation: BLEU scores provide objective measurement
- Visible Results: You can immediately see if translations make sense
- Historical Significance: The original transformer was designed for translation
- Reasonable Scale: Can train on a laptop/single GPU
Why not just classify?
Dataset: Multi30k
Multi30k is a multilingual extension of the Flickr30k image captioning dataset:
| Split | Sentences | Avg. Length (DE) | Avg. Length (EN) |
|---|---|---|---|
| Train | 29,000 | 12.1 tokens | 13.0 tokens |
| Valid | 1,014 | 12.0 tokens | 12.8 tokens |
| Test | 1,000 | 11.6 tokens | 12.5 tokens |
Sample Pairs:
1DE: Eine Gruppe von Menschen steht vor einem Iglu.
2EN: A group of people stands in front of an igloo.
3
4DE: Ein kleines MΓ€dchen klettert in ein Spielhaus aus Holz.
5EN: A little girl climbing into a wooden playhouse.
6
7DE: Ein Mann in einem blauen Hemd steht auf einer Leiter.
8EN: A man in a blue shirt is standing on a ladder.Model Specifications
| Hyperparameter | Value |
|---|---|
| d_model | 256-512 |
| Encoder layers | 4-6 |
| Decoder layers | 4-6 |
| Attention heads | 8 |
| d_ff | 4 Γ d_model |
| Vocabulary (BPE) | ~10,000 |
| Max sequence length | 128 |
| Dropout | 0.1-0.3 |
| Batch size | 64-128 |
| Learning rate | 1e-4 to 3e-4 |
Resource planning
Expected Results
BLEU Score Progression:
1Epoch 1: 5-10 (learning basic patterns)
2Epoch 5: 15-20 (capturing word relationships)
3Epoch 10: 25-30 (good fluency, some errors)
4Epoch 20: 30-35 (our target)
5Epoch 30+: 35-40 (diminishing returns)Watch Your Model Learn
This is what makes the project satisfyingβyou can literally watch your model go from nonsense to fluent translations:
1βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
2β YOUR MODEL'S LEARNING JOURNEY β
3βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
4β β
5β Input: "Ein Mann liest ein Buch" β
6β β
7β Epoch 1: "A the the the the" β Gibberish β
8β ββ Model just repeating common tokens β
9β β
10β Epoch 5: "A man is a book" β οΈ Words present, wrong order β
11β ββ Learning word mappings, not structure β
12β β
13β Epoch 10: "A man reading a book" β οΈ Close! Wrong verb form β
14β ββ Getting syntax, some errors β
15β β
16β Epoch 15: "A man reads a book" β
Correct! β
17β ββ Learned the pattern β
18β β
19β Epoch 20: "A man in a blue shirt β
Handles complexity! β
20β reads a book in the park" β
21β ββ Generalizes to longer sentences β
22β β
23βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββThe Magic Moment: There's a specific epoch where your model goes from producing garbage to producing readable translations. You'll feel like a proud parent.
4.4 Your First Win (Try Now!)
Don't wait until Chapter 2βlet's get a small win right now. Run this code to see attention in action:
1# π― YOUR FIRST WIN: See attention patterns in 60 seconds
2# Copy this into a Python file or Jupyter notebook
3
4import torch
5import torch.nn.functional as F
6import matplotlib.pyplot as plt
7
8# Simulate a mini attention computation
9sentence = ["The", "cat", "sat", "on", "the", "mat"]
10seq_len = len(sentence)
11
12# Random Q, K, V matrices (normally these come from embeddings)
13d_k = 8
14Q = torch.randn(seq_len, d_k)
15K = torch.randn(seq_len, d_k)
16V = torch.randn(seq_len, d_k)
17
18# THE ATTENTION FORMULA (you'll implement this properly in Chapter 2)
19scores = torch.matmul(Q, K.transpose(-2, -1)) / (d_k ** 0.5)
20attention_weights = F.softmax(scores, dim=-1)
21
22# Visualize: which words attend to which?
23plt.figure(figsize=(8, 6))
24plt.imshow(attention_weights.detach().numpy(), cmap='Blues')
25plt.xticks(range(seq_len), sentence, rotation=45)
26plt.yticks(range(seq_len), sentence)
27plt.xlabel('Keys (attending TO)')
28plt.ylabel('Queries (attending FROM)')
29plt.title('Your First Attention Map!')
30plt.colorbar(label='Attention Weight')
31plt.tight_layout()
32plt.savefig('my_first_attention.png')
33plt.show()
34
35print("β
You just computed attention from scratch!")
36print("π Saved: my_first_attention.png")
37print("π― Next: Chapter 2 will explain WHY this formula works")Do this now!
What You'll See:
- A heatmap showing which words "pay attention" to which
- Brighter colors = stronger attention
- The diagonal often lights up (words attend to themselves)
- This is the core of what makes transformers work!
Save this feeling
4.5 Prerequisites and Setup
Knowledge Requirements
This course assumes you're comfortable with:
| Topic | Level | We'll Review? | Self-Check |
|---|---|---|---|
| Python | Intermediate | No | Can you write classes and use list comprehensions? |
| NumPy | Basic | Briefly | Can you reshape arrays and do broadcasting? |
| PyTorch | Basic | Yes | Can you create tensors and define nn.Module? |
| Linear algebra | Basics | Yes (as needed) | Do you know what matrix multiplication does? |
| Neural networks | Conceptual | Yes (quick recap) | Do you know what a loss function is? |
| Calculus/Gradients | Awareness | No (PyTorch handles it) | Do you know backprop exists? |
Quick Self-Assessment
Rate yourself 1-5 on each. If any are below 3, consider a quick refresher:
1PREREQUISITE CHECKLIST
2βββββββββββββββββββββββββββββββββββββββββββββββββ
3β‘ I can write a Python class with __init__ and methods [ /5]
4β‘ I can reshape a NumPy array from (10, 20) to (20, 10) [ /5]
5β‘ I can define a simple nn.Module in PyTorch [ /5]
6β‘ I know what A @ B means for matrices [ /5]
7β‘ I understand train/val/test splits [ /5]
8βββββββββββββββββββββββββββββββββββββββββββββββββ
9 Total: [ /25]
10
11Score 20+: You're ready! Dive in.
12Score 15-19: You'll be fine, expect some Googling.
13Score <15: Spend a day on PyTorch basics first.Technical Setup
Required:
1Python >= 3.8
2PyTorch >= 2.0
3torchtext (for data utilities)
4matplotlib (for visualization)
5tqdm (for progress bars)Optional but Recommended:
1CUDA-capable GPU (training is 10-20x faster)
2Jupyter/Colab (for interactive exploration)
3wandb (for experiment tracking)Quick Environment Setup:
1# Create a fresh environment
2conda create -n transformers python=3.10
3conda activate transformers
4
5# Install PyTorch (check pytorch.org for your CUDA version)
6pip install torch torchvision torchaudio
7
8# Install course dependencies
9pip install torchtext matplotlib tqdm numpy
10
11# Verify installation
12python -c "import torch; print(f'PyTorch {torch.__version__}, CUDA: {torch.cuda.is_available()}')"
13
14# Expected output: PyTorch 2.x.x, CUDA: True (or False if no GPU)Directory Structure
By course end, your project will look like:
1transformer_translation/
2βββ data/
3β βββ multi30k/
4β β βββ train.de
5β β βββ train.en
6β β βββ val.de
7β β βββ val.en
8β β βββ test.de
9β β βββ test.en
10β βββ vocab/
11β βββ bpe.de
12β βββ bpe.en
13βββ models/
14β βββ attention.py
15β βββ encoder.py
16β βββ decoder.py
17β βββ transformer.py
18β βββ embeddings.py
19βββ training/
20β βββ trainer.py
21β βββ scheduler.py
22β βββ losses.py
23βββ evaluation/
24β βββ bleu.py
25β βββ analysis.py
26βββ utils/
27β βββ tokenizer.py
28β βββ dataset.py
29β βββ visualization.py
30βββ configs/
31β βββ base_config.yaml
32βββ checkpoints/
33βββ notebooks/
34β βββ exploration.ipynb
35βββ train.py4.6 Learning Strategies
Active Learning Tips
- Code Along: Don't just readβtype the code yourself
- Experiment: Change hyperparameters, break things, see what happens
- Visualize: Plot attention weights, embeddings, loss curves
- Debug Deliberately: When something doesn't work, trace tensor shapes
- Teach Others: Explaining concepts solidifies understanding
Micro-habits that pay off
The Shape Debugging Habit
90% of transformer bugs are shape mismatches. Build this habit from day one:
1# Add this to EVERY function you write:
2def my_attention(Q, K, V):
3 print(f"Q: {Q.shape}, K: {K.shape}, V: {V.shape}") # Always check inputs
4
5 # ... your code ...
6
7 print(f"output: {output.shape}") # Always check outputs
8 return output
9
10# Expected shapes to memorize:
11# Q, K, V: (batch, heads, seq_len, d_k)
12# Attention weights: (batch, heads, seq_len, seq_len)
13# Output: (batch, heads, seq_len, d_k)Common Pitfalls to Avoid
1# Pitfall 1: Tensor shape confusion
2# ALWAYS check shapes with print statements
3print(f"Q shape: {Q.shape}") # Expected: (batch, heads, seq_len, d_k)
4
5# Pitfall 2: Forgetting to handle padding
6# Masks are essential for variable-length sequences
7
8# Pitfall 3: Not scaling attention
9# The 1/βd_k factor prevents gradient vanishing
10
11# Pitfall 4: Wrong mask dimensions
12# src_mask: (batch, 1, 1, src_len)
13# tgt_mask: (batch, 1, tgt_len, tgt_len)
14
15# Pitfall 5: Not handling start/end tokens
16# <sos> and <eos> tokens are critical for generationQuick sanity checks
Suggested Pace
| Background | Suggested Time | Chapters/Week | Focus Areas |
|---|---|---|---|
| ML beginner | 8-10 weeks | 1-2 | Take time on Ch 2-4, foundations are key |
| ML practitioner | 4-6 weeks | 2-3 | Can skim Ch 1, focus on implementation |
| DL experienced | 2-3 weeks | 4-6 | Focus on project chapters (12-14) |
4.7 Career & Portfolio Value
Why This Project Matters for Your Career
Completing this course gives you demonstrable skills that stand out:
| What You'll Have | How It Helps |
|---|---|
| Working translation model | Concrete proof you can build, not just use, transformers |
| GitHub repo with clean code | Portfolio piece for job applications |
| BLEU score to report | Quantifiable metric ("I achieved BLEU 32 on Multi30k") |
| Attention visualizations | Great for interview presentations |
| From-scratch implementations | Shows deep understanding vs API-only usage |
Interview-Ready Talking Points
1AFTER THIS COURSE, YOU CAN CONFIDENTLY SAY:
2βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
3β
"I implemented a transformer from scratch in PyTorch"
4
5β
"I understand why we scale attention by βd_kβit's to prevent
6 softmax saturation when dimensions are large"
7
8β
"I built a German-English translator that achieved BLEU 32,
9 competitive with baseline models"
10
11β
"I can explain the difference between encoder self-attention,
12 decoder self-attention, and cross-attention"
13
14β
"I've debugged attention masksβthe batch and head dimensions
15 need to broadcast correctly"
16
17β
"I know when to use encoder-only (BERT) vs decoder-only (GPT)
18 vs encoder-decoder (T5) architectures"
19βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββPortfolio tip
Skills You'll Gain
These skills transfer to many roles:
| Role | How This Course Helps |
|---|---|
| ML Engineer | Implement and optimize production transformers |
| Research Engineer | Read papers, implement baselines, run experiments |
| Data Scientist | Fine-tune and evaluate language models |
| Software Engineer + ML | Integrate transformer models into products |
| Graduate Student | Foundation for NLP/ML research |
4.8 Resources & Community
Essential References
| Resource | What It's For | When to Use |
|---|---|---|
| "Attention Is All You Need" paper | The original transformer paper | Read after Ch 2-3 for deeper understanding |
| The Annotated Transformer (Harvard) | Line-by-line code walkthrough | Compare with your implementation |
| Jay Alammar's blog | Visual explanations | When you need intuition |
| HuggingFace docs | API reference | After building from scratch, to compare |
| PyTorch docs | Tensor operations | When debugging shape issues |
Foundational Papers (Optional Reading)
- Attention Is All You Need (Vaswani et al., 2017) - The original
- BERT (Devlin et al., 2018) - Encoder-only pretraining
- GPT-2 (Radford et al., 2019) - Decoder-only at scale
- T5 (Raffel et al., 2020) - Unified text-to-text framework
- Layer Normalization (Ba et al., 2016) - Why LayerNorm works
Paper reading tip
Getting Help
When you're stuck:
- Check tensor shapes - 90% of bugs are shape mismatches
- Print intermediate values - Is attention summing to 1? Is the mask applied?
- Simplify - Test with batch_size=1, seq_len=4 first
- Compare with reference - Check The Annotated Transformer
- Ask with context - Share shapes, error messages, and what you've tried
Communities
- r/MachineLearning - Paper discussions, research news
- HuggingFace Discord - Transformer-specific help
- PyTorch Forums - Implementation questions
- Twitter/X ML community - Latest research, quick tips
Summary
What You'll Achieve
By completing this course, you will:
1YOUR ACHIEVEMENT CHECKLIST
2ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
3β‘ UNDERSTAND transformers deeply, not just use them as black boxes
4β‘ IMPLEMENT every component from scratch in PyTorch
5β‘ BUILD a working German-to-English translation system
6β‘ ACHIEVE a BLEU score of 30-35, competitive with baselines
7β‘ VISUALIZE and interpret attention mechanisms
8β‘ DEBUG transformer issues confidently (shapes, masks, gradients)
9β‘ EXPLAIN the architecture in job interviews
10β‘ APPLY these skills to other transformer-based projects
11ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββCourse Materials
Each chapter includes:
- Theory sections: Conceptual explanations with diagrams
- Code sections: Step-by-step implementations
- Exercises: Practice problems (with solutions)
- Jupyter notebooks: Interactive experimentation
Let's Begin!
In the next chapter, we'll dive into the heart of transformers: the attention mechanism. You'll implement scaled dot-product attention from scratch and see exactly how transformers "pay attention" to different parts of their input.
The journey from "transformers seem magical" to "I built one myself" starts now. You've got the roadmapβlet's build.
Exercises
Reflection Questions
- Based on the course roadmap, which chapter are you most excited about? Which seems most challenging?
- Why is machine translation a good project for learning transformers? What are the advantages over classification tasks?
- What aspects of your current PyTorch knowledge do you feel confident about? What areas might need review?
- Look at the "Before vs. After" table. Which transformation are you most looking forward to?
Setup Tasks
- Create the directory structure outlined in Section 4.5. Verify you have all required packages installed.
- Run the "First Win" code from Section 4.4 and save the attention visualization. What patterns do you notice?
- (Optional) Download the Multi30k dataset and examine a few examples. What patterns do you notice in German vs. English sentence structure?
- Write a simple PyTorch module (any module) to verify your environment is set up correctly:
1import torch
2import torch.nn as nn
3
4class TestModule(nn.Module):
5 def __init__(self):
6 super().__init__()
7 self.linear = nn.Linear(10, 5)
8
9 def forward(self, x):
10 return self.linear(x)
11
12# Test it
13model = TestModule()
14x = torch.randn(2, 10)
15print(model(x).shape) # Should print: torch.Size([2, 5])
16print("β
Environment ready! You're all set for Chapter 2.")Bonus Challenge
Modify the "First Win" attention code to:
- Use a longer sentence (10+ words)
- Try different values of d_k (4, 8, 16, 64) and see how the attention patterns change
- Add a mask that prevents the first word from attending to the last word
Next Chapter Preview
Chapter 2: Attention Mechanism From Scratch
We'll implement the core innovation that makes transformers work:
1def attention(Q, K, V):
2 """
3 The heart of transformers.
4
5 Q: What am I looking for? (Query)
6 K: What do I contain? (Key)
7 V: What information do I provide? (Value)
8
9 Returns: Weighted combination of Values
10 """
11 scores = Q @ K.T / sqrt(d_k)
12 weights = softmax(scores)
13 return weights @ VWhat You'll Learn:
- Why attention is called "attention" (the analogy)
- The mathematics behind Q, K, V
- Why we scale by βd_k (and what breaks if we don't)
- How to implement and visualize attention weights
- Masking for sequences of different lengths
See you in Chapter 2! Bring your code editorβwe're building from scratch.