Chapter 1
12 min read
Section 7 of 75

Course Roadmap and Project Preview

Introduction to Transformers

Introduction

Now that you understand why transformers matter and how they've revolutionized AI, it's time to chart our learning path. This section provides a detailed overview of the course structure and introduces the end-to-end machine translation project that will serve as our hands-on practice.

By the end of this course, you'll have implemented a complete German-to-English translation system from scratchβ€”no high-level wrappers, no magic black boxes. Just you, PyTorch, and the fundamental principles of transformers.

How to use this roadmap

Treat each chapter as a milestone: build the code, run the checks, and log what confused you. The later sections assume you've written (not just read) the earlier pieces.

Your Learning Journey

Here's your transformation from "transformers seem magical" to "I built one myself":

πŸ“text
1YOUR TRANSFORMER JOURNEY
2════════════════════════════════════════════════════════════════════════
3
4  START                                                              FINISH
5    β”‚                                                                   β”‚
6    β–Ό                                                                   β–Ό
7β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
8β”‚  "What  β”‚    β”‚ "I see  β”‚    β”‚"I built β”‚    β”‚"My modelβ”‚    β”‚"I understandβ”‚
9β”‚   is    │───▢│  how    │───▢│  the    │───▢│ learns! │───▢│ deeply and  β”‚
10β”‚attentionβ”‚    β”‚ Q,K,V   β”‚    β”‚encoder- β”‚    β”‚BLEU: 30+β”‚    β”‚ can modify/ β”‚
11β”‚    ?"   β”‚    β”‚  work"  β”‚    β”‚decoder" β”‚    β”‚         β”‚    β”‚   extend"   β”‚
12β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
13     β”‚              β”‚              β”‚              β”‚               β”‚
14  Ch 1-2         Ch 3-4         Ch 7-8        Ch 10-14        Ch 15-17
15  ~1 week        ~1 week        ~1 week       ~2 weeks        Optional
16
17════════════════════════════════════════════════════════════════════════
18              Total: 4-8 weeks depending on your background

Before vs. After This Course

AspectBeforeAfter
Understanding"Transformers use attention somehow""I can derive the attention formula and explain why scaling matters"
Reading PapersSkip math, hope for intuitionUnderstand equations, implement from description
Using LibrariesCopy-paste, pray it worksKnow what's inside, debug confidently
Job Interviews"I've used HuggingFace...""I implemented transformers from scratch, here's my BLEU score"
Building ProjectsLimited to tutorialsCan architect custom transformer variants
DebuggingRandom guessing"Check mask shapes, verify attention patterns"
The Goal: You won't just be able to use transformersβ€”you'll be able to explain them, modify them, debug them, and build them for your specific needs.

4.1 Course Philosophy

Learning by Building

This course follows a constructionist approach:

The best way to understand something is to build it.

We won't just read about attention mechanismsβ€”we'll implement them tensor by tensor. We won't just discuss positional encodingβ€”we'll visualize why sinusoidal functions work.

What you should do at each step

Write the code yourself, add a print or assert for shapes, and run a micro test before moving on. Copy-pasting skips the learning.

Progressive Complexity

πŸ“text
1Chapter 1-4:   Foundations     (Attention, Multi-Head, Position, Embeddings)
2                     ↓
3Chapter 5-6:   Building Blocks (Tokenization, FFN, Normalization)
4                     ↓
5Chapter 7-9:   Core Architecture (Encoder, Decoder, Generation)
6                     ↓
7Chapter 10-11: Training & Evaluation (Pipeline, BLEU)
8                     ↓
9Chapter 12-14: Complete Project (Data, Training, Analysis)
10                     ↓
11Chapter 15-17: Advanced Topics (Variants, Modern Methods, Production)

Time Investment by Part

πŸ“text
1TIME COMMITMENT OVERVIEW
2────────────────────────────────────────────────────────────────────
3Part 1: Foundations      β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘  ~2 weeks
4Part 2: Building Blocks  β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘  ~1 week
5Part 3: Core Arch        β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘  ~1 week
6Part 4: Training/Eval    β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘  ~1 week
7Part 5: Full Project     β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘  ~2 weeks
8Part 6: Advanced         β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘  Optional
9────────────────────────────────────────────────────────────────────
10                         β”‚         β”‚         β”‚         β”‚
11                         5hrs      10hrs     15hrs     20hrs/week

Time planning

ML beginner: 8-10 weeks total. ML practitioner: 4-6 weeks. DL experienced: 2-3 weeks. Adjust based on how deeply you want to explore each topic.

Each chapter builds on the previous, and by Chapter 14, all components come together in a working system.

From Scratch, Then Compare

We implement everything from basic PyTorch operations:

🐍python
1# What we DO:
2def scaled_dot_product_attention(Q, K, V, mask=None):
3    d_k = Q.size(-1)
4    scores = torch.matmul(Q, K.transpose(-2, -1)) / math.sqrt(d_k)
5    if mask is not None:
6        scores = scores.masked_fill(mask == 0, -1e9)
7    attention_weights = F.softmax(scores, dim=-1)
8    return torch.matmul(attention_weights, V), attention_weights
9
10# What we DON'T do (until comparison at the end):
11output = nn.MultiheadAttention(embed_dim, num_heads)(query, key, value)

Only after understanding how components work do we compare with PyTorch's built-in implementations.

Minimum done criteria

Each chapter should leave you with runnable code, a tiny test (even a shape check), and one visualization or printed example. Keep these in a `notes.md` to track gotchas.

4.2 Chapter-by-Chapter Roadmap

Part 1: Foundations (Chapters 1-4)

Chapter 1: Introduction to Transformers βœ“

You are here!

  • Evolution of sequence modeling
  • The Transformer revolution
  • Applications and variants
  • Course roadmap

Checkpoint: confident if you can

Explain why self-attention beats recurrence, name the three transformer families, and outline the course project without peeking.

Chapter 2: Attention Mechanism From Scratch

Core Learning: Build attention from first principles.

🐍python
1# What you'll implement:
2- Dot-product similarity
3- Softmax attention weights
4- Weighted value aggregation
5- Scaling factor derivation
6- Masking mechanisms

Key Outputs:

  • attention.py module with reusable attention functions
  • Visualizations showing attention patterns
Quick Check: Can you explain why we divide by √d_k in attention? What would happen if we didn't?

Chapter 3: Multi-Head Attention

Core Learning: Parallel attention heads for richer representations.

🐍python
1# What you'll implement:
2class MultiHeadAttention(nn.Module):
3    def __init__(self, d_model, num_heads):
4        # Projection matrices
5        self.W_Q = nn.Linear(d_model, d_model)
6        self.W_K = nn.Linear(d_model, d_model)
7        self.W_V = nn.Linear(d_model, d_model)
8        self.W_O = nn.Linear(d_model, d_model)

Key Insights:

  • Why multiple heads help (different relationship types)
  • Efficient implementation via reshaping
  • Head-by-head attention visualization

Chapter 4: Positional Encoding and Embeddings

Core Learning: How transformers understand sequence order.

🐍python
1# What you'll implement:
2- Sinusoidal positional encoding
3- Learned positional embeddings
4- Token embedding layer
5- Combined input representations

Visualizations:

  • Sinusoidal patterns across dimensions
  • Position similarity matrices

Part 2: Building Blocks (Chapters 5-6)

Chapter 5: Subword Tokenization for Translation

Core Learning: BPE and vocabulary creation for translation.

🐍python
1# What you'll implement:
2class BPETokenizer:
3    def train(self, corpus, vocab_size):
4        # Learn merge rules from data
5
6    def encode(self, text):
7        # Convert text to token IDs
8
9    def decode(self, token_ids):
10        # Convert token IDs back to text

Why It Matters:

  • Handles unknown words gracefully
  • Balances vocabulary size with representation
  • Essential for multilingual models

Chapter 6: Feed-Forward Networks and Normalization

Core Learning: The "processing" layers between attention.

🐍python
1# What you'll implement:
2class PositionwiseFeedForward(nn.Module):
3    def __init__(self, d_model, d_ff, dropout=0.1):
4        self.linear1 = nn.Linear(d_model, d_ff)
5        self.linear2 = nn.Linear(d_ff, d_model)
6        self.dropout = nn.Dropout(dropout)
7
8class LayerNorm(nn.Module):
9    # From scratch, understanding why it helps training

Topics Covered:

  • ReLU/GELU activation functions
  • Why 4Γ— expansion in FFN
  • Pre-norm vs. post-norm
  • Residual connections

Part 3: Core Architecture (Chapters 7-9)

Chapter 7: Transformer Encoder

Core Learning: Complete encoder implementation.

🐍python
1# What you'll implement:
2class EncoderLayer(nn.Module):
3    # Self-attention + FFN + residuals + normalization
4
5class TransformerEncoder(nn.Module):
6    # Stack of N encoder layers

Architecture:

πŸ“text
1Input embeddings + positions
2         ↓
3β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
4β”‚    Self-Attention      │←─┐
5β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€  β”‚ Γ— N layers
6β”‚ Add & Norm             β”‚β”€β”€β”˜
7β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
8β”‚    Feed-Forward        │←─┐
9β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€  β”‚
10β”‚ Add & Norm             β”‚β”€β”€β”˜
11β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
12         ↓
13    Encoder output

Chapter 8: Transformer Decoder

Core Learning: Decoder with masked self-attention and cross-attention.

🐍python
1# What you'll implement:
2class DecoderLayer(nn.Module):
3    def forward(self, x, encoder_output, src_mask, tgt_mask):
4        # Masked self-attention
5        x = self.self_attention(x, x, x, tgt_mask)
6        # Cross-attention to encoder
7        x = self.cross_attention(x, encoder_output, encoder_output, src_mask)
8        # Feed-forward
9        x = self.feed_forward(x)
10        return x

Key Concepts:

  • Causal masking (preventing future peeking)
  • Cross-attention mechanism
  • Teacher forcing during training

Chapter 9: Autoregressive Generation and Inference

Core Learning: How decoders generate text token by token.

🐍python
1# What you'll implement:
2def greedy_decode(model, encoder_output, max_len):
3    """Generate one token at a time, feeding back predictions."""
4
5def beam_search(model, encoder_output, beam_width, max_len):
6    """Maintain multiple hypotheses for better results."""

Topics:

  • Greedy vs. beam search
  • Temperature sampling
  • Top-k and nucleus (top-p) sampling
  • KV caching for efficient inference

Part 4: Training & Evaluation (Chapters 10-11)

Chapter 10: Training Pipeline for Translation

Core Learning: Complete training loop with best practices.

🐍python
1# What you'll implement:
2class TransformerTrainer:
3    def train_epoch(self, dataloader):
4        for batch in dataloader:
5            loss = self.compute_loss(batch)
6            loss.backward()
7            self.optimizer.step()
8            self.scheduler.step()

Topics:

  • Cross-entropy loss with label smoothing
  • Adam optimizer with warmup
  • Learning rate scheduling
  • Gradient clipping
  • Mixed precision training
  • Checkpointing

Chapter 11: Evaluation Metrics for Translation

Core Learning: Measuring translation quality.

🐍python
1# What you'll implement:
2def compute_bleu(references, hypotheses, max_n=4):
3    """Compute BLEU score from scratch."""
4
5def corpus_bleu(references, hypotheses):
6    """Corpus-level BLEU with brevity penalty."""

Metrics Covered:

  • BLEU (n-gram precision)
  • METEOR
  • chrF
  • Human evaluation considerations

Part 5: Complete Project (Chapters 12-14)

Chapter 12: Project - Data Pipeline

The Dataset: Multi30k German-English parallel corpus

🐍python
1# ~30,000 sentence pairs
2# Example:
3german:  "Ein Mann in einem blauen Hemd steht auf einer Leiter."
4english: "A man in a blue shirt is standing on a ladder."

What You'll Build:

  • Data loading and preprocessing
  • BPE vocabulary training
  • Batching with dynamic padding
  • Data augmentation techniques

Chapter 13: Project - Model and Training

The Complete Model:

🐍python
1class Transformer(nn.Module):
2    def __init__(self, config):
3        self.encoder = TransformerEncoder(config)
4        self.decoder = TransformerDecoder(config)
5        self.src_embed = Embeddings(config)
6        self.tgt_embed = Embeddings(config)
7        self.generator = nn.Linear(config.d_model, config.tgt_vocab_size)

Training Goals:

  • Target BLEU: 30-35 on test set
  • Training time: 2-4 hours on single GPU
  • Model size: ~30M parameters

Chapter 14: Project - Evaluation and Analysis

Analysis You'll Perform:

  • Quantitative evaluation (BLEU, loss curves)
  • Attention visualizations
  • Error analysis and failure cases
  • Comparison with baseline and PyTorch built-ins

Part 6: Advanced Topics (Chapters 15-17)

Chapter 15: Advanced Architectures and Techniques

  • Relative position encodings (T5, ALiBi)
  • Rotary position embeddings (RoPE)
  • Sparse attention patterns
  • Mixture of Experts

Expect optional depth

Advanced chapters are bonus: skim first, then pick what aligns with your goals (e.g., long-context for docs/code, MoE for scaling).

Chapter 16: Modern Transformer Variants

  • Vision Transformers
  • Decoder-only LLMs (GPT-style)
  • BERT-style pre-training
  • Flash Attention

Production relevance

Flash Attention and quantization (next chapter) are the biggest levers for latency; prioritize if you plan to deploy.

Chapter 17: Production and Deployment

  • Model quantization
  • ONNX export
  • KV caching optimization
  • Serving architectures

4.3 The Translation Project

Why Machine Translation?

Translation is the ideal learning project for transformers because:

  1. Full Architecture: Uses complete encoder-decoder (not just encoder or decoder)
  2. Clear Evaluation: BLEU scores provide objective measurement
  3. Visible Results: You can immediately see if translations make sense
  4. Historical Significance: The original transformer was designed for translation
  5. Reasonable Scale: Can train on a laptop/single GPU

Why not just classify?

Translation forces you to handle varying lengths, masking, teacher forcing, and evaluation with BLEUβ€”all core skills you need for any seq-to-seq problem.

Dataset: Multi30k

Multi30k is a multilingual extension of the Flickr30k image captioning dataset:

SplitSentencesAvg. Length (DE)Avg. Length (EN)
Train29,00012.1 tokens13.0 tokens
Valid1,01412.0 tokens12.8 tokens
Test1,00011.6 tokens12.5 tokens

Sample Pairs:

πŸ“text
1DE: Eine Gruppe von Menschen steht vor einem Iglu.
2EN: A group of people stands in front of an igloo.
3
4DE: Ein kleines MΓ€dchen klettert in ein Spielhaus aus Holz.
5EN: A little girl climbing into a wooden playhouse.
6
7DE: Ein Mann in einem blauen Hemd steht auf einer Leiter.
8EN: A man in a blue shirt is standing on a ladder.

Model Specifications

HyperparameterValue
d_model256-512
Encoder layers4-6
Decoder layers4-6
Attention heads8
d_ff4 Γ— d_model
Vocabulary (BPE)~10,000
Max sequence length128
Dropout0.1-0.3
Batch size64-128
Learning rate1e-4 to 3e-4

Resource planning

If VRAM is tight: drop d_model to 256, use 4 layers, reduce batch size, and enable gradient accumulation; keep heads at 8 to preserve attention quality.

Expected Results

BLEU Score Progression:

πŸ“text
1Epoch 1:   5-10  (learning basic patterns)
2Epoch 5:   15-20 (capturing word relationships)
3Epoch 10:  25-30 (good fluency, some errors)
4Epoch 20:  30-35 (our target)
5Epoch 30+: 35-40 (diminishing returns)

Watch Your Model Learn

This is what makes the project satisfyingβ€”you can literally watch your model go from nonsense to fluent translations:

πŸ“text
1β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
2β”‚  YOUR MODEL'S LEARNING JOURNEY                                      β”‚
3β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
4β”‚                                                                     β”‚
5β”‚  Input: "Ein Mann liest ein Buch"                                   β”‚
6β”‚                                                                     β”‚
7β”‚  Epoch 1:  "A the the the the"        ❌ Gibberish                  β”‚
8β”‚            └─ Model just repeating common tokens                    β”‚
9β”‚                                                                     β”‚
10β”‚  Epoch 5:  "A man is a book"          ⚠️ Words present, wrong order β”‚
11β”‚            └─ Learning word mappings, not structure                 β”‚
12β”‚                                                                     β”‚
13β”‚  Epoch 10: "A man reading a book"     ⚠️ Close! Wrong verb form     β”‚
14β”‚            └─ Getting syntax, some errors                           β”‚
15β”‚                                                                     β”‚
16β”‚  Epoch 15: "A man reads a book"       βœ… Correct!                   β”‚
17β”‚            └─ Learned the pattern                                   β”‚
18β”‚                                                                     β”‚
19β”‚  Epoch 20: "A man in a blue shirt     βœ… Handles complexity!        β”‚
20β”‚             reads a book in the park"                               β”‚
21β”‚            └─ Generalizes to longer sentences                       β”‚
22β”‚                                                                     β”‚
23β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
The Magic Moment: There's a specific epoch where your model goes from producing garbage to producing readable translations. You'll feel like a proud parent.

4.4 Your First Win (Try Now!)

Don't wait until Chapter 2β€”let's get a small win right now. Run this code to see attention in action:

🐍python
1# 🎯 YOUR FIRST WIN: See attention patterns in 60 seconds
2# Copy this into a Python file or Jupyter notebook
3
4import torch
5import torch.nn.functional as F
6import matplotlib.pyplot as plt
7
8# Simulate a mini attention computation
9sentence = ["The", "cat", "sat", "on", "the", "mat"]
10seq_len = len(sentence)
11
12# Random Q, K, V matrices (normally these come from embeddings)
13d_k = 8
14Q = torch.randn(seq_len, d_k)
15K = torch.randn(seq_len, d_k)
16V = torch.randn(seq_len, d_k)
17
18# THE ATTENTION FORMULA (you'll implement this properly in Chapter 2)
19scores = torch.matmul(Q, K.transpose(-2, -1)) / (d_k ** 0.5)
20attention_weights = F.softmax(scores, dim=-1)
21
22# Visualize: which words attend to which?
23plt.figure(figsize=(8, 6))
24plt.imshow(attention_weights.detach().numpy(), cmap='Blues')
25plt.xticks(range(seq_len), sentence, rotation=45)
26plt.yticks(range(seq_len), sentence)
27plt.xlabel('Keys (attending TO)')
28plt.ylabel('Queries (attending FROM)')
29plt.title('Your First Attention Map!')
30plt.colorbar(label='Attention Weight')
31plt.tight_layout()
32plt.savefig('my_first_attention.png')
33plt.show()
34
35print("βœ… You just computed attention from scratch!")
36print("πŸ“Š Saved: my_first_attention.png")
37print("🎯 Next: Chapter 2 will explain WHY this formula works")

Do this now!

Seriouslyβ€”copy this code and run it. You'll have a visual understanding of attention before you even start Chapter 2. This 60-second experiment will make everything click faster.

What You'll See:

  • A heatmap showing which words "pay attention" to which
  • Brighter colors = stronger attention
  • The diagonal often lights up (words attend to themselves)
  • This is the core of what makes transformers work!

Save this feeling

When you finish Chapter 2, come back to this code. You'll understand every line deeply.

4.5 Prerequisites and Setup

Knowledge Requirements

This course assumes you're comfortable with:

TopicLevelWe'll Review?Self-Check
PythonIntermediateNoCan you write classes and use list comprehensions?
NumPyBasicBrieflyCan you reshape arrays and do broadcasting?
PyTorchBasicYesCan you create tensors and define nn.Module?
Linear algebraBasicsYes (as needed)Do you know what matrix multiplication does?
Neural networksConceptualYes (quick recap)Do you know what a loss function is?
Calculus/GradientsAwarenessNo (PyTorch handles it)Do you know backprop exists?

Quick Self-Assessment

Rate yourself 1-5 on each. If any are below 3, consider a quick refresher:

πŸ“text
1PREREQUISITE CHECKLIST
2─────────────────────────────────────────────────
3β–‘ I can write a Python class with __init__ and methods     [  /5]
4β–‘ I can reshape a NumPy array from (10, 20) to (20, 10)    [  /5]
5β–‘ I can define a simple nn.Module in PyTorch               [  /5]
6β–‘ I know what A @ B means for matrices                     [  /5]
7β–‘ I understand train/val/test splits                       [  /5]
8─────────────────────────────────────────────────
9                                          Total: [  /25]
10
11Score 20+: You're ready! Dive in.
12Score 15-19: You'll be fine, expect some Googling.
13Score <15: Spend a day on PyTorch basics first.

Technical Setup

Required:

⚑bash
1Python >= 3.8
2PyTorch >= 2.0
3torchtext (for data utilities)
4matplotlib (for visualization)
5tqdm (for progress bars)

Optional but Recommended:

⚑bash
1CUDA-capable GPU (training is 10-20x faster)
2Jupyter/Colab (for interactive exploration)
3wandb (for experiment tracking)

Quick Environment Setup:

⚑bash
1# Create a fresh environment
2conda create -n transformers python=3.10
3conda activate transformers
4
5# Install PyTorch (check pytorch.org for your CUDA version)
6pip install torch torchvision torchaudio
7
8# Install course dependencies
9pip install torchtext matplotlib tqdm numpy
10
11# Verify installation
12python -c "import torch; print(f'PyTorch {torch.__version__}, CUDA: {torch.cuda.is_available()}')"
13
14# Expected output: PyTorch 2.x.x, CUDA: True (or False if no GPU)

Directory Structure

By course end, your project will look like:

πŸ“text
1transformer_translation/
2β”œβ”€β”€ data/
3β”‚   β”œβ”€β”€ multi30k/
4β”‚   β”‚   β”œβ”€β”€ train.de
5β”‚   β”‚   β”œβ”€β”€ train.en
6β”‚   β”‚   β”œβ”€β”€ val.de
7β”‚   β”‚   β”œβ”€β”€ val.en
8β”‚   β”‚   β”œβ”€β”€ test.de
9β”‚   β”‚   └── test.en
10β”‚   └── vocab/
11β”‚       β”œβ”€β”€ bpe.de
12β”‚       └── bpe.en
13β”œβ”€β”€ models/
14β”‚   β”œβ”€β”€ attention.py
15β”‚   β”œβ”€β”€ encoder.py
16β”‚   β”œβ”€β”€ decoder.py
17β”‚   β”œβ”€β”€ transformer.py
18β”‚   └── embeddings.py
19β”œβ”€β”€ training/
20β”‚   β”œβ”€β”€ trainer.py
21β”‚   β”œβ”€β”€ scheduler.py
22β”‚   └── losses.py
23β”œβ”€β”€ evaluation/
24β”‚   β”œβ”€β”€ bleu.py
25β”‚   └── analysis.py
26β”œβ”€β”€ utils/
27β”‚   β”œβ”€β”€ tokenizer.py
28β”‚   β”œβ”€β”€ dataset.py
29β”‚   └── visualization.py
30β”œβ”€β”€ configs/
31β”‚   └── base_config.yaml
32β”œβ”€β”€ checkpoints/
33β”œβ”€β”€ notebooks/
34β”‚   └── exploration.ipynb
35└── train.py

4.6 Learning Strategies

Active Learning Tips

  1. Code Along: Don't just readβ€”type the code yourself
  2. Experiment: Change hyperparameters, break things, see what happens
  3. Visualize: Plot attention weights, embeddings, loss curves
  4. Debug Deliberately: When something doesn't work, trace tensor shapes
  5. Teach Others: Explaining concepts solidifies understanding

Micro-habits that pay off

Keep a running `notes.md` with shape logs and bugs; run a tiny smoke test after each function; plot one thing per chapter (loss curve, attention map, positional encoding heatmap).

The Shape Debugging Habit

90% of transformer bugs are shape mismatches. Build this habit from day one:

🐍python
1# Add this to EVERY function you write:
2def my_attention(Q, K, V):
3    print(f"Q: {Q.shape}, K: {K.shape}, V: {V.shape}")  # Always check inputs
4
5    # ... your code ...
6
7    print(f"output: {output.shape}")  # Always check outputs
8    return output
9
10# Expected shapes to memorize:
11# Q, K, V: (batch, heads, seq_len, d_k)
12# Attention weights: (batch, heads, seq_len, seq_len)
13# Output: (batch, heads, seq_len, d_k)

Common Pitfalls to Avoid

🐍python
1# Pitfall 1: Tensor shape confusion
2# ALWAYS check shapes with print statements
3print(f"Q shape: {Q.shape}")  # Expected: (batch, heads, seq_len, d_k)
4
5# Pitfall 2: Forgetting to handle padding
6# Masks are essential for variable-length sequences
7
8# Pitfall 3: Not scaling attention
9# The 1/√d_k factor prevents gradient vanishing
10
11# Pitfall 4: Wrong mask dimensions
12# src_mask: (batch, 1, 1, src_len)
13# tgt_mask: (batch, 1, tgt_len, tgt_len)
14
15# Pitfall 5: Not handling start/end tokens
16# <sos> and <eos> tokens are critical for generation

Quick sanity checks

Attention outputs should preserve seq length; masks must broadcast; BLEU should rise after a few epochs; loss should trend downβ€”if not, check masks and learning rate.

Suggested Pace

BackgroundSuggested TimeChapters/WeekFocus Areas
ML beginner8-10 weeks1-2Take time on Ch 2-4, foundations are key
ML practitioner4-6 weeks2-3Can skim Ch 1, focus on implementation
DL experienced2-3 weeks4-6Focus on project chapters (12-14)

4.7 Career & Portfolio Value

Why This Project Matters for Your Career

Completing this course gives you demonstrable skills that stand out:

What You'll HaveHow It Helps
Working translation modelConcrete proof you can build, not just use, transformers
GitHub repo with clean codePortfolio piece for job applications
BLEU score to reportQuantifiable metric ("I achieved BLEU 32 on Multi30k")
Attention visualizationsGreat for interview presentations
From-scratch implementationsShows deep understanding vs API-only usage

Interview-Ready Talking Points

πŸ“text
1AFTER THIS COURSE, YOU CAN CONFIDENTLY SAY:
2─────────────────────────────────────────────────────────────────────
3βœ… "I implemented a transformer from scratch in PyTorch"
4
5βœ… "I understand why we scale attention by √d_kβ€”it's to prevent
6    softmax saturation when dimensions are large"
7
8βœ… "I built a German-English translator that achieved BLEU 32,
9    competitive with baseline models"
10
11βœ… "I can explain the difference between encoder self-attention,
12    decoder self-attention, and cross-attention"
13
14βœ… "I've debugged attention masksβ€”the batch and head dimensions
15    need to broadcast correctly"
16
17βœ… "I know when to use encoder-only (BERT) vs decoder-only (GPT)
18    vs encoder-decoder (T5) architectures"
19─────────────────────────────────────────────────────────────────────

Portfolio tip

Add a README to your GitHub repo with: (1) sample translations, (2) attention visualization, (3) BLEU progression chart, (4) what you learned. This makes your repo interview-ready.

Skills You'll Gain

These skills transfer to many roles:

RoleHow This Course Helps
ML EngineerImplement and optimize production transformers
Research EngineerRead papers, implement baselines, run experiments
Data ScientistFine-tune and evaluate language models
Software Engineer + MLIntegrate transformer models into products
Graduate StudentFoundation for NLP/ML research

4.8 Resources & Community

Essential References

ResourceWhat It's ForWhen to Use
"Attention Is All You Need" paperThe original transformer paperRead after Ch 2-3 for deeper understanding
The Annotated Transformer (Harvard)Line-by-line code walkthroughCompare with your implementation
Jay Alammar's blogVisual explanationsWhen you need intuition
HuggingFace docsAPI referenceAfter building from scratch, to compare
PyTorch docsTensor operationsWhen debugging shape issues

Foundational Papers (Optional Reading)

  • Attention Is All You Need (Vaswani et al., 2017) - The original
  • BERT (Devlin et al., 2018) - Encoder-only pretraining
  • GPT-2 (Radford et al., 2019) - Decoder-only at scale
  • T5 (Raffel et al., 2020) - Unified text-to-text framework
  • Layer Normalization (Ba et al., 2016) - Why LayerNorm works

Paper reading tip

Don't try to read papers before implementing. Build first, then readβ€”the math will make much more sense.

Getting Help

When you're stuck:

  1. Check tensor shapes - 90% of bugs are shape mismatches
  2. Print intermediate values - Is attention summing to 1? Is the mask applied?
  3. Simplify - Test with batch_size=1, seq_len=4 first
  4. Compare with reference - Check The Annotated Transformer
  5. Ask with context - Share shapes, error messages, and what you've tried

Communities

  • r/MachineLearning - Paper discussions, research news
  • HuggingFace Discord - Transformer-specific help
  • PyTorch Forums - Implementation questions
  • Twitter/X ML community - Latest research, quick tips

Summary

What You'll Achieve

By completing this course, you will:

πŸ“text
1YOUR ACHIEVEMENT CHECKLIST
2════════════════════════════════════════════════════════════════════
3β–‘ UNDERSTAND transformers deeply, not just use them as black boxes
4β–‘ IMPLEMENT every component from scratch in PyTorch
5β–‘ BUILD a working German-to-English translation system
6β–‘ ACHIEVE a BLEU score of 30-35, competitive with baselines
7β–‘ VISUALIZE and interpret attention mechanisms
8β–‘ DEBUG transformer issues confidently (shapes, masks, gradients)
9β–‘ EXPLAIN the architecture in job interviews
10β–‘ APPLY these skills to other transformer-based projects
11════════════════════════════════════════════════════════════════════

Course Materials

Each chapter includes:

  • Theory sections: Conceptual explanations with diagrams
  • Code sections: Step-by-step implementations
  • Exercises: Practice problems (with solutions)
  • Jupyter notebooks: Interactive experimentation

Let's Begin!

In the next chapter, we'll dive into the heart of transformers: the attention mechanism. You'll implement scaled dot-product attention from scratch and see exactly how transformers "pay attention" to different parts of their input.

The journey from "transformers seem magical" to "I built one myself" starts now. You've got the roadmapβ€”let's build.

Exercises

Reflection Questions

  1. Based on the course roadmap, which chapter are you most excited about? Which seems most challenging?
  2. Why is machine translation a good project for learning transformers? What are the advantages over classification tasks?
  3. What aspects of your current PyTorch knowledge do you feel confident about? What areas might need review?
  4. Look at the "Before vs. After" table. Which transformation are you most looking forward to?

Setup Tasks

  1. Create the directory structure outlined in Section 4.5. Verify you have all required packages installed.
  2. Run the "First Win" code from Section 4.4 and save the attention visualization. What patterns do you notice?
  3. (Optional) Download the Multi30k dataset and examine a few examples. What patterns do you notice in German vs. English sentence structure?
  4. Write a simple PyTorch module (any module) to verify your environment is set up correctly:
🐍python
1import torch
2import torch.nn as nn
3
4class TestModule(nn.Module):
5    def __init__(self):
6        super().__init__()
7        self.linear = nn.Linear(10, 5)
8
9    def forward(self, x):
10        return self.linear(x)
11
12# Test it
13model = TestModule()
14x = torch.randn(2, 10)
15print(model(x).shape)  # Should print: torch.Size([2, 5])
16print("βœ… Environment ready! You're all set for Chapter 2.")

Bonus Challenge

Modify the "First Win" attention code to:

  • Use a longer sentence (10+ words)
  • Try different values of d_k (4, 8, 16, 64) and see how the attention patterns change
  • Add a mask that prevents the first word from attending to the last word

Next Chapter Preview

Chapter 2: Attention Mechanism From Scratch

We'll implement the core innovation that makes transformers work:

🐍python
1def attention(Q, K, V):
2    """
3    The heart of transformers.
4
5    Q: What am I looking for? (Query)
6    K: What do I contain? (Key)
7    V: What information do I provide? (Value)
8
9    Returns: Weighted combination of Values
10    """
11    scores = Q @ K.T / sqrt(d_k)
12    weights = softmax(scores)
13    return weights @ V

What You'll Learn:

  • Why attention is called "attention" (the analogy)
  • The mathematics behind Q, K, V
  • Why we scale by √d_k (and what breaks if we don't)
  • How to implement and visualize attention weights
  • Masking for sequences of different lengths
See you in Chapter 2! Bring your code editorβ€”we're building from scratch.