Boo-AI — Master Artificial Intelligence by Building from Scratch

Introduction

Now that you understand why transformers matter and how they've revolutionized AI, it's time to chart our learning path. This section provides a detailed overview of the course structure and introduces the end-to-end machine translation project that will serve as our hands-on practice.

By the end of this course, you'll have implemented a complete German-to-English translation system from scratch—no high-level wrappers, no magic black boxes. Just you, PyTorch, and the fundamental principles of transformers.

How to use this roadmap

Treat each chapter as a milestone: build the code, run the checks, and log what confused you. The later sections assume you've written (not just read) the earlier pieces.

Your Learning Journey

Here's your transformation from "transformers seem magical" to "I built one myself":

📝text

1YOUR TRANSFORMER JOURNEY
2════════════════════════════════════════════════════════════════════════
3
4  START                                                              FINISH
5    │                                                                   │
6    ▼                                                                   ▼
7┌─────────┐    ┌─────────┐    ┌─────────┐    ┌─────────┐    ┌─────────────┐
8│  "What  │    │ "I see  │    │"I built │    │"My model│    │"I understand│
9│   is    │───▶│  how    │───▶│  the    │───▶│ learns! │───▶│ deeply and  │
10│attention│    │ Q,K,V   │    │encoder- │    │BLEU: 30+│    │ can modify/ │
11│    ?"   │    │  work"  │    │decoder" │    │         │    │   extend"   │
12└─────────┘    └─────────┘    └─────────┘    └─────────┘    └─────────────┘
13     │              │              │              │               │
14  Ch 1-2         Ch 3-4         Ch 7-8        Ch 10-14        Ch 15-17
15  ~1 week        ~1 week        ~1 week       ~2 weeks        Optional
16
17════════════════════════════════════════════════════════════════════════
18              Total: 4-8 weeks depending on your background

Before vs. After This Course

Aspect	Before	After
Understanding	"Transformers use attention somehow"	"I can derive the attention formula and explain why scaling matters"
Reading Papers	Skip math, hope for intuition	Understand equations, implement from description
Using Libraries	Copy-paste, pray it works	Know what's inside, debug confidently
Job Interviews	"I've used HuggingFace..."	"I implemented transformers from scratch, here's my BLEU score"
Building Projects	Limited to tutorials	Can architect custom transformer variants
Debugging	Random guessing	"Check mask shapes, verify attention patterns"

The Goal: You won't just be able to use transformers—you'll be able to explain them, modify them, debug them, and build them for your specific needs.

4.1 Course Philosophy

Learning by Building

This course follows a constructionist approach:

The best way to understand something is to build it.

We won't just read about attention mechanisms—we'll implement them tensor by tensor. We won't just discuss positional encoding—we'll visualize why sinusoidal functions work.

What you should do at each step

Write the code yourself, add a print or assert for shapes, and run a micro test before moving on. Copy-pasting skips the learning.

Progressive Complexity

📝text

1Chapter 1-4:   Foundations     (Attention, Multi-Head, Position, Embeddings)
2                     ↓
3Chapter 5-6:   Building Blocks (Tokenization, FFN, Normalization)
4                     ↓
5Chapter 7-9:   Core Architecture (Encoder, Decoder, Generation)
6                     ↓
7Chapter 10-11: Training & Evaluation (Pipeline, BLEU)
8                     ↓
9Chapter 12-14: Complete Project (Data, Training, Analysis)
10                     ↓
11Chapter 15-17: Advanced Topics (Variants, Modern Methods, Production)

Time Investment by Part

📝text

1TIME COMMITMENT OVERVIEW
2────────────────────────────────────────────────────────────────────
3Part 1: Foundations      ████████████████░░░░░░░░░░░░░░░  ~2 weeks
4Part 2: Building Blocks  ████████░░░░░░░░░░░░░░░░░░░░░░░  ~1 week
5Part 3: Core Arch        ████████░░░░░░░░░░░░░░░░░░░░░░░  ~1 week
6Part 4: Training/Eval    ████████░░░░░░░░░░░░░░░░░░░░░░░  ~1 week
7Part 5: Full Project     ████████████████░░░░░░░░░░░░░░░  ~2 weeks
8Part 6: Advanced         ████████████░░░░░░░░░░░░░░░░░░░  Optional
9────────────────────────────────────────────────────────────────────
10                         │         │         │         │
11                         5hrs      10hrs     15hrs     20hrs/week

Time planning

ML beginner: 8-10 weeks total. ML practitioner: 4-6 weeks. DL experienced: 2-3 weeks. Adjust based on how deeply you want to explore each topic.

Each chapter builds on the previous, and by Chapter 14, all components come together in a working system.

From Scratch, Then Compare

We implement everything from basic PyTorch operations:

🐍python

1# What we DO:
2def scaled_dot_product_attention(Q, K, V, mask=None):
3    d_k = Q.size(-1)
4    scores = torch.matmul(Q, K.transpose(-2, -1)) / math.sqrt(d_k)
5    if mask is not None:
6        scores = scores.masked_fill(mask == 0, -1e9)
7    attention_weights = F.softmax(scores, dim=-1)
8    return torch.matmul(attention_weights, V), attention_weights
9
10# What we DON'T do (until comparison at the end):
11output = nn.MultiheadAttention(embed_dim, num_heads)(query, key, value)

Only after understanding how components work do we compare with PyTorch's built-in implementations.

Minimum done criteria

Each chapter should leave you with runnable code, a tiny test (even a shape check), and one visualization or printed example. Keep these in a `notes.md` to track gotchas.

4.2 Chapter-by-Chapter Roadmap

Part 1: Foundations (Chapters 1-4)

Chapter 1: Introduction to Transformers ✓

You are here!

Evolution of sequence modeling
The Transformer revolution
Applications and variants
Course roadmap

Checkpoint: confident if you can

Explain why self-attention beats recurrence, name the three transformer families, and outline the course project without peeking.

Chapter 2: Attention Mechanism From Scratch

Core Learning: Build attention from first principles.

🐍python

1# What you'll implement:
2- Dot-product similarity
3- Softmax attention weights
4- Weighted value aggregation
5- Scaling factor derivation
6- Masking mechanisms

Key Outputs:

attention.py module with reusable attention functions
Visualizations showing attention patterns

Quick Check: Can you explain why we divide by √d_k in attention? What would happen if we didn't?

Chapter 3: Multi-Head Attention

Core Learning: Parallel attention heads for richer representations.

🐍python

1# What you'll implement:
2class MultiHeadAttention(nn.Module):
3    def __init__(self, d_model, num_heads):
4        # Projection matrices
5        self.W_Q = nn.Linear(d_model, d_model)
6        self.W_K = nn.Linear(d_model, d_model)
7        self.W_V = nn.Linear(d_model, d_model)
8        self.W_O = nn.Linear(d_model, d_model)

Key Insights:

Why multiple heads help (different relationship types)
Efficient implementation via reshaping
Head-by-head attention visualization

Chapter 4: Positional Encoding and Embeddings

Core Learning: How transformers understand sequence order.

🐍python

1# What you'll implement:
2- Sinusoidal positional encoding
3- Learned positional embeddings
4- Token embedding layer
5- Combined input representations

Visualizations:

Sinusoidal patterns across dimensions
Position similarity matrices

Part 2: Building Blocks (Chapters 5-6)

Chapter 5: Subword Tokenization for Translation

Core Learning: BPE and vocabulary creation for translation.

🐍python

1# What you'll implement:
2class BPETokenizer:
3    def train(self, corpus, vocab_size):
4        # Learn merge rules from data
5
6    def encode(self, text):
7        # Convert text to token IDs
8
9    def decode(self, token_ids):
10        # Convert token IDs back to text

Why It Matters:

Handles unknown words gracefully
Balances vocabulary size with representation
Essential for multilingual models

Chapter 6: Feed-Forward Networks and Normalization

Core Learning: The "processing" layers between attention.

🐍python

1# What you'll implement:
2class PositionwiseFeedForward(nn.Module):
3    def __init__(self, d_model, d_ff, dropout=0.1):
4        self.linear1 = nn.Linear(d_model, d_ff)
5        self.linear2 = nn.Linear(d_ff, d_model)
6        self.dropout = nn.Dropout(dropout)
7
8class LayerNorm(nn.Module):
9    # From scratch, understanding why it helps training

Topics Covered:

ReLU/GELU activation functions
Why 4× expansion in FFN
Pre-norm vs. post-norm
Residual connections

Part 3: Core Architecture (Chapters 7-9)

Chapter 7: Transformer Encoder

Core Learning: Complete encoder implementation.

🐍python

1# What you'll implement:
2class EncoderLayer(nn.Module):
3    # Self-attention + FFN + residuals + normalization
4
5class TransformerEncoder(nn.Module):
6    # Stack of N encoder layers

Architecture:

📝text

1Input embeddings + positions
2         ↓
3┌────────────────────────┐
4│    Self-Attention      │←─┐
5├────────────────────────┤  │ × N layers
6│ Add & Norm             │──┘
7├────────────────────────┤
8│    Feed-Forward        │←─┐
9├────────────────────────┤  │
10│ Add & Norm             │──┘
11└────────────────────────┘
12         ↓
13    Encoder output

Chapter 8: Transformer Decoder

Core Learning: Decoder with masked self-attention and cross-attention.

🐍python

1# What you'll implement:
2class DecoderLayer(nn.Module):
3    def forward(self, x, encoder_output, src_mask, tgt_mask):
4        # Masked self-attention
5        x = self.self_attention(x, x, x, tgt_mask)
6        # Cross-attention to encoder
7        x = self.cross_attention(x, encoder_output, encoder_output, src_mask)
8        # Feed-forward
9        x = self.feed_forward(x)
10        return x

Key Concepts:

Causal masking (preventing future peeking)
Cross-attention mechanism
Teacher forcing during training

Chapter 9: Autoregressive Generation and Inference

Core Learning: How decoders generate text token by token.

🐍python

1# What you'll implement:
2def greedy_decode(model, encoder_output, max_len):
3    """Generate one token at a time, feeding back predictions."""
4
5def beam_search(model, encoder_output, beam_width, max_len):
6    """Maintain multiple hypotheses for better results."""

Topics:

Greedy vs. beam search
Temperature sampling
Top-k and nucleus (top-p) sampling
KV caching for efficient inference

Part 4: Training & Evaluation (Chapters 10-11)

Chapter 10: Training Pipeline for Translation

Core Learning: Complete training loop with best practices.

🐍python

1# What you'll implement:
2class TransformerTrainer:
3    def train_epoch(self, dataloader):
4        for batch in dataloader:
5            loss = self.compute_loss(batch)
6            loss.backward()
7            self.optimizer.step()
8            self.scheduler.step()

Topics:

Cross-entropy loss with label smoothing
Adam optimizer with warmup
Learning rate scheduling
Gradient clipping
Mixed precision training
Checkpointing

Chapter 11: Evaluation Metrics for Translation

Core Learning: Measuring translation quality.

🐍python

1# What you'll implement:
2def compute_bleu(references, hypotheses, max_n=4):
3    """Compute BLEU score from scratch."""
4
5def corpus_bleu(references, hypotheses):
6    """Corpus-level BLEU with brevity penalty."""

Metrics Covered:

BLEU (n-gram precision)
METEOR
chrF
Human evaluation considerations

Part 5: Complete Project (Chapters 12-14)

Chapter 12: Project - Data Pipeline

The Dataset: Multi30k German-English parallel corpus

🐍python

1# ~30,000 sentence pairs
2# Example:
3german:  "Ein Mann in einem blauen Hemd steht auf einer Leiter."
4english: "A man in a blue shirt is standing on a ladder."

What You'll Build:

Data loading and preprocessing
BPE vocabulary training
Batching with dynamic padding
Data augmentation techniques

Chapter 13: Project - Model and Training

The Complete Model:

🐍python

1class Transformer(nn.Module):
2    def __init__(self, config):
3        self.encoder = TransformerEncoder(config)
4        self.decoder = TransformerDecoder(config)
5        self.src_embed = Embeddings(config)
6        self.tgt_embed = Embeddings(config)
7        self.generator = nn.Linear(config.d_model, config.tgt_vocab_size)

Training Goals:

Target BLEU: 30-35 on test set
Training time: 2-4 hours on single GPU
Model size: ~30M parameters

Chapter 14: Project - Evaluation and Analysis

Analysis You'll Perform:

Quantitative evaluation (BLEU, loss curves)
Attention visualizations
Error analysis and failure cases
Comparison with baseline and PyTorch built-ins

Part 6: Advanced Topics (Chapters 15-17)

Chapter 15: Advanced Architectures and Techniques

Relative position encodings (T5, ALiBi)
Rotary position embeddings (RoPE)
Sparse attention patterns
Mixture of Experts

Expect optional depth

Advanced chapters are bonus: skim first, then pick what aligns with your goals (e.g., long-context for docs/code, MoE for scaling).

Chapter 16: Modern Transformer Variants

Vision Transformers
Decoder-only LLMs (GPT-style)
BERT-style pre-training
Flash Attention

Production relevance

Flash Attention and quantization (next chapter) are the biggest levers for latency; prioritize if you plan to deploy.

Chapter 17: Production and Deployment

Model quantization
ONNX export
KV caching optimization
Serving architectures

4.3 The Translation Project

Why Machine Translation?

Translation is the ideal learning project for transformers because:

Full Architecture: Uses complete encoder-decoder (not just encoder or decoder)
Clear Evaluation: BLEU scores provide objective measurement
Visible Results: You can immediately see if translations make sense
Historical Significance: The original transformer was designed for translation
Reasonable Scale: Can train on a laptop/single GPU

Why not just classify?

Translation forces you to handle varying lengths, masking, teacher forcing, and evaluation with BLEU—all core skills you need for any seq-to-seq problem.

Dataset: Multi30k

Multi30k is a multilingual extension of the Flickr30k image captioning dataset:

Split	Sentences	Avg. Length (DE)	Avg. Length (EN)
Train	29,000	12.1 tokens	13.0 tokens
Valid	1,014	12.0 tokens	12.8 tokens
Test	1,000	11.6 tokens	12.5 tokens

Sample Pairs:

📝text

1DE: Eine Gruppe von Menschen steht vor einem Iglu.
2EN: A group of people stands in front of an igloo.
3
4DE: Ein kleines Mädchen klettert in ein Spielhaus aus Holz.
5EN: A little girl climbing into a wooden playhouse.
6
7DE: Ein Mann in einem blauen Hemd steht auf einer Leiter.
8EN: A man in a blue shirt is standing on a ladder.

Model Specifications

Hyperparameter	Value
d_model	256-512
Encoder layers	4-6
Decoder layers	4-6
Attention heads	8
d_ff	4 × d_model
Vocabulary (BPE)	~10,000
Max sequence length	128
Dropout	0.1-0.3
Batch size	64-128
Learning rate	1e-4 to 3e-4

Resource planning

If VRAM is tight: drop d_model to 256, use 4 layers, reduce batch size, and enable gradient accumulation; keep heads at 8 to preserve attention quality.

Expected Results

BLEU Score Progression:

📝text

1Epoch 1:   5-10  (learning basic patterns)
2Epoch 5:   15-20 (capturing word relationships)
3Epoch 10:  25-30 (good fluency, some errors)
4Epoch 20:  30-35 (our target)
5Epoch 30+: 35-40 (diminishing returns)

Watch Your Model Learn

This is what makes the project satisfying—you can literally watch your model go from nonsense to fluent translations:

📝text

1┌─────────────────────────────────────────────────────────────────────┐
2│  YOUR MODEL'S LEARNING JOURNEY                                      │
3├─────────────────────────────────────────────────────────────────────┤
4│                                                                     │
5│  Input: "Ein Mann liest ein Buch"                                   │
6│                                                                     │
7│  Epoch 1:  "A the the the the"        ❌ Gibberish                  │
8│            └─ Model just repeating common tokens                    │
9│                                                                     │
10│  Epoch 5:  "A man is a book"          ⚠️ Words present, wrong order │
11│            └─ Learning word mappings, not structure                 │
12│                                                                     │
13│  Epoch 10: "A man reading a book"     ⚠️ Close! Wrong verb form     │
14│            └─ Getting syntax, some errors                           │
15│                                                                     │
16│  Epoch 15: "A man reads a book"       ✅ Correct!                   │
17│            └─ Learned the pattern                                   │
18│                                                                     │
19│  Epoch 20: "A man in a blue shirt     ✅ Handles complexity!        │
20│             reads a book in the park"                               │
21│            └─ Generalizes to longer sentences                       │
22│                                                                     │
23└─────────────────────────────────────────────────────────────────────┘

The Magic Moment: There's a specific epoch where your model goes from producing garbage to producing readable translations. You'll feel like a proud parent.

4.4 Your First Win (Try Now!)

Don't wait until Chapter 2—let's get a small win right now. Run this code to see attention in action:

🐍python

1# 🎯 YOUR FIRST WIN: See attention patterns in 60 seconds
2# Copy this into a Python file or Jupyter notebook
3
4import torch
5import torch.nn.functional as F
6import matplotlib.pyplot as plt
7
8# Simulate a mini attention computation
9sentence = ["The", "cat", "sat", "on", "the", "mat"]
10seq_len = len(sentence)
11
12# Random Q, K, V matrices (normally these come from embeddings)
13d_k = 8
14Q = torch.randn(seq_len, d_k)
15K = torch.randn(seq_len, d_k)
16V = torch.randn(seq_len, d_k)
17
18# THE ATTENTION FORMULA (you'll implement this properly in Chapter 2)
19scores = torch.matmul(Q, K.transpose(-2, -1)) / (d_k ** 0.5)
20attention_weights = F.softmax(scores, dim=-1)
21
22# Visualize: which words attend to which?
23plt.figure(figsize=(8, 6))
24plt.imshow(attention_weights.detach().numpy(), cmap='Blues')
25plt.xticks(range(seq_len), sentence, rotation=45)
26plt.yticks(range(seq_len), sentence)
27plt.xlabel('Keys (attending TO)')
28plt.ylabel('Queries (attending FROM)')
29plt.title('Your First Attention Map!')
30plt.colorbar(label='Attention Weight')
31plt.tight_layout()
32plt.savefig('my_first_attention.png')
33plt.show()
34
35print("✅ You just computed attention from scratch!")
36print("📊 Saved: my_first_attention.png")
37print("🎯 Next: Chapter 2 will explain WHY this formula works")

Do this now!

Seriously—copy this code and run it. You'll have a visual understanding of attention before you even start Chapter 2. This 60-second experiment will make everything click faster.

What You'll See:

A heatmap showing which words "pay attention" to which
Brighter colors = stronger attention
The diagonal often lights up (words attend to themselves)
This is the core of what makes transformers work!

Save this feeling

When you finish Chapter 2, come back to this code. You'll understand every line deeply.

4.5 Prerequisites and Setup

Knowledge Requirements

This course assumes you're comfortable with:

Topic	Level	We'll Review?	Self-Check
Python	Intermediate	No	Can you write classes and use list comprehensions?
NumPy	Basic	Briefly	Can you reshape arrays and do broadcasting?
PyTorch	Basic	Yes	Can you create tensors and define nn.Module?
Linear algebra	Basics	Yes (as needed)	Do you know what matrix multiplication does?
Neural networks	Conceptual	Yes (quick recap)	Do you know what a loss function is?
Calculus/Gradients	Awareness	No (PyTorch handles it)	Do you know backprop exists?

Quick Self-Assessment

Rate yourself 1-5 on each. If any are below 3, consider a quick refresher:

📝text

1PREREQUISITE CHECKLIST
2─────────────────────────────────────────────────
3□ I can write a Python class with __init__ and methods     [  /5]
4□ I can reshape a NumPy array from (10, 20) to (20, 10)    [  /5]
5□ I can define a simple nn.Module in PyTorch               [  /5]
6□ I know what A @ B means for matrices                     [  /5]
7□ I understand train/val/test splits                       [  /5]
8─────────────────────────────────────────────────
9                                          Total: [  /25]
10
11Score 20+: You're ready! Dive in.
12Score 15-19: You'll be fine, expect some Googling.
13Score <15: Spend a day on PyTorch basics first.

Technical Setup

Required:

⚡bash

1Python >= 3.8
2PyTorch >= 2.0
3torchtext (for data utilities)
4matplotlib (for visualization)
5tqdm (for progress bars)

Optional but Recommended:

⚡bash

1CUDA-capable GPU (training is 10-20x faster)
2Jupyter/Colab (for interactive exploration)
3wandb (for experiment tracking)

Quick Environment Setup:

⚡bash

1# Create a fresh environment
2conda create -n transformers python=3.10
3conda activate transformers
4
5# Install PyTorch (check pytorch.org for your CUDA version)
6pip install torch torchvision torchaudio
7
8# Install course dependencies
9pip install torchtext matplotlib tqdm numpy
10
11# Verify installation
12python -c "import torch; print(f'PyTorch {torch.__version__}, CUDA: {torch.cuda.is_available()}')"
13
14# Expected output: PyTorch 2.x.x, CUDA: True (or False if no GPU)

Directory Structure

By course end, your project will look like:

📝text

1transformer_translation/
2├── data/
3│   ├── multi30k/
4│   │   ├── train.de
5│   │   ├── train.en
6│   │   ├── val.de
7│   │   ├── val.en
8│   │   ├── test.de
9│   │   └── test.en
10│   └── vocab/
11│       ├── bpe.de
12│       └── bpe.en
13├── models/
14│   ├── attention.py
15│   ├── encoder.py
16│   ├── decoder.py
17│   ├── transformer.py
18│   └── embeddings.py
19├── training/
20│   ├── trainer.py
21│   ├── scheduler.py
22│   └── losses.py
23├── evaluation/
24│   ├── bleu.py
25│   └── analysis.py
26├── utils/
27│   ├── tokenizer.py
28│   ├── dataset.py
29│   └── visualization.py
30├── configs/
31│   └── base_config.yaml
32├── checkpoints/
33├── notebooks/
34│   └── exploration.ipynb
35└── train.py

4.6 Learning Strategies

Active Learning Tips

Code Along: Don't just read—type the code yourself
Experiment: Change hyperparameters, break things, see what happens
Visualize: Plot attention weights, embeddings, loss curves
Debug Deliberately: When something doesn't work, trace tensor shapes
Teach Others: Explaining concepts solidifies understanding

Micro-habits that pay off

Keep a running `notes.md` with shape logs and bugs; run a tiny smoke test after each function; plot one thing per chapter (loss curve, attention map, positional encoding heatmap).

The Shape Debugging Habit

90% of transformer bugs are shape mismatches. Build this habit from day one:

🐍python

1# Add this to EVERY function you write:
2def my_attention(Q, K, V):
3    print(f"Q: {Q.shape}, K: {K.shape}, V: {V.shape}")  # Always check inputs
4
5    # ... your code ...
6
7    print(f"output: {output.shape}")  # Always check outputs
8    return output
9
10# Expected shapes to memorize:
11# Q, K, V: (batch, heads, seq_len, d_k)
12# Attention weights: (batch, heads, seq_len, seq_len)
13# Output: (batch, heads, seq_len, d_k)

Common Pitfalls to Avoid

🐍python

1# Pitfall 1: Tensor shape confusion
2# ALWAYS check shapes with print statements
3print(f"Q shape: {Q.shape}")  # Expected: (batch, heads, seq_len, d_k)
4
5# Pitfall 2: Forgetting to handle padding
6# Masks are essential for variable-length sequences
7
8# Pitfall 3: Not scaling attention
9# The 1/√d_k factor prevents gradient vanishing
10
11# Pitfall 4: Wrong mask dimensions
12# src_mask: (batch, 1, 1, src_len)
13# tgt_mask: (batch, 1, tgt_len, tgt_len)
14
15# Pitfall 5: Not handling start/end tokens
16# <sos> and <eos> tokens are critical for generation

Quick sanity checks

Attention outputs should preserve seq length; masks must broadcast; BLEU should rise after a few epochs; loss should trend down—if not, check masks and learning rate.

Suggested Pace

Background	Suggested Time	Chapters/Week	Focus Areas
ML beginner	8-10 weeks	1-2	Take time on Ch 2-4, foundations are key
ML practitioner	4-6 weeks	2-3	Can skim Ch 1, focus on implementation
DL experienced	2-3 weeks	4-6	Focus on project chapters (12-14)

4.7 Career & Portfolio Value

Why This Project Matters for Your Career

Completing this course gives you demonstrable skills that stand out:

What You'll Have	How It Helps
Working translation model	Concrete proof you can build, not just use, transformers
GitHub repo with clean code	Portfolio piece for job applications
BLEU score to report	Quantifiable metric ("I achieved BLEU 32 on Multi30k")
Attention visualizations	Great for interview presentations
From-scratch implementations	Shows deep understanding vs API-only usage

Interview-Ready Talking Points

📝text

1AFTER THIS COURSE, YOU CAN CONFIDENTLY SAY:
2─────────────────────────────────────────────────────────────────────
3✅ "I implemented a transformer from scratch in PyTorch"
4
5✅ "I understand why we scale attention by √d_k—it's to prevent
6    softmax saturation when dimensions are large"
7
8✅ "I built a German-English translator that achieved BLEU 32,
9    competitive with baseline models"
10
11✅ "I can explain the difference between encoder self-attention,
12    decoder self-attention, and cross-attention"
13
14✅ "I've debugged attention masks—the batch and head dimensions
15    need to broadcast correctly"
16
17✅ "I know when to use encoder-only (BERT) vs decoder-only (GPT)
18    vs encoder-decoder (T5) architectures"
19─────────────────────────────────────────────────────────────────────

Portfolio tip

Add a README to your GitHub repo with: (1) sample translations, (2) attention visualization, (3) BLEU progression chart, (4) what you learned. This makes your repo interview-ready.

Skills You'll Gain

These skills transfer to many roles:

Role	How This Course Helps
ML Engineer	Implement and optimize production transformers
Research Engineer	Read papers, implement baselines, run experiments
Data Scientist	Fine-tune and evaluate language models
Software Engineer + ML	Integrate transformer models into products
Graduate Student	Foundation for NLP/ML research

4.8 Resources & Community

Essential References

Resource	What It's For	When to Use
"Attention Is All You Need" paper	The original transformer paper	Read after Ch 2-3 for deeper understanding
The Annotated Transformer (Harvard)	Line-by-line code walkthrough	Compare with your implementation
Jay Alammar's blog	Visual explanations	When you need intuition
HuggingFace docs	API reference	After building from scratch, to compare
PyTorch docs	Tensor operations	When debugging shape issues

Foundational Papers (Optional Reading)

Attention Is All You Need (Vaswani et al., 2017) - The original
BERT (Devlin et al., 2018) - Encoder-only pretraining
GPT-2 (Radford et al., 2019) - Decoder-only at scale
T5 (Raffel et al., 2020) - Unified text-to-text framework
Layer Normalization (Ba et al., 2016) - Why LayerNorm works

Paper reading tip

Don't try to read papers before implementing. Build first, then read—the math will make much more sense.

Getting Help

When you're stuck:

Check tensor shapes - 90% of bugs are shape mismatches
Print intermediate values - Is attention summing to 1? Is the mask applied?
Simplify - Test with batch_size=1, seq_len=4 first
Compare with reference - Check The Annotated Transformer
Ask with context - Share shapes, error messages, and what you've tried

Communities

r/MachineLearning - Paper discussions, research news
HuggingFace Discord - Transformer-specific help
PyTorch Forums - Implementation questions
Twitter/X ML community - Latest research, quick tips

Summary

What You'll Achieve

By completing this course, you will:

📝text

1YOUR ACHIEVEMENT CHECKLIST
2════════════════════════════════════════════════════════════════════
3□ UNDERSTAND transformers deeply, not just use them as black boxes
4□ IMPLEMENT every component from scratch in PyTorch
5□ BUILD a working German-to-English translation system
6□ ACHIEVE a BLEU score of 30-35, competitive with baselines
7□ VISUALIZE and interpret attention mechanisms
8□ DEBUG transformer issues confidently (shapes, masks, gradients)
9□ EXPLAIN the architecture in job interviews
10□ APPLY these skills to other transformer-based projects
11════════════════════════════════════════════════════════════════════

Course Materials

Each chapter includes:

Theory sections: Conceptual explanations with diagrams
Code sections: Step-by-step implementations
Exercises: Practice problems (with solutions)
Jupyter notebooks: Interactive experimentation

Let's Begin!

In the next chapter, we'll dive into the heart of transformers: the attention mechanism. You'll implement scaled dot-product attention from scratch and see exactly how transformers "pay attention" to different parts of their input.

The journey from "transformers seem magical" to "I built one myself" starts now. You've got the roadmap—let's build.

Exercises

Reflection Questions

Based on the course roadmap, which chapter are you most excited about? Which seems most challenging?
Why is machine translation a good project for learning transformers? What are the advantages over classification tasks?
What aspects of your current PyTorch knowledge do you feel confident about? What areas might need review?
Look at the "Before vs. After" table. Which transformation are you most looking forward to?

Setup Tasks

Create the directory structure outlined in Section 4.5. Verify you have all required packages installed.
Run the "First Win" code from Section 4.4 and save the attention visualization. What patterns do you notice?
(Optional) Download the Multi30k dataset and examine a few examples. What patterns do you notice in German vs. English sentence structure?
Write a simple PyTorch module (any module) to verify your environment is set up correctly:

🐍python

1import torch
2import torch.nn as nn
3
4class TestModule(nn.Module):
5    def __init__(self):
6        super().__init__()
7        self.linear = nn.Linear(10, 5)
8
9    def forward(self, x):
10        return self.linear(x)
11
12# Test it
13model = TestModule()
14x = torch.randn(2, 10)
15print(model(x).shape)  # Should print: torch.Size([2, 5])
16print("✅ Environment ready! You're all set for Chapter 2.")

Bonus Challenge

Modify the "First Win" attention code to:

Use a longer sentence (10+ words)
Try different values of d_k (4, 8, 16, 64) and see how the attention patterns change
Add a mask that prevents the first word from attending to the last word

Next Chapter Preview

Chapter 2: Attention Mechanism From Scratch

We'll implement the core innovation that makes transformers work:

🐍python

1def attention(Q, K, V):
2    """
3    The heart of transformers.
4
5    Q: What am I looking for? (Query)
6    K: What do I contain? (Key)
7    V: What information do I provide? (Value)
8
9    Returns: Weighted combination of Values
10    """
11    scores = Q @ K.T / sqrt(d_k)
12    weights = softmax(scores)
13    return weights @ V

What You'll Learn:

Why attention is called "attention" (the analogy)
The mathematics behind Q, K, V
Why we scale by √d_k (and what breaks if we don't)
How to implement and visualize attention weights
Masking for sequences of different lengths

See you in Chapter 2! Bring your code editor—we're building from scratch.