Introduction
In June 2017, a team at Google published "Attention Is All You Need" (Vaswani et al.), introducing the Transformer architecture. This paper fundamentally changed the field of deep learning and led directly to models like BERT, GPT, and virtually every modern language model.
In this section, we'll explore the key ideas of the Transformer at a high level, understanding the paradigm shift it represents before diving into implementation details in later chapters.
Why this matters right now
2.1 The Central Thesis
"Attention Is All You Need"
The paper's title captures its radical proposal:
Remove recurrence entirely. Use only attention mechanisms.
This was counterintuitive at the time. How can you model sequences without any notion of sequential processing?
The answer: Self-attention - a mechanism where every element in a sequence can directly attend to every other element.
2.2 Self-Attention: The Core Innovation
The Key Idea
Instead of processing sequences step-by-step, self-attention computes relationships between ALL pairs of positions simultaneously.
Analogy: newsroom fact-check
For a sequence of tokens:
- Each token creates a Query (Q): "What am I looking for?"
- Each token creates a Key (K): "What do I contain?"
- Each token creates a Value (V): "What information do I provide?"
The Attention Formula
Let's unpack this:
- : Compute similarity between every query and every key ( matrix)
- : Scale to prevent vanishing gradients in softmax
- softmax: Convert similarities to probabilities (attention weights)
- : Weighted sum of values based on attention weights
Worked Micro-Example (3 tokens, )
What to notice
Visual Intuition
Consider the sentence: "The cat sat on the mat"
1Query from "sat"
2 ↓
3 The cat sat on the mat
4 ↓ ↓ ↓ ↓ ↓ ↓
5Keys: k₁ k₂ k₃ k₄ k₅ k₆
6
7Attention weights: [0.05, 0.35, 0.15, 0.10, 0.05, 0.30]
8 ↑ ↑ ↑
9 "The" "cat" "mat"
10 (high attention to subject and object)The word "sat" can directly attend to "cat" (the subject) and "mat" (where it sat), regardless of their positions.
2.3 Why Self-Attention Beats Recurrence
Path Length Comparison
To connect two positions separated by distance :
| Architecture | Maximum Path Length |
|---|---|
| RNN | - must pass through steps |
| CNN | - depends on kernel size |
| Self-Attention | - direct connection! |
Parallelization
For a sequence of length :
| Architecture | Parallel Operations | Sequential Operations |
|---|---|---|
| RNN | ||
| CNN | ||
| Self-Attention |
Self-attention requires operations total, but they can ALL be computed in parallel!
The Computational Trade-off
For modern GPUs optimized for parallel matrix operations:
- parallel operations can be FASTER than sequential operations
- This is why Transformers train much faster on GPUs than RNNs
Watch out for long sequences
2.4 The Transformer Architecture Overview
High-Level Structure
The original Transformer uses an encoder-decoder architecture:

Encoder Stack
Each encoder layer contains:
- Multi-Head Self-Attention
- Queries, Keys, Values all come from the same input
- Each position attends to all positions in the source
- Feed-Forward Network (FFN)
- Two linear transformations with ReLU
- Applied independently to each position
- Residual Connections + Layer Normalization
- Around each sub-layer
- Helps with training deep networks
The encoder layer flow can be summarized as:
Decoder Stack
Each decoder layer contains:
- Masked Multi-Head Self-Attention
- Same as encoder, but with causal masking
- Position can only attend to positions
- Prevents "seeing the future" during generation
- Encoder-Decoder Attention (Cross-Attention)
- Queries from decoder, Keys/Values from encoder output
- This is how the decoder "looks at" the source sequence
- Feed-Forward Network
- Same as encoder
Encoder/Decoder Data Flow at a Glance
1Input tokens -> Embeddings + Positional Encoding
2 -> Encoder self-attention (Q=K=V=input) -> FFN
3Decoder prefix -> Masked self-attention (Q=K=V=shifted target)
4 -> Cross-attention (Q=decoder, K/V=encoder output)
5 -> FFN -> Linear + Softmax -> Next-token logits
6
7Masks:
8- Padding mask: hide padding in encoder & decoder.
9- Causal mask: hide future tokens in decoder self-attention.Masking pitfalls
Multi-Head Attention
Instead of single attention, use multiple "heads" in parallel:
where:
Benefits:
- Each head can learn different types of relationships
- Some heads might focus on syntax, others on semantics
- Richer representation of dependencies
Choosing heads and dimensions
2.5 Positional Encoding
The Problem
Self-attention treats input as a set, not a sequence:
- Changing token order doesn't change attention computations
- "cat sat mat" and "mat sat cat" would be identical
The Solution
Add positional encodings to input embeddings:
Analogy: song positions
Sinusoidal Encoding (Original Paper)
Where:
- is the position in the sequence
- is the dimension
- is the model dimension
Why Sinusoidal?
- Bounded values: Between -1 and 1
- Unique encoding: Each position has a distinct pattern
- Relative positions: can be represented as a linear function of
- Extrapolation: Works for sequences longer than those seen in training
Common implementation gotcha
2.6 Key Architectural Choices
Residual Connections
Every sub-layer has a skip connection:
Benefits:
- Enables training of very deep models
- Gradients can flow directly through skip connections
- Each layer only needs to learn the "residual" (difference from identity)
Layer Normalization
Normalizes across features (not across batch):
Benefits:
- Works with variable-length sequences
- Stable training dynamics
- Works with batch size of 1
The Feed-Forward Network
Dimensions:
- Input/output: (e.g., 512)
- Hidden: (e.g., 2048) - typically
Role:
- Adds non-linearity
- Processes each position independently
- Acts as a "memory" or lookup table
Encoder Layer: PyTorch-style pseudo-code
1def encoder_layer(x, attn_mask):
2 # x: [batch, seq, d_model]
3 attn_out = multi_head_attention(
4 q=x, k=x, v=x, mask=attn_mask
5 ) # [batch, seq, d_model]
6 x = layer_norm(x + attn_out)
7
8 ff = relu(x @ W1 + b1) # W1: [d_model, d_ff]
9 ff = ff @ W2 + b2 # W2: [d_ff, d_model]
10 return layer_norm(x + ff) # [batch, seq, d_model]Shape sanity checks
2.7 Training the Transformer
Teacher Forcing
During training:
- Provide the correct previous tokens to decoder
- Don't use model's own predictions (which might be wrong)
Label Smoothing
Instead of hard targets (0 or 1):
- Smooth labels: 0.9 for correct class, for others (where is vocab size)
- Prevents overconfidence
- Improves generalization
Learning Rate Schedule
The original paper uses a custom schedule:
This creates:
- Linear warmup: Gradually increase LR
- Inverse square root decay: Slowly decrease LR
Practical training knobs
2.8 Why Transformers Succeeded
Superior Performance
Within months of publication:
- Machine Translation: Beat previous SOTA by 2+ BLEU points
- Training Speed: 3.5 days on 8 GPUs (vs. weeks for RNN models)
- Scalability: Performance improved with model size
Hardware Alignment
Transformers are perfectly suited for modern hardware:
- Matrix multiplications () are GPU-optimized
- No sequential dependencies during training
- Can leverage tensor cores and specialized accelerators
Flexibility
The architecture generalizes to:
- Encoder-only (BERT): Classification, NER, embeddings
- Decoder-only (GPT): Language modeling, generation
- Encoder-decoder (T5, BART): Translation, summarization
Scalability
The same architecture works from:
- Millions of parameters (BERT-base: 110M)
- Billions of parameters (GPT-3: 175B)
- Trillions of parameters (future models)
Mini Timeline
| Year | Milestone | What changed |
|---|---|---|
| 2014-2016 | Seq2Seq + attention | RNNs learn to peek at encoder states |
| 2017 | Transformer (Vaswani et al.) | Drop recurrence; full self-attention |
| 2018 | BERT | Encoder-only pretraining for understanding |
| 2019 | GPT-2 | Decoder-only scaling for generation |
| 2020 | T5/BART | Text-to-text unification; denoising pretraining |
| 2023+ | GPT-4, PaLM 2, Gemini | Massive scale, multimodal, tool use |
2.9 The Architecture We'll Build
Our Course Project
We'll implement a full encoder-decoder Transformer for German-to-English translation:
1Source (German): "Der Hund ist schwarz."
2 ↓
3 [Transformer]
4 ↓
5Target (English): "The dog is black."Model Specifications
| Component | Value |
|---|---|
| 256-512 | |
| Encoder layers | 4-6 |
| Decoder layers | 4-6 |
| Attention heads | 8 |
| Vocabulary size | ~10,000 (BPE) |
| Max sequence length | 128 |
From Scratch
We'll implement:
- Scaled dot-product attention
- Multi-head attention
- Positional encoding
- Encoder layer and stack
- Decoder layer and stack
- Complete Transformer model
No nn.Transformer wrappers - everything from basic PyTorch operations.
Summary
Key Innovations of the Transformer
- Self-Attention: Direct connections between any two positions
- Parallelization: All computations can happen simultaneously
- Multi-Head Attention: Multiple perspectives on relationships
- Positional Encoding: Sequence order through learned/fixed patterns
Why It Changed Everything
| Aspect | Before Transformers | After Transformers |
|---|---|---|
| Architecture | RNNs, LSTMs | Self-attention |
| Training | Sequential, slow | Parallel, fast |
| Long-range | Difficult | Natural |
| Scale | Millions of params | Billions+ of params |
| GPU utilization | Poor | Excellent |
Glossary / Cheat Sheet
| Term | Meaning |
|---|---|
| Q, K, V | Queries ask; Keys describe; Values carry content |
| Embedding width (e.g., 512) | |
| Per-head key/query width (d_model / heads) | |
| Hidden width in FFN (~4 x d_model) | |
| Heads | Parallel attention subspaces |
| Warmup steps | Steps to ramp LR before decay |
| Label smoothing \epsilon | Softens one-hot targets to reduce overconfidence |
Exercises
Conceptual Questions
- Why does self-attention have path length between any two positions, while RNNs have ?
- Explain why Transformers can be parallelized during training while RNNs cannot.
- What problem does positional encoding solve? What would happen without it?
- Why do we use multiple attention heads instead of one large attention layer?
Architecture Analysis
- Draw the flow of information through an encoder layer for processing the sentence "The cat sat".
- In encoder-decoder attention, where do Q, K, and V come from? Why is this design important?
- What is the purpose of the feed-forward network in each layer? Why not just use attention?
Hands-on Drills
- Implement sinusoidal positional encodings in PyTorch or NumPy; verify shapes and plot a few dimensions across positions.
- Manually compute the output of a 2-head attention layer on three 2D tokens (pick your own tiny numbers) to feel the math.
Stretch question
Looking Ahead
In the next section, we'll survey the incredible variety of applications that Transformers now dominate:
- NLP: Translation, summarization, question answering
- Vision: Image classification, object detection
- Multimodal: CLIP, DALL-E, GPT-4V
- Beyond: Time series, code, robotics
You should now be able to...
- Explain self-attention and multi-head attention with a concrete numeric example.
- Sketch the encoder/decoder data flow and point to where masking is required.
- Recall key hyperparameters (, heads, , warmup) and predict their effect on compute.
Then we'll set up our development environment and preview the German-to-English translation project that will serve as our hands-on practice throughout this course.