Chapter 1
15 min read
Section 5 of 75

The Transformer Revolution

Introduction to Transformers

Introduction

In June 2017, a team at Google published "Attention Is All You Need" (Vaswani et al.), introducing the Transformer architecture. This paper fundamentally changed the field of deep learning and led directly to models like BERT, GPT, and virtually every modern language model.

In this section, we'll explore the key ideas of the Transformer at a high level, understanding the paradigm shift it represents before diving into implementation details in later chapters.

Why this matters right now

Self-attention removes the sequential bottleneck that slowed RNNs, so you can scale models and training speed just by throwing GPU-friendly matrix multiplies at the problem. Everything else in this section flows from that shift.

2.1 The Central Thesis

"Attention Is All You Need"

The paper's title captures its radical proposal:

Remove recurrence entirely. Use only attention mechanisms.

This was counterintuitive at the time. How can you model sequences without any notion of sequential processing?

The answer: Self-attention - a mechanism where every element in a sequence can directly attend to every other element.


2.2 Self-Attention: The Core Innovation

The Key Idea

Instead of processing sequences step-by-step, self-attention computes relationships between ALL pairs of positions simultaneously.

Analogy: newsroom fact-check

Imagine a newsroom where every editor can instantly read every sentence in every article and leave notes for each other at once. Self-attention gives every token that same superpower, so important context is never more than one hop away.

For a sequence of nn tokens:

  • Each token creates a Query (Q): "What am I looking for?"
  • Each token creates a Key (K): "What do I contain?"
  • Each token creates a Value (V): "What information do I provide?"

The Attention Formula

Attention(Q,K,V)=softmax(QKTdk)×V\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right) \times V

Let's unpack this:

  1. QKTQK^T: Compute similarity between every query and every key (n×nn \times n matrix)
  2. ÷dk\div \sqrt{d_k}: Scale to prevent vanishing gradients in softmax
  3. softmax: Convert similarities to probabilities (attention weights)
  4. ×V\times V: Weighted sum of values based on attention weights

Worked Micro-Example (3 tokens, dk=2d_k = 2)

What to notice

Highest weights land on tokens with similar Q/K; scaling keeps softmax from collapsing; every row is a distribution that blends V vectors.

Visual Intuition

Consider the sentence: "The cat sat on the mat"

📝attention-visualization.txt
1Query from "sat"
23     The   cat   sat   on   the   mat
4      ↓     ↓     ↓     ↓     ↓     ↓
5Keys: k₁    k₂   k₃    k₄    k₅    k₆
6
7Attention weights: [0.05, 0.35, 0.15, 0.10, 0.05, 0.30]
8                     ↑      ↑                        ↑
9                   "The"  "cat"                    "mat"
10                         (high attention to subject and object)

The word "sat" can directly attend to "cat" (the subject) and "mat" (where it sat), regardless of their positions.


2.3 Why Self-Attention Beats Recurrence

Path Length Comparison

To connect two positions separated by distance dd:

ArchitectureMaximum Path Length
RNNO(d)O(d) - must pass through dd steps
CNNO(logk(d))O(\log_k(d)) - depends on kernel size
Self-AttentionO(1)O(1) - direct connection!

Parallelization

For a sequence of length nn:

ArchitectureParallel OperationsSequential Operations
RNNO(1)O(1)O(n)O(n)
CNNO(1)O(1)O(logk(n))O(\log_k(n))
Self-AttentionO(n2)O(n^2)O(1)O(1)

Self-attention requires O(n2)O(n^2) operations total, but they can ALL be computed in parallel!

The Computational Trade-off

RNN: O(n) sequential×O(d) per step=O(nd) total, O(n) time\text{RNN: } O(n) \text{ sequential} \times O(d) \text{ per step} = O(n \cdot d) \text{ total, } O(n) \text{ time}
Attention: O(n2) parallel×O(d) per step=O(n2d) total, O(1) time\text{Attention: } O(n^2) \text{ parallel} \times O(d) \text{ per step} = O(n^2 \cdot d) \text{ total, } O(1) \text{ time}

For modern GPUs optimized for parallel matrix operations:

  • O(n2)O(n^2) parallel operations can be FASTER than O(n)O(n) sequential operations
  • This is why Transformers train much faster on GPUs than RNNs

Watch out for long sequences

The quadratic n2n^2 cost in self-attention becomes painful for long context. Practical fixes: chunked attention, sliding windows, sparse kernels (Longformer, BigBird), or low-rank/linear attention (Performer, FlashAttention-2 with block sparsity).

2.4 The Transformer Architecture Overview

High-Level Structure

The original Transformer uses an encoder-decoder architecture:

The Transformer Architecture - Encoder-Decoder model with multi-head self-attention
The Transformer Architecture from 'Attention Is All You Need' (Vaswani et al., 2017)

Encoder Stack

Each encoder layer contains:

  1. Multi-Head Self-Attention
    • Queries, Keys, Values all come from the same input
    • Each position attends to all positions in the source
  2. Feed-Forward Network (FFN)
    • Two linear transformations with ReLU
    • Applied independently to each position
  3. Residual Connections + Layer Normalization
    • Around each sub-layer
    • Helps with training deep networks

The encoder layer flow can be summarized as:

output1=LayerNorm(x+SelfAttention(x))\text{output}_1 = \text{LayerNorm}(x + \text{SelfAttention}(x))
output2=LayerNorm(output1+FFN(output1))\text{output}_2 = \text{LayerNorm}(\text{output}_1 + \text{FFN}(\text{output}_1))

Decoder Stack

Each decoder layer contains:

  1. Masked Multi-Head Self-Attention
    • Same as encoder, but with causal masking
    • Position ii can only attend to positions i\leq i
    • Prevents "seeing the future" during generation
  2. Encoder-Decoder Attention (Cross-Attention)
    • Queries from decoder, Keys/Values from encoder output
    • This is how the decoder "looks at" the source sequence
  3. Feed-Forward Network
    • Same as encoder

Encoder/Decoder Data Flow at a Glance

📝transformer-dataflow.txt
1Input tokens -> Embeddings + Positional Encoding
2              -> Encoder self-attention (Q=K=V=input) -> FFN
3Decoder prefix -> Masked self-attention (Q=K=V=shifted target)
4                -> Cross-attention (Q=decoder, K/V=encoder output)
5                -> FFN -> Linear + Softmax -> Next-token logits
6
7Masks:
8- Padding mask: hide padding in encoder & decoder.
9- Causal mask: hide future tokens in decoder self-attention.

Masking pitfalls

Forgetting the causal mask leaks future tokens and produces a train/test mismatch; forgetting the padding mask makes the model attend to zeros and hurts stability.

Multi-Head Attention

Instead of single attention, use multiple "heads" in parallel:

MultiHead(Q,K,V)=Concat(head1,,headh)×WO\text{MultiHead}(Q, K, V) = \text{Concat}(\text{head}_1, \ldots, \text{head}_h) \times W^O

where:

headi=Attention(QWiQ,KWiK,VWiV)\text{head}_i = \text{Attention}(Q W_i^Q, K W_i^K, V W_i^V)

Benefits:

  • Each head can learn different types of relationships
  • Some heads might focus on syntax, others on semantics
  • Richer representation of dependencies

Choosing heads and dimensions

Typical choice: dk=dv=dmodel/hd_k = d_v = d_{model} / h. More heads increase representational power but also VRAM and attention overhead; if dkd_k gets too small, heads underfit. Start with 8 heads for dmodel=512d_{model}=512, scale up only when batch size and memory allow.

2.5 Positional Encoding

The Problem

Self-attention treats input as a set, not a sequence:

  • Changing token order doesn't change attention computations
  • "cat sat mat" and "mat sat cat" would be identical

The Solution

Add positional encodings to input embeddings:

input=token_embedding+positional_encoding\text{input} = \text{token\_embedding} + \text{positional\_encoding}

Analogy: song positions

Think of a song: the note itself is the token embedding, the timestamp in the track is the positional encoding. Add them and the model knows both what the sound is and when it happened.

Sinusoidal Encoding (Original Paper)

PE(pos,2i)=sin(pos100002i/dmodel)PE_{(pos, 2i)} = \sin\left(\frac{pos}{10000^{2i/d_{model}}}\right)
PE(pos,2i+1)=cos(pos100002i/dmodel)PE_{(pos, 2i+1)} = \cos\left(\frac{pos}{10000^{2i/d_{model}}}\right)

Where:

  • pospos is the position in the sequence
  • ii is the dimension
  • dmodeld_{model} is the model dimension

Why Sinusoidal?

  1. Bounded values: Between -1 and 1
  2. Unique encoding: Each position has a distinct pattern
  3. Relative positions: PEpos+kPE_{pos+k} can be represented as a linear function of PEposPE_{pos}
  4. Extrapolation: Works for sequences longer than those seen in training

Common implementation gotcha

Scale token embeddings by dmodel\sqrt{d_{model}} before adding positional encodings; otherwise the sinusoidal signal can be drowned out by the embedding magnitudes.

2.6 Key Architectural Choices

Residual Connections

Every sub-layer has a skip connection:

output=LayerNorm(x+Sublayer(x))\text{output} = \text{LayerNorm}(x + \text{Sublayer}(x))

Benefits:

  • Enables training of very deep models
  • Gradients can flow directly through skip connections
  • Each layer only needs to learn the "residual" (difference from identity)

Layer Normalization

Normalizes across features (not across batch):

LayerNorm(x)=γ×xμσ+β\text{LayerNorm}(x) = \gamma \times \frac{x - \mu}{\sigma} + \beta

Benefits:

  • Works with variable-length sequences
  • Stable training dynamics
  • Works with batch size of 1

The Feed-Forward Network

FFN(x)=ReLU(xW1+b1)W2+b2\text{FFN}(x) = \text{ReLU}(x W_1 + b_1) W_2 + b_2

Dimensions:

  • Input/output: dmodeld_{model} (e.g., 512)
  • Hidden: dffd_{ff} (e.g., 2048) - typically 4×dmodel4 \times d_{model}

Role:

  • Adds non-linearity
  • Processes each position independently
  • Acts as a "memory" or lookup table

Encoder Layer: PyTorch-style pseudo-code

🐍encoder_layer.py
1def encoder_layer(x, attn_mask):
2    # x: [batch, seq, d_model]
3    attn_out = multi_head_attention(
4        q=x, k=x, v=x, mask=attn_mask
5    )                          # [batch, seq, d_model]
6    x = layer_norm(x + attn_out)
7
8    ff = relu(x @ W1 + b1)     # W1: [d_model, d_ff]
9    ff = ff @ W2 + b2          # W2: [d_ff, d_model]
10    return layer_norm(x + ff)  # [batch, seq, d_model]

Shape sanity checks

Multi-head attention preserves sequence length; FFN expands to dffd_{ff} then projects back. Masks should include both padding and (for decoders) causal components.

2.7 Training the Transformer

Teacher Forcing

During training:

  • Provide the correct previous tokens to decoder
  • Don't use model's own predictions (which might be wrong)

Label Smoothing

Instead of hard targets (0 or 1):

  • Smooth labels: 0.9 for correct class, 0.1/(V1)0.1/(V-1) for others (where VV is vocab size)
  • Prevents overconfidence
  • Improves generalization

Learning Rate Schedule

The original paper uses a custom schedule:

lr=dmodel0.5×min(step0.5,step×warmup_steps1.5)lr = d_{model}^{-0.5} \times \min(step^{-0.5}, step \times warmup\_steps^{-1.5})

This creates:

  1. Linear warmup: Gradually increase LR
  2. Inverse square root decay: Slowly decrease LR

Practical training knobs

Memory is dominated by sequence_length×heads×dksequence\_length \times heads \times d_k and batch size. If you run out of VRAM: shorten sequences, reduce heads, or gradient-accumulate. Label smoothing and warmup often give bigger wins than adding one more layer.

2.8 Why Transformers Succeeded

Superior Performance

Within months of publication:

  • Machine Translation: Beat previous SOTA by 2+ BLEU points
  • Training Speed: 3.5 days on 8 GPUs (vs. weeks for RNN models)
  • Scalability: Performance improved with model size

Hardware Alignment

Transformers are perfectly suited for modern hardware:

  • Matrix multiplications (Q×KTQ \times K^T) are GPU-optimized
  • No sequential dependencies during training
  • Can leverage tensor cores and specialized accelerators

Flexibility

The architecture generalizes to:

  • Encoder-only (BERT): Classification, NER, embeddings
  • Decoder-only (GPT): Language modeling, generation
  • Encoder-decoder (T5, BART): Translation, summarization

Scalability

The same architecture works from:

  • Millions of parameters (BERT-base: 110M)
  • Billions of parameters (GPT-3: 175B)
  • Trillions of parameters (future models)

Mini Timeline

YearMilestoneWhat changed
2014-2016Seq2Seq + attentionRNNs learn to peek at encoder states
2017Transformer (Vaswani et al.)Drop recurrence; full self-attention
2018BERTEncoder-only pretraining for understanding
2019GPT-2Decoder-only scaling for generation
2020T5/BARTText-to-text unification; denoising pretraining
2023+GPT-4, PaLM 2, GeminiMassive scale, multimodal, tool use

2.9 The Architecture We'll Build

Our Course Project

We'll implement a full encoder-decoder Transformer for German-to-English translation:

📝translation-example.txt
1Source (German): "Der Hund ist schwarz."
23                  [Transformer]
45Target (English): "The dog is black."

Model Specifications

ComponentValue
dmodeld_{model}256-512
Encoder layers4-6
Decoder layers4-6
Attention heads8
dffd_{ff}4×dmodel4 \times d_{model}
Vocabulary size~10,000 (BPE)
Max sequence length128

From Scratch

We'll implement:

  1. Scaled dot-product attention
  2. Multi-head attention
  3. Positional encoding
  4. Encoder layer and stack
  5. Decoder layer and stack
  6. Complete Transformer model

No nn.Transformer wrappers - everything from basic PyTorch operations.


Summary

Key Innovations of the Transformer

  1. Self-Attention: Direct connections between any two positions
  2. Parallelization: All computations can happen simultaneously
  3. Multi-Head Attention: Multiple perspectives on relationships
  4. Positional Encoding: Sequence order through learned/fixed patterns

Why It Changed Everything

AspectBefore TransformersAfter Transformers
ArchitectureRNNs, LSTMsSelf-attention
TrainingSequential, slowParallel, fast
Long-rangeDifficultNatural
ScaleMillions of paramsBillions+ of params
GPU utilizationPoorExcellent

Glossary / Cheat Sheet

TermMeaning
Q, K, VQueries ask; Keys describe; Values carry content
dmodeld_{model}Embedding width (e.g., 512)
dkd_kPer-head key/query width (d_model / heads)
dffd_{ff}Hidden width in FFN (~4 x d_model)
HeadsParallel attention subspaces
Warmup stepsSteps to ramp LR before decay
Label smoothing \epsilonSoftens one-hot targets to reduce overconfidence

Exercises

Conceptual Questions

  1. Why does self-attention have O(1)O(1) path length between any two positions, while RNNs have O(n)O(n)?
  2. Explain why Transformers can be parallelized during training while RNNs cannot.
  3. What problem does positional encoding solve? What would happen without it?
  4. Why do we use multiple attention heads instead of one large attention layer?

Architecture Analysis

  1. Draw the flow of information through an encoder layer for processing the sentence "The cat sat".
  2. In encoder-decoder attention, where do Q, K, and V come from? Why is this design important?
  3. What is the purpose of the feed-forward network in each layer? Why not just use attention?

Hands-on Drills

  1. Implement sinusoidal positional encodings in PyTorch or NumPy; verify shapes and plot a few dimensions across positions.
  2. Manually compute the output of a 2-head attention layer on three 2D tokens (pick your own tiny numbers) to feel the math.

Stretch question

Why do relative positional encodings (e.g., T5, RoPE) help for long contexts, and how do they change the attention score computation?

Looking Ahead

In the next section, we'll survey the incredible variety of applications that Transformers now dominate:

  • NLP: Translation, summarization, question answering
  • Vision: Image classification, object detection
  • Multimodal: CLIP, DALL-E, GPT-4V
  • Beyond: Time series, code, robotics

You should now be able to...

  1. Explain self-attention and multi-head attention with a concrete numeric example.
  2. Sketch the encoder/decoder data flow and point to where masking is required.
  3. Recall key hyperparameters (dmodeld_{model}, heads, dffd_{ff}, warmup) and predict their effect on compute.

Then we'll set up our development environment and preview the German-to-English translation project that will serve as our hands-on practice throughout this course.