Boo-AI — Master Artificial Intelligence by Building from Scratch

Introduction

In June 2017, a team at Google published "Attention Is All You Need" (Vaswani et al.), introducing the Transformer architecture. This paper fundamentally changed the field of deep learning and led directly to models like BERT, GPT, and virtually every modern language model.

In this section, we'll explore the key ideas of the Transformer at a high level, understanding the paradigm shift it represents before diving into implementation details in later chapters.

Why this matters right now

Self-attention removes the sequential bottleneck that slowed RNNs, so you can scale models and training speed just by throwing GPU-friendly matrix multiplies at the problem. Everything else in this section flows from that shift.

2.1 The Central Thesis

"Attention Is All You Need"

The paper's title captures its radical proposal:

Remove recurrence entirely. Use only attention mechanisms.

This was counterintuitive at the time. How can you model sequences without any notion of sequential processing?

The answer: Self-attention - a mechanism where every element in a sequence can directly attend to every other element.

2.2 Self-Attention: The Core Innovation

The Key Idea

Instead of processing sequences step-by-step, self-attention computes relationships between ALL pairs of positions simultaneously.

Analogy: newsroom fact-check

Imagine a newsroom where every editor can instantly read every sentence in every article and leave notes for each other at once. Self-attention gives every token that same superpower, so important context is never more than one hop away.

For a sequence of $n$ tokens:

Each token creates a Query (Q): "What am I looking for?"
Each token creates a Key (K): "What do I contain?"
Each token creates a Value (V): "What information do I provide?"

The Attention Formula

\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right) \times V

Let's unpack this:

$QK^T$ : Compute similarity between every query and every key ( $n \times n$ matrix)
$\div \sqrt{d_k}$ : Scale to prevent vanishing gradients in softmax
softmax: Convert similarities to probabilities (attention weights)
$\times V$ : Weighted sum of values based on attention weights

Worked Micro-Example (3 tokens, $d_k = 2$ )

What to notice

Highest weights land on tokens with similar Q/K; scaling keeps softmax from collapsing; every row is a distribution that blends V vectors.

Visual Intuition

Consider the sentence: "The cat sat on the mat"

📝attention-visualization.txt

1Query from "sat"
2                    ↓
3     The   cat   sat   on   the   mat
4      ↓     ↓     ↓     ↓     ↓     ↓
5Keys: k₁    k₂   k₃    k₄    k₅    k₆
6
7Attention weights: [0.05, 0.35, 0.15, 0.10, 0.05, 0.30]
8                     ↑      ↑                        ↑
9                   "The"  "cat"                    "mat"
10                         (high attention to subject and object)

The word "sat" can directly attend to "cat" (the subject) and "mat" (where it sat), regardless of their positions.

2.3 Why Self-Attention Beats Recurrence

Path Length Comparison

To connect two positions separated by distance $d$ :

Architecture	Maximum Path Length
RNN	$O(d)$ - must pass through $d$ steps
CNN	$O(\log_k(d))$ - depends on kernel size
Self-Attention	$O(1)$ - direct connection!

Parallelization

For a sequence of length $n$ :

Architecture	Parallel Operations	Sequential Operations
RNN	$O(1)$	$O(n)$
CNN	$O(1)$	$O(\log_k(n))$
Self-Attention	$O(n^2)$	$O(1)$

Self-attention requires $O(n^2)$ operations total, but they can ALL be computed in parallel!

The Computational Trade-off

\text{RNN: } O(n) \text{ sequential} \times O(d) \text{ per step} = O(n \cdot d) \text{ total, } O(n) \text{ time}

\text{Attention: } O(n^2) \text{ parallel} \times O(d) \text{ per step} = O(n^2 \cdot d) \text{ total, } O(1) \text{ time}

For modern GPUs optimized for parallel matrix operations:

$O(n^2)$ parallel operations can be FASTER than $O(n)$ sequential operations
This is why Transformers train much faster on GPUs than RNNs

Watch out for long sequences

The quadratic

n^2

cost in self-attention becomes painful for long context. Practical fixes: chunked attention, sliding windows, sparse kernels (Longformer, BigBird), or low-rank/linear attention (Performer, FlashAttention-2 with block sparsity).

2.4 The Transformer Architecture Overview

High-Level Structure

The original Transformer uses an encoder-decoder architecture:

The Transformer Architecture - Encoder-Decoder model with multi-head self-attention — The Transformer Architecture from 'Attention Is All You Need' (Vaswani et al., 2017)

Encoder Stack

Each encoder layer contains:

Multi-Head Self-Attention
- Queries, Keys, Values all come from the same input
- Each position attends to all positions in the source
Feed-Forward Network (FFN)
- Two linear transformations with ReLU
- Applied independently to each position
Residual Connections + Layer Normalization
- Around each sub-layer
- Helps with training deep networks

The encoder layer flow can be summarized as:

\text{output}_1 = \text{LayerNorm}(x + \text{SelfAttention}(x))

\text{output}_2 = \text{LayerNorm}(\text{output}_1 + \text{FFN}(\text{output}_1))

Decoder Stack

Each decoder layer contains:

Masked Multi-Head Self-Attention
- Same as encoder, but with causal masking
- Position $i$ can only attend to positions $\leq i$
- Prevents "seeing the future" during generation
Encoder-Decoder Attention (Cross-Attention)
- Queries from decoder, Keys/Values from encoder output
- This is how the decoder "looks at" the source sequence
Feed-Forward Network
- Same as encoder

Encoder/Decoder Data Flow at a Glance

📝transformer-dataflow.txt

1Input tokens -> Embeddings + Positional Encoding
2              -> Encoder self-attention (Q=K=V=input) -> FFN
3Decoder prefix -> Masked self-attention (Q=K=V=shifted target)
4                -> Cross-attention (Q=decoder, K/V=encoder output)
5                -> FFN -> Linear + Softmax -> Next-token logits
6
7Masks:
8- Padding mask: hide padding in encoder & decoder.
9- Causal mask: hide future tokens in decoder self-attention.

Masking pitfalls

Forgetting the causal mask leaks future tokens and produces a train/test mismatch; forgetting the padding mask makes the model attend to zeros and hurts stability.

Multi-Head Attention

Instead of single attention, use multiple "heads" in parallel:

\text{MultiHead}(Q, K, V) = \text{Concat}(\text{head}_1, \ldots, \text{head}_h) \times W^O

where:

\text{head}_i = \text{Attention}(Q W_i^Q, K W_i^K, V W_i^V)

Benefits:

Each head can learn different types of relationships
Some heads might focus on syntax, others on semantics
Richer representation of dependencies

Choosing heads and dimensions

Typical choice:

d_k = d_v = d_{model} / h

. More heads increase representational power but also VRAM and attention overhead; if

d_k

gets too small, heads underfit. Start with 8 heads for

d_{model}=512

, scale up only when batch size and memory allow.

2.5 Positional Encoding

The Problem

Self-attention treats input as a set, not a sequence:

Changing token order doesn't change attention computations
"cat sat mat" and "mat sat cat" would be identical

The Solution

Add positional encodings to input embeddings:

\text{input} = \text{token\_embedding} + \text{positional\_encoding}

Analogy: song positions

Think of a song: the note itself is the token embedding, the timestamp in the track is the positional encoding. Add them and the model knows both what the sound is and when it happened.

Sinusoidal Encoding (Original Paper)

PE_{(pos, 2i)} = \sin\left(\frac{pos}{10000^{2i/d_{model}}}\right)

PE_{(pos, 2i+1)} = \cos\left(\frac{pos}{10000^{2i/d_{model}}}\right)

Where:

$pos$ is the position in the sequence
$i$ is the dimension
$d_{model}$ is the model dimension

Why Sinusoidal?

Bounded values: Between -1 and 1
Unique encoding: Each position has a distinct pattern
Relative positions: $PE_{pos+k}$ can be represented as a linear function of $PE_{pos}$
Extrapolation: Works for sequences longer than those seen in training

Common implementation gotcha

Scale token embeddings by

\sqrt{d_{model}}

before adding positional encodings; otherwise the sinusoidal signal can be drowned out by the embedding magnitudes.

2.6 Key Architectural Choices

Residual Connections

Every sub-layer has a skip connection:

\text{output} = \text{LayerNorm}(x + \text{Sublayer}(x))

Benefits:

Enables training of very deep models
Gradients can flow directly through skip connections
Each layer only needs to learn the "residual" (difference from identity)

Layer Normalization

Normalizes across features (not across batch):

\text{LayerNorm}(x) = \gamma \times \frac{x - \mu}{\sigma} + \beta

Benefits:

Works with variable-length sequences
Stable training dynamics
Works with batch size of 1

The Feed-Forward Network

\text{FFN}(x) = \text{ReLU}(x W_1 + b_1) W_2 + b_2

Dimensions:

Input/output: $d_{model}$ (e.g., 512)
Hidden: $d_{ff}$ (e.g., 2048) - typically $4 \times d_{model}$

Role:

Adds non-linearity
Processes each position independently
Acts as a "memory" or lookup table

Encoder Layer: PyTorch-style pseudo-code

🐍encoder_layer.py

1def encoder_layer(x, attn_mask):
2    # x: [batch, seq, d_model]
3    attn_out = multi_head_attention(
4        q=x, k=x, v=x, mask=attn_mask
5    )                          # [batch, seq, d_model]
6    x = layer_norm(x + attn_out)
7
8    ff = relu(x @ W1 + b1)     # W1: [d_model, d_ff]
9    ff = ff @ W2 + b2          # W2: [d_ff, d_model]
10    return layer_norm(x + ff)  # [batch, seq, d_model]

Shape sanity checks

Multi-head attention preserves sequence length; FFN expands to

d_{ff}

then projects back. Masks should include both padding and (for decoders) causal components.

2.7 Training the Transformer

Teacher Forcing

During training:

Provide the correct previous tokens to decoder
Don't use model's own predictions (which might be wrong)

Label Smoothing

Instead of hard targets (0 or 1):

Smooth labels: 0.9 for correct class, $0.1/(V-1)$ for others (where $V$ is vocab size)
Prevents overconfidence
Improves generalization

Learning Rate Schedule

The original paper uses a custom schedule:

lr = d_{model}^{-0.5} \times \min(step^{-0.5}, step \times warmup\_steps^{-1.5})

This creates:

Linear warmup: Gradually increase LR
Inverse square root decay: Slowly decrease LR

Practical training knobs

Memory is dominated by

sequence\_length \times heads \times d_k

and batch size. If you run out of VRAM: shorten sequences, reduce heads, or gradient-accumulate. Label smoothing and warmup often give bigger wins than adding one more layer.

2.8 Why Transformers Succeeded

Superior Performance

Within months of publication:

Machine Translation: Beat previous SOTA by 2+ BLEU points
Training Speed: 3.5 days on 8 GPUs (vs. weeks for RNN models)
Scalability: Performance improved with model size

Hardware Alignment

Transformers are perfectly suited for modern hardware:

Matrix multiplications ( $Q \times K^T$ ) are GPU-optimized
No sequential dependencies during training
Can leverage tensor cores and specialized accelerators

Flexibility

The architecture generalizes to:

Encoder-only (BERT): Classification, NER, embeddings
Decoder-only (GPT): Language modeling, generation
Encoder-decoder (T5, BART): Translation, summarization

Scalability

The same architecture works from:

Millions of parameters (BERT-base: 110M)
Billions of parameters (GPT-3: 175B)
Trillions of parameters (future models)

Mini Timeline

Year	Milestone	What changed
2014-2016	Seq2Seq + attention	RNNs learn to peek at encoder states
2017	Transformer (Vaswani et al.)	Drop recurrence; full self-attention
2018	BERT	Encoder-only pretraining for understanding
2019	GPT-2	Decoder-only scaling for generation
2020	T5/BART	Text-to-text unification; denoising pretraining
2023+	GPT-4, PaLM 2, Gemini	Massive scale, multimodal, tool use

2.9 The Architecture We'll Build

Our Course Project

We'll implement a full encoder-decoder Transformer for German-to-English translation:

📝translation-example.txt

1Source (German): "Der Hund ist schwarz."
2                         ↓
3                  [Transformer]
4                         ↓
5Target (English): "The dog is black."

Model Specifications

Component	Value
$d_{model}$	256-512
Encoder layers	4-6
Decoder layers	4-6
Attention heads	8
$d_{ff}$	$4 \times d_{model}$
Vocabulary size	~10,000 (BPE)
Max sequence length	128

From Scratch

We'll implement:

Scaled dot-product attention
Multi-head attention
Positional encoding
Encoder layer and stack
Decoder layer and stack
Complete Transformer model

No nn.Transformer wrappers - everything from basic PyTorch operations.

Summary

Key Innovations of the Transformer

Self-Attention: Direct connections between any two positions
Parallelization: All computations can happen simultaneously
Multi-Head Attention: Multiple perspectives on relationships
Positional Encoding: Sequence order through learned/fixed patterns

Why It Changed Everything

Aspect	Before Transformers	After Transformers
Architecture	RNNs, LSTMs	Self-attention
Training	Sequential, slow	Parallel, fast
Long-range	Difficult	Natural
Scale	Millions of params	Billions+ of params
GPU utilization	Poor	Excellent

Glossary / Cheat Sheet

Term	Meaning
Q, K, V	Queries ask; Keys describe; Values carry content
$d_{model}$	Embedding width (e.g., 512)
$d_k$	Per-head key/query width (d_model / heads)
$d_{ff}$	Hidden width in FFN (~4 x d_model)
Heads	Parallel attention subspaces
Warmup steps	Steps to ramp LR before decay
Label smoothing \epsilon	Softens one-hot targets to reduce overconfidence

Exercises

Conceptual Questions

Why does self-attention have $O(1)$ path length between any two positions, while RNNs have $O(n)$ ?
Explain why Transformers can be parallelized during training while RNNs cannot.
What problem does positional encoding solve? What would happen without it?
Why do we use multiple attention heads instead of one large attention layer?

Architecture Analysis

Draw the flow of information through an encoder layer for processing the sentence "The cat sat".
In encoder-decoder attention, where do Q, K, and V come from? Why is this design important?
What is the purpose of the feed-forward network in each layer? Why not just use attention?

Hands-on Drills

Implement sinusoidal positional encodings in PyTorch or NumPy; verify shapes and plot a few dimensions across positions.
Manually compute the output of a 2-head attention layer on three 2D tokens (pick your own tiny numbers) to feel the math.

Stretch question

Why do relative positional encodings (e.g., T5, RoPE) help for long contexts, and how do they change the attention score computation?

Looking Ahead

In the next section, we'll survey the incredible variety of applications that Transformers now dominate:

NLP: Translation, summarization, question answering
Vision: Image classification, object detection
Multimodal: CLIP, DALL-E, GPT-4V
Beyond: Time series, code, robotics

You should now be able to...

Explain self-attention and multi-head attention with a concrete numeric example.
Sketch the encoder/decoder data flow and point to where masking is required.
Recall key hyperparameters ( $d_{model}$ , heads, $d_{ff}$ , warmup) and predict their effect on compute.

Then we'll set up our development environment and preview the German-to-English translation project that will serve as our hands-on practice throughout this course.

Introduction

Why this matters right now

2.1 The Central Thesis

"Attention Is All You Need"

2.2 Self-Attention: The Core Innovation

The Key Idea

Analogy: newsroom fact-check

The Attention Formula

Worked Micro-Example (3 tokens, dk=2d_k = 2dk​=2)

What to notice

Visual Intuition

2.3 Why Self-Attention Beats Recurrence

Path Length Comparison

Parallelization

The Computational Trade-off

Watch out for long sequences

2.4 The Transformer Architecture Overview

High-Level Structure

Encoder Stack

Decoder Stack

Encoder/Decoder Data Flow at a Glance

Masking pitfalls

Multi-Head Attention

Choosing heads and dimensions

2.5 Positional Encoding

The Problem

The Solution

Analogy: song positions

Sinusoidal Encoding (Original Paper)

Why Sinusoidal?

Common implementation gotcha

2.6 Key Architectural Choices

Residual Connections

Layer Normalization

The Feed-Forward Network

Encoder Layer: PyTorch-style pseudo-code

Shape sanity checks

2.7 Training the Transformer

Teacher Forcing

Label Smoothing

Learning Rate Schedule

Practical training knobs

2.8 Why Transformers Succeeded

Superior Performance

Hardware Alignment

Flexibility

Scalability

Mini Timeline

2.9 The Architecture We'll Build

Our Course Project

Model Specifications

From Scratch

Summary

Key Innovations of the Transformer

Why It Changed Everything

Glossary / Cheat Sheet

Exercises

Conceptual Questions

Architecture Analysis

Hands-on Drills

Stretch question

Looking Ahead

You should now be able to...

Worked Micro-Example (3 tokens, $d_k = 2$ )