Learning Objectives
By the end of this chapter, you will be able to:
- Explain why RNNs and LSTMs hit a fundamental bottleneck for long sequences and why a parallel, all-pairs mechanism was needed.
- Derive the scaled dot-product attention formula from first principles, understanding every symbol and operation.
- Prove why the scaling factor is necessary and what happens without it.
- Compute attention weights and outputs by hand for our shared example sentence “The cat sat on mat”.
- Implement a complete, runnable Python class that you can use to simulate scaled dot-product attention on any input.
- Connect this foundational mechanism to its extensions: multi-head attention, Flash Attention, KV-cache, and positional encodings.
Where this appears: Scaled dot-product attention, or a close variant of it, is the computational core of most transformer-based models — including GPT-4, Claude, Gemini, LLaMA, Stable Diffusion, AlphaFold, Codex, and ViT. These systems compute this formula (or an optimized equivalent like Flash Attention) billions of times during inference. Understanding it deeply is the foundation for understanding modern AI architectures.
The Shared Example — Used in Every Chapter
Every mechanism in this book operates on the exact same sentence, the same matrices, and the same parameters. This allows you to directly compare what each mechanism does differently — the only thing that changes is the attention computation itself.
The Sentence and Tokens
| Parameter | Value |
|---|---|
| Sentence | "The cat sat on mat" |
| Tokens | [The, cat, sat, on, mat] |
| 5 tokens | |
| 4 (kept tiny so every number is readable) | |
| 4 (in this chapter, single-head: ) | |
| 2 heads (used from Chapter 2 onward; per-head dim ) |
Why “mat” and not “the mat”?
The Query, Key, and Value Matrices
, , and are the three fundamental matrices in every attention mechanism. In practice they are produced by learned linear projections of the input embeddings: , , , where is the input embedding matrix, are the learned query and key projections, and is the learned value projection (in practice , but the formalism allows them to differ). Here we fix them so all 15 mechanisms start from the same point.
(Query Matrix) — — each row encodes “what is this token looking for?”
| dim-0 | dim-1 | dim-2 | dim-3 | |
|---|---|---|---|---|
| The | 1.0 | 0.0 | 1.0 | 0.0 |
| cat | 0.0 | 2.0 | 0.0 | 1.0 |
| sat | 1.0 | 1.0 | 1.0 | 0.0 |
| on | 0.0 | 0.0 | 1.0 | 1.0 |
| mat | 1.0 | 0.0 | 0.0 | 1.0 |
(Key Matrix) — — each row encodes “what information does this token advertise?”
| dim-0 | dim-1 | dim-2 | dim-3 | |
|---|---|---|---|---|
| The | 0.0 | 1.0 | 0.0 | 1.0 |
| cat | 1.0 | 0.0 | 1.0 | 0.0 |
| sat | 1.0 | 1.0 | 0.0 | 0.0 |
| on | 0.0 | 0.0 | 1.0 | 1.0 |
| mat | 1.0 | 0.0 | 0.5 | 0.5 |
(Value Matrix) — — each row is the actual content that gets retrieved when a token is attended to
| dim-0 | dim-1 | dim-2 | dim-3 | |
|---|---|---|---|---|
| The | 1.0 | 0.0 | 0.0 | 0.0 |
| cat | 0.0 | 1.0 | 0.0 | 0.0 |
| sat | 0.0 | 0.0 | 1.0 | 0.0 |
| on | 0.0 | 0.0 | 0.0 | 1.0 |
| mat | 0.5 | 0.5 | 0.5 | 0.5 |
What Every Chapter Shows
For each mechanism you will find four sections:
- Problem & Intuition — what limitation this mechanism was designed to fix
- The Math — the exact formula with every variable defined
- Step-by-Step Calculation — the formula applied to “The” (row 0) using real numbers
- Python Code — a clean, runnable class implementation using only NumPy
Every chapter ends with the full Attention Weight Matrix () and Output Matrix () so you can immediately see how the pattern changes from one mechanism to the next.
How to use this book
The Real Problem
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, L., & Polosukhin, I. (2017). “Attention Is All You Need.” Advances in Neural Information Processing Systems (NeurIPS), 30.
To understand why scaled dot-product attention exists, you must first understand what it replaced and why that predecessor was breaking under the weight of real-world language.
The RNN Bottleneck
Before 2017, the dominant architecture for sequence modeling was the Recurrent Neural Network (RNN) and its variants (LSTM, GRU). These models processed tokens sequentially — one at a time, left to right — accumulating information into a fixed-size hidden state vector .
This created three critical problems:
- The information bottleneck. The entire history of a 10,000-token document had to be compressed into a single vector of, say, 512 dimensions. By the time the model reached the end of a long document, the information from the beginning had been overwritten or diluted beyond recognition. In seq2seq translation (Sutskever et al., 2014), the encoder's final hidden state was the only bridge between the source and target language — a catastrophic bottleneck.
- No parallelism. Because depended on , every timestep had to wait for the previous one to finish. You could not process tokens 1 through 10,000 in parallel. Training on long sequences was agonizingly slow, even on GPUs designed for massive parallelism.
- Vanishing and exploding gradients. During backpropagation through time (BPTT), gradients had to flow through multiplicative steps. Even with LSTM gating, gradients for dependencies spanning hundreds of tokens degraded to near zero, making it nearly impossible to learn long-range patterns like coreference (“The cat... it...” separated by 50 tokens).
The Bahdanau breakthrough (2014)
The Fundamental Question
Vaswani et al. asked a radical question: what if we removed the recurrence entirely? What if, instead of processing tokens sequentially, every token could directly attend to every other token in a single parallel operation?
The answer required a mechanism with three properties:
- All-pairs comparison — every token should be able to compute its relevance to every other token in the sequence, in parallel.
- Differentiable selection — the model should learn which tokens are relevant via soft weights (probabilities), not hard selection, so gradients can flow.
- Content-based addressing — relevance should be determined by what the tokens mean (their representations), not by their position alone.
Scaled dot-product attention satisfies all three. It is the mechanism that made the Transformer possible.
From Intuition to Mathematics
The Query-Key-Value Intuition
The central metaphor is a library lookup. Imagine you walk into a library with a question in mind (your query). Each book on the shelf has a label on its spine (its key) and content inside (its value). You scan all the spine labels, judge which ones are most relevant to your question, and then read a weighted blend of the most relevant books.
| Concept | In Attention | In Our Example |
|---|---|---|
| Query (Q) | “What am I looking for?” Each token broadcasts a question about what context it needs. | — “The” is looking for tokens with activity in dims 0 and 2 |
| Key (K) | “What do I have to offer?” Each token advertises the kind of information it carries. | — “cat” advertises presence in dims 0 and 2 |
| Value (V) | “Here is my actual content.” The information that gets retrieved when a token is attended to. | — the actual semantic content of “cat” |
The critical insight is that Q and K live in the same space so their dot product is meaningful as a similarity measure, while V can live in a different space — it carries the content that the query wants to retrieve.
Dot Product as Similarity
Why use the dot product to measure relevance? Consider two vectors and :
The dot product is large because the vectors “agree” — they are active in the same dimensions. Now compare with :
The dot product is zero because the vectors are orthogonal — they have nothing in common. This is exactly what we want: the dot product naturally measures the alignment between what the query is looking for and what the key advertises. Geometrically, , so it captures both magnitude and direction.
The Mathematical Definition
The complete formula for scaled dot-product attention is:
This single line is the most important equation in modern deep learning. Let us dissect every piece.
Symbol-by-Symbol Breakdown
| Symbol | Shape | Meaning |
|---|---|---|
| Query matrix. Row is the query vector for token . In our example: . | ||
| Key matrix. Row is the key vector for token . Same shape as so dot products are valid. | ||
| Value matrix. Row is the content vector for token . Can have a different dimension than , though in practice . | ||
| Transpose of , so that produces an matrix of pairwise similarities. | ||
| The raw score matrix. Entry is — how much token 's query aligns with token 's key. | ||
| scalar | Scaling factor. In our example, . In GPT-2, . In GPT-3, . | |
| row-wise | Applied independently to each row. Converts raw scores into a probability distribution: . Each row sums to 1. | |
| Output | The context-enriched representation. Each row is a weighted combination of all value vectors, where weights reflect learned relevance. |
What the Formula Says in Plain English
The formula performs four operations in sequence:
- Compare every query with every key via dot product (), producing an matrix of raw similarity scores.
- Scale the scores by to prevent gradient saturation in the softmax.
- Normalize each row with softmax to convert scores into attention weights (probabilities that sum to 1).
- Aggregate by multiplying the attention weights with the value matrix, producing a context-aware output for each token.
In short: every token asks “who is relevant to me?” (via Q · K), computes how relevant (via softmax), and collects a weighted blend of answers (via V). The result is that each token's output representation is enriched with contextual information from all other tokens, with more weight given to the most relevant ones.
Why Scale? The Variance Argument
This is the most commonly asked question about the formula: why divide by and not some other constant? The answer comes from a simple variance analysis.
The Mathematical Proof
Assume each element of and is drawn independently from a distribution with mean 0 and variance 1. The dot product of two such vectors is:
Since each and are independent with mean 0 and variance 1:
The dot product is a sum of independent terms, each with variance 1, so by the additivity of variance:
When , the standard deviation of the dot product is . This means individual scores can easily reach values like or more.
Now consider what softmax does with such extreme values. If one score is 40 and the rest are near 0:
The output is essentially a one-hot vector. The gradient of softmax at these saturation points is near zero (), which means the model cannot learn to adjust these scores — learning stalls.
The fix is elegant: dividing by rescales the variance back to 1:
Now the scores have a standard deviation of ~1 regardless of the dimension. Softmax receives moderate inputs, produces smooth (non-saturated) probability distributions, and gradients flow properly for learning. This is why we scale by exactly — not 2, not , not a learned parameter. It is the mathematically correct normalization to keep the softmax in its useful operating regime.
Additive vs. multiplicative attention
Interactive: Scaling Explorer
Use the interactive explorer below to see exactly what happens to softmax outputs as you increase . Toggle between “With Scaling” and “Without Scaling” to see how the distribution collapses without the correction.
Step-by-Step Calculation for “The” (Row 0)
We now trace through every arithmetic step for the first token “The” to see exactly how scaled dot-product attention builds its output vector. Every number here can be verified with the Python class at the bottom of this chapter.
Step 1 — Raw Dot Products:
The query vector for “The” is . We compute its dot product with every key vector:
| Pair | Calculation | Raw Score |
|---|---|---|
| 0.0 | ||
| 2.0 | ||
| 1.0 | ||
| 1.0 | ||
| 1.5 |
Interpretation: “The” has the highest raw similarity with “cat” (score = 2.0) because and are identical — perfect alignment. The lowest similarity is with itself (score = 0.0) because is orthogonal to its query.
Step 2 — Scaling:
We divide each raw score by :
| Token | Raw Score | Scaled Score |
|---|---|---|
| The | 0.0 / 2.0 | 0.000 |
| cat | 2.0 / 2.0 | 1.000 |
| sat | 1.0 / 2.0 | 0.500 |
| on | 1.0 / 2.0 | 0.500 |
| mat | 1.5 / 2.0 | 0.750 |
In our small example (), scaling reduces scores by half. In a production model with , scores would be divided by ~11.3, a much more dramatic reduction.
Step 3 — Softmax:
The softmax function converts the scaled scores into a probability distribution. For a vector , the softmax of element is .
Computing each exponential:
| Token | Scaled Score | Exponential |
|---|---|---|
| The | 0.000 | |
| cat | 1.000 | |
| sat | 0.500 | |
| on | 0.500 | |
| mat | 0.750 |
Sum of exponentials:
Dividing each exponential by the sum:
| Token | Attention Weight | Percentage |
|---|---|---|
| The | 0.1095 | 10.95% |
| cat | 0.2976 | 29.76% |
| sat | 0.1805 | 18.05% |
| on | 0.1805 | 18.05% |
| mat | 0.2318 | 23.18% |
Interpretation: “The” pays 29.76% of its attention to “cat” — the token most aligned with its query. But notice that attention is not concentrated on a single token. “mat” also receives significant weight (23.18%) because its key partially overlaps with the query. This soft, distributed attention is what allows the mechanism to capture nuanced relationships.
Step 4 — Weighted Sum of Values:
Each output dimension is the weighted average of that dimension across all value vectors:
| Component | Calculation |
|---|---|
| weighted sum |
Expanding each value vector:
10.1095 × [1.0, 0.0, 0.0, 0.0] = [0.1095, 0.0000, 0.0000, 0.0000] (The)
20.2976 × [0.0, 1.0, 0.0, 0.0] = [0.0000, 0.2976, 0.0000, 0.0000] (cat)
30.1805 × [0.0, 0.0, 1.0, 0.0] = [0.0000, 0.0000, 0.1805, 0.0000] (sat)
40.1805 × [0.0, 0.0, 0.0, 1.0] = [0.0000, 0.0000, 0.0000, 0.1805] (on)
50.2318 × [0.5, 0.5, 0.5, 0.5] = [0.1159, 0.1159, 0.1159, 0.1159] (mat)
6─────────────────────────────────────────────────────────────────────
7Sum: [0.2254, 0.4135, 0.2964, 0.2964] ← OutputFinal output for “The”:
Notice that dim-1 has the largest value (0.4135) because “cat” — the most-attended token — contributes all its weight to dim-1 (). The output for “The” has been enriched with the semantic content of “cat”, reflecting the strong query-key alignment between these two tokens.
Interactive: Attention Pipeline
Select any token to trace the full attention computation step by step. Click through the four stages — dot products, scaling, softmax, and weighted output — to see how attention flows from the query to the final output.
Full Attention Weights and Output
Below is the complete attention weight matrix and its corresponding output for all five tokens. Hover over any cell in the heatmap to see the full computation for that query-key pair.
Interpreting the Heatmap
Several patterns emerge from the attention weight matrix:
- “cat” attends most to “The” (0.4026) — the determiner that typically precedes it. has strong overlap with (dot product = 3.0, the largest in the entire matrix).
- “on” attends most to itself (0.3137) — . When the query and key of the same token are aligned, self-attention is high. This is common for function words that carry positional rather than semantic weight.
- “mat” distributes attention nearly uniformly (all weights near 0.19-0.24) — its query partially matches many keys without strongly preferring one.
Reading attention patterns
Applications Across Domains
Scaled dot-product attention is not limited to language. The same formula powers breakthroughs across every domain that involves sequences, sets, or structured data.
Natural Language Processing
In GPT-4, Claude, and LLaMA, each layer applies scaled dot-product attention so that every token can gather context from every other token. When the model processes “The cat sat on the mat because it was tired,” the attention mechanism for “it” assigns high weight to “cat” — resolving the pronoun coreference. Without attention, an RNN would need to carry the “cat” information through every intermediate hidden state.
Computer Vision
In Vision Transformers (ViT; Dosovitskiy et al., 2021), an image is split into 16×16 patches. Each patch becomes a token, and scaled dot-product attention lets every patch attend to every other patch. A patch containing a dog's eye can attend to the patch containing its tail — capturing long-range spatial relationships that CNNs can only reach through many stacked convolutional layers.
Code Generation
In Codex and GitHub Copilot, attention enables the model to connect a function's name (token 1) with its return type (token 50), its docstring (tokens 5-20), and the variable names used inside (tokens 30-100). When generating return total_price, the model's attention assigns high weight to the earlier definition total_price = base_price * quantity — even if that line is hundreds of tokens away.
Scientific Sequence Modeling
In AlphaFold2 (Jumper et al., 2021), attention operates over amino acid residues in a protein sequence. Each residue attends to every other residue, learning which pairs of amino acids are likely to be spatially close in the 3D folded structure — even when they are far apart in the linear sequence. The attention weights effectively learn a proxy for physical contact maps, which is remarkable because this structural information is not given as supervision.
Real-World Scale
To appreciate the scale at which this formula operates, the table below shows illustrative parameters from published model architectures. The “Attention Ops” column is a back-of-the-envelope estimate using (the dominant term from across all heads and layers).
| Model | Layers | Heads | Max Seq Length | Attention Ops (est.) | |
|---|---|---|---|---|---|
| GPT-3 (175B) | 96 | 96 | 128 | 2,048 | ~39 billion |
| LLaMA-2 70B | 80 | 64 | 128 | 4,096 | ~86 billion |
| GPT-4 (speculative) | ~120 | ~96 | 128 | 8,192 | ~770 billion |
| ViT-Large | 24 | 16 | 64 | 197 patches | ~15 million |
| AlphaFold2 | 48 | 8 | 32 | ~500 residues | ~96 million |
| Our example | 1 | 1 | 4 | 5 | 25 |
About these numbers
The last row is our toy example — the same formula, just orders of magnitude smaller. This is why understanding the mechanism at small scale transfers directly to understanding production systems.
Connection to Modern Systems
Scaled dot-product attention is the atomic building block. Every subsequent chapter in this book modifies, extends, or optimizes it. Here is how the major variants relate to this foundation.
Multi-Head Attention (Chapter 2)
Instead of running one attention function with -dimensional keys, multi-head attention runs independent scaled dot-product attention operations in parallel, each on a smaller -dimensional subspace. This allows different heads to learn different types of relationships — one head might learn syntactic dependency, another might learn semantic similarity, another might learn positional patterns. The outputs are concatenated and linearly projected back to .
Flash Attention (Chapter 13)
Flash Attention (Dao et al., 2022) computes mathematically identical results to scaled dot-product attention, but reorganizes the computation to minimize GPU memory reads/writes. The key insight is that the attention matrix is never fully materialized in GPU HBM (high-bandwidth memory). Instead, it is computed in tiles that fit in SRAM (on-chip memory), achieving 2-4x speedup and reducing memory from to . The math is the same; only the memory access pattern changes.
KV-Cache Optimization
During autoregressive generation (producing one token at a time), the model computes attention for the new token's query against all previous keys and values. The KV-cache stores the and matrices from all previous tokens so they don't need to be recomputed from the input embeddings at each step. The per-token attention cost remains (the new query still scores against all cached keys), but without the cache the model would need to reproject and recompute attention for all previous tokens from scratch — an cost per generated token. The cache trades memory for this saving. Multi-Query Attention (Chapter 5) and Grouped-Query Attention (Chapter 6) reduce the memory cost by sharing K and V across heads.
Positional Encodings (Chapters 7-9)
Scaled dot-product attention is permutation-invariant — it treats the input as a set, not a sequence. Swapping the order of tokens in the input would produce the same attention weights (just reordered). To inject positional information, the Transformer adds positional encodings to the input embeddings. Chapters 7-9 cover three approaches: learned relative position bias (T5), Rotary Position Embeddings (RoPE, used in LLaMA), and Attention with Linear Biases (ALiBi). Each modifies either the input representations or the attention scores to encode where tokens are, not just what they are.
Complexity Analysis
To understand why attention becomes expensive, do not start with formulas. Start with one simple question:
As the sequence gets longer, how much more work does the model have to do, and how much more memory does it need to hold things while doing that work?
That is all complexity analysis is trying to measure.
- Time complexity tells us how the amount of work grows.
- Space complexity tells us how the amount of memory grows.
In attention, both grow quickly because every token can interact with every other token.
A Simple Picture: Students in a Classroom
Imagine a classroom with students. Each student wants to decide: “Which other students should I listen to?”
If every student checks every other student, then student 1 checks students, student 2 checks students, and so on. The total number of checks is:
That is the main time cost.
Now imagine writing every one of those scores on a big board. The board needs rows and columns, so it stores numbers. That is the main space cost. This is exactly what happens in attention.
Where the Cost Comes From in Attention
In scaled dot-product attention, each token asks: “How relevant is every other token to me?” The model answers that by building a giant score table:
If there are tokens, this table has shape . That means computing it takes a lot of work, and storing it takes a lot of memory. That single all-to-all interaction pattern is the reason attention becomes expensive.
Step-by-Step Intuition
Step 1: Compute — the score matrix. This is the “everyone compares with everyone” step. Each token compares itself with every other token. If there are tokens, there are about token pairs. But each comparison is not just one tiny action — it is a dot product over vectors of size . So each pair needs about small multiply-and-add operations.
Time: . Space: .
Intuition: a giant spreadsheet where every row is a token and every column is another token, and each cell stores “how much should I pay attention?”
Step 2: Scaling. After the scores are computed, each one is divided by . This keeps the numbers from getting too large. There are scores, so touching all of them costs time. Usually this is done directly on the same score table, so it does not need another giant table — the extra memory cost is .
Intuition: you already filled the board with numbers, and now you go through and slightly shrink each one. That is work, but not extra board space.
Step 3: Softmax. Each row of the score matrix is turned into normalized attention weights. Each row has numbers, and there are rows, so the total work is . The model still stores the full weight table, so memory is still .
Intuition: the raw scores become “attention percentages.” Same giant board, same size, just cleaner numbers.
Step 4: Multiply weights by . The model uses those attention weights to mix the value vectors and produce a new output vector for each token. For each of the output tokens, it looks across tokens, and each value vector has size . So the work is . The output stores one vector of length per token, so the memory needed is only .
Intuition: each token builds a summary of what it learned from everyone else.
Interactive: Worked Example with 3 Tokens
The steps above describe what happens in words. But to truly understand why each operation costs what it does, nothing beats watching the numbers unfold. The walkthrough below uses 3 tokens with vectors of length 3 and walks through every single multiply-and-add. Click Next to advance.
The Real Idea
Before we start, let's clear up a common confusion:
One token-to-token comparison is NOT a single check.
It is a small calculation made from many numbers.
Think of it like this: if one token is described by 3 features and another token is described by 3 features, then to compare them you must check feature 1 with feature 1, feature 2 with feature 2, feature 3 with feature 3, then add them together.
So one comparison is really 3 little comparisons plus addition, not one instant action. Let's see this with real numbers.
A Tiny Example
Suppose there are only 4 tokens. The score matrix has entries. That is small. Easy to compute. Easy to store.
Now suppose there are 1,000 tokens. The score matrix has entries.
Now suppose there are 8,192 tokens. The score matrix has entries. That is about 67 million scores for just one head in one layer.
This is why attention becomes expensive so quickly: the sequence length grows linearly, but the pairwise interaction table grows quadratically.
What “Bottleneck” Really Means Here
The bottleneck is the part that hurts the most as the input grows. In attention, that bottleneck is the interaction pattern. Both the time to compute all token-to-token interactions and the space to store all those interaction scores become large. So when people say standard attention has an bottleneck, they mean:
The cost blows up because every token talks to every token.
Summary Table
| Operation | What It Does | Time | Space |
|---|---|---|---|
| Compares everyone with everyone | |||
| Scaling | Adjusts every score once | extra | |
| Softmax | Turns scores into attention weights | ||
| weights × | Uses weights to gather information | ||
| Total |
The reason the total is dominated by quadratic behavior is simple: the giant board keeps showing up.
At tokens (a typical context window), the attention matrix has 67 million entries — per head, per layer. At 32 heads and 32 layers, that is ~69 billion score computations per forward pass.
Attention becomes expensive because every token compares itself with every other token. If there are tokens, that creates an table of interaction scores. Building that table takes computation, and storing it takes memory. Time complexity measures how the amount of computation grows; space complexity measures how the required memory grows. In standard attention, both are dominated by this all-to-all interaction pattern, which is why the cost scales quadratically with sequence length.
Why this matters for the rest of the book
Python Implementation
Below is a complete, runnable implementation of scaled dot-product attention as a Python class. Click any highlighted line on the right to see its detailed explanation on the left. You can copy the full code and run it with python scaled_dot_product_attention.py — the only dependency is NumPy.
PyTorch Implementation
The NumPy version above is ideal for understanding every arithmetic step. But in practice, you will use PyTorch — it gives you GPU acceleration, automatic differentiation, and a built-in optimized implementation. Below is the same scaled dot-product attention as an nn.Module, using the identical shared example.
Three things change from NumPy to PyTorch:
- Tensors replace arrays.
torch.tensortracks computation graphs for gradient computation.np.arraydoes not. - Batching is built in.
torch.matmulandK.transpose(-2, -1)work for any number of leading batch dimensions. The NumPyK.Tonly works for 2D matrices. - Masking support. The PyTorch version accepts an optional mask tensor that sets masked positions to before softmax, so and those tokens receive zero attention. This is how causal masking (Chapter 3) and padding masks are implemented.
PyTorch Built-in: F.scaled_dot_product_attention
Since PyTorch 2.0, there is a single built-in function that computes exactly what our class does — but it automatically selects the fastest available backend:
| Backend | When Used | Speedup |
|---|---|---|
| FlashAttention-2 | CUDA GPU, no custom mask, fp16/bf16 | 2-4x |
| Memory-Efficient | CUDA GPU with masks, any dtype | 1.5-3x |
| Math fallback | CPU, or when above backends unavailable | 1x (baseline) |
The call is a single line. Click any highlighted line to see the exact tensor shapes and output values — including what happens when you enable causal masking:
When to use which
F.scaled_dot_product_attention in production code — it is faster, memory-efficient, and battle-tested. Use the manual nn.Module class when you need to inspect intermediate values (attention weights), add custom modifications, or learn how the mechanism works.NumPy vs PyTorch — Side-by-Side
| Aspect | NumPy | PyTorch |
|---|---|---|
| Data type | np.ndarray | torch.Tensor |
| Matrix multiply | Q @ K.T | torch.matmul(Q, K.transpose(-2, -1)) |
| Softmax | Manual (exp, sum, divide) | F.softmax(scores, dim=-1) |
| Gradients | Not supported | Automatic (autograd) |
| GPU | Not supported | .cuda() / .to("cuda") |
| Batching | 2D only (N, d_k) | Any dims (..., N, d_k) |
| Masking | Not built in | masked_fill(mask, -inf) |
| Best for | Learning, debugging, prototyping | Training, inference, production |
| Output | Identical — both produce the same attention weights and output matrices | |
Key Takeaways
- The core formula performs four operations: compare (dot product), scale, normalize (softmax), and aggregate (weighted sum).
- Scaling by is not arbitrary — it is the exact factor needed to keep the variance of dot products at 1, preventing softmax saturation and gradient death.
- Every token talks to every other token in parallel, solving the RNN's sequential bottleneck. The computational cost is , which is the price of all-pairs comparison.
- Q, K, V serve distinct roles: Q asks the question, K advertises what's available, V provides the content. The learned projections give the model flexibility to learn task-specific notions of relevance.
- This mechanism is the atom from which all 15 variants in this book are built. Multi-head attention runs it in parallel subspaces. Flash Attention computes it with better memory access. Causal masking restricts which keys each query can see. RoPE rotates Q and K to encode position. Understanding this chapter deeply makes every subsequent chapter immediate.
Exercises
These exercises reinforce the concepts from this chapter. Work through them by hand first, then verify with the Python class above.
Exercise 1: Compute Attention for “sat” (Row 2)
Using and the same and matrices, compute the four steps by hand: raw dot products, scaled scores, softmax weights, and the output vector. Which token does “sat” attend to most, and why?
Hint: “sat”'s query is active in dims 0, 1, and 2. Look for keys with the highest overlap in those same dimensions.
Exercise 2: What If We Skip Scaling?
Recompute the softmax weights for “The” (row 0) without dividing by . Compare the resulting weight distribution to the scaled version. How much sharper is the unscaled distribution? Now imagine — what would happen to the softmax output?
Exercise 3: Identical Q and K
Suppose we set (every token's query equals its own key). What pattern would you expect in the attention weight matrix? Would every token attend most to itself? Test your prediction by modifying the Python class: set K = Q.copy() and run the code.
Exercise 4: Orthogonal Keys
Design a matrix where every key vector is orthogonal to every other key vector (hint: use the identity matrix for the first 4 tokens). What does the attention weight matrix look like? What does this tell you about how the model behaves when keys carry maximally distinct information?
Exercise 5: Scale Factor Derivation
The variance proof assumes and . What if the elements have instead of 1? Derive the new variance of the dot product and determine what scaling factor would be needed to normalize it back to 1.
References
- Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, L., & Polosukhin, I. (2017). “Attention Is All You Need.” Advances in Neural Information Processing Systems, 30. The paper that introduced the Transformer and scaled dot-product attention.
- Bahdanau, D., Cho, K., & Bengio, Y. (2015). “Neural Machine Translation by Jointly Learning to Align and Translate.” ICLR 2015. The first attention mechanism for sequence-to-sequence models (additive attention).
- Sutskever, I., Vinyals, O., & Le, Q.V. (2014). “Sequence to Sequence Learning with Neural Networks.” NeurIPS 2014. The seq2seq architecture that exposed the encoder bottleneck problem.
- Dao, T., Fu, D.Y., Ermon, S., Rudra, A., & Ré, C. (2022). “FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness.” NeurIPS 2022. IO-aware tiling of the attention computation.
- Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., & Houlsby, N. (2021). “An Image Is Worth 16x16 Words: Transformers for Image Recognition at Scale.” ICLR 2021. Vision Transformer (ViT).
- Jumper, J., Evans, R., Pritzel, A., et al. (2021). “Highly Accurate Protein Structure Prediction with AlphaFold.” Nature, 596, 583-589. Attention applied to protein folding.
- Jain, S. & Wallace, B.C. (2019). “Attention is not Explanation.” NAACL 2019. Cautionary analysis of over-interpreting attention weights.