The Attention Atlas: Mechanisms That Power Modern AI
A Visual, Mathematical, and Practical Guide
Master all 15 attention mechanisms using the same example sentence throughout. From Scaled Dot-Product to Flash Attention, MLA, and Differential Attention — with step-by-step math, matrices, and Python code for each.
Foundation— Shared example and core attention.
The Origin of Attention
From biological neurons to the transformer — how evolution's solution to resource allocation became the most important mechanism in modern AI
Scaled Dot-Product Attention
Vaswani et al. 2017 — the foundational attention mechanism
Multi-Head Variants— MHA, Causal, Cross-Attention.
Multi-Head Attention
Vaswani et al. 2017 — H independent attention heads in parallel
Causal (Masked) Self-Attention
Radford et al. 2018 — autoregressive masking for language generation
Cross-Attention
Vaswani et al. 2017 — the bridge between encoder and decoder
KV-Cache Optimization— MQA, GQA for efficient inference.
Multi-Query Attention (MQA)
Shazeer 2019 — shared K/V across all heads for fast inference
Grouped-Query Attention (GQA)
Ainslie et al. 2023 — the sweet spot between MHA and MQA
Positional Encoding— Relative bias, RoPE, ALiBi.
Relative Position Bias Attention
Shaw et al. 2018 / T5 2020 — distance-aware attention scoring
RoPE — Rotary Position Embedding
Su et al. 2021 — position via rotation of Q and K vectors
ALiBi — Attention with Linear Biases
Press et al. 2022 — linear distance penalty, no positional embeddings
Efficient Attention— Linear, Sliding Window, Sparse.
Linear Attention
Katharopoulos et al. 2020 — O(Nd²) instead of O(N²d)
Sliding Window Attention
Beltagy et al. 2020 — local window for O(N) complexity
Sparse Attention — BigBird
Zaheer et al. 2020 — local + global + random for long documents
Modern Innovations— Flash, Differential, MLA.
Flash Attention
Dao et al. 2022 — IO-aware tiling, same math, 3-8x faster
Differential Attention
Microsoft Research 2024 — noise cancellation via dual softmax
Multi-Head Latent Attention (MLA)
DeepSeek-V2 2024 — compressed KV-cache via learned bottleneck
Comparison— All 15 mechanisms side-by-side.
Final Comparison
All 15 mechanisms side-by-side — choosing the right one
Where the book lands in practice.
Flash Attention
Dao et al. 2022 — IO-aware tiling, same math, 3-8x faster
Open chapterDifferential Attention
Microsoft Research 2024 — noise cancellation via dual softmax
Open chapterMulti-Head Latent Attention (MLA)
DeepSeek-V2 2024 — compressed KV-cache via learned bottleneck
Open chapter17 sections. Begin with one.
Chapter 0 — The Origin of Attention — is where every reader starts.