Boo-AI — Master Artificial Intelligence by Building from Scratch

Why This Section Exists

§17.1 through §17.3 taught the LSTM and GRU end to end, with PyTorch implementation. If you stopped reading here and treated the chapter as “how recurrent networks work,” you would be well-equipped to ship a 2016-era production RNN. But the field moved on — not because LSTMs were wrong, but because modern accelerators hate serial work. This forward-looking section traces the exact computational reason recurrence hit a wall and shows how attention, the KV-cache, Flash Attention, multi-head projections, and positional encodings are the direct engineering descendants of everything you just learned. The hidden state never died — it was rewritten.

Heads up. This section is a conceptual primer, not a tutorial. You will see shapes and intuitions — the dedicated Transformer chapter is where we build attention from first principles in PyTorch. Read this to understand the bridge, not to implement it.

Why Recurrence Hit a Wall

LSTMs solved the vanishing-gradient problem, but they could not solve a different problem: modern accelerators hate serial work. A modern GPU has tens of thousands of parallel cores. An LSTM, by construction, cannot compute $h_t$ until $h_{t-1}$ has finished. For a sequence of length 1024 that means 1024 sequential matrix multiplications, one after another, with most of the GPU sitting idle at any moment.

Worse, the path between two tokens far apart grows with their distance. Token 1 influences token 500 only through 499 intermediate cell-state updates. The gradient of the loss at token 500 with respect to token 1's input passes through 499 chained Jacobians. Even with the forget-gate highway, that path stretches the “effective receptive field” of the model. Compare with attention, where every pair of tokens is one matrix multiply away.

Property	LSTM / GRU	Self-attention
Path length between any two tokens	O(L)	O(1) — every pair is one hop
Forward-pass parallelism across time	None (serial by construction)	Full (all pairs computed in parallel)
Per-step compute	O(H²)	O(L · H)
Total compute for a length-L sequence	O(L · H²)	O(L² · H)
Memory	O(H) state	O(L²) attention matrix
Long-range dependencies	Indirect (gated)	Direct (attention weight)

Attention trades a fixed-size bottleneck (the hidden state) for a quadratic-size one (the attention matrix). The trade pays off wildly when you have the compute to spend — which is exactly what happened between 2017 and 2024 as GPUs got faster and cheaper faster than sequences got longer.

Interactive: Recurrence vs Attention

The comparison is best seen dynamically. Press Animate to step through sequential RNN computation on the left; the attention panel on the right shows every output cell reading every input cell in one parallel pass.

Loading comparison…

KV-Cache: The Hidden State, Reborn

When an autoregressive transformer generates one token at a time, a naive implementation re-runs attention over the entire growing prefix at every step — pure quadratic waste. The fix is a KV-cache: for each past token, store the computed key and value vectors once, and reuse them on every subsequent step.

The shape of this optimization is uncannily familiar. An RNN's hidden state is a compact, running summary of the past, updated each step. A KV-cache is a growing, loss-less record of the past, appended each step. Both let the next step avoid re-doing work. The difference: the RNN squeezes history into a fixed-size vector and pays for that compression every time; the KV-cache grows linearly but keeps every token in full detail.

Aspect	RNN hidden state	Transformer KV-cache
Size	Fixed: O(H)	Grows: O(L · n_layers · n_heads · d_head)
Information loss	Lossy (compressed every step)	Lossless (every past K,V retained)
Per-step compute	One cell forward pass	One new Q attending to L cached (K, V) pairs
What it preserves	A learned summary	The exact past activations

For a 7B-parameter LLaMA-class model (32 layers, 32 heads, $d_{head} = 128$ , fp16) the cache footprint is $32 \cdot 32 \cdot 128 \cdot 2 \cdot 2 = 524{,}288$ bytes per token — exactly 512 KiB. A 32k-token context therefore costs 16 GiB of cache, which is why context-length has become a memory-planning problem as much as a quality problem.

Interactive: KV Cache Growth

Watch the cache grow during a toy 32-token generation. Toggle the cache off to see the per-step attention cost blow up quadratically.

Loading KV-cache visualizer…

The numbers on the bottom panels are real: those are the actual cache sizes for a 7B-parameter model at context lengths 128, 1k, 32k, 128k. Engineering innovations like grouped-query attention (GQA), paged attention (vLLM), and sliding-window attention (Mistral) exist specifically to bend this curve.

Flash Attention: When Memory, Not Compute, is the Bottleneck

Standard self-attention for a sequence of length $L$ and hidden size $d$ computes $A = \text{softmax}(QK^\top / \sqrt{d})$ and then $AV$ . The intermediate $A$ is an $L \times L$ matrix. For $L = 8{,}192$ this is 67 million entries — too big to fit in GPU on-chip SRAM, which forces reads and writes to high-bandwidth memory (HBM) at every step. HBM bandwidth, not FLOPs, becomes the limiter.

Flash Attention (Dao et al., 2022) removes this limit by never materializing the full $L \times L$ matrix. It tiles Q, K, V into small blocks, streams them through SRAM, and uses a numerically stable online softmax that updates running max and denominator as new blocks arrive. The output is bit-identical to standard attention; the HBM traffic drops by a factor of the tile size, and wall-clock speed improves 2–4×. Flash Attention 2 and 3 add further tile-shape and warp-scheduling refinements.

Variant	Bottleneck	Key idea
Naive attention	HBM bandwidth	Materialize full A ∈ R^(L×L)
Flash Attention	SRAM capacity (much larger headroom)	Tiled streaming softmax, no full A
Paged KV-cache	KV memory fragmentation	Treat cache as pages, share across sequences (vLLM)
Sliding-window	Long-context quadratic memory	Bound attention to W recent tokens (Mistral, Longformer)
Grouped-query (GQA)	KV-cache size	Share K,V heads across multiple Q heads — ~4× smaller cache

Lesson. Once compute per FLOP became cheap, sequence model performance stopped being about more math and became about moving fewer bytes between memory tiers. This is precisely the same shift that once moved us from vanilla RNNs to LSTMs — not a new formula, a better flow of information.

Multi-Head Attention and Positional Encodings

Two more ideas close the loop between recurrence and transformers.

Multi-head attention

A single attention head computes one similarity pattern. A language model wants many simultaneous patterns: syntactic agreement, semantic similarity, coreference, topic, punctuation boundaries, and so on. Multi-head attention projects Q, K, V into $h$ parallel subspaces of dimension $d_{head} = d / h$ , runs attention independently in each, then concatenates and projects back. It is the attention analogue of having multiple convolutional filters per layer.

An LSTM's hidden state, by contrast, is one monolithic vector whose dimensions are free to specialize but entangled in every update. Multi-head attention is more like running several smaller LSTMs in parallel and fusing their outputs — at zero sequential cost.

Positional encodings

Attention is permutation-invariant: shuffle the input tokens and the attention output permutes the same way. That is disastrous for language. The fix is to add a position signal to every token embedding so the model can distinguish “dog bites man” from “man bites dog”.

The two dominant schemes today are sinusoidal encodings (fixed, from the original “Attention Is All You Need” paper, $\text{PE}_{pos, 2i} = \sin(pos / 10000^{2i/d})$ and cosine on odd dimensions) and RoPE (rotary positional embeddings, Su et al. 2021) which rotates Q and K in 2-D subspaces by an angle proportional to position. RoPE has the beautiful property that the dot product $Q_m \cdot K_n$ depends only on the relative offset $m - n$ , not on absolute positions.

An RNN does not need positional encodings because order is baked into the sequential update rule — token 5 is processed after token 4 by construction. The price of parallelism is that position becomes something you must inject explicitly.

When to Use Which Model

Despite transformer dominance, LSTMs and GRUs are not obsolete. They remain the correct tool for a specific class of problems.

Use case	Recommended	Why
Very long, strictly sequential time series (sensor, audio)	LSTM / GRU	O(H) memory beats O(L²) for L in the millions
Real-time streaming inference on edge devices	LSTM / GRU	Fixed-size state; no cache to manage
Language modelling, translation, any NLP with large datasets	Transformer + FlashAttn + KV-cache	Parallelism in training, O(1) path length
Small tabular sequence problem, data < 10k examples	GRU	Fewer parameters than LSTM, easier to regularize
Tasks needing bidirectional context offline	BiLSTM or BERT-style transformer	Both see future context; transformer scales better if data allows
Extremely low-latency single-step inference (e.g. <1ms)	LSTM cell	One matmul per step beats attention + KV lookup

A healthy mental model: LSTM/GRU for O(L) memory; attention for O(1) path length; pick whichever property matters more for your problem.

Summary

Recurrent models hit a wall because their time loop is serial, which wastes modern hardware. Attention broke the wall by making every pair of tokens one matmul apart.
The KV-cache is the transformer's version of an RNN hidden state: a running memory of the past that lets the next step avoid redundant work. The trade is linear growth for zero information loss.
Flash Attention removes the memory-bandwidth wall of naive attention by tiling computation through SRAM and never materializing the full L×L matrix.
Multi-head attention is many small attention computations in parallel; positional encodings (sinusoidal, RoPE) inject the order information that recurrence got for free.
LSTMs, GRUs, and transformers are not rivals — they are points on a Pareto frontier of memory versus path length. Which one you pick depends on which constraint is hurting you most.

The closing frame: the central character of this whole chapter is not the LSTM or the GRU. It is the additive identity path — the idea that you can let gradients flow through time by adding state rather than multiplying it. LSTMs put that path on $C$ ; GRUs put it on $h$ ; residual connections in transformers put it between layers; KV-caches put it across generation steps. Different implementations, one idea.

References

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, &Lstrok;. & Polosukhin, I. (2017). Attention Is All You Need. NeurIPS 2017 / arXiv:1706.03762. [Introduces the transformer, scaled dot-product attention, multi-head attention, and sinusoidal positional encodings.]
Dao, T., Fu, D. Y., Ermon, S., Rudra, A. & Ré, C. (2022). FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness. NeurIPS 2022 / arXiv:2205.14135.
Dao, T. (2023). FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning. arXiv:2307.08691.
Su, J., Lu, Y., Pan, S., Murtadha, A., Wen, B. & Liu, Y. (2021). RoFormer: Enhanced Transformer with Rotary Position Embedding. arXiv:2104.09864.
Ainslie, J., Lee-Thorp, J., de Jong, M., Zemlyanskiy, Y., Lebrón, F. & Sanghai, S. (2023). GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints. EMNLP 2023 / arXiv:2305.13245.
Kwon, W., Li, Z., Zhuang, S., Sheng, Y., Zheng, L., Yu, C. H., Gonzalez, J. E., Zhang, H. & Stoica, I. (2023). Efficient Memory Management for Large Language Model Serving with PagedAttention. SOSP 2023 / arXiv:2309.06180.
Beltagy, I., Peters, M. E. & Cohan, A. (2020). Longformer: The Long-Document Transformer. arXiv:2004.05150. [Sliding-window attention.]