Why This Section Exists
§17.1 through §17.3 taught the LSTM and GRU end to end, with PyTorch implementation. If you stopped reading here and treated the chapter as “how recurrent networks work,” you would be well-equipped to ship a 2016-era production RNN. But the field moved on — not because LSTMs were wrong, but because modern accelerators hate serial work. This forward-looking section traces the exact computational reason recurrence hit a wall and shows how attention, the KV-cache, Flash Attention, multi-head projections, and positional encodings are the direct engineering descendants of everything you just learned. The hidden state never died — it was rewritten.
Why Recurrence Hit a Wall
LSTMs solved the vanishing-gradient problem, but they could not solve a different problem: modern accelerators hate serial work. A modern GPU has tens of thousands of parallel cores. An LSTM, by construction, cannot compute until has finished. For a sequence of length 1024 that means 1024 sequential matrix multiplications, one after another, with most of the GPU sitting idle at any moment.
Worse, the path between two tokens far apart grows with their distance. Token 1 influences token 500 only through 499 intermediate cell-state updates. The gradient of the loss at token 500 with respect to token 1's input passes through 499 chained Jacobians. Even with the forget-gate highway, that path stretches the “effective receptive field” of the model. Compare with attention, where every pair of tokens is one matrix multiply away.
| Property | LSTM / GRU | Self-attention |
|---|---|---|
| Path length between any two tokens | O(L) | O(1) — every pair is one hop |
| Forward-pass parallelism across time | None (serial by construction) | Full (all pairs computed in parallel) |
| Per-step compute | O(H²) | O(L · H) |
| Total compute for a length-L sequence | O(L · H²) | O(L² · H) |
| Memory | O(H) state | O(L²) attention matrix |
| Long-range dependencies | Indirect (gated) | Direct (attention weight) |
Attention trades a fixed-size bottleneck (the hidden state) for a quadratic-size one (the attention matrix). The trade pays off wildly when you have the compute to spend — which is exactly what happened between 2017 and 2024 as GPUs got faster and cheaper faster than sequences got longer.
Interactive: Recurrence vs Attention
The comparison is best seen dynamically. Press Animate to step through sequential RNN computation on the left; the attention panel on the right shows every output cell reading every input cell in one parallel pass.
KV-Cache: The Hidden State, Reborn
When an autoregressive transformer generates one token at a time, a naive implementation re-runs attention over the entire growing prefix at every step — pure quadratic waste. The fix is a KV-cache: for each past token, store the computed key and value vectors once, and reuse them on every subsequent step.
The shape of this optimization is uncannily familiar. An RNN's hidden state is a compact, running summary of the past, updated each step. A KV-cache is a growing, loss-less record of the past, appended each step. Both let the next step avoid re-doing work. The difference: the RNN squeezes history into a fixed-size vector and pays for that compression every time; the KV-cache grows linearly but keeps every token in full detail.
| Aspect | RNN hidden state | Transformer KV-cache |
|---|---|---|
| Size | Fixed: O(H) | Grows: O(L · n_layers · n_heads · d_head) |
| Information loss | Lossy (compressed every step) | Lossless (every past K,V retained) |
| Per-step compute | One cell forward pass | One new Q attending to L cached (K, V) pairs |
| What it preserves | A learned summary | The exact past activations |
For a 7B-parameter LLaMA-class model (32 layers, 32 heads, , fp16) the cache footprint is bytes per token — exactly 512 KiB. A 32k-token context therefore costs 16 GiB of cache, which is why context-length has become a memory-planning problem as much as a quality problem.
Interactive: KV Cache Growth
Watch the cache grow during a toy 32-token generation. Toggle the cache off to see the per-step attention cost blow up quadratically.
Flash Attention: When Memory, Not Compute, is the Bottleneck
Standard self-attention for a sequence of length and hidden size computes and then . The intermediate is an matrix. For this is 67 million entries — too big to fit in GPU on-chip SRAM, which forces reads and writes to high-bandwidth memory (HBM) at every step. HBM bandwidth, not FLOPs, becomes the limiter.
Flash Attention (Dao et al., 2022) removes this limit by never materializing the full matrix. It tiles Q, K, V into small blocks, streams them through SRAM, and uses a numerically stable online softmax that updates running max and denominator as new blocks arrive. The output is bit-identical to standard attention; the HBM traffic drops by a factor of the tile size, and wall-clock speed improves 2–4×. Flash Attention 2 and 3 add further tile-shape and warp-scheduling refinements.
| Variant | Bottleneck | Key idea |
|---|---|---|
| Naive attention | HBM bandwidth | Materialize full A ∈ R^(L×L) |
| Flash Attention | SRAM capacity (much larger headroom) | Tiled streaming softmax, no full A |
| Paged KV-cache | KV memory fragmentation | Treat cache as pages, share across sequences (vLLM) |
| Sliding-window | Long-context quadratic memory | Bound attention to W recent tokens (Mistral, Longformer) |
| Grouped-query (GQA) | KV-cache size | Share K,V heads across multiple Q heads — ~4× smaller cache |
Lesson. Once compute per FLOP became cheap, sequence model performance stopped being about more math and became about moving fewer bytes between memory tiers. This is precisely the same shift that once moved us from vanilla RNNs to LSTMs — not a new formula, a better flow of information.
Multi-Head Attention and Positional Encodings
Two more ideas close the loop between recurrence and transformers.
Multi-head attention
A single attention head computes one similarity pattern. A language model wants many simultaneous patterns: syntactic agreement, semantic similarity, coreference, topic, punctuation boundaries, and so on. Multi-head attention projects Q, K, V into parallel subspaces of dimension , runs attention independently in each, then concatenates and projects back. It is the attention analogue of having multiple convolutional filters per layer.
An LSTM's hidden state, by contrast, is one monolithic vector whose dimensions are free to specialize but entangled in every update. Multi-head attention is more like running several smaller LSTMs in parallel and fusing their outputs — at zero sequential cost.
Positional encodings
Attention is permutation-invariant: shuffle the input tokens and the attention output permutes the same way. That is disastrous for language. The fix is to add a position signal to every token embedding so the model can distinguish “dog bites man” from “man bites dog”.
The two dominant schemes today are sinusoidal encodings (fixed, from the original “Attention Is All You Need” paper, and cosine on odd dimensions) and RoPE (rotary positional embeddings, Su et al. 2021) which rotates Q and K in 2-D subspaces by an angle proportional to position. RoPE has the beautiful property that the dot product depends only on the relative offset , not on absolute positions.
An RNN does not need positional encodings because order is baked into the sequential update rule — token 5 is processed after token 4 by construction. The price of parallelism is that position becomes something you must inject explicitly.
When to Use Which Model
Despite transformer dominance, LSTMs and GRUs are not obsolete. They remain the correct tool for a specific class of problems.
| Use case | Recommended | Why |
|---|---|---|
| Very long, strictly sequential time series (sensor, audio) | LSTM / GRU | O(H) memory beats O(L²) for L in the millions |
| Real-time streaming inference on edge devices | LSTM / GRU | Fixed-size state; no cache to manage |
| Language modelling, translation, any NLP with large datasets | Transformer + FlashAttn + KV-cache | Parallelism in training, O(1) path length |
| Small tabular sequence problem, data < 10k examples | GRU | Fewer parameters than LSTM, easier to regularize |
| Tasks needing bidirectional context offline | BiLSTM or BERT-style transformer | Both see future context; transformer scales better if data allows |
| Extremely low-latency single-step inference (e.g. <1ms) | LSTM cell | One matmul per step beats attention + KV lookup |
Summary
- Recurrent models hit a wall because their time loop is serial, which wastes modern hardware. Attention broke the wall by making every pair of tokens one matmul apart.
- The KV-cache is the transformer's version of an RNN hidden state: a running memory of the past that lets the next step avoid redundant work. The trade is linear growth for zero information loss.
- Flash Attention removes the memory-bandwidth wall of naive attention by tiling computation through SRAM and never materializing the full L×L matrix.
- Multi-head attention is many small attention computations in parallel; positional encodings (sinusoidal, RoPE) inject the order information that recurrence got for free.
- LSTMs, GRUs, and transformers are not rivals — they are points on a Pareto frontier of memory versus path length. Which one you pick depends on which constraint is hurting you most.
The closing frame: the central character of this whole chapter is not the LSTM or the GRU. It is the additive identity path — the idea that you can let gradients flow through time by adding state rather than multiplying it. LSTMs put that path on ; GRUs put it on ; residual connections in transformers put it between layers; KV-caches put it across generation steps. Different implementations, one idea.
References
- Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł. & Polosukhin, I. (2017). Attention Is All You Need. NeurIPS 2017 / arXiv:1706.03762. [Introduces the transformer, scaled dot-product attention, multi-head attention, and sinusoidal positional encodings.]
- Dao, T., Fu, D. Y., Ermon, S., Rudra, A. & Ré, C. (2022). FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness. NeurIPS 2022 / arXiv:2205.14135.
- Dao, T. (2023). FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning. arXiv:2307.08691.
- Su, J., Lu, Y., Pan, S., Murtadha, A., Wen, B. & Liu, Y. (2021). RoFormer: Enhanced Transformer with Rotary Position Embedding. arXiv:2104.09864.
- Ainslie, J., Lee-Thorp, J., de Jong, M., Zemlyanskiy, Y., Lebrón, F. & Sanghai, S. (2023). GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints. EMNLP 2023 / arXiv:2305.13245.
- Kwon, W., Li, Z., Zhuang, S., Sheng, Y., Zheng, L., Yu, C. H., Gonzalez, J. E., Zhang, H. & Stoica, I. (2023). Efficient Memory Management for Large Language Model Serving with PagedAttention. SOSP 2023 / arXiv:2309.06180.
- Beltagy, I., Peters, M. E. & Cohan, A. (2020). Longformer: The Long-Document Transformer. arXiv:2004.05150. [Sliding-window attention.]