Learning Objectives
By the end of this chapter, you will:
- Understand why the quadratic bottleneck of standard attention is the fundamental scalability limit of transformers, and what real problems it causes at sequence lengths beyond 4K tokens
- Master the kernel trick that replaces softmax with a feature map , enabling the associativity reordering that eliminates the matrix
- Derive and compute the full linear attention formula step by step using our shared example "The cat sat on the mat", tracing every matrix multiplication and normalization
- Understand why linear attention produces flatter weight distributions than softmax attention, and what this means for model quality in practice
- Implement both the non-causal and causal (recurrent) forms in Python/NumPy and PyTorch, understanding how the recurrent form enables per-token inference
- Connect linear attention to modern systems including state-space models (Mamba), RetNet, RWKV, and the linear-attention renaissance in long-context architectures
The Real Problem
The Quadratic Wall
Every attention mechanism we have discussed so far — scaled dot-product, multi-head, causal, cross-attention, MQA, GQA, and all positional encoding variants — shares a fundamental bottleneck: they must compute and store the full attention score matrix. The cost is in time and in memory, where is the sequence length and is the head dimension.
To grasp why this matters, consider the concrete numbers:
| Sequence Length (N) | Attention Matrix Size | Memory (FP32) | FLOPs (d=64) |
|---|---|---|---|
| 512 | 262K entries | 1 MB | 16.8M |
| 2,048 | 4.2M entries | 16 MB | 268M |
| 4,096 | 16.8M entries | 64 MB | 1.1B |
| 32,768 | 1.07B entries | 4 GB | 68.7B |
| 100,000 | 10B entries | 40 GB | 640B |
| 1,000,000 | 1T entries | 4 TB | 64T |
At (a modest context for modern LLMs), the attention matrix alone consumes 64 MB per head per layer. A model with 32 heads and 40 layers needs 64 MB 32 40 = 80 GB just for attention scores — exceeding the memory of an A100 GPU. At (a single novel or legal document), the matrix has 10 billion entries. Processing a million-token document is simply impossible with quadratic attention.
The Quadratic Wall: Standard attention works beautifully for short sequences (N < 2K) but hits a hard memory and compute wall that prevents scaling to long documents, code repositories, genomic sequences, and multi-modal data. This wall motivated an entire research agenda to find sub-quadratic alternatives.
Who Solved It
In 2020, Angelos Katharopoulos, Apoorv Vyas, Nikolaos Pappas, and François Fleuret published "Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention" at ICML 2020. Their key insight was deceptively simple: if you replace softmax with a kernel feature map , you can exploit the associativity of matrix multiplication to reorder the computation and eliminate the matrix entirely.
The paper title reveals their deeper insight: with this reordering, a transformer layer becomes mathematically equivalent to an RNN, processing one token at a time with a fixed-size hidden state. This means linear attention can perform autoregressive inference in time per token (constant, regardless of context length), compared to for standard attention with KV-cache.
The Kernel Trick Intuition
Matrix Multiplication is Associative
The entire idea behind linear attention rests on a property you learned in linear algebra: matrix multiplication is associative. For matrices , , with compatible dimensions:
In standard attention, the computation order is fixed by softmax:
The softmax operates on the score matrix, so you must compute first, then apply softmax, then multiply by . You cannot rearrange the parentheses because softmax is a nonlinear, row-wise operation that breaks associativity.
But what if we replace softmax with something that preserves the linear structure? Suppose we have a feature map such that the attention weight between query and key is simply:
Then the output for query (before normalization) is:
Since does not depend on , we can pull it out of the sum. The inner sum is a matrix that can be precomputed once for all queries. This is the associativity trick:
The Feature Map
The feature map must satisfy one critical constraint: it must produce non-negative values. This is because represents an unnormalized attention weight, and negative weights would mean "anti-attending" to certain tokens, which breaks the probabilistic interpretation.
Katharopoulos et al. chose :
This maps every real number to a strictly positive value: for , the output is ; for , the output is . It is smooth, differentiable, and cheap to compute.
The Mathematical Definition
Symbol-by-Symbol Breakdown
The complete linear attention formula is:
In matrix form:
Let us break this down symbol by symbol:
| Symbol | Shape | Meaning |
|---|---|---|
| Q, K | N × d_k | Query and key matrices (our familiar 5×4 matrices) |
| V | N × d_v | Value matrix (5×4 in our example) |
| φ(·) | element-wise | Feature map ELU(x)+1 applied to every element |
| φ(Q) | N × d_k | Transformed queries (5×4, all values > 0) |
| φ(K) | N × d_k | Transformed keys (5×4, all values > 0) |
| φ(K)ᵀ · V | d_k × d_v | THE KEY: a small d×d summary matrix (4×4!) |
| φ(Q) · (φ(K)ᵀ V) | N × d_v | Numerator: unnormalized output (5×4) |
| Σ φ(K_j) | d_k | Sum of all transformed key vectors (4-dim vector) |
| φ(Q) · Σφ(K) | N | Denominator: one normalization scalar per query (5 values) |
| O | N × d_v | Final output (5×4) = numerator / denominator |
Why Normalize?
Without normalization, the magnitude of the output would depend on the number of tokens and the magnitude of the feature-mapped keys. The denominator ensures that the implicit attention weights for each query sum to 1, just like softmax. This keeps the output values in a reasonable range regardless of sequence length.
Think of it this way: the numerator computes a weighted sum of values where the weights are . The denominator divides by the total weight, converting raw scores into a proper weighted average. This is analogous to how softmax normalizes by .
Step-by-Step Calculation
Let us compute linear attention for our shared example "The cat sat on the mat" using the same Q, K, V matrices as every other chapter.
Step 1: Apply Feature Map to Q and K
Since all entries in our Q and K are , the feature map simply adds 1 to every element:
| Token | Q[i] | φ(Q[i]) = Q[i] + 1 |
|---|---|---|
| The | [1.0, 0.0, 1.0, 0.0] | [2.0, 1.0, 2.0, 1.0] |
| cat | [0.0, 2.0, 0.0, 1.0] | [1.0, 3.0, 1.0, 2.0] |
| sat | [1.0, 1.0, 1.0, 0.0] | [2.0, 2.0, 2.0, 1.0] |
| on | [0.0, 0.0, 1.0, 1.0] | [1.0, 1.0, 2.0, 2.0] |
| mat | [1.0, 0.0, 0.0, 1.0] | [2.0, 1.0, 1.0, 2.0] |
| Token | K[i] | φ(K[i]) = K[i] + 1 |
|---|---|---|
| The | [0.0, 1.0, 0.0, 1.0] | [1.0, 2.0, 1.0, 2.0] |
| cat | [1.0, 0.0, 1.0, 0.0] | [2.0, 1.0, 2.0, 1.0] |
| sat | [1.0, 1.0, 0.0, 0.0] | [2.0, 2.0, 1.0, 1.0] |
| on | [0.0, 0.0, 1.0, 1.0] | [1.0, 1.0, 2.0, 2.0] |
| mat | [1.0, 0.0, 0.5, 0.5] | [2.0, 1.0, 1.5, 1.5] |
Step 2: Compute (shape , NOT !)
This is the critical computation. Instead of building a attention matrix, we compute a summary matrix:
For row 0 of KV:
| v₀ | v₁ | v₂ | v₃ | |
|---|---|---|---|---|
| KV[0,:] | 2.00 | 3.00 | 3.00 | 2.00 |
| KV[1,:] | 2.50 | 1.50 | 2.50 | 1.50 |
| KV[2,:] | 1.75 | 2.75 | 1.75 | 2.75 |
| KV[3,:] | 2.75 | 1.75 | 1.75 | 2.75 |
Step 3: Compute Numerator =
Each query row is multiplied by the KV matrix. For "The" (row 0):
Computing each dimension:
- dim 0:
- dim 1:
- dim 2:
- dim 3:
| Token | Numerator |
|---|---|
| The | [12.7500, 14.7500, 13.7500, 13.7500] |
| cat | [16.7500, 13.7500, 15.7500, 14.7500] |
| sat | [15.2500, 16.2500, 16.2500, 15.2500] |
| on | [13.5000, 13.5000, 12.5000, 14.5000] |
| mat | [13.7500, 13.7500, 13.7500, 13.7500] |
Step 4: Compute Denominator =
First, sum all rows:
Then dot each row with this sum:
| Token | φ(Q[i]) | Dot with [8, 7, 7.5, 7.5] | Denominator |
|---|---|---|---|
| The | [2, 1, 2, 1] | 2×8 + 1×7 + 2×7.5 + 1×7.5 | 45.5 |
| cat | [1, 3, 1, 2] | 1×8 + 3×7 + 1×7.5 + 2×7.5 | 51.5 |
| sat | [2, 2, 2, 1] | 2×8 + 2×7 + 2×7.5 + 1×7.5 | 52.5 |
| on | [1, 1, 2, 2] | 1×8 + 1×7 + 2×7.5 + 2×7.5 | 45.0 |
| mat | [2, 1, 1, 2] | 2×8 + 1×7 + 1×7.5 + 2×7.5 | 45.5 |
Step 5: Final Output = Numerator / Denominator
Divide each row of the numerator by its denominator:
| Token | Numerator / Denom | Output |
|---|---|---|
| The | [12.75, 14.75, 13.75, 13.75] / 45.5 | [0.2802, 0.3242, 0.3022, 0.3022] |
| cat | [16.75, 13.75, 15.75, 14.75] / 51.5 | [0.3252, 0.2670, 0.3058, 0.2864] |
| sat | [15.25, 16.25, 16.25, 15.25] / 52.5 | [0.2905, 0.3095, 0.3095, 0.2905] |
| on | [13.50, 13.50, 12.50, 14.50] / 45.0 | [0.3000, 0.3000, 0.2778, 0.3222] |
| mat | [13.75, 13.75, 13.75, 13.75] / 45.5 | [0.3022, 0.3022, 0.3022, 0.3022] |
Interactive: The Kernel Trick Step-by-Step
Click through each step to see exactly how linear attention avoids the matrix. Watch the shapes: the intermediate matrices are always (4×4), never (5×5).
Full Attention Weights and Output
Although linear attention never explicitly builds the weight matrix, we can compute implicit attention weights to understand what the mechanism is doing. The implicit weight from query to key is:
Interpreting the Weights
| Query \ Key | The | cat | sat | on | mat |
|---|---|---|---|---|---|
| The | 0.1758 | 0.2198 | 0.1978 | 0.1978 | 0.2088 |
| cat | 0.2330 | 0.1748 | 0.2136 | 0.1942 | 0.1845 |
| sat | 0.1905 | 0.2095 | 0.2095 | 0.1905 | 0.2000 |
| on | 0.2000 | 0.2000 | 0.1778 | 0.2222 | 0.2000 |
| mat | 0.1978 | 0.1978 | 0.1978 | 0.1978 | 0.2088 |
Compare this with the softmax attention weights from Chapter 1:
| Query \ Key | The | cat | sat | on | mat |
|---|---|---|---|---|---|
| The | 0.1095 | 0.2976 | 0.1805 | 0.1805 | 0.2318 |
| cat | 0.4026 | 0.0898 | 0.2442 | 0.1481 | 0.1153 |
| sat | 0.1519 | 0.2505 | 0.2505 | 0.1519 | 0.1951 |
| on | 0.1903 | 0.1903 | 0.1154 | 0.3137 | 0.1903 |
| mat | 0.1892 | 0.1892 | 0.1892 | 0.1892 | 0.2430 |
The difference is striking:
- Softmax attention is peaked: "cat" gives 40.3% of its attention to "The" and only 9.0% to itself. It has strong preferences.
- Linear attention is flat: "cat" gives 23.3% to "The" and 17.5% to itself. All weights are close to the uniform value of 20%.
- Standard deviation: Softmax row std ranges from 0.02 to 0.11. Linear row std ranges from 0.004 to 0.02. Linear attention is 3-5x more uniform.
The Quality-Speed Trade-off: Linear attention is faster because it avoids the exponential nonlinearity of softmax. But that same exponential is what creates the sharp, peaked distributions that allow standard attention to selectively focus on the most relevant tokens. Linear attention trades discrimination power for computational efficiency.
Output Matrix — Linear Attention
| Token | dim-0 | dim-1 | dim-2 | dim-3 |
|---|---|---|---|---|
| The | 0.2802 | 0.3242 | 0.3022 | 0.3022 |
| cat | 0.3252 | 0.2670 | 0.3058 | 0.2864 |
| sat | 0.2905 | 0.3095 | 0.3095 | 0.2905 |
| on | 0.3000 | 0.3000 | 0.2778 | 0.3222 |
| mat | 0.3022 | 0.3022 | 0.3022 | 0.3022 |
Note how the linear attention outputs are closer to the mean (0.30) compared to softmax outputs, which show more variance. This reflects the flatter attention distributions.
Interactive: Linear vs Softmax Heatmap
Toggle between linear and softmax attention weights to see the difference in selectivity. Click a row to see the side-by-side bar chart comparison.
The Recurrent Form (Causal Linear Attention)
The most remarkable property of linear attention is that it can be reformulated as an RNN. This is why Katharopoulos et al. titled their paper "Transformers are RNNs." The recurrent form processes tokens one at a time using a fixed-size hidden state, enabling per-token inference — constant time regardless of how many tokens have been processed.
The Recurrent Equations
Define two running states that are updated as each new token arrives:
The output for token is then:
is a matrix (4×4 in our example) that accumulates the outer products of all keys and values seen so far. is a -dimensional vector that accumulates the sum of all transformed keys. Both grow by a fixed amount per token — the memory is regardless of sequence length.
| Property | Standard Attention | Linear Attention (Recurrent) |
|---|---|---|
| Per-token compute | O(N · d) — recompute all attention | O(d²) — update S and z |
| Memory per layer | O(N · d) — full KV cache | O(d²) — just S and z |
| Total for N tokens | O(N² · d) | O(N · d²) |
| Context length limit | Bounded by KV cache memory | Unlimited (fixed state) |
Interactive: Recurrent Processing
Click each token to see how the running state and accumulate information. Notice how the state size stays fixed at 4×4 regardless of how many tokens have been processed.
Complexity Analysis
The complexity comparison between standard and linear attention depends on the relationship between (sequence length) and (head dimension):
| Metric | Standard Attention | Linear Attention | When Linear Wins |
|---|---|---|---|
| Time | O(N² · d) | O(N · d²) | N > d (almost always) |
| Memory | O(N²) | O(N · d + d²) | N > d |
| Per-token (causal) | O(N · d) | O(d²) | Always (N cancels) |
| Speedup factor | — | N / d | e.g., 4096/64 = 64x |
The crossover point is . When (extremely rare in practice), standard attention is actually more efficient. But for typical values ( or , ), linear attention wins by a factor of .
Interactive: Complexity Explorer
Drag the slider to change the head dimension and see how the speedup scales with sequence length. The bars use a logarithmic scale.
Applications Across Domains
Linear attention's scaling opens doors that quadratic attention cannot enter:
| Domain | Why Linear Attention Matters | Example Systems |
|---|---|---|
| Long-document NLP | Process entire books (100K+ tokens) in a single pass without chunking or hierarchical attention | LongT5, FNet |
| Genomics | DNA sequences can be millions of bases long. O(N²) is completely infeasible | Enformer variants, genomic foundation models |
| Time-series forecasting | Financial data, sensor streams, and weather models have extremely long sequences | Autoformer, linear-attention forecasters |
| Image generation | High-resolution images (1024×1024) have 1M+ pixel patches. Quadratic attention requires terabytes | Linear Transformer GAN, efficient ViT variants |
| Edge/mobile deployment | The recurrent form uses O(d²) memory per layer — feasible on devices with <1GB RAM | On-device language models |
| Streaming/real-time | Process infinite streams without growing memory. Each token is O(d²) = O(1) w.r.t. N | Real-time translation, live captioning |
Connection to Modern Systems
Linear attention was one of the earliest sub-quadratic attention mechanisms (2020), and it directly inspired a family of modern architectures:
- State-Space Models (Mamba, S4): Mamba (Gu & Dao, 2023) is essentially a selective linear attention with a data-dependent gating mechanism. The core idea — maintaining a fixed-size hidden state updated per token — comes directly from linear attention's recurrent form. Mamba adds input-dependent selection of which information to keep or forget.
- RetNet (Sun et al., 2023): Retention networks explicitly decompose attention into a recurrent form with exponential decay, directly building on the "Transformers are RNNs" insight. RetNet achieves the training parallelism of transformers with the inference efficiency of RNNs.
- RWKV (Peng et al., 2023): Another linear-attention-inspired architecture that achieves RNN-like inference with transformer-like training. Uses a different feature map (WKV mechanism) but the same core principle of avoiding the N×N matrix.
- Flash Attention (Dao et al., 2022): Takes the opposite approach — instead of changing the math (removing softmax), Flash Attention keeps exact softmax but optimizes the memory access pattern. Flash Attention is an IO-aware algorithm; linear attention is an algebraic reformulation. They solve the same problem from different angles.
- Flash Linear Attention (Yang et al., 2024): Combines both ideas — applies Flash Attention's IO-aware tiling to the linear attention computation. Shows that the two approaches are complementary, not competing.
- GLA (Gated Linear Attention): Adds a learnable gate to the recurrent state update, allowing the model to selectively forget old information. This addresses linear attention's tendency to accumulate noise in the state over very long sequences.
The linear attention renaissance: After being overshadowed by Flash Attention (which made quadratic attention fast enough for most uses), linear attention is experiencing a resurgence in 2024-2025 through models like Mamba-2, RWKV-6, and GLA. The key insight: for very long contexts (100K+), no amount of IO optimization can overcome the fundamental scaling.
Python Implementation
The full Python class implementation with both non-causal and causal (recurrent) forms. Click any line to see the exact values flowing through the computation.
PyTorch Implementation
The production-ready PyTorch implementation uses for clean batched operations and supports both causal and non-causal modes.
Key Takeaways
- The quadratic wall is real: At , the attention matrix has 10 billion entries. This is the fundamental scalability limit that linear attention solves.
- The trick is algebraic, not approximate: Linear attention does not approximate softmax attention. It computes a different attention mechanism that happens to be computable in time by exploiting matrix associativity.
- The feature map replaces softmax: maps all values to positive numbers, enabling the kernel trick that makes associative reordering valid.
- The key computation is : This matrix summarizes all key-value interactions without building the attention matrix.
- Flatter weights are the cost: Without the exponential amplification of softmax, linear attention produces more uniform weight distributions. This reduces selectivity but enables scalability.
- The recurrent form enables O(1) per-token inference: By maintaining a fixed-size state , linear attention processes each new token in constant time — a property that standard attention with KV-cache cannot match.
- Linear attention spawned an entire research field: Mamba, RetNet, RWKV, and GLA all trace their lineage to the "Transformers are RNNs" insight. The principle of maintaining a compact, updateable state is the foundation of modern sub-quadratic architectures.
Exercises
Exercise 1: Verify the Associativity
Compute (the matrix) and then multiply by V. Verify that you get the same numerator as the computation above. Why is the first approach and the second ? Count the multiply-add operations.
Exercise 2: Different Feature Maps
Replace with (where ). How do the attention weights change? What happens when many key values are negative? Why did the authors prefer ELU over ReLU?
Exercise 3: Causal vs Non-Causal
Run the causal (recurrent) form for all 5 tokens and compare the output for "mat" (the last token) with the non-causal output. They should match — why? Now compare the output for "The" (the first token) between causal and non-causal. Why are they different?
Exercise 4: Scaling to Long Sequences
Generate random Q, K, V matrices with and . Time both the quadratic computation () and the linear computation (). Does the measured speedup match the theoretical ? Why might it differ?
Exercise 5: Memory Analysis
For a transformer with 32 heads and 40 layers processing tokens with , calculate the total memory for: (a) Standard attention matrices, (b) Linear attention's KV matrices, (c) Linear attention's recurrent state. What is the memory reduction factor?
References
- Katharopoulos, A., Vyas, A., Pappas, N., & Fleuret, F. (2020). "Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention." ICML 2020.
- Vaswani, A., Shazeer, N., Parmar, N., et al. (2017). "Attention Is All You Need." NeurIPS 2017.
- Gu, A., & Dao, T. (2023). "Mamba: Linear-Time Sequence Modeling with Selective State Spaces." arXiv:2312.00752.
- Sun, Y., Dong, L., Huang, S., et al. (2023). "Retentive Network: A Successor to Transformer for Large Language Models." arXiv:2307.08621.
- Peng, B., Alcaide, E., Anthony, Q., et al. (2023). "RWKV: Reinventing RNNs for the Transformer Era." EMNLP 2023.
- Dao, T., Fu, D. Y., Ermon, S., Rudra, A., & Ré, C. (2022). "FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness." NeurIPS 2022.
- Yang, S., Wang, B., Shen, Y., Panda, R., & Kim, Y. (2024). "Gated Linear Attention Transformers with Hardware-Efficient Training." ICML 2024.