Shaw, Uszkoreit & Vaswani, "Self-Attention with Relative Position Representations", Google, 2018 — Raffel et al., "Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer" (T5), Google, 2020
Learning Objectives
After completing this chapter, you will be able to:
- Explain why absolute positional embeddings fail to capture the notion of "how far apart are these two tokens?" and why this matters for generalization
- Describe how relative position bias injects distance information directly into the attention score computation via an additive bias matrix
- Compute the bias matrix by hand and explain why logarithmic decay is preferred over linear decay
- Perform a full relative position bias forward pass on the shared “The cat sat on mat” example, comparing results with and without position bias
- Explain T5's logarithmic bucketing scheme and why it uses a fixed number of buckets instead of one parameter per distance
- Analyze how position bias redistributes attention weight from distant tokens to nearby tokens
- Implement relative position bias attention from scratch in both NumPy and PyTorch
- Connect relative position bias to its successors: RoPE (Chapter 8) and ALiBi (Chapter 9)
The Problem: Absolute Positions Are Blind to Distance
Why Absolute Positional Embeddings Fail
In Chapter 1, we saw that standard scaled dot-product attention computes . This formula treats every token pair identically regardless of how far apart they are in the sequence. The token at position 1 and the token at position 100 get no distance signal at all — the attention mechanism operates purely on content similarity.
The original Transformer (Vaswani et al., 2017) addressed this by adding sinusoidal positional embeddings to the input. Each position gets a unique vector added to the token embedding before computing Q, K, V. This means Q and K implicitly contain position information because they were projected from position-enriched inputs.
But here is the critical problem: these are absolute positions. Position 3 always maps to the same embedding vector, regardless of sequence length or context. The model must learn, from training data alone, that position 3 and position 5 are "close" while position 3 and position 50 are "far." This relationship is never explicitly computed — it must be discovered indirectly from the dot products of position-enriched Q and K vectors.
The Missing Signal
Consider a concrete example. The phrase "the cat sat on the mat" might appear at the start of a document (positions 0-5) or in the middle (positions 500-505). With absolute positions, the model sees completely different positional embeddings in each case, even though the internal structure of the phrase is identical. The model must independently learn that Q[500]·K[502] should produce the same attention pattern as Q[0]·K[2].
The Core Limitation: Standard attention never explicitly computes "how far apart are tokens i and j?" It knows (content similarity), but not (relative distance). The position signal is buried inside the Q and K vectors, mixed with content information, and the model must disentangle them.
This leads to three practical failures:
- Poor length generalization. A model trained on sequences of length 512 struggles with sequences of length 1024, because it has never seen the positional embeddings for positions 512-1023.
- Wasted capacity. The model uses some of its attention heads just to discover relative position patterns, instead of having this information provided directly.
- Inconsistent local patterns. The same local structure ("adjective noun") at different absolute positions produces different Q·K products, forcing the model to learn position-invariant patterns redundantly.
The Intuition Behind Relative Position Bias
The Classroom Analogy
Imagine a student sitting in a classroom lecture. When the professor says something, the student naturally pays more attention to the words just spoken (nearby in time) and less to words from five minutes ago (distant in time). This isn't because the recent words are intrinsically more important — it's because temporal proximity is a strong prior for relevance. Nearby words are more likely to be part of the same thought, sentence, or argument.
The student still can recall something from five minutes ago if it's highly relevant to the current topic. The distance penalty doesn't block long-range connections; it just means that distant information needs to be more relevant (higher content similarity) to compete with local information. This is exactly what relative position bias does to attention.
The Key Insight
The key insight of Shaw et al. (2018) was remarkably simple: add a distance-dependent term directly to the attention scores before softmax. Instead of relying on the model to discover distance from absolute position embeddings, we tell it explicitly. For each pair of tokens at positions and , we add a bias that depends only on , the relative offset.
This is a single addition to the standard attention formula:
Standard:
With bias:
Everything else remains identical. The bias matrix is the only new component. It shifts the attention landscape so that nearby token pairs start with a natural advantage, while distant pairs must overcome a distance penalty through high content similarity alone.
Historical Development
The idea developed in two key stages:
- Shaw et al. (2018) introduced relative position representations in "Self-Attention with Relative Position Representations." They proposed learning separate key and value embeddings for relative positions and clipping distances beyond a maximum offset. This was the first work to demonstrate that relative positions outperform absolute positions on machine translation.
- Raffel et al. (2020) simplified the idea in the T5 paper. Instead of learning full embedding vectors per distance, T5 uses a single learned scalar bias per relative position bucket. They introduced logarithmic bucketing: short distances get exact representation, while long distances are grouped into shared buckets. This is more parameter-efficient and has become the standard approach.
Mathematical Formulation
Standard Attention (Recap)
From Chapter 1, standard scaled dot-product attention is:
where is the query matrix, is the key matrix, is the value matrix, and is the sequence length.
Adding the Relative Position Bias
Relative position bias modifies this by adding a bias matrix to the scaled scores before softmax:
The key constraint is that depends only on the relative offset , not on the absolute positions and individually. This makes the bias translation-invariant: the same phrase at different positions in the sequence produces the same bias pattern.
How Is B[i,j] Computed?
There are two main approaches:
1. Learned bias (T5 style). Each relative position bucket has a learned scalar parameter. The model learns, through backpropagation, what bias to assign each distance. T5 uses 32 buckets with logarithmic spacing.
2. Fixed bias (our simulation). A mathematical function computes the bias from the distance. Our simulation uses logarithmic decay:
where is the bias scale. This formula has several desirable properties:
- — no penalty for attending to yourself (distance 0)
- for all — all non-self attention gets penalized
- Logarithmic growth means the penalty increases slowly — distance 4 gets only the penalty of distance 1, not
- Symmetric: — direction doesn't matter
Symbol-by-Symbol Breakdown
| Symbol | Meaning | Our Example |
|---|---|---|
| Q | Query matrix — what each token is looking for | (5, 4) matrix |
| K | Key matrix — what each token advertises | (5, 4) matrix |
| V | Value matrix — the content to retrieve | (5, 4) matrix |
| N | Sequence length (number of tokens) | 5 |
| d_k | Key dimension (for scaling) | 4 |
| B | Relative position bias matrix | (5, 5) — symmetric, zero diagonal |
| λ | Bias scale — strength of distance penalty | 0.3 |
| |i - j| | Absolute distance between positions i and j | 0 to 4 |
| log(1 + d) | Natural logarithm of (1 + distance) | 0.0 to 1.609 |
Interactive: Bias Matrix Heatmap
Explore how the bias matrix changes with different bias scales and decay functions. Hover over any cell to see the exact calculation:
Step-by-Step Calculation
Setup: Shared Example
We use the same Q, K, V matrices from every chapter: 5 tokens ("The cat sat on mat"), each with dimensions. The key parameters are: (bias scale), (scaling factor).
Step 1: Raw Scores
First, compute the raw dot products between all query-key pairs. These are identical to Chapter 1 — no position information yet:
| Q·Kᵀ | The | cat | sat | on | mat |
|---|---|---|---|---|---|
| The | 0.0 | 2.0 | 1.0 | 1.0 | 1.5 |
| cat | 3.0 | 0.0 | 2.0 | 1.0 | 0.5 |
| sat | 1.0 | 2.0 | 2.0 | 1.0 | 1.5 |
| on | 1.0 | 1.0 | 0.0 | 2.0 | 1.0 |
| mat | 1.0 | 1.0 | 1.0 | 1.0 | 1.5 |
Notice: "cat" attending to "The" has the highest raw score (3.0), while "The" attending to "The" has the lowest (0.0). These scores reflect only content similarity.
Step 2: Scaling by
Divide every score by :
| Scaled | The | cat | sat | on | mat |
|---|---|---|---|---|---|
| The | 0.000 | 1.000 | 0.500 | 0.500 | 0.750 |
| cat | 1.500 | 0.000 | 1.000 | 0.500 | 0.250 |
| sat | 0.500 | 1.000 | 1.000 | 0.500 | 0.750 |
| on | 0.500 | 0.500 | 0.000 | 1.000 | 0.500 |
| mat | 0.500 | 0.500 | 0.500 | 0.500 | 0.750 |
Step 3: Building the Bias Matrix B
For each distance , compute :
| Distance |i-j| | log(1+d) | B = -0.3 × log(1+d) |
|---|---|---|
| 0 | 0.0000 | 0.0000 |
| 1 | 0.6931 | -0.2079 |
| 2 | 1.0986 | -0.3296 |
| 3 | 1.3863 | -0.4159 |
| 4 | 1.6094 | -0.4828 |
The complete 5×5 bias matrix (a Toeplitz matrix — each diagonal has the same value):
| B | The | cat | sat | on | mat |
|---|---|---|---|---|---|
| The | 0.0000 | -0.2079 | -0.3296 | -0.4159 | -0.4828 |
| cat | -0.2079 | 0.0000 | -0.2079 | -0.3296 | -0.4159 |
| sat | -0.3296 | -0.2079 | 0.0000 | -0.2079 | -0.3296 |
| on | -0.4159 | -0.3296 | -0.2079 | 0.0000 | -0.2079 |
| mat | -0.4828 | -0.4159 | -0.3296 | -0.2079 | 0.0000 |
Step 4: Adding Bias to Scores
Element-wise addition:
| Biased | The | cat | sat | on | mat |
|---|---|---|---|---|---|
| The | 0.0000 | 0.7921 | 0.1704 | 0.0841 | 0.2672 |
| cat | 1.2921 | 0.0000 | 0.7921 | 0.1704 | -0.1659 |
| sat | 0.1704 | 0.7921 | 1.0000 | 0.2921 | 0.4204 |
| on | 0.0841 | 0.1704 | -0.2079 | 1.0000 | 0.2921 |
| mat | 0.0172 | 0.0841 | 0.1704 | 0.2921 | 0.7500 |
Notice how "cat"'s score for "The" dropped from 1.500 to 1.2921 (distance 1, mild penalty), while "cat"'s score for "mat" dropped from 0.250 to -0.1659 (distance 3, strong enough to push it negative). Nearby tokens are penalized less.
Step 5: Softmax
Let's trace the softmax computation for "The" (row 0) in detail:
- Biased scores:
- Subtract max (0.7921):
- Exponentiate:
- Sum:
- Normalize:
For "sat" (row 2) — the middle token benefits most from relative bias because it has equal-distance neighbors on both sides:
- Biased scores:
- Subtract max (1.0000):
- Exponentiate:
- Sum:
- Normalize:
"sat" now gives 30.3% weight to itself (same position, zero penalty), compared to 25.1% in standard attention. The bias has concentrated attention on the center.
Step 6: Output Computation
The output for each token is the weighted sum of all value vectors. For "The":
Interactive: Standard vs Biased Attention
Compare attention weights with and without relative position bias. Select a query token and adjust the bias scale to see how distance penalties redistribute attention:
Attention Weight Analysis
Standard vs Biased: Weight Comparison
The difference matrix (biased weights minus standard weights) reveals the redistribution pattern:
| Δ Weight | The | cat | sat | on | mat |
|---|---|---|---|---|---|
| The | +0.0378 | +0.0276 | -0.0058 | -0.0203 | -0.0394 |
| cat | +0.0073 | +0.0228 | +0.0044 | -0.0146 | -0.0200 |
| sat | -0.0198 | -0.0045 | +0.0524 | -0.0027 | -0.0254 |
| on | -0.0380 | -0.0243 | -0.0017 | +0.0668 | -0.0028 |
| mat | -0.0385 | -0.0280 | -0.0135 | +0.0092 | +0.0708 |
What Changed and Why
The diagonal is always positive — every token gained self-attention weight because distance 0 has zero penalty. The off-diagonal pattern follows distance:
- "The" row: "cat" (d=1) gained +0.0276, while "mat" (d=4) lost -0.0394. The nearest neighbor benefits most.
- "sat" row (center): Self-attention gained the most (+0.0524) because "sat" has equal neighbors on both sides. Both "The" (d=2) and "mat" (d=2) lost equally (-0.0198 and -0.0254).
- "mat" row (edge): Self-attention gained +0.0708 (largest diagonal gain) because "mat" is at the sequence edge — it has the most distant tokens (d=4 to "The"), so the bias strongly penalizes those pairs and redistributes weight to self.
Full Attention Weight Matrices
Standard Attention Weights (no bias)
| The | cat | sat | on | mat | |
|---|---|---|---|---|---|
| The | 0.1095 | 0.2976 | 0.1805 | 0.1805 | 0.2318 |
| cat | 0.4026 | 0.0898 | 0.2442 | 0.1481 | 0.1153 |
| sat | 0.1519 | 0.2505 | 0.2505 | 0.1519 | 0.1951 |
| on | 0.1903 | 0.1903 | 0.1154 | 0.3137 | 0.1903 |
| mat | 0.1892 | 0.1892 | 0.1892 | 0.1892 | 0.2430 |
Relative Position Bias Attention Weights
| The | cat | sat | on | mat | |
|---|---|---|---|---|---|
| The | 0.1473 | 0.3253 | 0.1747 | 0.1603 | 0.1924 |
| cat | 0.4099 | 0.1126 | 0.2486 | 0.1335 | 0.0954 |
| sat | 0.1321 | 0.2460 | 0.3029 | 0.1492 | 0.1697 |
| on | 0.1523 | 0.1660 | 0.1137 | 0.3805 | 0.1875 |
| mat | 0.1508 | 0.1612 | 0.1758 | 0.1985 | 0.3138 |
Full Output Matrices
Standard Output (no bias)
| dim-0 | dim-1 | dim-2 | dim-3 | |
|---|---|---|---|---|
| The | 0.2254 | 0.4135 | 0.2964 | 0.2964 |
| cat | 0.4602 | 0.1475 | 0.3018 | 0.2058 |
| sat | 0.2495 | 0.3481 | 0.3481 | 0.2495 |
| on | 0.2854 | 0.2854 | 0.2106 | 0.4089 |
| mat | 0.3108 | 0.3108 | 0.3108 | 0.3108 |
Relative Position Bias Output
| dim-0 | dim-1 | dim-2 | dim-3 | |
|---|---|---|---|---|
| The | 0.2435 | 0.4215 | 0.2709 | 0.2565 |
| cat | 0.4576 | 0.1603 | 0.2963 | 0.1812 |
| sat | 0.2170 | 0.3309 | 0.3877 | 0.2341 |
| on | 0.2460 | 0.2597 | 0.2074 | 0.4743 |
| mat | 0.3077 | 0.3181 | 0.3326 | 0.3554 |
Notice that "mat"'s output was (nearly uniform) in standard attention but became with bias — now favoring dim-3 because "mat"'s nearest neighbor "on" has V = (strong in dim-3) and received more weight (+0.0092).
T5's Logarithmic Bucketing Scheme
Why Use Buckets?
A naive implementation would assign one learnable parameter per relative distance. For a sequence of length , this requires parameters (distances from to ). For , that's 8,191 parameters per attention head — not terrible, but with diminishing returns. The difference between "97 tokens away" and "98 tokens away" is linguistically meaningless. We don't need separate parameters for these nearly identical distances.
T5 solves this with logarithmic bucketing: nearby distances get their own bucket (exact representation), while distant ranges share a bucket (coarse approximation). With 32 buckets, you get:
- Exact region (buckets 0-7): distances 0-7 each get their own bucket. Position 3 is distinguishable from position 4.
- Logarithmic region (buckets 8-15): distances 8-128 are logarithmically distributed across 8 buckets. Distances 8-11 share one bucket, 12-16 share another, etc.
- Bidirectional: the first 16 buckets handle "token j is to the left of token i" and the second 16 handle "token j is to the right."
The Bucket Algorithm
The T5 bucketing algorithm works as follows. Let be the signed relative position:
- If bidirectional, separate positive and negative offsets into two halves of the bucket space
- For (half of half-buckets): assign bucket = directly
- For : assign bucket =
This is a logarithmic mapping: doubling the distance adds roughly one bucket. The model learns one scalar bias per bucket, shared across all distances that map to that bucket.
Interactive: T5 Bucket Explorer
Explore how T5's logarithmic bucketing maps distances to buckets. Adjust the number of buckets and maximum distance to see how the exact/logarithmic boundary shifts:
Applications Across Domains
| Domain | How Relative Position Bias Helps | Example |
|---|---|---|
| NLP / Language Modeling | Nearby words are more likely to be syntactically related. The bias encodes this prior, reducing the burden on content similarity alone. | In "The cat sat on the mat," the determiner "the" most strongly modifies its adjacent noun. |
| Machine Translation | Source-target alignments tend to be locally monotonic. Relative bias encourages attending to nearby source positions, improving alignment quality. | T5 achieved state-of-the-art on WMT translation benchmarks using relative position bias. |
| Document Summarization | Key sentences often cluster within paragraphs. Relative bias helps the model attend within paragraph boundaries without hard-coded structural constraints. | T5 summarization on CNN/DailyMail benefits from position-aware attention. |
| Code Generation | Variable references typically occur within a few lines of their declaration. The bias naturally encodes this locality pattern. | A function's local variables are usually defined within 5-10 lines of their use. |
| Protein Sequence Modeling | Amino acids that are close in sequence are more likely to interact physically (local secondary structure). Relative bias captures this spatial prior. | Alpha-helices and beta-sheets are formed by nearby amino acid interactions. |
| Vision Transformers | Nearby image patches are more likely to belong to the same object. Relative bias encodes spatial locality without explicit convolution. | Swin Transformer uses relative position bias within local attention windows. |
Connection to Modern Systems
| System | Relationship to Relative Position Bias | Key Difference |
|---|---|---|
| RoPE (Chapter 8) | Also encodes relative position, but multiplies Q and K by rotation matrices instead of adding a bias. The dot product Q·K naturally becomes position-dependent. | Multiplicative (rotation) vs additive (bias). RoPE doesn't need an explicit B matrix. |
| ALiBi (Chapter 9) | Direct descendant. Uses a fixed linear bias: B[i,j] = -m × |i-j| with head-specific slopes m. No learned parameters. | Linear decay vs logarithmic. ALiBi is simpler (no log, no buckets) and enables infinite length extrapolation. |
| Flash Attention (Chapter 13) | Flash Attention can incorporate relative position bias by adding B to the score tiles during the IO-aware tiling computation. The bias matrix is never fully materialized. | Flash Attention is an implementation optimization, not a different mechanism. It computes the same result. |
| Swin Transformer | Uses learned relative position bias within fixed-size local windows. Each attention head has a (2W-1) × (2W-1) bias table where W is the window size. | Applied within local windows, not globally. Combined with window shifting for cross-window information flow. |
| T5 / mT5 / Flan-T5 | The production implementation of relative position bias. Uses 32 logarithmic buckets with learned biases per head. | Our simulation uses a fixed formula; T5 learns the bias values through training. |
| KV-Cache (Chapter 5-6) | Relative position bias is compatible with KV-cache. The bias for new tokens can be computed incrementally without recomputing the full B matrix. | Only the new row/column of B needs to be computed for each new token during autoregressive generation. |
Complexity Analysis
| Aspect | Standard Attention | Relative Position Bias |
|---|---|---|
| Time: Q·Kᵀ | O(N² d_k) | O(N² d_k) — unchanged |
| Time: Build B | N/A | O(N²) — one log per entry |
| Time: Add B | N/A | O(N²) — element-wise addition |
| Time: Softmax | O(N²) | O(N²) — unchanged |
| Time: Total | O(N² d_k) | O(N² d_k) — B is dominated by Q·Kᵀ |
| Space: Extra | O(1) | O(N²) for B matrix, or O(B) for bucket table |
| Parameters | None extra | T5: 32 scalars per head. Fixed: 0 extra. |
| Length Generalization | Poor (unseen positions) | Good (only relative distances matter) |
The key insight is that relative position bias adds work to build and add B, but this is dominated by the cost of the dot-product computation. In practice, the overhead is negligible — less than 5% of total attention computation time.
Python Implementation
The complete NumPy implementation with full execution trace. Click any line to see the exact values flowing through the computation:
PyTorch Implementation
The PyTorch version adds GPU support, automatic differentiation for training, and uses vectorized operations instead of Python loops. It also includes an bias table for the T5-style learned approach:
Key Takeaways
- One-line change, big impact. Relative position bias adds a single matrix to the attention scores before softmax. Everything else — Q·K\u1D40 computation, scaling, softmax, weighted sum — remains identical to standard attention.
- Distance as a prior. The bias matrix encodes the assumption that nearby tokens are more likely to be relevant. This doesn't prevent long-range attention; it just requires distant tokens to have higher content similarity to overcome the distance penalty.
- Translation invariance. Because depends only on , the same phrase at any position in the sequence produces the same bias pattern. This enables better length generalization than absolute positional embeddings.
- Logarithmic decay. The function grows slowly — quadrupling the distance only doubles the log. This means the model distinguishes fine-grained nearby positions while treating distant positions more coarsely.
- T5's bucketing is efficient. Instead of learning one parameter per distance, T5 uses 32 logarithmic buckets. This captures fine-grained local position with few parameters and generalizes to sequences longer than those seen during training.
- Foundation for successors. This chapter lays the groundwork for RoPE (Chapter 8, which uses rotation instead of addition) and ALiBi (Chapter 9, which uses linear decay instead of logarithmic).
Exercises
- Bias Scale Sensitivity: Recompute the attention weights for "The" using , , and . How does increasing affect the balance between local and global attention? At what value of does self-attention dominate?
- Linear vs Logarithmic Decay: Replace the bias formula with (linear decay). Compute the attention weights and compare. Which decay function penalizes long-range attention more harshly? When might linear be preferred?
- Asymmetric Bias: In causal (autoregressive) models, we might want asymmetric bias where looking backward (j < i) has a different penalty than looking forward. Design a bias formula that penalizes backward attention less than forward attention. Why might this be useful for language generation?
- T5 Bucket Mapping: For a model with 32 buckets and max_distance=128, compute the bucket assignment for distances 1, 5, 10, 50, and 100. Verify your answers with the interactive explorer above.
- Memory Analysis: A model has 32 attention heads, sequence length 4096, and uses T5-style bucketing with 32 buckets. How many parameters does the bias table require? Compare this to storing one parameter per relative distance.
References
- Shaw, P., Uszkoreit, J., & Vaswani, A. (2018). Self-Attention with Relative Position Representations. NAACL 2018.
- Raffel, C., Shazeer, N., Roberts, A., Lee, K., Narang, S., Matena, M., Zhou, Y., Li, W., & Liu, P. J. (2020). Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer. JMLR, 21(140), 1-67.
- Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, \u0141., & Polosukhin, I. (2017). Attention Is All You Need. NeurIPS 2017.
- Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., & Guo, B. (2021). Swin Transformer: Hierarchical Vision Transformer using Shifted Windows. ICCV 2021.
- Press, O., Smith, N. A., & Lewis, M. (2022). Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation. ICLR 2022.
- Su, J., Lu, Y., Pan, S., Murtadha, A., Wen, B., & Liu, Y. (2021). RoFormer: Enhanced Transformer with Rotary Position Embedding. arXiv:2104.09864.