Chapter 7
35 min read
Section 8 of 17

Relative Position Bias Attention

Relative Position Bias Attention

Shaw, Uszkoreit & Vaswani, "Self-Attention with Relative Position Representations", Google, 2018 — Raffel et al., "Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer" (T5), Google, 2020


Learning Objectives

After completing this chapter, you will be able to:

  1. Explain why absolute positional embeddings fail to capture the notion of "how far apart are these two tokens?" and why this matters for generalization
  2. Describe how relative position bias injects distance information directly into the attention score computation via an additive bias matrix B[i,j]B[i,j]
  3. Compute the bias matrix B[i,j]=λlog(1+ij)B[i,j] = -\lambda \cdot \log(1 + |i - j|) by hand and explain why logarithmic decay is preferred over linear decay
  4. Perform a full relative position bias forward pass on the shared “The cat sat on mat” example, comparing results with and without position bias
  5. Explain T5's logarithmic bucketing scheme and why it uses a fixed number of buckets instead of one parameter per distance
  6. Analyze how position bias redistributes attention weight from distant tokens to nearby tokens
  7. Implement relative position bias attention from scratch in both NumPy and PyTorch
  8. Connect relative position bias to its successors: RoPE (Chapter 8) and ALiBi (Chapter 9)

The Problem: Absolute Positions Are Blind to Distance

Why Absolute Positional Embeddings Fail

In Chapter 1, we saw that standard scaled dot-product attention computes softmax(QK/dk)V\text{softmax}(QK^\top / \sqrt{d_k}) \cdot V. This formula treats every token pair identically regardless of how far apart they are in the sequence. The token at position 1 and the token at position 100 get no distance signal at all — the attention mechanism operates purely on content similarity.

The original Transformer (Vaswani et al., 2017) addressed this by adding sinusoidal positional embeddings to the input. Each position pp gets a unique vector PE(p)PE(p) added to the token embedding before computing Q, K, V. This means Q and K implicitly contain position information because they were projected from position-enriched inputs.

But here is the critical problem: these are absolute positions. Position 3 always maps to the same embedding vector, regardless of sequence length or context. The model must learn, from training data alone, that position 3 and position 5 are "close" while position 3 and position 50 are "far." This relationship is never explicitly computed — it must be discovered indirectly from the dot products of position-enriched Q and K vectors.

The Missing Signal

Consider a concrete example. The phrase "the cat sat on the mat" might appear at the start of a document (positions 0-5) or in the middle (positions 500-505). With absolute positions, the model sees completely different positional embeddings in each case, even though the internal structure of the phrase is identical. The model must independently learn that Q[500]·K[502] should produce the same attention pattern as Q[0]·K[2].

The Core Limitation: Standard attention never explicitly computes "how far apart are tokens i and j?" It knows Q[i]K[j]Q[i] \cdot K[j] (content similarity), but not ij|i - j| (relative distance). The position signal is buried inside the Q and K vectors, mixed with content information, and the model must disentangle them.

This leads to three practical failures:

  1. Poor length generalization. A model trained on sequences of length 512 struggles with sequences of length 1024, because it has never seen the positional embeddings for positions 512-1023.
  2. Wasted capacity. The model uses some of its attention heads just to discover relative position patterns, instead of having this information provided directly.
  3. Inconsistent local patterns. The same local structure ("adjective noun") at different absolute positions produces different Q·K products, forcing the model to learn position-invariant patterns redundantly.

The Intuition Behind Relative Position Bias

The Classroom Analogy

Imagine a student sitting in a classroom lecture. When the professor says something, the student naturally pays more attention to the words just spoken (nearby in time) and less to words from five minutes ago (distant in time). This isn't because the recent words are intrinsically more important — it's because temporal proximity is a strong prior for relevance. Nearby words are more likely to be part of the same thought, sentence, or argument.

The student still can recall something from five minutes ago if it's highly relevant to the current topic. The distance penalty doesn't block long-range connections; it just means that distant information needs to be more relevant (higher content similarity) to compete with local information. This is exactly what relative position bias does to attention.

The Key Insight

The key insight of Shaw et al. (2018) was remarkably simple: add a distance-dependent term directly to the attention scores before softmax. Instead of relying on the model to discover distance from absolute position embeddings, we tell it explicitly. For each pair of tokens at positions ii and jj, we add a bias B[i,j]B[i,j] that depends only on (ij)(i - j), the relative offset.

This is a single addition to the standard attention formula:

Standard: softmax ⁣(QKdk)V\text{softmax}\!\left(\frac{QK^\top}{\sqrt{d_k}}\right) \cdot V

With bias: softmax ⁣(QKdk+B)V\text{softmax}\!\left(\frac{QK^\top}{\sqrt{d_k}} + B\right) \cdot V

Everything else remains identical. The bias matrix BB is the only new component. It shifts the attention landscape so that nearby token pairs start with a natural advantage, while distant pairs must overcome a distance penalty through high content similarity alone.

Historical Development

The idea developed in two key stages:

  1. Shaw et al. (2018) introduced relative position representations in "Self-Attention with Relative Position Representations." They proposed learning separate key and value embeddings for relative positions and clipping distances beyond a maximum offset. This was the first work to demonstrate that relative positions outperform absolute positions on machine translation.
  2. Raffel et al. (2020) simplified the idea in the T5 paper. Instead of learning full embedding vectors per distance, T5 uses a single learned scalar bias per relative position bucket. They introduced logarithmic bucketing: short distances get exact representation, while long distances are grouped into shared buckets. This is more parameter-efficient and has become the standard approach.

Mathematical Formulation

Standard Attention (Recap)

From Chapter 1, standard scaled dot-product attention is:

A=softmax ⁣(QKdk)VA = \text{softmax}\!\left(\frac{Q \cdot K^\top}{\sqrt{d_k}}\right) \cdot V

where QRN×dkQ \in \mathbb{R}^{N \times d_k} is the query matrix, KRN×dkK \in \mathbb{R}^{N \times d_k} is the key matrix, VRN×dvV \in \mathbb{R}^{N \times d_v} is the value matrix, and NN is the sequence length.

Adding the Relative Position Bias

Relative position bias modifies this by adding a bias matrix BRN×NB \in \mathbb{R}^{N \times N} to the scaled scores before softmax:

A=softmax ⁣(QKdk+B)VA = \text{softmax}\!\left(\frac{Q \cdot K^\top}{\sqrt{d_k}} + B\right) \cdot V

The key constraint is that B[i,j]B[i,j] depends only on the relative offset (ij)(i - j), not on the absolute positions ii and jj individually. This makes the bias translation-invariant: the same phrase at different positions in the sequence produces the same bias pattern.

How Is B[i,j] Computed?

There are two main approaches:

1. Learned bias (T5 style). Each relative position bucket has a learned scalar parameter. The model learns, through backpropagation, what bias to assign each distance. T5 uses 32 buckets with logarithmic spacing.

2. Fixed bias (our simulation). A mathematical function computes the bias from the distance. Our simulation uses logarithmic decay:

B[i,j]=λlog(1+ij)B[i,j] = -\lambda \cdot \log(1 + |i - j|)

where λ=0.3\lambda = 0.3 is the bias scale. This formula has several desirable properties:

  • B[i,i]=0B[i,i] = 0 — no penalty for attending to yourself (distance 0)
  • B[i,j]<0B[i,j] < 0 for all iji \neq j — all non-self attention gets penalized
  • Logarithmic growth means the penalty increases slowly — distance 4 gets only 2.3×2.3\times the penalty of distance 1, not 4×4\times
  • Symmetric: B[i,j]=B[j,i]B[i,j] = B[j,i] — direction doesn't matter

Symbol-by-Symbol Breakdown

SymbolMeaningOur Example
QQuery matrix — what each token is looking for(5, 4) matrix
KKey matrix — what each token advertises(5, 4) matrix
VValue matrix — the content to retrieve(5, 4) matrix
NSequence length (number of tokens)5
d_kKey dimension (for scaling)4
BRelative position bias matrix(5, 5) — symmetric, zero diagonal
λBias scale — strength of distance penalty0.3
|i - j|Absolute distance between positions i and j0 to 4
log(1 + d)Natural logarithm of (1 + distance)0.0 to 1.609

Interactive: Bias Matrix Heatmap

Explore how the bias matrix changes with different bias scales and decay functions. Hover over any cell to see the exact calculation:

Loading bias heatmap...

Step-by-Step Calculation

Setup: Shared Example

We use the same Q, K, V matrices from every chapter: 5 tokens ("The cat sat on mat"), each with dk=4d_k = 4 dimensions. The key parameters are: λ=0.3\lambda = 0.3 (bias scale), dk=2.0\sqrt{d_k} = 2.0 (scaling factor).

Step 1: Raw Scores QKQ \cdot K^\top

First, compute the raw dot products between all query-key pairs. These are identical to Chapter 1 — no position information yet:

Q·KᵀThecatsatonmat
The0.02.01.01.01.5
cat3.00.02.01.00.5
sat1.02.02.01.01.5
on1.01.00.02.01.0
mat1.01.01.01.01.5

Notice: "cat" attending to "The" has the highest raw score (3.0), while "The" attending to "The" has the lowest (0.0). These scores reflect only content similarity.

Step 2: Scaling by dk\sqrt{d_k}

Divide every score by dk=4=2.0\sqrt{d_k} = \sqrt{4} = 2.0:

ScaledThecatsatonmat
The0.0001.0000.5000.5000.750
cat1.5000.0001.0000.5000.250
sat0.5001.0001.0000.5000.750
on0.5000.5000.0001.0000.500
mat0.5000.5000.5000.5000.750

Step 3: Building the Bias Matrix B

For each distance d=ijd = |i - j|, compute B=0.3×log(1+d)B = -0.3 \times \log(1 + d):

Distance |i-j|log(1+d)B = -0.3 × log(1+d)
00.00000.0000
10.6931-0.2079
21.0986-0.3296
31.3863-0.4159
41.6094-0.4828

The complete 5×5 bias matrix (a Toeplitz matrix — each diagonal has the same value):

BThecatsatonmat
The0.0000-0.2079-0.3296-0.4159-0.4828
cat-0.20790.0000-0.2079-0.3296-0.4159
sat-0.3296-0.20790.0000-0.2079-0.3296
on-0.4159-0.3296-0.20790.0000-0.2079
mat-0.4828-0.4159-0.3296-0.20790.0000

Step 4: Adding Bias to Scores

Element-wise addition: biased_scores=scaled_scores+B\text{biased\_scores} = \text{scaled\_scores} + B

BiasedThecatsatonmat
The0.00000.79210.17040.08410.2672
cat1.29210.00000.79210.1704-0.1659
sat0.17040.79211.00000.29210.4204
on0.08410.1704-0.20791.00000.2921
mat0.01720.08410.17040.29210.7500

Notice how "cat"'s score for "The" dropped from 1.500 to 1.2921 (distance 1, mild penalty), while "cat"'s score for "mat" dropped from 0.250 to -0.1659 (distance 3, strong enough to push it negative). Nearby tokens are penalized less.

Step 5: Softmax

Let's trace the softmax computation for "The" (row 0) in detail:

  1. Biased scores: [0.0000,0.7921,0.1704,0.0841,0.2672][0.0000, 0.7921, 0.1704, 0.0841, 0.2672]
  2. Subtract max (0.7921): [0.7921,0.0000,0.6216,0.7079,0.5249][-0.7921, 0.0000, -0.6216, -0.7079, -0.5249]
  3. Exponentiate: [0.4529,1.0000,0.5371,0.4927,0.5916][0.4529, 1.0000, 0.5371, 0.4927, 0.5916]
  4. Sum: 3.07433.0743
  5. Normalize: [0.1473,0.3253,0.1747,0.1603,0.1924][0.1473, 0.3253, 0.1747, 0.1603, 0.1924]

For "sat" (row 2) — the middle token benefits most from relative bias because it has equal-distance neighbors on both sides:

  1. Biased scores: [0.1704,0.7921,1.0000,0.2921,0.4204][0.1704, 0.7921, 1.0000, 0.2921, 0.4204]
  2. Subtract max (1.0000): [0.8296,0.2079,0.0000,0.7079,0.5796][-0.8296, -0.2079, 0.0000, -0.7079, -0.5796]
  3. Exponentiate: [0.4362,0.8123,1.0000,0.4927,0.5601][0.4362, 0.8123, 1.0000, 0.4927, 0.5601]
  4. Sum: 3.30133.3013
  5. Normalize: [0.1321,0.2460,0.3029,0.1492,0.1697][0.1321, 0.2460, 0.3029, 0.1492, 0.1697]

"sat" now gives 30.3% weight to itself (same position, zero penalty), compared to 25.1% in standard attention. The bias has concentrated attention on the center.

Step 6: Output Computation

The output for each token is the weighted sum of all value vectors. For "The":

Output[The]=0.1473×V[The]+0.3253×V[cat]+0.1747×V[sat]+0.1603×V[on]+0.1924×V[mat]\text{Output}[\text{The}] = 0.1473 \times V[\text{The}] + 0.3253 \times V[\text{cat}] + 0.1747 \times V[\text{sat}] + 0.1603 \times V[\text{on}] + 0.1924 \times V[\text{mat}]

=[0.2435,0.4215,0.2709,0.2565]= [0.2435, 0.4215, 0.2709, 0.2565]


Interactive: Standard vs Biased Attention

Compare attention weights with and without relative position bias. Select a query token and adjust the bias scale to see how distance penalties redistribute attention:

Loading attention comparison...

Attention Weight Analysis

Standard vs Biased: Weight Comparison

The difference matrix (biased weights minus standard weights) reveals the redistribution pattern:

Δ WeightThecatsatonmat
The+0.0378+0.0276-0.0058-0.0203-0.0394
cat+0.0073+0.0228+0.0044-0.0146-0.0200
sat-0.0198-0.0045+0.0524-0.0027-0.0254
on-0.0380-0.0243-0.0017+0.0668-0.0028
mat-0.0385-0.0280-0.0135+0.0092+0.0708

What Changed and Why

The diagonal is always positive — every token gained self-attention weight because distance 0 has zero penalty. The off-diagonal pattern follows distance:

  • "The" row: "cat" (d=1) gained +0.0276, while "mat" (d=4) lost -0.0394. The nearest neighbor benefits most.
  • "sat" row (center): Self-attention gained the most (+0.0524) because "sat" has equal neighbors on both sides. Both "The" (d=2) and "mat" (d=2) lost equally (-0.0198 and -0.0254).
  • "mat" row (edge): Self-attention gained +0.0708 (largest diagonal gain) because "mat" is at the sequence edge — it has the most distant tokens (d=4 to "The"), so the bias strongly penalizes those pairs and redistributes weight to self.
Key Observation: Edge tokens ("The" and "mat") show the largest self-attention gains because they have the most distant tokens in the sequence. Center tokens ("sat") show more balanced redistribution. This is a natural consequence of the logarithmic distance penalty.

Full Attention Weight Matrices

Standard Attention Weights (no bias)

Thecatsatonmat
The0.10950.29760.18050.18050.2318
cat0.40260.08980.24420.14810.1153
sat0.15190.25050.25050.15190.1951
on0.19030.19030.11540.31370.1903
mat0.18920.18920.18920.18920.2430

Relative Position Bias Attention Weights

Thecatsatonmat
The0.14730.32530.17470.16030.1924
cat0.40990.11260.24860.13350.0954
sat0.13210.24600.30290.14920.1697
on0.15230.16600.11370.38050.1875
mat0.15080.16120.17580.19850.3138

Full Output Matrices

Standard Output (no bias)

dim-0dim-1dim-2dim-3
The0.22540.41350.29640.2964
cat0.46020.14750.30180.2058
sat0.24950.34810.34810.2495
on0.28540.28540.21060.4089
mat0.31080.31080.31080.3108

Relative Position Bias Output

dim-0dim-1dim-2dim-3
The0.24350.42150.27090.2565
cat0.45760.16030.29630.1812
sat0.21700.33090.38770.2341
on0.24600.25970.20740.4743
mat0.30770.31810.33260.3554

Notice that "mat"'s output was [0.3108,0.3108,0.3108,0.3108][0.3108, 0.3108, 0.3108, 0.3108] (nearly uniform) in standard attention but became [0.3077,0.3181,0.3326,0.3554][0.3077, 0.3181, 0.3326, 0.3554] with bias — now favoring dim-3 because "mat"'s nearest neighbor "on" has V = [0,0,0,1][0, 0, 0, 1] (strong in dim-3) and received more weight (+0.0092).


T5's Logarithmic Bucketing Scheme

Why Use Buckets?

A naive implementation would assign one learnable parameter per relative distance. For a sequence of length NN, this requires 2N12N - 1 parameters (distances from (N1)-(N-1) to +(N1)+(N-1)). For N=4096N = 4096, that's 8,191 parameters per attention head — not terrible, but with diminishing returns. The difference between "97 tokens away" and "98 tokens away" is linguistically meaningless. We don't need separate parameters for these nearly identical distances.

T5 solves this with logarithmic bucketing: nearby distances get their own bucket (exact representation), while distant ranges share a bucket (coarse approximation). With 32 buckets, you get:

  • Exact region (buckets 0-7): distances 0-7 each get their own bucket. Position 3 is distinguishable from position 4.
  • Logarithmic region (buckets 8-15): distances 8-128 are logarithmically distributed across 8 buckets. Distances 8-11 share one bucket, 12-16 share another, etc.
  • Bidirectional: the first 16 buckets handle "token j is to the left of token i" and the second 16 handle "token j is to the right."

The Bucket Algorithm

The T5 bucketing algorithm works as follows. Let r=ijr = i - j be the signed relative position:

  1. If bidirectional, separate positive and negative offsets into two halves of the bucket space
  2. For r<nexact|r| < n_{\text{exact}} (half of half-buckets): assign bucket = r|r| directly
  3. For rnexact|r| \geq n_{\text{exact}}: assign bucket = nexact+log(r/nexact)/log(dmax/nexact)×(nhalfnexact)n_{\text{exact}} + \lfloor \log(|r| / n_{\text{exact}}) / \log(d_{\max} / n_{\text{exact}}) \times (n_{\text{half}} - n_{\text{exact}}) \rfloor

This is a logarithmic mapping: doubling the distance adds roughly one bucket. The model learns one scalar bias per bucket, shared across all distances that map to that bucket.


Interactive: T5 Bucket Explorer

Explore how T5's logarithmic bucketing maps distances to buckets. Adjust the number of buckets and maximum distance to see how the exact/logarithmic boundary shifts:

Loading bucket explorer...

Applications Across Domains

DomainHow Relative Position Bias HelpsExample
NLP / Language ModelingNearby words are more likely to be syntactically related. The bias encodes this prior, reducing the burden on content similarity alone.In "The cat sat on the mat," the determiner "the" most strongly modifies its adjacent noun.
Machine TranslationSource-target alignments tend to be locally monotonic. Relative bias encourages attending to nearby source positions, improving alignment quality.T5 achieved state-of-the-art on WMT translation benchmarks using relative position bias.
Document SummarizationKey sentences often cluster within paragraphs. Relative bias helps the model attend within paragraph boundaries without hard-coded structural constraints.T5 summarization on CNN/DailyMail benefits from position-aware attention.
Code GenerationVariable references typically occur within a few lines of their declaration. The bias naturally encodes this locality pattern.A function's local variables are usually defined within 5-10 lines of their use.
Protein Sequence ModelingAmino acids that are close in sequence are more likely to interact physically (local secondary structure). Relative bias captures this spatial prior.Alpha-helices and beta-sheets are formed by nearby amino acid interactions.
Vision TransformersNearby image patches are more likely to belong to the same object. Relative bias encodes spatial locality without explicit convolution.Swin Transformer uses relative position bias within local attention windows.

Connection to Modern Systems

SystemRelationship to Relative Position BiasKey Difference
RoPE (Chapter 8)Also encodes relative position, but multiplies Q and K by rotation matrices instead of adding a bias. The dot product Q·K naturally becomes position-dependent.Multiplicative (rotation) vs additive (bias). RoPE doesn't need an explicit B matrix.
ALiBi (Chapter 9)Direct descendant. Uses a fixed linear bias: B[i,j] = -m × |i-j| with head-specific slopes m. No learned parameters.Linear decay vs logarithmic. ALiBi is simpler (no log, no buckets) and enables infinite length extrapolation.
Flash Attention (Chapter 13)Flash Attention can incorporate relative position bias by adding B to the score tiles during the IO-aware tiling computation. The bias matrix is never fully materialized.Flash Attention is an implementation optimization, not a different mechanism. It computes the same result.
Swin TransformerUses learned relative position bias within fixed-size local windows. Each attention head has a (2W-1) × (2W-1) bias table where W is the window size.Applied within local windows, not globally. Combined with window shifting for cross-window information flow.
T5 / mT5 / Flan-T5The production implementation of relative position bias. Uses 32 logarithmic buckets with learned biases per head.Our simulation uses a fixed formula; T5 learns the bias values through training.
KV-Cache (Chapter 5-6)Relative position bias is compatible with KV-cache. The bias for new tokens can be computed incrementally without recomputing the full B matrix.Only the new row/column of B needs to be computed for each new token during autoregressive generation.

Complexity Analysis

AspectStandard AttentionRelative Position Bias
Time: Q·KᵀO(N² d_k)O(N² d_k) — unchanged
Time: Build BN/AO(N²) — one log per entry
Time: Add BN/AO(N²) — element-wise addition
Time: SoftmaxO(N²)O(N²) — unchanged
Time: TotalO(N² d_k)O(N² d_k) — B is dominated by Q·Kᵀ
Space: ExtraO(1)O(N²) for B matrix, or O(B) for bucket table
ParametersNone extraT5: 32 scalars per head. Fixed: 0 extra.
Length GeneralizationPoor (unseen positions)Good (only relative distances matter)

The key insight is that relative position bias adds O(N2)O(N^2) work to build and add B, but this is dominated by the O(N2dk)O(N^2 d_k) cost of the dot-product computation. In practice, the overhead is negligible — less than 5% of total attention computation time.


Python Implementation

The complete NumPy implementation with full execution trace. Click any line to see the exact values flowing through the computation:

Relative Position Bias Attention — NumPy Implementation
🐍relative_position_bias_attention.py
1import numpy as np

NumPy provides vectorized matrix operations. Q @ K.T runs as optimized C code, not Python loops. Same library used in every chapter.

2import math

Python standard library. We use math.sqrt() for the scaling factor and math.log() for the logarithmic bias decay.

4class RelativePositionBiasAttention

Wraps the complete relative position bias mechanism. The key difference from Chapter 1 (standard attention): this class builds a bias matrix B and adds it to scores before softmax. This is the ONLY change — everything else (Q·K^T, scaling, softmax, weighted sum) is identical.

14def __init__(self, d_k, bias_scale)

Constructor takes d_k (key dimension for scaling) and bias_scale (controls how strongly distance penalizes attention). Higher bias_scale = stronger preference for local context.

EXECUTION STATE
⬇ input: d_k = 4 (dimension of key vectors)
⬇ input: bias_scale = 0.3 (moderate distance penalty)
⬆ returns = None (sets self.d_k, self.bias_scale, self.scale)
20self.d_k = d_k

Store key dimension. Used later for scaling: scores / sqrt(d_k).

EXECUTION STATE
self.d_k = 4
21self.bias_scale = bias_scale

Store bias strength. This is the λ in B[i,j] = -λ · log(1 + |i-j|). Typical values: 0.1 (weak), 0.3 (moderate), 1.0 (strong).

EXECUTION STATE
self.bias_scale = 0.3
22self.scale = math.sqrt(d_k)

Precompute √d_k once. Same scaling factor as Chapter 1 — prevents dot products from growing too large and saturating softmax.

EXECUTION STATE
math.sqrt(4) = 2.0
self.scale = 2.0
24def _build_bias_matrix(self, N) → np.ndarray

Builds the N×N bias matrix B where B[i,j] = -bias_scale × log(1 + |i - j|). This is a Toeplitz matrix: every diagonal has the same value because the bias depends only on the distance |i-j|, not on absolute positions.

EXECUTION STATE
⬇ input: N = 5 (number of tokens)
⬆ returns = np.ndarray (5, 5) — symmetric bias matrix
26B = np.zeros((N, N))

Initialize a 5×5 matrix of zeros. Will be filled with negative bias values (penalties for distance).

EXECUTION STATE
B =
      The    cat    sat     on    mat
The  0.000  0.000  0.000  0.000  0.000
cat  0.000  0.000  0.000  0.000  0.000
sat  0.000  0.000  0.000  0.000  0.000
on   0.000  0.000  0.000  0.000  0.000
mat  0.000  0.000  0.000  0.000  0.000
27for i in range(N):

Outer loop: iterate over each query position i (rows of the bias matrix).

LOOP TRACE · 5 iterations
i=0 (The)
row = B[0,:] — bias from The's perspective
i=1 (cat)
row = B[1,:] — bias from cat's perspective
i=2 (sat)
row = B[2,:] — bias from sat's perspective
i=3 (on)
row = B[3,:] — bias from on's perspective
i=4 (mat)
row = B[4,:] — bias from mat's perspective
28for j in range(N):

Inner loop: for each key position j, compute the bias B[i,j] based on the distance |i - j|.

LOOP TRACE · 5 iterations
i=0, j=0: |0-0|=0
B[0,0] = -0.3 × log(1+0) = -0.3 × 0.0000 = 0.0000
i=0, j=1: |0-1|=1
B[0,1] = -0.3 × log(1+1) = -0.3 × 0.6931 = -0.2079
i=0, j=2: |0-2|=2
B[0,2] = -0.3 × log(1+2) = -0.3 × 1.0986 = -0.3296
i=0, j=3: |0-3|=3
B[0,3] = -0.3 × log(1+3) = -0.3 × 1.3863 = -0.4159
i=0, j=4: |0-4|=4
B[0,4] = -0.3 × log(1+4) = -0.3 × 1.6094 = -0.4828
29B[i, j] = -self.bias_scale * math.log(1 + abs(i - j))

The core formula. For each pair (i,j), compute the negative logarithmic penalty. abs(i-j) = distance between positions. log(1+d) grows slowly — distance 4 gets only 2.3× the penalty of distance 1. The minus sign makes it a penalty (subtracts from attention score).

EXECUTION STATE
abs(i - j) = Absolute distance between token positions
math.log(1 + d) = Natural logarithm. log(1)=0, log(2)=0.693, log(3)=1.099, log(4)=1.386, log(5)=1.609
⬆ Complete B matrix =
       The      cat      sat       on      mat
The   0.0000  -0.2079  -0.3296  -0.4159  -0.4828
cat  -0.2079   0.0000  -0.2079  -0.3296  -0.4159
sat  -0.3296  -0.2079   0.0000  -0.2079  -0.3296
on   -0.4159  -0.3296  -0.2079   0.0000  -0.2079
mat  -0.4828  -0.4159  -0.3296  -0.2079   0.0000
32def _softmax(self, x) → np.ndarray

Numerically stable softmax. Subtracts row-wise max before exp() to prevent overflow. Identical to Chapter 1 implementation.

EXECUTION STATE
⬇ input: x = shape (5, 5) — biased score matrix
⬆ returns = np.ndarray (5, 5) — each row sums to 1.0
34x_shifted = x - np.max(x, axis=-1, keepdims=True)

Subtract the maximum of each row. This makes the largest element 0, so exp(0)=1 is the largest exponential — no overflow possible.

EXECUTION STATE
axis=-1 = Operate along the LAST axis (columns within each row). For 5×5, find the max of each row independently.
keepdims=True = Keep the reduced axis as size-1 dimension. Returns shape (5,1) not (5,), so broadcasting x(5×5) - max(5×1) works correctly.
── Row 0 (The) ── =
x[0] = [0.0000, 0.7921, 0.1704, 0.0841, 0.2672]
max(x[0]) = 0.7921
x_shifted[0] = [-0.7921, 0.0000, -0.6216, -0.7079, -0.5249]
── Row 1 (cat) ── =
x[1] = [1.2921, 0.0000, 0.7921, 0.1704, -0.1659]
max(x[1]) = 1.2921
x_shifted[1] = [0.0000, -1.2921, -0.5000, -1.1216, -1.4579]
── Row 2 (sat) ── =
x[2] = [0.1704, 0.7921, 1.0000, 0.2921, 0.4204]
max(x[2]) = 1.0000
x_shifted[2] = [-0.8296, -0.2079, 0.0000, -0.7079, -0.5796]
── Row 3 (on) ── =
x[3] = [0.0841, 0.1704, -0.2079, 1.0000, 0.2921]
max(x[3]) = 1.0000
x_shifted[3] = [-0.9159, -0.8296, -1.2079, 0.0000, -0.7079]
── Row 4 (mat) ── =
x[4] = [0.0172, 0.0841, 0.1704, 0.2921, 0.7500]
max(x[4]) = 0.7500
x_shifted[4] = [-0.7328, -0.6659, -0.5796, -0.4579, 0.0000]
35exp_x = np.exp(x_shifted)

Element-wise exponentiation. After shifting, all values are ≤ 0, so all exp values are ≤ 1.0 (no overflow). Larger values → closer to 1.0 → higher attention weight.

EXECUTION STATE
── Row 0 (The) ── =
exp(x_shifted[0]) = [0.4529, 1.0000, 0.5371, 0.4927, 0.5916]
── Row 1 (cat) ── =
exp(x_shifted[1]) = [1.0000, 0.2747, 0.6065, 0.3257, 0.2327]
── Row 2 (sat) ── =
exp(x_shifted[2]) = [0.4362, 0.8123, 1.0000, 0.4927, 0.5601]
── Row 3 (on) ── =
exp(x_shifted[3]) = [0.4002, 0.4362, 0.2988, 1.0000, 0.4927]
── Row 4 (mat) ── =
exp(x_shifted[4]) = [0.4805, 0.5138, 0.5601, 0.6326, 1.0000]
36return exp_x / np.sum(exp_x, axis=-1, keepdims=True)

Divide each row by its sum to get probabilities. Each row sums to exactly 1.0.

EXECUTION STATE
── Row 0 (The) ── =
sum = 3.0743
weights[0] = [0.1473, 0.3253, 0.1747, 0.1603, 0.1924]
── Row 1 (cat) ── =
sum = 2.4397
weights[1] = [0.4099, 0.1126, 0.2486, 0.1335, 0.0954]
── Row 2 (sat) ── =
sum = 3.3013
weights[2] = [0.1321, 0.2460, 0.3029, 0.1492, 0.1697]
── Row 3 (on) ── =
sum = 2.6279
weights[3] = [0.1523, 0.1660, 0.1137, 0.3805, 0.1875]
── Row 4 (mat) ── =
sum = 3.1871
weights[4] = [0.1508, 0.1612, 0.1758, 0.1985, 0.3138]
38def forward(self, Q, K, V)

Main computation pipeline. Takes Q, K, V matrices and returns attention weights and output. The 6-step pipeline: (1) raw scores Q·K^T, (2) scale by √d_k, (3) build bias B, (4) add bias, (5) softmax, (6) weighted sum with V.

EXECUTION STATE
⬇ input: Q (5×4) =
      d0   d1   d2   d3
The  1.0  0.0  1.0  0.0
cat  0.0  2.0  0.0  1.0
sat  1.0  1.0  1.0  0.0
on   0.0  0.0  1.0  1.0
mat  1.0  0.0  0.0  1.0
⬇ input: K (5×4) =
      d0   d1   d2   d3
The  0.0  1.0  0.0  1.0
cat  1.0  0.0  1.0  0.0
sat  1.0  1.0  0.0  0.0
on   0.0  0.0  1.0  1.0
mat  1.0  0.0  0.5  0.5
⬇ input: V (5×4) =
      d0   d1   d2   d3
The  1.0  0.0  0.0  0.0
cat  0.0  1.0  0.0  0.0
sat  0.0  0.0  1.0  0.0
on   0.0  0.0  0.0  1.0
mat  0.5  0.5  0.5  0.5
⬆ returns = (weights, output) — (5,5) attention weights + (5,4) output
49N = Q.shape[0]

Number of tokens (sequence length). Used to build the bias matrix.

EXECUTION STATE
N = 5
51raw_scores = Q @ K.T

Matrix multiply Q (5×4) with K transposed (4×5). Each element raw_scores[i,j] = dot product of Q[i] and K[j] — how much token i's query matches token j's key, without any position information.

EXECUTION STATE
@ = Python matrix multiplication operator
K.T = NumPy transpose — K(5×4) becomes K.T(4×5)
⬆ raw_scores (5×5) =
      The   cat   sat    on   mat
The   0.0   2.0   1.0   1.0   1.5
cat   3.0   0.0   2.0   1.0   0.5
sat   1.0   2.0   2.0   1.0   1.5
on    1.0   1.0   0.0   2.0   1.0
mat   1.0   1.0   1.0   1.0   1.5
52scaled_scores = raw_scores / self.scale

Divide all scores by √d_k = 2.0. This halves every value, preventing softmax from becoming too peaked (all weight on one token) when d_k is large.

EXECUTION STATE
self.scale = 2.0 (precomputed √4 in __init__)
⬆ scaled_scores (5×5) =
       The     cat     sat      on     mat
The  0.000   1.000   0.500   0.500   0.750
cat  1.500   0.000   1.000   0.500   0.250
sat  0.500   1.000   1.000   0.500   0.750
on   0.500   0.500   0.000   1.000   0.500
mat  0.500   0.500   0.500   0.500   0.750
53B = self._build_bias_matrix(N)

Build the 5×5 relative position bias matrix. This is the ONE new ingredient compared to standard attention. B is symmetric (distance is the same in both directions) and has zeros on the diagonal (distance 0).

EXECUTION STATE
⬆ B (5×5) =
       The      cat      sat       on      mat
The   0.0000  -0.2079  -0.3296  -0.4159  -0.4828
cat  -0.2079   0.0000  -0.2079  -0.3296  -0.4159
sat  -0.3296  -0.2079   0.0000  -0.2079  -0.3296
on   -0.4159  -0.3296  -0.2079   0.0000  -0.2079
mat  -0.4828  -0.4159  -0.3296  -0.2079   0.0000
54biased_scores = scaled_scores + B

Element-wise addition: each attention score gets its distance penalty. Nearby token pairs lose little (small negative bias), distant pairs lose more (larger negative bias). This is where position information enters the attention computation.

EXECUTION STATE
⬆ biased_scores (5×5) =
       The      cat      sat       on      mat
The  0.0000   0.7921   0.1704   0.0841   0.2672
cat  1.2921   0.0000   0.7921   0.1704  -0.1659
sat  0.1704   0.7921   1.0000   0.2921   0.4204
on   0.0841   0.1704  -0.2079   1.0000   0.2921
mat  0.0172   0.0841   0.1704   0.2921   0.7500
55weights = self._softmax(biased_scores)

Apply softmax row-wise to convert biased scores into attention probabilities. Each row sums to 1.0. The bias has shifted probability mass toward nearby tokens.

EXECUTION STATE
⬆ weights (5×5) =
       The      cat      sat       on      mat
The  0.1473   0.3253   0.1747   0.1603   0.1924
cat  0.4099   0.1126   0.2486   0.1335   0.0954
sat  0.1321   0.2460   0.3029   0.1492   0.1697
on   0.1523   0.1660   0.1137   0.3805   0.1875
mat  0.1508   0.1612   0.1758   0.1985   0.3138
56output = weights @ V

Weighted sum of value vectors. Each output row is a blend of all 5 value vectors, weighted by attention. Now the blending is position-aware: nearby values contribute more.

EXECUTION STATE
⬆ output (5×4) =
       d0       d1       d2       d3
The  0.2435   0.4215   0.2709   0.2565
cat  0.4576   0.1603   0.2963   0.1812
sat  0.2170   0.3309   0.3877   0.2341
on   0.2460   0.2597   0.2074   0.4743
mat  0.3077   0.3181   0.3326   0.3554
58return weights, output

Return both the attention weight matrix (for visualization and analysis) and the output context vectors.

EXECUTION STATE
⬆ return: weights = (5, 5) — position-biased attention weights
⬆ return: output = (5, 4) — position-aware context vectors
76tokens = [...]

The 5 tokens used in every chapter. 5 tokens gives clean 5×5 attention matrices.

EXECUTION STATE
tokens = ['The', 'cat', 'sat', 'on', 'mat']
78Q = np.array([...])

Query matrix (5×4). Each row is what that token is looking for. Shared across all 15 chapters.

EXECUTION STATE
Q =
      d0   d1   d2   d3
The  1.0  0.0  1.0  0.0
cat  0.0  2.0  0.0  1.0
sat  1.0  1.0  1.0  0.0
on   0.0  0.0  1.0  1.0
mat  1.0  0.0  0.0  1.0
86K = np.array([...])

Key matrix (5×4). Each row is what that token advertises. The dot product Q[i]·K[j] measures how well token i's query matches token j's key.

EXECUTION STATE
K =
      d0   d1   d2   d3
The  0.0  1.0  0.0  1.0
cat  1.0  0.0  1.0  0.0
sat  1.0  1.0  0.0  0.0
on   0.0  0.0  1.0  1.0
mat  1.0  0.0  0.5  0.5
94V = np.array([...])

Value matrix (5×4). The content retrieved when a token is attended to. The output is a weighted combination of these value vectors.

EXECUTION STATE
V =
      d0   d1   d2   d3
The  1.0  0.0  0.0  0.0
cat  0.0  1.0  0.0  0.0
sat  0.0  0.0  1.0  0.0
on   0.0  0.0  0.0  1.0
mat  0.5  0.5  0.5  0.5
103rpb = RelativePositionBiasAttention(d_k=4, bias_scale=0.3)

Instantiate the mechanism with d_k=4 and moderate bias_scale=0.3. scale = √4 = 2.0.

EXECUTION STATE
d_k = 4
bias_scale = 0.3 (moderate distance penalty)
scale = 2.0
104weights, output = rpb.forward(Q, K, V)

Run the full relative position bias pipeline. Returns position-aware attention weights and output.

EXECUTION STATE
⬆ weights (5×5) =
       The      cat      sat       on      mat
The  0.1473   0.3253   0.1747   0.1603   0.1924
cat  0.4099   0.1126   0.2486   0.1335   0.0954
sat  0.1321   0.2460   0.3029   0.1492   0.1697
on   0.1523   0.1660   0.1137   0.3805   0.1875
mat  0.1508   0.1612   0.1758   0.1985   0.3138
⬆ output (5×4) =
       d0       d1       d2       d3
The  0.2435   0.4215   0.2709   0.2565
cat  0.4576   0.1603   0.2963   0.1812
sat  0.2170   0.3309   0.3877   0.2341
on   0.2460   0.2597   0.2074   0.4743
mat  0.3077   0.3181   0.3326   0.3554
99 lines without explanation
1import numpy as np
2import math
3
4class RelativePositionBiasAttention:
5    """
6    Relative Position Bias Attention (Shaw et al. 2018 / T5 2020)
7
8    Adds a learned or computed bias B[i,j] to the attention scores
9    before softmax. B depends only on the relative distance (i - j),
10    encouraging the model to prefer local context by default.
11
12    Formula: A = softmax(Q·K^T / sqrt(d_k) + B) · V
13    """
14
15    def __init__(self, d_k: int, bias_scale: float = 0.3):
16        """
17        Args:
18            d_k: Dimension of key vectors (for scaling)
19            bias_scale: Controls strength of distance penalty
20        """
21        self.d_k = d_k
22        self.bias_scale = bias_scale
23        self.scale = math.sqrt(d_k)
24
25    def _build_bias_matrix(self, N: int) -> np.ndarray:
26        """Build the N x N relative position bias matrix."""
27        B = np.zeros((N, N))
28        for i in range(N):
29            for j in range(N):
30                B[i, j] = -self.bias_scale * math.log(1 + abs(i - j))
31        return B
32
33    def _softmax(self, x: np.ndarray) -> np.ndarray:
34        """Numerically stable softmax along last axis."""
35        x_shifted = x - np.max(x, axis=-1, keepdims=True)
36        exp_x = np.exp(x_shifted)
37        return exp_x / np.sum(exp_x, axis=-1, keepdims=True)
38
39    def forward(self, Q: np.ndarray, K: np.ndarray, V: np.ndarray):
40        """
41        Args:
42            Q: Query matrix  (N, d_k)
43            K: Key matrix    (N, d_k)
44            V: Value matrix  (N, d_v)
45        Returns:
46            weights: Attention weight matrix (N, N)
47            output:  Context vectors (N, d_v)
48        """
49        N = Q.shape[0]
50
51        raw_scores = Q @ K.T
52        scaled_scores = raw_scores / self.scale
53        B = self._build_bias_matrix(N)
54        biased_scores = scaled_scores + B
55        weights = self._softmax(biased_scores)
56        output = weights @ V
57
58        return weights, output
59
60    def explain(self, Q, K, V, tokens, query_idx=0):
61        """Print step-by-step trace for one token."""
62        weights, output = self.forward(Q, K, V)
63        token = tokens[query_idx]
64        B = self._build_bias_matrix(len(tokens))
65        raw = Q @ K.T
66        scaled = raw / self.scale
67
68        print(f"\n=== Relative Position Bias trace for '{token}' ===")
69        print(f"bias_scale = {self.bias_scale}, d_k = {self.d_k}")
70        print(f"\nRaw scores:    {np.round(raw[query_idx], 4)}")
71        print(f"Scaled scores: {np.round(scaled[query_idx], 4)}")
72        print(f"Bias values:   {np.round(B[query_idx], 4)}")
73        print(f"Biased scores: {np.round((scaled + B)[query_idx], 4)}")
74        print(f"\nAttention weights:")
75        for j, t in enumerate(tokens):
76            w = weights[query_idx, j]
77            bar = "#" * int(w * 50)
78            dist = abs(query_idx - j)
79            print(f"  A[{token},{t}] = {w:.4f} (d={dist}) |{bar}|")
80
81        print(f"\nOutput[{token}] = {np.round(output[query_idx], 4)}")
82
83
84# ── Shared Example (used in every chapter) ──
85tokens = ["The", "cat", "sat", "on", "mat"]
86
87Q = np.array([
88    [1.0, 0.0, 1.0, 0.0],   # The
89    [0.0, 2.0, 0.0, 1.0],   # cat
90    [1.0, 1.0, 1.0, 0.0],   # sat
91    [0.0, 0.0, 1.0, 1.0],   # on
92    [1.0, 0.0, 0.0, 1.0],   # mat
93])
94
95K = np.array([
96    [0.0, 1.0, 0.0, 1.0],   # The
97    [1.0, 0.0, 1.0, 0.0],   # cat
98    [1.0, 1.0, 0.0, 0.0],   # sat
99    [0.0, 0.0, 1.0, 1.0],   # on
100    [1.0, 0.0, 0.5, 0.5],   # mat
101])
102
103V = np.array([
104    [1.0, 0.0, 0.0, 0.0],   # The
105    [0.0, 1.0, 0.0, 0.0],   # cat
106    [0.0, 0.0, 1.0, 0.0],   # sat
107    [0.0, 0.0, 0.0, 1.0],   # on
108    [0.5, 0.5, 0.5, 0.5],   # mat
109])
110
111# ── Run Relative Position Bias Attention ──
112rpb = RelativePositionBiasAttention(d_k=4, bias_scale=0.3)
113weights, output = rpb.forward(Q, K, V)
114
115print("=== Relative Position Bias Attention ===")
116print("Weights:")
117print(np.round(weights, 4))
118print("\nOutput:")
119print(np.round(output, 4))
120
121# ── Step-by-step trace for "The" ──
122rpb.explain(Q, K, V, tokens, query_idx=0)
123
124# ── Compare with standard (no bias) ──
125rpb_none = RelativePositionBiasAttention(d_k=4, bias_scale=0.0)
126std_weights, std_output = rpb_none.forward(Q, K, V)
127print("\n=== Standard vs Biased (The) ===")
128print("Standard:", np.round(std_weights[0], 4))
129print("Biased:  ", np.round(weights[0], 4))
130print("Diff:    ", np.round(weights[0] - std_weights[0], 4))

PyTorch Implementation

The PyTorch version adds GPU support, automatic differentiation for training, and uses vectorized operations instead of Python loops. It also includes an nn.Parameter\texttt{nn.Parameter} bias table for the T5-style learned approach:

Relative Position Bias Attention — PyTorch Implementation
🐍relative_position_bias_attention_torch.py
1import torch

PyTorch tensor library. Provides GPU acceleration, automatic differentiation, and CUDA support for production deployment.

2import torch.nn as nn

Neural network module. nn.Module is the base class for all PyTorch models. We use nn.Parameter to store learnable bias values.

3import torch.nn.functional as F

Functional API. F.softmax() is optimized for GPU and handles numerical stability internally.

4import math

Standard library math module. math.sqrt(d_k) is computed once at init time.

6class RelativePositionBiasAttention(nn.Module)

PyTorch module version of relative position bias attention. Inherits from nn.Module for compatibility with PyTorch training loops, .to(device), state_dict(), etc. Uses nn.Parameter for learnable bias values.

16def __init__(self, d_k, max_seq_len, num_buckets, bias_scale)

Constructor takes d_k (key dimension), max_seq_len (for allocating bias table), num_buckets (T5-style), and bias_scale (distance penalty strength).

EXECUTION STATE
⬇ input: d_k = 4
⬇ input: max_seq_len = 512 (default max positions)
⬇ input: num_buckets = 32 (T5 default)
⬇ input: bias_scale = 0.3
26self.relative_bias_table = nn.Parameter(torch.zeros(num_buckets))

Learnable bias table. In the T5 approach, each bucket has a learned scalar bias. nn.Parameter tells PyTorch to include this in gradient computation during training. Initialized to zeros (no bias initially — the model learns the right values).

EXECUTION STATE
nn.Parameter = Registers tensor as a model parameter — included in optimizer.step() updates
shape = (32,) — one learnable scalar per bucket
30def _compute_bias(self, N) → torch.Tensor

Build the N×N bias matrix. Uses vectorized PyTorch operations instead of Python loops — much faster on GPU.

EXECUTION STATE
⬇ input: N = 5 (number of tokens)
⬆ returns = torch.Tensor (5, 5) — symmetric bias matrix
32positions = torch.arange(N)

Create position indices [0, 1, 2, 3, 4] as a 1D tensor.

EXECUTION STATE
positions = [0, 1, 2, 3, 4]
34relative_pos = positions.unsqueeze(0) - positions.unsqueeze(1)

Vectorized distance computation. unsqueeze(0) makes (1,5) row vector, unsqueeze(1) makes (5,1) column vector. Broadcasting produces (5,5) matrix of all pairwise differences.

EXECUTION STATE
.unsqueeze(0) = Add dimension at axis 0: [0,1,2,3,4] → shape (1, 5) — row vector for broadcasting
.unsqueeze(1) = Add dimension at axis 1: [0,1,2,3,4] → shape (5, 1) — column vector for broadcasting
relative_pos (5×5) =
      The   cat   sat    on   mat
The    0    -1    -2    -3    -4
cat    1     0    -1    -2    -3
sat    2     1     0    -1    -2
on     3     2     1     0    -1
mat    4     3     2     1     0
35distances = relative_pos.abs().float()

Take absolute values of all distances. The bias is symmetric — distance 3 forward equals distance 3 backward. Convert to float for log computation.

EXECUTION STATE
.abs() = Element-wise absolute value — removes direction, keeps magnitude
distances (5×5) =
      The   cat   sat    on   mat
The    0     1     2     3     4
cat    1     0     1     2     3
sat    2     1     0     1     2
on     3     2     1     0     1
mat    4     3     2     1     0
37B = -self.bias_scale * torch.log(1 + distances)

Compute the full bias matrix in one vectorized operation. Equivalent to the Python loop version but runs on GPU and is differentiable for training.

EXECUTION STATE
torch.log(1 + distances) = Element-wise natural log. torch.log is differentiable — gradients flow through during training.
⬆ B (5×5) =
       The      cat      sat       on      mat
The   0.0000  -0.2079  -0.3296  -0.4159  -0.4828
cat  -0.2079   0.0000  -0.2079  -0.3296  -0.4159
sat  -0.3296  -0.2079   0.0000  -0.2079  -0.3296
on   -0.4159  -0.3296  -0.2079   0.0000  -0.2079
mat  -0.4828  -0.4159  -0.3296  -0.2079   0.0000
51scores = torch.matmul(Q, K.T) / self.scale

Standard scaled dot-product scores. torch.matmul replaces NumPy's @ operator. K.T transposes the key matrix. Dividing by √d_k = 2.0.

EXECUTION STATE
torch.matmul vs @ = Equivalent for 2D. torch.matmul supports batched inputs (..., N, d_k).
⬆ scores (5×5) =
       The     cat     sat      on     mat
The  0.000   1.000   0.500   0.500   0.750
cat  1.500   0.000   1.000   0.500   0.250
sat  0.500   1.000   1.000   0.500   0.750
on   0.500   0.500   0.000   1.000   0.500
mat  0.500   0.500   0.500   0.500   0.750
52B = self._compute_bias(N).to(Q.device)

Compute bias matrix and move to the same device as Q (CPU or GPU). .to(Q.device) ensures tensors are on the same device for addition.

EXECUTION STATE
.to(Q.device) = Move tensor to same device as Q. Required for GPU training — can't add CPU tensor to GPU tensor.
53scores = scores + B

Add position bias to scaled scores. This is the ONE change that makes this relative position bias attention. Element-wise addition — each score[i,j] gets its distance penalty B[i,j].

EXECUTION STATE
⬆ scores + B (5×5) =
       The      cat      sat       on      mat
The  0.0000   0.7921   0.1704   0.0841   0.2672
cat  1.2921   0.0000   0.7921   0.1704  -0.1659
sat  0.1704   0.7921   1.0000   0.2921   0.4204
on   0.0841   0.1704  -0.2079   1.0000   0.2921
mat  0.0172   0.0841   0.1704   0.2921   0.7500
58weights = F.softmax(scores, dim=-1)

PyTorch's built-in softmax. dim=-1 applies along the last axis (each row independently). Numerically stable internally.

EXECUTION STATE
dim=-1 = Softmax along last dimension — each query token's attention weights sum to 1.0
⬆ weights (5×5) =
       The      cat      sat       on      mat
The  0.1473   0.3253   0.1747   0.1603   0.1924
cat  0.4099   0.1126   0.2486   0.1335   0.0954
sat  0.1321   0.2460   0.3029   0.1492   0.1697
on   0.1523   0.1660   0.1137   0.3805   0.1875
mat  0.1508   0.1612   0.1758   0.1985   0.3138
59output = torch.matmul(weights, V)

Weighted sum of value vectors. Each output row = sum of V[j] × weight[i,j] for all j. Now position-aware.

EXECUTION STATE
⬆ output (5×4) =
       d0       d1       d2       d3
The  0.2435   0.4215   0.2709   0.2565
cat  0.4576   0.1603   0.2963   0.1812
sat  0.2170   0.3309   0.3877   0.2341
on   0.2460   0.2597   0.2074   0.4743
mat  0.3077   0.3181   0.3326   0.3554
61return weights, output

Return both attention weights and output. Weights can be used for visualization, analysis, or auxiliary losses.

EXECUTION STATE
⬆ return: weights = (5, 5) — position-biased probabilities
⬆ return: output = (5, 4) — position-aware context vectors
84 lines without explanation
1import torch
2import torch.nn as nn
3import torch.nn.functional as F
4import math
5
6class RelativePositionBiasAttention(nn.Module):
7    """
8    Relative Position Bias Attention - PyTorch
9
10    Adds a learned bias table to attention scores before softmax.
11    In this version, biases are stored as learnable parameters
12    indexed by relative position, matching the T5 approach.
13    """
14
15    def __init__(self, d_k: int, max_seq_len: int = 512,
16                 num_buckets: int = 32, bias_scale: float = 0.3):
17        super().__init__()
18        self.d_k = d_k
19        self.scale = math.sqrt(d_k)
20        self.bias_scale = bias_scale
21        self.num_buckets = num_buckets
22        self.max_seq_len = max_seq_len
23
24        # Learnable bias table: one scalar per bucket
25        self.relative_bias_table = nn.Parameter(
26            torch.zeros(num_buckets)
27        )
28
29    def _compute_bias(self, N: int) -> torch.Tensor:
30        """Build N x N bias matrix using log-decay formula."""
31        positions = torch.arange(N)
32        # Compute relative distances: (N, N) matrix
33        relative_pos = positions.unsqueeze(0) - positions.unsqueeze(1)
34        distances = relative_pos.abs().float()
35        # Log-decay bias (matches our NumPy simulation)
36        B = -self.bias_scale * torch.log(1 + distances)
37        return B
38
39    def forward(
40        self,
41        Q: torch.Tensor,
42        K: torch.Tensor,
43        V: torch.Tensor,
44        mask: torch.Tensor | None = None,
45    ) -> tuple[torch.Tensor, torch.Tensor]:
46        """
47        Args:
48            Q: (N, d_k), K: (N, d_k), V: (N, d_v)
49            mask: Optional (N, N) boolean, True = masked
50        Returns:
51            weights: (N, N) attention weights
52            output:  (N, d_v) context vectors
53        """
54        N = Q.shape[0]
55
56        scores = torch.matmul(Q, K.T) / self.scale
57        B = self._compute_bias(N).to(Q.device)
58        scores = scores + B
59
60        if mask is not None:
61            scores = scores.masked_fill(mask, float("-inf"))
62
63        weights = F.softmax(scores, dim=-1)
64        output = torch.matmul(weights, V)
65
66        return weights, output
67
68
69# ── Shared Example ──
70tokens = ["The", "cat", "sat", "on", "mat"]
71
72Q = torch.tensor([
73    [1.0, 0.0, 1.0, 0.0],
74    [0.0, 2.0, 0.0, 1.0],
75    [1.0, 1.0, 1.0, 0.0],
76    [0.0, 0.0, 1.0, 1.0],
77    [1.0, 0.0, 0.0, 1.0],
78])
79K = torch.tensor([
80    [0.0, 1.0, 0.0, 1.0],
81    [1.0, 0.0, 1.0, 0.0],
82    [1.0, 1.0, 0.0, 0.0],
83    [0.0, 0.0, 1.0, 1.0],
84    [1.0, 0.0, 0.5, 0.5],
85])
86V = torch.tensor([
87    [1.0, 0.0, 0.0, 0.0],
88    [0.0, 1.0, 0.0, 0.0],
89    [0.0, 0.0, 1.0, 0.0],
90    [0.0, 0.0, 0.0, 1.0],
91    [0.5, 0.5, 0.5, 0.5],
92])
93
94# ── Run with relative position bias ──
95rpb = RelativePositionBiasAttention(d_k=4, bias_scale=0.3)
96with torch.no_grad():
97    weights, output = rpb(Q, K, V)
98
99print("Relative Position Bias Attention Output:")
100print(output.numpy().round(4))
101print("\nAttention Weights:")
102print(weights.numpy().round(4))

Key Takeaways

  1. One-line change, big impact. Relative position bias adds a single matrix BB to the attention scores before softmax. Everything else — Q·K\u1D40 computation, scaling, softmax, weighted sum — remains identical to standard attention.
  2. Distance as a prior. The bias matrix encodes the assumption that nearby tokens are more likely to be relevant. This doesn't prevent long-range attention; it just requires distant tokens to have higher content similarity to overcome the distance penalty.
  3. Translation invariance. Because B[i,j]B[i,j] depends only on (ij)(i - j), the same phrase at any position in the sequence produces the same bias pattern. This enables better length generalization than absolute positional embeddings.
  4. Logarithmic decay. The function λlog(1+ij)-\lambda \cdot \log(1 + |i-j|) grows slowly — quadrupling the distance only doubles the log. This means the model distinguishes fine-grained nearby positions while treating distant positions more coarsely.
  5. T5's bucketing is efficient. Instead of learning one parameter per distance, T5 uses 32 logarithmic buckets. This captures fine-grained local position with few parameters and generalizes to sequences longer than those seen during training.
  6. Foundation for successors. This chapter lays the groundwork for RoPE (Chapter 8, which uses rotation instead of addition) and ALiBi (Chapter 9, which uses linear decay instead of logarithmic).

Exercises

  1. Bias Scale Sensitivity: Recompute the attention weights for "The" using λ=0.0\lambda = 0.0, λ=0.3\lambda = 0.3, and λ=1.0\lambda = 1.0. How does increasing λ\lambda affect the balance between local and global attention? At what value of λ\lambda does self-attention dominate?
  2. Linear vs Logarithmic Decay: Replace the bias formula with B[i,j]=λijB[i,j] = -\lambda \cdot |i-j| (linear decay). Compute the attention weights and compare. Which decay function penalizes long-range attention more harshly? When might linear be preferred?
  3. Asymmetric Bias: In causal (autoregressive) models, we might want asymmetric bias where looking backward (j < i) has a different penalty than looking forward. Design a bias formula that penalizes backward attention less than forward attention. Why might this be useful for language generation?
  4. T5 Bucket Mapping: For a model with 32 buckets and max_distance=128, compute the bucket assignment for distances 1, 5, 10, 50, and 100. Verify your answers with the interactive explorer above.
  5. Memory Analysis: A model has 32 attention heads, sequence length 4096, and uses T5-style bucketing with 32 buckets. How many parameters does the bias table require? Compare this to storing one parameter per relative distance.

References

  1. Shaw, P., Uszkoreit, J., & Vaswani, A. (2018). Self-Attention with Relative Position Representations. NAACL 2018.
  2. Raffel, C., Shazeer, N., Roberts, A., Lee, K., Narang, S., Matena, M., Zhou, Y., Li, W., & Liu, P. J. (2020). Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer. JMLR, 21(140), 1-67.
  3. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, \u0141., & Polosukhin, I. (2017). Attention Is All You Need. NeurIPS 2017.
  4. Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., & Guo, B. (2021). Swin Transformer: Hierarchical Vision Transformer using Shifted Windows. ICCV 2021.
  5. Press, O., Smith, N. A., & Lewis, M. (2022). Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation. ICLR 2022.
  6. Su, J., Lu, Y., Pan, S., Murtadha, A., Wen, B., & Liu, Y. (2021). RoFormer: Enhanced Transformer with Rotary Position Embedding. arXiv:2104.09864.
Loading comments...