Chapter 10
11 min read
Section 39 of 121

Scaled Dot-Product Attention

Multi-Head Self-Attention

Why Attention After BiLSTM

The BiLSTM (Chapter 9) gave every cycle a 512-dimensional context vector that summarises “everything before AND after, processed sequentially”. That sequential processing is its strength, but it also means the BiLSTM has to compress all 30 cycles of relevant information into a single hidden state at every step. Attention does something different - it lets each cycle look back at every other cycle DIRECTLY, without going through a recurrent bottleneck.

For RUL on a 30-cycle window the practical benefit is small but consistent: a single self-attention layer on top of the BiLSTM sharpens the model's ability to relate distant cycles (the failure spike at cycle 27 with the early-warning at cycle 4) at a cost of ~100k extra parameters.

The role. BiLSTM gives sequential context; attention gives direct cross-cycle look-ups. Both contribute to the final feature representation.

The Formula (Recap)

Section 3.4 walked through the math. The single line:

Attention(Q,K,V)=softmax ⁣(QKdk)V.\text{Attention}(\mathbf{Q}, \mathbf{K}, \mathbf{V}) = \text{softmax}\!\left(\frac{\mathbf{Q}\mathbf{K}^{\top}}{\sqrt{d_k}}\right) \mathbf{V}.

For SELF-attention all three are derived from the same input X\mathbf{X} via three learnable projections: Q=XWQ,K=XWK,V=XWV\mathbf{Q} = \mathbf{X} W^Q,\, \mathbf{K} = \mathbf{X} W^K,\, \mathbf{V} = \mathbf{X} W^V.

TensorShapeRole
X (BiLSTM out)(B, T, 512)Input - one vector per cycle
Q(B, T, 64)What each cycle is looking for
K(B, T, 64)What each cycle offers
V(B, T, 64)What each cycle contributes if matched
scores = QK^T/√d_k(B, T, T)Pairwise relevance
attn = softmax(scores)(B, T, T)Probability distribution per query
out = attn V(B, T, 64)Re-weighted blend

Interactive: Pick a Query (Recap)

Same heatmap from §3.4 - shown here so the formula has a concrete picture to land on.

Loading attention heatmap…

Interactive: The Macro Flow

Zoom out: the four-stage pipeline.

Attention Flow

How information flows through the attention mechanism

1

Q compares with K

QKᵀ

Similarity scores

2

Softmax normalizes

softmax(·)

Weights sum to 1

3

Multiply by V

A × V

Weighted values

4

Output

Context

Enriched representation

Compare
Weight
Combine
Context-aware output

Python: Attention Over BiLSTM Output

Twenty lines of NumPy applied to a 30-step, 512-D BiLSTM-shaped input. One head; dk=64d_k = 64 is the value we will use per head when we go multi-head in §10.2.

Self-attention on a (T, 512) sequence
🐍attention_over_bilstm.py
1import numpy as np

Standard alias.

3def softmax_rowwise(x):

Numerically stable softmax along the last axis.

4shifted = x - x.max(axis=-1, keepdims=True)

Max-subtract for numerical stability - avoids np.exp(big) = inf.

5expd = np.exp(shifted)

Element-wise exponential.

6return expd / expd.sum(axis=-1, keepdims=True)

Normalise rows to sum to 1.

9def self_attention(X, Wq, Wk, Wv):

ONE attention head applied to a sequence X. Self-attention: Q, K, V all derived from the same input.

EXECUTION STATE
input: X (T, d_model) = Output of BiLSTM at every cycle (here d_model=512)
input: Wq, Wk, Wv = (d_model, d_k) projection matrices
returns = (T, d_v) attention output
12Q = X @ Wq

Project to query space.

EXECUTION STATE
Q.shape = (30, 64)
13K = X @ Wk

Project to key space.

14V = X @ Wv

Project to value space.

16d_k = Q.shape[-1]

Per-head dim, used in the scaling factor.

17scores = Q @ K.T / np.sqrt(d_k)

All T x T pairwise dot products, scaled.

EXECUTION STATE
scores.shape = (30, 30)
18attn = softmax_rowwise(scores)

Convert to row-stochastic attention map.

19return attn @ V

Weighted sum of value vectors.

EXECUTION STATE
Output shape = (30, 64)
23np.random.seed(0)

Determinism.

24T, d_model, d_k = 30, 512, 64

Backbone shapes: 30 cycles, BiLSTM output dim 512, per-head dim 64 (= 512 / 8 heads, foreshadowing §10.2).

27X = np.random.randn(T, d_model).astype(np.float32) * 0.1

Fake BiLSTM output for one sample.

28Wq = np.random.randn(d_model, d_k).astype(np.float32) * (1 / np.sqrt(d_model))

Xavier-style init for the query projection. Same for Wk, Wv.

33out = self_attention(X, Wq, Wk, Wv)

Run.

35print("X.shape :", X.shape)

Verify input.

EXECUTION STATE
Output = X.shape : (30, 512)
36print("Q.shape :", "(30, 64)")

Q after projection.

EXECUTION STATE
Output = Q.shape : (30, 64)
37print("scores :", "(30, 30)")

Attention map shape.

EXECUTION STATE
Output = scores : (30, 30)
38print("out.shape :", out.shape)

Output dim equals d_v = d_k = 64 (one head).

EXECUTION STATE
Output = out.shape : (30, 64)
16 lines without explanation
1import numpy as np
2
3def softmax_rowwise(x: np.ndarray) -> np.ndarray:
4    shifted = x - x.max(axis=-1, keepdims=True)
5    expd = np.exp(shifted)
6    return expd / expd.sum(axis=-1, keepdims=True)
7
8
9def self_attention(X: np.ndarray, Wq, Wk, Wv) -> np.ndarray:
10    """Self-attention applied to a single sequence X of shape (T, d_model).
11
12    Wq, Wk, Wv: (d_model, d_k) projection matrices."""
13    Q = X @ Wq                       # (T, d_k)
14    K = X @ Wk                       # (T, d_k)
15    V = X @ Wv                       # (T, d_v)
16
17    d_k = Q.shape[-1]
18    scores = Q @ K.T / np.sqrt(d_k)  # (T, T)
19    attn   = softmax_rowwise(scores)
20    return attn @ V                  # (T, d_v)
21
22
23# ----- Run on a 30-step BiLSTM-shaped sequence -----
24np.random.seed(0)
25T, d_model, d_k = 30, 512, 64
26
27# Pretend X is the BiLSTM encoder output for ONE sample
28X  = np.random.randn(T, d_model).astype(np.float32) * 0.1
29Wq = np.random.randn(d_model, d_k).astype(np.float32) * (1 / np.sqrt(d_model))
30Wk = np.random.randn(d_model, d_k).astype(np.float32) * (1 / np.sqrt(d_model))
31Wv = np.random.randn(d_model, d_k).astype(np.float32) * (1 / np.sqrt(d_model))
32
33out = self_attention(X, Wq, Wk, Wv)
34
35print("X.shape   :", X.shape)         # (30, 512)
36print("Q.shape   :", "(30, 64)")
37print("scores    :", "(30, 30)")
38print("out.shape :", out.shape)       # (30, 64)

PyTorch: F.scaled_dot_product_attention on (B, T, 512)

One fused call; optionally Flash-Attention on CUDA
🐍attention_torch.py
1import torch

Top-level PyTorch.

2import torch.nn.functional as F

F.scaled_dot_product_attention lives here.

4torch.manual_seed(0)

Determinism.

5B, T, d_model, d_k = 2, 30, 512, 64

Backbone shapes for one head.

8X = torch.randn(B, T, d_model)

Fake BiLSTM output. Real code: X = bilstm(cnn(input)).

11Wq = torch.nn.Parameter(torch.randn(d_model, d_k) / d_model**0.5)

Wrap in nn.Parameter so PyTorch tracks gradients. Xavier-style init.

12Wk = torch.nn.Parameter(...)

Key projection.

13Wv = torch.nn.Parameter(...)

Value projection.

15Q = X @ Wq

Broadcast matmul: (B, T, d_model) @ (d_model, d_k) → (B, T, d_k).

EXECUTION STATE
Q.shape = torch.Size([2, 30, 64])
16K = X @ Wk

Same shape.

17V = X @ Wv

Same shape.

20out = F.scaled_dot_product_attention(Q, K, V)

Single fused call. On CUDA hardware that supports it (Ampere or newer), PyTorch routes to Flash-Attention; otherwise falls back to a plain math implementation. Always faster than writing softmax(Q @ K^T) yourself.

EXECUTION STATE
out.shape = torch.Size([2, 30, 64])
→ Flash-Attention = Replaces the (T, T) intermediate matrix with a tiled, on-the-fly computation. 3-10x faster on long sequences. Same numerics modulo float32 round-off.
22print("X.shape :", tuple(X.shape))

Verify input.

EXECUTION STATE
Output = X.shape : (2, 30, 512)
23print("Q.shape :", tuple(Q.shape))

Q after projection.

EXECUTION STATE
Output = Q.shape : (2, 30, 64)
24print("out.shape :", tuple(out.shape))

Output. Time preserved; channel dim = d_v = 64 for this single head.

EXECUTION STATE
Output = out.shape : (2, 30, 64)
9 lines without explanation
1import torch
2import torch.nn.functional as F
3
4torch.manual_seed(0)
5B, T, d_model, d_k = 2, 30, 512, 64
6
7# ----- Pretend BiLSTM output -----
8X = torch.randn(B, T, d_model)
9
10# ----- Three projection matrices (learnable in real code) -----
11Wq = torch.nn.Parameter(torch.randn(d_model, d_k) / d_model**0.5)
12Wk = torch.nn.Parameter(torch.randn(d_model, d_k) / d_model**0.5)
13Wv = torch.nn.Parameter(torch.randn(d_model, d_k) / d_model**0.5)
14
15Q = X @ Wq
16K = X @ Wk
17V = X @ Wv
18
19# ----- One fused call, optionally Flash-Attention on CUDA -----
20out = F.scaled_dot_product_attention(Q, K, V)
21
22print("X.shape   :", tuple(X.shape))    # (2, 30, 512)
23print("Q.shape   :", tuple(Q.shape))    # (2, 30, 64)
24print("out.shape :", tuple(out.shape))  # (2, 30, 64)
Performance note. On modern NVIDIA GPUs (A100, H100, RTX 30/40 series) PyTorch's F.scaled_dot_product_attention routes to Flash-Attention automatically. 3-10× speed-up vs writing softmax(QK^T) yourself, with mathematically equivalent output. Never roll your own.

Attention Beyond Transformers

DomainWhere attention sitsFamous example
RUL (this book)Single layer post-BiLSTMCNN-BiLSTM-Attention paper
seq2seq translation (pre-Transformer)Encoder-decoder bridgeBahdanau attention 2015
Transformer encodersStacked, replaces RNN entirelyBERT, GPT, T5
VisionPatch-to-patch attentionViT, Swin
Protein foldingResidue-to-residue cross-attentionAlphaFold 2
RecommendationItem-to-item self-attentionSASRec
Music generationNote-level attentionMusic Transformer

Two Pitfalls Specific to Backbone Use

Pitfall 1: Forgetting d_k = 512 / num_heads. With multi-head (next section) the per-head dimension is d_model / H. Setting d_k arbitrarily breaks the concat-and-project reshape later.
Pitfall 2: Replacing the BiLSTM with attention. On 30-cycle windows the BiLSTM still helps - it captures sequential dynamics that pure attention misses. Replace with attention only if you have positional encoding AND much longer sequences.
The point. Self-attention adds direct cross-cycle look-ups on top of the BiLSTM's sequential context. One layer; one fused PyTorch call; ~100k extra parameters in the multi-head version.

Takeaway

  • Attention sits AFTER the BiLSTM. Input is (B, 30, 512); output is (B, 30, 64) per head.
  • The math is one line. softmax(QK/dk)V\text{softmax}(QK^{\top}/\sqrt{d_k}) V.
  • Use F.scaled_dot_product_attention. Fused; Flash-Attention-aware; identical numerics to the manual version.
  • BiLSTM and attention complement each other. Sequential dynamics + direct cross-cycle relevance.
Loading comments...