The Library Analogy
Walk into a library with a question you cannot quite phrase — you know it has something to do with nineteenth-century French opera but you do not remember more. The librarian listens, walks the shelves, and pulls a stack of books that overlap with your fuzzy query. From those books she produces a single answer that is a weighted blend of the most relevant pages.
That is precisely what self-attention does to a sequence. Each timestep produces a query; the system compares it against every timestep's key; the matches determine how much each timestep's value should contribute to the answer. For RUL prediction this lets cycle 27 attend to cycle 5 if cycle 5 carried the early-warning signal — something neither convolution (only sees K cycles) nor a vanilla LSTM (forgets gradually) can reliably do.
Queries, Keys, Values
Given an input sequence (T cycles, d-dim features), three projection matrices produce three new sequences:
are learnable. After the projections, and .
| Symbol | Role | Analogy |
|---|---|---|
| What this position is looking for | The fuzzy query | |
| What this position offers | The book index entry | |
| What this position contributes | The book content |
Scaled Dot-Product Attention
Compute pairwise similarity between every query and every key, then normalise:
Three things in one expression. (1) gives every query-key dot product. (2) Divide by to keep variance roughly 1 regardless of ; without this, large pushes softmax into a near-one-hot regime where gradients vanish. (3) Softmax row-wise to produce a valid probability distribution, then multiply by to compute the weighted sum.
Interactive: Pick a Query, See Where It Looks
Below is a attention map computed on random Q, K, V matrices (matching the NumPy run further down so you can verify the numbers). Click any row to pick a query; the bar chart on the right shows where its probability mass goes. Toggle between raw scores and softmax-normalised attention.
Try sliding the scale factor down to 0.5: the softmax row collapses to one-hot — the model fixates on a single key, which kills gradient flow to the others. Push it up to 8: the row becomes near-uniform, and self-attention degenerates into uniform averaging. The choice is the sweet spot.
Interactive: The Macro Flow
Zooming out, the four-stage flow:
Attention Flow
How information flows through the attention mechanism
Q compares with K
Similarity scores
Softmax normalizes
Weights sum to 1
Multiply by V
Weighted values
Output
Enriched representation
Multi-Head Attention
One attention head learns one kind of relationship. Multi-head attention runs attention computations in parallel with separate for each head , then concatenates the outputs and projects:
with . Each head sees a lower-dim slice () and can specialise: one head might learn “late cycles attend to early cycles” while another learns “adjacent-cycle smoothing”. The paper and our backbone use 8 heads.
Python: Attention From Scratch
Twenty lines of NumPy implement scaled-dot-product attention end to end. The output values match what the interactive heatmap displays when you select query 0 with scale = sqrt(4) = 2.
PyTorch: scaled_dot_product_attention
PyTorch 2.0+ ships a single fused kernel for attention, optionally backed by Flash-Attention on CUDA. For multi-head usage the nn.MultiheadAttention module wraps the projections and head-splitting boilerplate.
F.scaled_dot_product_attention dispatches to the Flash-Attention kernel automatically when shapes are favourable, producing 3-10x speedups vs. the naive softmax(Q K^T) @ V. No code changes — PyTorch picks the fastest backend.Attention Beyond RUL
| Domain | What attends to what | Famous architecture |
|---|---|---|
| RUL prediction (this book) | Cycles within a 30-cycle window | CNN-BiLSTM-Attention |
| Language modelling | Tokens within a context window | GPT, BERT, T5 |
| Image recognition | Patches within an image | ViT, Swin Transformer |
| Speech recognition | Audio frames | Whisper, Conformer |
| Protein folding | Residues within a chain | AlphaFold 2 |
| Music generation | Notes within a phrase | Music Transformer |
| Recommendation | Items in a user history | SASRec |
| Medical imaging | Voxels in a 3-D scan | TransUNet |
The same five-line attention computation underpins all of them. The mathematics is exact; what changes is the modality and the size of the (T, d) tensor.
The Three Pitfalls
np.exp(1000) = inf. PyTorch's functional softmax handles it; if you implement softmax yourself, always shift first.The point. Self-attention lets every cycle see every other cycle in O(T^2) time, learn which to weight, and produce a context-rich output. Combined with Conv1D (local patterns) and BiLSTM (sequential dynamics), it gives our backbone its long-range modelling power.
Takeaway
- Self-attention is differentiable lookup. Each query selects a weighted blend of values via similarity to keys.
- The whole formula is one line. — three matrix multiplies plus a softmax.
- Scale by sqrt(d_k). Keeps softmax in its well-behaved regime regardless of the per-head dimension.
- Multi-head runs H attentions in parallel. Each head sees a lower-dim slice and learns its own type of relationship.
- PyTorch ships a fused kernel.
F.scaled_dot_product_attentiondispatches to Flash-Attention automatically on CUDA — never write the softmax(Q K^T)/sqrt(d) yourself in production code.