Why Attention After BiLSTM
The BiLSTM (Chapter 9) gave every cycle a 512-dimensional context vector that summarises “everything before AND after, processed sequentially”. That sequential processing is its strength, but it also means the BiLSTM has to compress all 30 cycles of relevant information into a single hidden state at every step. Attention does something different - it lets each cycle look back at every other cycle DIRECTLY, without going through a recurrent bottleneck.
For RUL on a 30-cycle window the practical benefit is small but consistent: a single self-attention layer on top of the BiLSTM sharpens the model's ability to relate distant cycles (the failure spike at cycle 27 with the early-warning at cycle 4) at a cost of ~100k extra parameters.
The Formula (Recap)
Section 3.4 walked through the math. The single line:
For SELF-attention all three are derived from the same input via three learnable projections: .
| Tensor | Shape | Role |
|---|---|---|
| X (BiLSTM out) | (B, T, 512) | Input - one vector per cycle |
| Q | (B, T, 64) | What each cycle is looking for |
| K | (B, T, 64) | What each cycle offers |
| V | (B, T, 64) | What each cycle contributes if matched |
| scores = QK^T/√d_k | (B, T, T) | Pairwise relevance |
| attn = softmax(scores) | (B, T, T) | Probability distribution per query |
| out = attn V | (B, T, 64) | Re-weighted blend |
Interactive: Pick a Query (Recap)
Same heatmap from §3.4 - shown here so the formula has a concrete picture to land on.
Interactive: The Macro Flow
Zoom out: the four-stage pipeline.
Attention Flow
How information flows through the attention mechanism
Q compares with K
Similarity scores
Softmax normalizes
Weights sum to 1
Multiply by V
Weighted values
Output
Enriched representation
Python: Attention Over BiLSTM Output
Twenty lines of NumPy applied to a 30-step, 512-D BiLSTM-shaped input. One head; is the value we will use per head when we go multi-head in §10.2.
PyTorch: F.scaled_dot_product_attention on (B, T, 512)
Attention Beyond Transformers
| Domain | Where attention sits | Famous example |
|---|---|---|
| RUL (this book) | Single layer post-BiLSTM | CNN-BiLSTM-Attention paper |
| seq2seq translation (pre-Transformer) | Encoder-decoder bridge | Bahdanau attention 2015 |
| Transformer encoders | Stacked, replaces RNN entirely | BERT, GPT, T5 |
| Vision | Patch-to-patch attention | ViT, Swin |
| Protein folding | Residue-to-residue cross-attention | AlphaFold 2 |
| Recommendation | Item-to-item self-attention | SASRec |
| Music generation | Note-level attention | Music Transformer |
Two Pitfalls Specific to Backbone Use
The point. Self-attention adds direct cross-cycle look-ups on top of the BiLSTM's sequential context. One layer; one fused PyTorch call; ~100k extra parameters in the multi-head version.
Takeaway
- Attention sits AFTER the BiLSTM. Input is (B, 30, 512); output is (B, 30, 64) per head.
- The math is one line. .
- Use F.scaled_dot_product_attention. Fused; Flash-Attention-aware; identical numerics to the manual version.
- BiLSTM and attention complement each other. Sequential dynamics + direct cross-cycle relevance.