Learning Objectives
By the end of this section, you will:
- Understand why attention follows BiLSTM in the AMNL architecture
- Grasp the intuition behind attention as dynamic weighting
- Define self-attention where queries, keys, and values come from the same sequence
- Explain why attention benefits RUL prediction
- Appreciate the query-key-value paradigm
Why This Matters: The BiLSTM encodes each timestep with full sequence context, but treats all timesteps equally. Attention allows the model to focus on the most informative timesteps—late-stage degradation signals may be more predictive than early stable readings—dramatically improving RUL estimation.
From BiLSTM to Attention
The BiLSTM produces a sequence of 256-dimensional hidden states, one per timestep. How do we convert this sequence into a single prediction?
BiLSTM Output Recap
1BiLSTM Output: (batch, 30, 256)
2
3Each timestep has a contextualized representation:
4 h₁ = context from entire sequence, focused at t=1
5 h₂ = context from entire sequence, focused at t=2
6 ...
7 h₃₀ = context from entire sequence, focused at t=30Naive Approaches
| Approach | Formula | Limitation |
|---|---|---|
| Last timestep | y = f(h₃₀) | Ignores 29 other timesteps |
| Mean pooling | y = f(mean(h₁...h₃₀)) | Treats all timesteps equally |
| Max pooling | y = f(max(h₁...h₃₀)) | Loses temporal nuance |
None of these approaches learn which timesteps matter most. We need a mechanism that can dynamically weight timesteps based on their relevance.
The Attention Solution
Where are learned attention weights satisfying . The model learns to assign higher weights to more informative timesteps.
Attention Intuition
Attention can be understood through several complementary intuitions.
Intuition 1: Soft Database Lookup
Think of attention as a soft database query:
- Query: "What information do I need?"
- Keys: "What information does each timestep contain?"
- Values: The actual information at each timestep
Instead of returning one exact match, attention returns a weighted combination of all values, with weights based on query-key similarity.
Intuition 2: Learned Pooling
Mean pooling uses uniform weights: . Attention learns data-dependent weights:
1Mean pooling: [0.033, 0.033, 0.033, ..., 0.033] (uniform)
2
3Learned attention: [0.02, 0.01, 0.05, ..., 0.15, 0.20] (adaptive)
4 ↑ ↑
5 Late timesteps get more weight
6 (degradation signals)Intuition 3: Dynamic Feature Selection
Attention selects which timesteps contribute to the final representation:
- Early stable readings: Low attention (not informative for RUL)
- Transition to degradation: Medium attention (important marker)
- Late degradation signals: High attention (most predictive)
Self-Attention Concept
In self-attention, the same sequence serves as queries, keys, and values. Each timestep attends to all other timesteps (including itself).
Self vs Cross Attention
| Type | Query Source | Key/Value Source | Use Case |
|---|---|---|---|
| Self-attention | Sequence X | Same sequence X | Our model |
| Cross-attention | Sequence X | Different sequence Y | Translation |
Query, Key, Value Projections
The same input is linearly projected three times:
Where:
- : Input sequence (BiLSTM output)
- : Query and key projections
- : Value projection
- : Queries and keys
- : Values
Why Three Different Projections?
The separate projections allow the model to learn:
- Query space: What each position is looking for
- Key space: What each position advertises
- Value space: What each position contributes
Why Attention for RUL?
Attention is particularly valuable for RUL prediction due to the nature of degradation data.
Non-Uniform Information Distribution
Not all timesteps are equally informative:
1Typical degradation trajectory:
2
3Timestep: 1 5 10 15 20 25 30
4 | | | | | | |
5State: [Healthy]----[Transition]---[Degrading]
6 ~~~~~~~~ ~~~~ !!!!!!!!!!!
7
8Informativeness for RUL:
9 Low Low Medium High Very HighAttention Learns This Pattern
Through training, attention weights naturally reflect informativeness:
| Phase | Timesteps | Typical Attention |
|---|---|---|
| Healthy (stable) | 1-10 | Low (~0.02 each) |
| Transition | 11-20 | Medium (~0.04 each) |
| Degrading | 21-30 | High (~0.08 each) |
Benefits for RUL
- Noise suppression: Low attention on irrelevant healthy readings
- Critical signal amplification: High attention on degradation indicators
- Adaptive focus: Different attention patterns for different failure modes
- Interpretability: Attention weights show what the model considers important
Key Insight: Attention allows the model to say "the reading at cycle 27 is critical for my prediction, while cycle 3 is essentially noise." This selective focus is what makes attention so powerful for time series with non-uniform informativeness.
Summary
In this section, we introduced self-attention for time series:
- Problem: BiLSTM outputs need to be aggregated into a single prediction
- Solution: Attention learns dynamic weights for each timestep
- Self-attention: Query, key, value all come from the same sequence
- Intuitions: Soft lookup, learned pooling, feature selection
- RUL relevance: Degradation signals are non-uniformly distributed in time
| Concept | Description |
|---|---|
| Query (Q) | What information is being sought |
| Key (K) | What information each position contains |
| Value (V) | The actual content to aggregate |
| Attention weights | Learned importance of each timestep |
| Context vector | Weighted sum of values |
Looking Ahead: We have the intuition—now for the math. The next section presents the scaled dot-product attention formula, the core computation that turns queries, keys, and values into attention weights and context vectors.
With the intuition established, we now derive the scaled dot-product attention formula.