Chapter 7
15 min read
Section 32 of 104

Self-Attention for Time Series

Multi-Head Self-Attention

Learning Objectives

By the end of this section, you will:

  1. Understand why attention follows BiLSTM in the AMNL architecture
  2. Grasp the intuition behind attention as dynamic weighting
  3. Define self-attention where queries, keys, and values come from the same sequence
  4. Explain why attention benefits RUL prediction
  5. Appreciate the query-key-value paradigm
Why This Matters: The BiLSTM encodes each timestep with full sequence context, but treats all timesteps equally. Attention allows the model to focus on the most informative timesteps—late-stage degradation signals may be more predictive than early stable readings—dramatically improving RUL estimation.

From BiLSTM to Attention

The BiLSTM produces a sequence of 256-dimensional hidden states, one per timestep. How do we convert this sequence into a single prediction?

BiLSTM Output Recap

📝text
1BiLSTM Output: (batch, 30, 256)
2
3Each timestep has a contextualized representation:
4  h₁ = context from entire sequence, focused at t=1
5  h₂ = context from entire sequence, focused at t=2
6  ...
7  h₃₀ = context from entire sequence, focused at t=30

Naive Approaches

ApproachFormulaLimitation
Last timestepy = f(h₃₀)Ignores 29 other timesteps
Mean poolingy = f(mean(h₁...h₃₀))Treats all timesteps equally
Max poolingy = f(max(h₁...h₃₀))Loses temporal nuance

None of these approaches learn which timesteps matter most. We need a mechanism that can dynamically weight timesteps based on their relevance.

The Attention Solution

context=t=1Tαtht\text{context} = \sum_{t=1}^{T} \alpha_t \cdot h_t

Where αt\alpha_t are learned attention weights satisfying tαt=1\sum_t \alpha_t = 1. The model learns to assign higher weights to more informative timesteps.


Attention Intuition

Attention can be understood through several complementary intuitions.

Intuition 1: Soft Database Lookup

Think of attention as a soft database query:

  • Query: "What information do I need?"
  • Keys: "What information does each timestep contain?"
  • Values: The actual information at each timestep

Instead of returning one exact match, attention returns a weighted combination of all values, with weights based on query-key similarity.

Intuition 2: Learned Pooling

Mean pooling uses uniform weights: αt=1/T\alpha_t = 1/T. Attention learns data-dependent weights:

📝text
1Mean pooling:    [0.033, 0.033, 0.033, ..., 0.033]  (uniform)
2
3Learned attention: [0.02, 0.01, 0.05, ..., 0.15, 0.20]  (adaptive)
4                                           ↑     ↑
5                               Late timesteps get more weight
6                               (degradation signals)

Intuition 3: Dynamic Feature Selection

Attention selects which timesteps contribute to the final representation:

  • Early stable readings: Low attention (not informative for RUL)
  • Transition to degradation: Medium attention (important marker)
  • Late degradation signals: High attention (most predictive)

Self-Attention Concept

In self-attention, the same sequence serves as queries, keys, and values. Each timestep attends to all other timesteps (including itself).

Self vs Cross Attention

TypeQuery SourceKey/Value SourceUse Case
Self-attentionSequence XSame sequence XOur model
Cross-attentionSequence XDifferent sequence YTranslation

Query, Key, Value Projections

The same input is linearly projected three times:

Q=XWQ,K=XWK,V=XWVQ = XW^Q, \quad K = XW^K, \quad V = XW^V

Where:

  • XRT×DX \in \mathbb{R}^{T \times D}: Input sequence (BiLSTM output)
  • WQ,WKRD×DkW^Q, W^K \in \mathbb{R}^{D \times D_k}: Query and key projections
  • WVRD×DvW^V \in \mathbb{R}^{D \times D_v}: Value projection
  • Q,KRT×DkQ, K \in \mathbb{R}^{T \times D_k}: Queries and keys
  • VRT×DvV \in \mathbb{R}^{T \times D_v}: Values

Why Three Different Projections?

The separate projections allow the model to learn:

  • Query space: What each position is looking for
  • Key space: What each position advertises
  • Value space: What each position contributes

Why Attention for RUL?

Attention is particularly valuable for RUL prediction due to the nature of degradation data.

Non-Uniform Information Distribution

Not all timesteps are equally informative:

📝text
1Typical degradation trajectory:
2
3Timestep:    1    5    10   15   20   25   30
4             |    |    |    |    |    |    |
5State:    [Healthy]----[Transition]---[Degrading]
6           ~~~~~~~~           ~~~~    !!!!!!!!!!!
7
8Informativeness for RUL:
9  Low        Low      Medium     High    Very High

Attention Learns This Pattern

Through training, attention weights naturally reflect informativeness:

PhaseTimestepsTypical Attention
Healthy (stable)1-10Low (~0.02 each)
Transition11-20Medium (~0.04 each)
Degrading21-30High (~0.08 each)

Benefits for RUL

  • Noise suppression: Low attention on irrelevant healthy readings
  • Critical signal amplification: High attention on degradation indicators
  • Adaptive focus: Different attention patterns for different failure modes
  • Interpretability: Attention weights show what the model considers important
Key Insight: Attention allows the model to say "the reading at cycle 27 is critical for my prediction, while cycle 3 is essentially noise." This selective focus is what makes attention so powerful for time series with non-uniform informativeness.

Summary

In this section, we introduced self-attention for time series:

  1. Problem: BiLSTM outputs need to be aggregated into a single prediction
  2. Solution: Attention learns dynamic weights for each timestep
  3. Self-attention: Query, key, value all come from the same sequence
  4. Intuitions: Soft lookup, learned pooling, feature selection
  5. RUL relevance: Degradation signals are non-uniformly distributed in time
ConceptDescription
Query (Q)What information is being sought
Key (K)What information each position contains
Value (V)The actual content to aggregate
Attention weightsLearned importance of each timestep
Context vectorWeighted sum of values
Looking Ahead: We have the intuition—now for the math. The next section presents the scaled dot-product attention formula, the core computation that turns queries, keys, and values into attention weights and context vectors.

With the intuition established, we now derive the scaled dot-product attention formula.