AI Book - Master Artificial Intelligence by Building from Scratch

Learning Objectives

By the end of this section, you will:

Understand why attention follows BiLSTM in the AMNL architecture
Grasp the intuition behind attention as dynamic weighting
Define self-attention where queries, keys, and values come from the same sequence
Explain why attention benefits RUL prediction
Appreciate the query-key-value paradigm

Why This Matters: The BiLSTM encodes each timestep with full sequence context, but treats all timesteps equally. Attention allows the model to focus on the most informative timesteps—late-stage degradation signals may be more predictive than early stable readings—dramatically improving RUL estimation.

From BiLSTM to Attention

The BiLSTM produces a sequence of 256-dimensional hidden states, one per timestep. How do we convert this sequence into a single prediction?

BiLSTM Output Recap

📝text

1BiLSTM Output: (batch, 30, 256)
2
3Each timestep has a contextualized representation:
4  h₁ = context from entire sequence, focused at t=1
5  h₂ = context from entire sequence, focused at t=2
6  ...
7  h₃₀ = context from entire sequence, focused at t=30

Naive Approaches

Approach	Formula	Limitation
Last timestep	y = f(h₃₀)	Ignores 29 other timesteps
Mean pooling	y = f(mean(h₁...h₃₀))	Treats all timesteps equally
Max pooling	y = f(max(h₁...h₃₀))	Loses temporal nuance

None of these approaches learn which timesteps matter most. We need a mechanism that can dynamically weight timesteps based on their relevance.

The Attention Solution

\text{context} = \sum_{t=1}^{T} \alpha_t \cdot h_t

Where $\alpha_t$ are learned attention weights satisfying $\sum_t \alpha_t = 1$ . The model learns to assign higher weights to more informative timesteps.

Attention Intuition

Attention can be understood through several complementary intuitions.

Intuition 1: Soft Database Lookup

Think of attention as a soft database query:

Query: "What information do I need?"
Keys: "What information does each timestep contain?"
Values: The actual information at each timestep

Instead of returning one exact match, attention returns a weighted combination of all values, with weights based on query-key similarity.

Intuition 2: Learned Pooling

Mean pooling uses uniform weights: $\alpha_t = 1/T$ . Attention learns data-dependent weights:

📝text

1Mean pooling:    [0.033, 0.033, 0.033, ..., 0.033]  (uniform)
2
3Learned attention: [0.02, 0.01, 0.05, ..., 0.15, 0.20]  (adaptive)
4                                           ↑     ↑
5                               Late timesteps get more weight
6                               (degradation signals)

Intuition 3: Dynamic Feature Selection

Attention selects which timesteps contribute to the final representation:

Early stable readings: Low attention (not informative for RUL)
Transition to degradation: Medium attention (important marker)
Late degradation signals: High attention (most predictive)

Self-Attention Concept

In self-attention, the same sequence serves as queries, keys, and values. Each timestep attends to all other timesteps (including itself).

Self vs Cross Attention

Type	Query Source	Key/Value Source	Use Case
Self-attention	Sequence X	Same sequence X	Our model
Cross-attention	Sequence X	Different sequence Y	Translation

Query, Key, Value Projections

The same input is linearly projected three times:

Q = XW^Q, \quad K = XW^K, \quad V = XW^V

Where:

$X \in \mathbb{R}^{T \times D}$ : Input sequence (BiLSTM output)
$W^Q, W^K \in \mathbb{R}^{D \times D_k}$ : Query and key projections
$W^V \in \mathbb{R}^{D \times D_v}$ : Value projection
$Q, K \in \mathbb{R}^{T \times D_k}$ : Queries and keys
$V \in \mathbb{R}^{T \times D_v}$ : Values

Why Three Different Projections?

The separate projections allow the model to learn:

Query space: What each position is looking for
Key space: What each position advertises
Value space: What each position contributes

Why Attention for RUL?

Attention is particularly valuable for RUL prediction due to the nature of degradation data.

Non-Uniform Information Distribution

Not all timesteps are equally informative:

📝text

1Typical degradation trajectory:
2
3Timestep:    1    5    10   15   20   25   30
4             |    |    |    |    |    |    |
5State:    [Healthy]----[Transition]---[Degrading]
6           ~~~~~~~~           ~~~~    !!!!!!!!!!!
7
8Informativeness for RUL:
9  Low        Low      Medium     High    Very High

Attention Learns This Pattern

Through training, attention weights naturally reflect informativeness:

Phase	Timesteps	Typical Attention
Healthy (stable)	1-10	Low (~0.02 each)
Transition	11-20	Medium (~0.04 each)
Degrading	21-30	High (~0.08 each)

Benefits for RUL

Noise suppression: Low attention on irrelevant healthy readings
Critical signal amplification: High attention on degradation indicators
Adaptive focus: Different attention patterns for different failure modes
Interpretability: Attention weights show what the model considers important

Key Insight: Attention allows the model to say "the reading at cycle 27 is critical for my prediction, while cycle 3 is essentially noise." This selective focus is what makes attention so powerful for time series with non-uniform informativeness.

Summary

In this section, we introduced self-attention for time series:

Problem: BiLSTM outputs need to be aggregated into a single prediction
Solution: Attention learns dynamic weights for each timestep
Self-attention: Query, key, value all come from the same sequence
Intuitions: Soft lookup, learned pooling, feature selection
RUL relevance: Degradation signals are non-uniformly distributed in time

Concept	Description
Query (Q)	What information is being sought
Key (K)	What information each position contains
Value (V)	The actual content to aggregate
Attention weights	Learned importance of each timestep
Context vector	Weighted sum of values

Looking Ahead: We have the intuition—now for the math. The next section presents the scaled dot-product attention formula, the core computation that turns queries, keys, and values into attention weights and context vectors.

With the intuition established, we now derive the scaled dot-product attention formula.