AI Book - Master Artificial Intelligence by Building from Scratch

Learning Objectives

By the end of this section, you will:

Understand the attention bottleneck problem that motivates attention mechanisms
Derive the general attention formula with queries, keys, and values
Compare attention mechanisms: additive, dot-product, and scaled dot-product
Apply temporal attention to sequence processing for RUL prediction
Preview self-attention as the foundation of transformer architectures
Understand attention weights as interpretable importance scores

Why This Matters: Our AMNL model uses temporal attention after the BiLSTM layer. Attention allows the model to focus on the most informative timesteps for RUL prediction—whether that's a sudden vibration spike or a gradual temperature trend. This section derives the mathematics that makes this focused reasoning possible.

Why Attention?

Despite solving the vanishing gradient problem, LSTMs still face a fundamental limitation we call the bottleneck problem.

The Bottleneck Problem

Consider a BiLSTM processing a 30-timestep sequence. The final hidden states are:

$\overrightarrow{\mathbf{h}}_{30}$ : Forward LSTM's final state (summarizes entire sequence)
$\overleftarrow{\mathbf{h}}_1$ : Backward LSTM's final state (also summarizes entire sequence)

If we only use these final states, all 30 timesteps must be compressed into a fixed-size vector of dimension $2H = 128$ . This creates a bottleneck:

\text{30 timesteps} \times 17 \text{ features} = 510 \text{ values} \rightarrow 128 \text{ dimensions}

Information Loss

This compression forces the network to prioritize some information over others:

Recent timesteps may dominate due to recency bias
Subtle but important early signals can be lost
The network cannot easily retrieve specific past information when needed

The Attention Solution

Attention addresses this by maintaining direct access to all timesteps:

\mathbf{c} = \sum_{t=1}^{T} \alpha_t \mathbf{h}_t

Where $\alpha_t \in [0, 1]$ are attention weights that sum to 1. The context vector $\mathbf{c}$ is a weighted average of all hidden states, not just the final one.

Approach	Information Access	Flexibility
Final hidden state	Compressed summary only	Fixed representation
Attention	All timesteps directly	Dynamic, query-dependent

Attention Fundamentals

At its core, attention is a mechanism for computing a weighted average where the weights depend on the content being aggregated.

The Query-Key-Value Framework

Modern attention mechanisms are formulated using three components:

Component	Symbol	Role	Intuition
Query	q	What we are looking for	The question
Keys	K	What each item contains	Index/labels
Values	V	The actual content	The answers

The attention mechanism computes how well each key matches the query, then uses these matches to weight the values.

General Attention Formula

Given:

Query: $\mathbf{q} \in \mathbb{R}^{d_k}$
Keys: $\mathbf{K} = [\mathbf{k}_1, \ldots, \mathbf{k}_T]^{\top} \in \mathbb{R}^{T \times d_k}$
Values: $\mathbf{V} = [\mathbf{v}_1, \ldots, \mathbf{v}_T]^{\top} \in \mathbb{R}^{T \times d_v}$

Attention computes:

\text{Attention}(\mathbf{q}, \mathbf{K}, \mathbf{V}) = \sum_{t=1}^{T} \alpha_t \mathbf{v}_t

Where the attention weights are:

\alpha_t = \frac{\exp(e_t)}{\sum_{j=1}^{T} \exp(e_j)}, \quad e_t = \text{score}(\mathbf{q}, \mathbf{k}_t)

The score function measures compatibility between query and key. The softmax ensures $\sum_t \alpha_t = 1$ .

Types of Attention Mechanisms

Different score functions define different attention mechanisms. The three main types are additive, dot-product, and scaled dot-product.

1. Additive Attention (Bahdanau)

Introduced by Bahdanau et al. (2014) for neural machine translation:

e_t = \mathbf{v}^{\top} \tanh(\mathbf{W}_q \mathbf{q} + \mathbf{W}_k \mathbf{k}_t)

Components:

$\mathbf{W}_q \in \mathbb{R}^{d_a \times d_q}$ : Projects query to attention dimension
$\mathbf{W}_k \in \mathbb{R}^{d_a \times d_k}$ : Projects keys to attention dimension
$\mathbf{v} \in \mathbb{R}^{d_a}$ : Learned vector that produces scalar score
$\tanh$ : Non-linearity that captures complex query-key interactions

Advantages: More expressive due to non-linearity; works well when query and key dimensions differ.

Disadvantages: Requires additional parameters; slower due to MLP computation.

2. Dot-Product Attention (Luong)

Simpler alternative using direct dot products:

e_t = \mathbf{q}^{\top} \mathbf{k}_t

Advantages: No additional parameters; highly efficient using matrix multiplication.

Disadvantages: Requires $d_q = d_k$ ; can have numerical issues for large dimensions.

3. Scaled Dot-Product Attention

Used in Transformers (Vaswani et al., 2017) to address the scaling issue:

e_t = \frac{\mathbf{q}^{\top} \mathbf{k}_t}{\sqrt{d_k}}

Why Scaling?

Comparison Summary

Mechanism	Score Function	Parameters	Complexity
Additive	v^T tanh(W_q q + W_k k)	3 weight matrices	O(d_a(d_q + d_k))
Dot-product	q^T k	None	O(d_k)
Scaled dot-product	q^T k / √d_k	None	O(d_k)

Temporal Attention for Sequences

For RUL prediction, we apply attention over the temporal dimension to identify which timesteps are most relevant for the prediction.

Setup

After the BiLSTM, we have hidden states for each timestep:

\mathbf{H} = [\mathbf{h}_1, \mathbf{h}_2, \ldots, \mathbf{h}_T]^{\top} \in \mathbb{R}^{T \times 2H}

Where $T = 30$ (window size) and $2H = 128$ (BiLSTM output dimension).

Temporal Attention Formulation

We use a learned attention mechanism to compute importance weights:

\begin{aligned} \mathbf{u}_t &= \tanh(\mathbf{W}_a \mathbf{h}_t + \mathbf{b}_a) & \text{(Attention hidden)} \\ e_t &= \mathbf{v}_a^{\top} \mathbf{u}_t & \text{(Alignment score)} \\ \alpha_t &= \frac{\exp(e_t)}{\sum_{j=1}^{T} \exp(e_j)} & \text{(Attention weight)} \\ \mathbf{c} &= \sum_{t=1}^{T} \alpha_t \mathbf{h}_t & \text{(Context vector)} \end{aligned}

Interpreting Each Component

Symbol	Dimension	Role
W_a	d_a × 2H	Projects hidden state to attention space
b_a	d_a	Bias for attention projection
u_t	d_a	Attention representation of timestep t
v_a	d_a	Query vector (what to look for)
e_t	scalar	Compatibility score for timestep t
α_t	[0, 1]	Importance weight for timestep t
c	2H	Weighted summary of all timesteps

Attention Interpretation

The attention weights $\alpha_t$ have a powerful interpretation: they tell us which timesteps the model considers most important for the prediction.

Interpretability Bonus

Attention weights provide post-hoc explanations. When predicting low RUL, we can inspect which sensor readings at which timesteps triggered the prediction—valuable for maintenance engineers trying to understand why the model predicts imminent failure.

Self-Attention Preview

While our model uses the temporal attention mechanism described above, it's worth understanding self-attention—the foundation of modern Transformer architectures.

From Attention to Self-Attention

In the attention we described, the query typically comes from a different source (e.g., decoder hidden state querying encoder states). In self-attention, queries, keys, and values all come from the same sequence:

\mathbf{Q} = \mathbf{H} \mathbf{W}_Q, \quad \mathbf{K} = \mathbf{H} \mathbf{W}_K, \quad \mathbf{V} = \mathbf{H} \mathbf{W}_V

Where $\mathbf{H}$ is the sequence of hidden states. Each position can attend to every other position, including itself.

Self-Attention Formula

\text{SelfAttention}(\mathbf{H}) = \text{softmax}\left(\frac{\mathbf{Q} \mathbf{K}^{\top}}{\sqrt{d_k}}\right) \mathbf{V}

Key Difference: All-to-All Attention

Aspect	Temporal Attention	Self-Attention
Query source	Learned global query	Each position queries
Output	Single context vector	Sequence of vectors
Computation	O(T)	O(T²)
Parallelization	Limited by RNN	Fully parallel

Self-attention computes $T \times T$ attention scores, allowing every timestep to directly attend to every other timestep. This enables:

Long-range dependencies: Timestep 1 can directly influence timestep 30 (no gradient path through intermediate states)
Parallel computation: All attention scores computed simultaneously
Rich representations: Each position gets a context-aware embedding

Why We Use Simpler Attention

For our 30-timestep windows, the $O(T^2) = O(900)$ cost of self-attention is manageable. However, our temporal attention approach produces a single context vector that feeds directly into the prediction heads, which is simpler and works well for our regression and classification tasks. Self-attention would be more beneficial for tasks requiring per-timestep outputs.

Attention in Our Architecture

Let's trace how attention integrates into the AMNL model pipeline.

Data Flow

CNN Output: $\mathbf{X} \in \mathbb{R}^{30 \times 64}$ (30 timesteps, 64 features)
BiLSTM Output: $\mathbf{H} \in \mathbb{R}^{30 \times 128}$ (forward + backward hidden states)
Attention: $\mathbf{c} \in \mathbb{R}^{128}$ (weighted combination)
Prediction Heads: RUL regression and health state classification

Attention Dimension Design

Common choices for attention hidden dimension $d_a$ :

Choice	d_a Value	Trade-off
Small	16-32	Fewer parameters, limited capacity
Match hidden	64	Balanced choice, common default
Match BiLSTM	128	Maximum capacity, more parameters

Complete Attention Layer

The attention computation can be written as:

🐍python

1# Attention layer (PyTorch-style pseudocode)
2# Input: H of shape (batch, T, 2*hidden_dim)
3# Output: context of shape (batch, 2*hidden_dim)
4
5# Step 1: Project to attention space
6U = tanh(self.W_a(H))  # (batch, T, attention_dim)
7
8# Step 2: Compute alignment scores
9e = self.v_a(U).squeeze(-1)  # (batch, T)
10
11# Step 3: Softmax to get weights
12alpha = softmax(e, dim=1)  # (batch, T)
13
14# Step 4: Weighted sum
15context = torch.sum(alpha.unsqueeze(-1) * H, dim=1)  # (batch, 2*hidden_dim)

Why Attention Helps RUL Prediction

For RUL prediction specifically, attention provides three key benefits:

Selective focus: Weights identify which parts of the degradation history matter most
Noise robustness: The model can downweight timesteps with anomalous sensor readings
Interpretability: Attention weights explain which time periods influenced the prediction

Attention Weight Visualization

In practice, attention weights can be plotted as a heatmap over the 30-timestep window. This visualization helps maintenance engineers understand model decisions and builds trust in the predictions.

Summary

In this section, we explored attention mechanisms as a solution to the sequence modeling bottleneck:

Bottleneck problem: Fixed-size hidden states must compress entire sequences, losing information
Attention solution: Compute weighted averages over all hidden states: $\mathbf{c} = \sum_t \alpha_t \mathbf{h}_t$
Query-Key-Value: Framework where queries search keys to retrieve values
Score functions: Additive (learned), dot-product (efficient), scaled dot-product (stable)
Temporal attention: Applies attention over time dimension for sequence summarization
Self-attention: Each position attends to all positions (foundation of Transformers)

Component	Formula	Purpose
Attention hidden	uₜ = tanh(Wₐhₜ + bₐ)	Non-linear projection
Alignment score	eₜ = vₐᵀuₜ	Query-key compatibility
Attention weight	αₜ = exp(eₜ) / Σexp(eⱼ)	Normalized importance
Context vector	c = Σ αₜhₜ	Weighted summary

Looking Ahead: We have now covered the individual components of our model: CNN for local feature extraction, BiLSTM for temporal dependencies, and attention for focused summarization. But our AMNL model has one more innovation—it performs multi-task learning, predicting both continuous RUL and discrete health states simultaneously. In the next section, we'll derive the multi-task learning framework and understand how combining tasks improves overall performance.

With attention understood, we are ready to explore how multi-task learning enables our model to leverage complementary prediction targets.