Chapter 2
18 min read
Section 10 of 104

Attention Mechanism Theory

Mathematical Foundations

Learning Objectives

By the end of this section, you will:

  1. Understand the attention bottleneck problem that motivates attention mechanisms
  2. Derive the general attention formula with queries, keys, and values
  3. Compare attention mechanisms: additive, dot-product, and scaled dot-product
  4. Apply temporal attention to sequence processing for RUL prediction
  5. Preview self-attention as the foundation of transformer architectures
  6. Understand attention weights as interpretable importance scores
Why This Matters: Our AMNL model uses temporal attention after the BiLSTM layer. Attention allows the model to focus on the most informative timesteps for RUL prediction—whether that's a sudden vibration spike or a gradual temperature trend. This section derives the mathematics that makes this focused reasoning possible.

Why Attention?

Despite solving the vanishing gradient problem, LSTMs still face a fundamental limitation we call the bottleneck problem.

The Bottleneck Problem

Consider a BiLSTM processing a 30-timestep sequence. The final hidden states are:

  • h30\overrightarrow{\mathbf{h}}_{30}: Forward LSTM's final state (summarizes entire sequence)
  • h1\overleftarrow{\mathbf{h}}_1: Backward LSTM's final state (also summarizes entire sequence)

If we only use these final states, all 30 timesteps must be compressed into a fixed-size vector of dimension 2H=1282H = 128. This creates a bottleneck:

30 timesteps×17 features=510 values128 dimensions\text{30 timesteps} \times 17 \text{ features} = 510 \text{ values} \rightarrow 128 \text{ dimensions}

Information Loss

This compression forces the network to prioritize some information over others:

  • Recent timesteps may dominate due to recency bias
  • Subtle but important early signals can be lost
  • The network cannot easily retrieve specific past information when needed

The Attention Solution

Attention addresses this by maintaining direct access to all timesteps:

c=t=1Tαtht\mathbf{c} = \sum_{t=1}^{T} \alpha_t \mathbf{h}_t

Where αt[0,1]\alpha_t \in [0, 1] are attention weights that sum to 1. The context vector c\mathbf{c} is a weighted average of all hidden states, not just the final one.

ApproachInformation AccessFlexibility
Final hidden stateCompressed summary onlyFixed representation
AttentionAll timesteps directlyDynamic, query-dependent

Attention Fundamentals

At its core, attention is a mechanism for computing a weighted average where the weights depend on the content being aggregated.

The Query-Key-Value Framework

Modern attention mechanisms are formulated using three components:

ComponentSymbolRoleIntuition
QueryqWhat we are looking forThe question
KeysKWhat each item containsIndex/labels
ValuesVThe actual contentThe answers

The attention mechanism computes how well each key matches the query, then uses these matches to weight the values.

General Attention Formula

Given:

  • Query: qRdk\mathbf{q} \in \mathbb{R}^{d_k}
  • Keys: K=[k1,,kT]RT×dk\mathbf{K} = [\mathbf{k}_1, \ldots, \mathbf{k}_T]^{\top} \in \mathbb{R}^{T \times d_k}
  • Values: V=[v1,,vT]RT×dv\mathbf{V} = [\mathbf{v}_1, \ldots, \mathbf{v}_T]^{\top} \in \mathbb{R}^{T \times d_v}

Attention computes:

Attention(q,K,V)=t=1Tαtvt\text{Attention}(\mathbf{q}, \mathbf{K}, \mathbf{V}) = \sum_{t=1}^{T} \alpha_t \mathbf{v}_t

Where the attention weights are:

αt=exp(et)j=1Texp(ej),et=score(q,kt)\alpha_t = \frac{\exp(e_t)}{\sum_{j=1}^{T} \exp(e_j)}, \quad e_t = \text{score}(\mathbf{q}, \mathbf{k}_t)

The score function measures compatibility between query and key. The softmax ensures tαt=1\sum_t \alpha_t = 1.


Types of Attention Mechanisms

Different score functions define different attention mechanisms. The three main types are additive, dot-product, and scaled dot-product.

1. Additive Attention (Bahdanau)

Introduced by Bahdanau et al. (2014) for neural machine translation:

et=vtanh(Wqq+Wkkt)e_t = \mathbf{v}^{\top} \tanh(\mathbf{W}_q \mathbf{q} + \mathbf{W}_k \mathbf{k}_t)

Components:

  • WqRda×dq\mathbf{W}_q \in \mathbb{R}^{d_a \times d_q}: Projects query to attention dimension
  • WkRda×dk\mathbf{W}_k \in \mathbb{R}^{d_a \times d_k}: Projects keys to attention dimension
  • vRda\mathbf{v} \in \mathbb{R}^{d_a}: Learned vector that produces scalar score
  • tanh\tanh: Non-linearity that captures complex query-key interactions

Advantages: More expressive due to non-linearity; works well when query and key dimensions differ.

Disadvantages: Requires additional parameters; slower due to MLP computation.

2. Dot-Product Attention (Luong)

Simpler alternative using direct dot products:

et=qkte_t = \mathbf{q}^{\top} \mathbf{k}_t

Advantages: No additional parameters; highly efficient using matrix multiplication.

Disadvantages: Requires dq=dkd_q = d_k; can have numerical issues for large dimensions.

3. Scaled Dot-Product Attention

Used in Transformers (Vaswani et al., 2017) to address the scaling issue:

et=qktdke_t = \frac{\mathbf{q}^{\top} \mathbf{k}_t}{\sqrt{d_k}}

Why Scaling?

Comparison Summary

MechanismScore FunctionParametersComplexity
Additivev^T tanh(W_q q + W_k k)3 weight matricesO(d_a(d_q + d_k))
Dot-productq^T kNoneO(d_k)
Scaled dot-productq^T k / √d_kNoneO(d_k)

Temporal Attention for Sequences

For RUL prediction, we apply attention over the temporal dimension to identify which timesteps are most relevant for the prediction.

Setup

After the BiLSTM, we have hidden states for each timestep:

H=[h1,h2,,hT]RT×2H\mathbf{H} = [\mathbf{h}_1, \mathbf{h}_2, \ldots, \mathbf{h}_T]^{\top} \in \mathbb{R}^{T \times 2H}

Where T=30T = 30 (window size) and 2H=1282H = 128 (BiLSTM output dimension).

Temporal Attention Formulation

We use a learned attention mechanism to compute importance weights:

ut=tanh(Waht+ba)(Attention hidden)et=vaut(Alignment score)αt=exp(et)j=1Texp(ej)(Attention weight)c=t=1Tαtht(Context vector)\begin{aligned} \mathbf{u}_t &= \tanh(\mathbf{W}_a \mathbf{h}_t + \mathbf{b}_a) & \text{(Attention hidden)} \\ e_t &= \mathbf{v}_a^{\top} \mathbf{u}_t & \text{(Alignment score)} \\ \alpha_t &= \frac{\exp(e_t)}{\sum_{j=1}^{T} \exp(e_j)} & \text{(Attention weight)} \\ \mathbf{c} &= \sum_{t=1}^{T} \alpha_t \mathbf{h}_t & \text{(Context vector)} \end{aligned}

Interpreting Each Component

SymbolDimensionRole
W_ad_a × 2HProjects hidden state to attention space
b_ad_aBias for attention projection
u_td_aAttention representation of timestep t
v_ad_aQuery vector (what to look for)
e_tscalarCompatibility score for timestep t
α_t[0, 1]Importance weight for timestep t
c2HWeighted summary of all timesteps

Attention Interpretation

The attention weights αt\alpha_t have a powerful interpretation: they tell us which timesteps the model considers most important for the prediction.

Interpretability Bonus

Attention weights provide post-hoc explanations. When predicting low RUL, we can inspect which sensor readings at which timesteps triggered the prediction—valuable for maintenance engineers trying to understand why the model predicts imminent failure.


Self-Attention Preview

While our model uses the temporal attention mechanism described above, it's worth understanding self-attention—the foundation of modern Transformer architectures.

From Attention to Self-Attention

In the attention we described, the query typically comes from a different source (e.g., decoder hidden state querying encoder states). In self-attention, queries, keys, and values all come from the same sequence:

Q=HWQ,K=HWK,V=HWV\mathbf{Q} = \mathbf{H} \mathbf{W}_Q, \quad \mathbf{K} = \mathbf{H} \mathbf{W}_K, \quad \mathbf{V} = \mathbf{H} \mathbf{W}_V

Where H\mathbf{H} is the sequence of hidden states. Each position can attend to every other position, including itself.

Self-Attention Formula

SelfAttention(H)=softmax(QKdk)V\text{SelfAttention}(\mathbf{H}) = \text{softmax}\left(\frac{\mathbf{Q} \mathbf{K}^{\top}}{\sqrt{d_k}}\right) \mathbf{V}

Key Difference: All-to-All Attention

AspectTemporal AttentionSelf-Attention
Query sourceLearned global queryEach position queries
OutputSingle context vectorSequence of vectors
ComputationO(T)O(T²)
ParallelizationLimited by RNNFully parallel

Self-attention computes T×TT \times T attention scores, allowing every timestep to directly attend to every other timestep. This enables:

  • Long-range dependencies: Timestep 1 can directly influence timestep 30 (no gradient path through intermediate states)
  • Parallel computation: All attention scores computed simultaneously
  • Rich representations: Each position gets a context-aware embedding

Why We Use Simpler Attention

For our 30-timestep windows, the O(T2)=O(900)O(T^2) = O(900) cost of self-attention is manageable. However, our temporal attention approach produces a single context vector that feeds directly into the prediction heads, which is simpler and works well for our regression and classification tasks. Self-attention would be more beneficial for tasks requiring per-timestep outputs.


Attention in Our Architecture

Let's trace how attention integrates into the AMNL model pipeline.

Data Flow

  1. CNN Output: XR30×64\mathbf{X} \in \mathbb{R}^{30 \times 64}(30 timesteps, 64 features)
  2. BiLSTM Output: HR30×128\mathbf{H} \in \mathbb{R}^{30 \times 128}(forward + backward hidden states)
  3. Attention: cR128\mathbf{c} \in \mathbb{R}^{128}(weighted combination)
  4. Prediction Heads: RUL regression and health state classification

Attention Dimension Design

Common choices for attention hidden dimension dad_a:

Choiced_a ValueTrade-off
Small16-32Fewer parameters, limited capacity
Match hidden64Balanced choice, common default
Match BiLSTM128Maximum capacity, more parameters

Complete Attention Layer

The attention computation can be written as:

🐍python
1# Attention layer (PyTorch-style pseudocode)
2# Input: H of shape (batch, T, 2*hidden_dim)
3# Output: context of shape (batch, 2*hidden_dim)
4
5# Step 1: Project to attention space
6U = tanh(self.W_a(H))  # (batch, T, attention_dim)
7
8# Step 2: Compute alignment scores
9e = self.v_a(U).squeeze(-1)  # (batch, T)
10
11# Step 3: Softmax to get weights
12alpha = softmax(e, dim=1)  # (batch, T)
13
14# Step 4: Weighted sum
15context = torch.sum(alpha.unsqueeze(-1) * H, dim=1)  # (batch, 2*hidden_dim)

Why Attention Helps RUL Prediction

For RUL prediction specifically, attention provides three key benefits:

  1. Selective focus: Weights identify which parts of the degradation history matter most
  2. Noise robustness: The model can downweight timesteps with anomalous sensor readings
  3. Interpretability: Attention weights explain which time periods influenced the prediction

Attention Weight Visualization

In practice, attention weights can be plotted as a heatmap over the 30-timestep window. This visualization helps maintenance engineers understand model decisions and builds trust in the predictions.


Summary

In this section, we explored attention mechanisms as a solution to the sequence modeling bottleneck:

  1. Bottleneck problem: Fixed-size hidden states must compress entire sequences, losing information
  2. Attention solution: Compute weighted averages over all hidden states: c=tαtht\mathbf{c} = \sum_t \alpha_t \mathbf{h}_t
  3. Query-Key-Value: Framework where queries search keys to retrieve values
  4. Score functions: Additive (learned), dot-product (efficient), scaled dot-product (stable)
  5. Temporal attention: Applies attention over time dimension for sequence summarization
  6. Self-attention: Each position attends to all positions (foundation of Transformers)
ComponentFormulaPurpose
Attention hiddenuₜ = tanh(Wₐhₜ + bₐ)Non-linear projection
Alignment scoreeₜ = vₐᵀuₜQuery-key compatibility
Attention weightαₜ = exp(eₜ) / Σexp(eⱼ)Normalized importance
Context vectorc = Σ αₜhₜWeighted summary
Looking Ahead: We have now covered the individual components of our model: CNN for local feature extraction, BiLSTM for temporal dependencies, and attention for focused summarization. But our AMNL model has one more innovation—it performs multi-task learning, predicting both continuous RUL and discrete health states simultaneously. In the next section, we'll derive the multi-task learning framework and understand how combining tasks improves overall performance.

With attention understood, we are ready to explore how multi-task learning enables our model to leverage complementary prediction targets.