Learning Objectives
By the end of this section, you will:
- Understand the attention bottleneck problem that motivates attention mechanisms
- Derive the general attention formula with queries, keys, and values
- Compare attention mechanisms: additive, dot-product, and scaled dot-product
- Apply temporal attention to sequence processing for RUL prediction
- Preview self-attention as the foundation of transformer architectures
- Understand attention weights as interpretable importance scores
Why This Matters: Our AMNL model uses temporal attention after the BiLSTM layer. Attention allows the model to focus on the most informative timesteps for RUL prediction—whether that's a sudden vibration spike or a gradual temperature trend. This section derives the mathematics that makes this focused reasoning possible.
Why Attention?
Despite solving the vanishing gradient problem, LSTMs still face a fundamental limitation we call the bottleneck problem.
The Bottleneck Problem
Consider a BiLSTM processing a 30-timestep sequence. The final hidden states are:
- : Forward LSTM's final state (summarizes entire sequence)
- : Backward LSTM's final state (also summarizes entire sequence)
If we only use these final states, all 30 timesteps must be compressed into a fixed-size vector of dimension . This creates a bottleneck:
Information Loss
This compression forces the network to prioritize some information over others:
- Recent timesteps may dominate due to recency bias
- Subtle but important early signals can be lost
- The network cannot easily retrieve specific past information when needed
The Attention Solution
Attention addresses this by maintaining direct access to all timesteps:
Where are attention weights that sum to 1. The context vector is a weighted average of all hidden states, not just the final one.
| Approach | Information Access | Flexibility |
|---|---|---|
| Final hidden state | Compressed summary only | Fixed representation |
| Attention | All timesteps directly | Dynamic, query-dependent |
Attention Fundamentals
At its core, attention is a mechanism for computing a weighted average where the weights depend on the content being aggregated.
The Query-Key-Value Framework
Modern attention mechanisms are formulated using three components:
| Component | Symbol | Role | Intuition |
|---|---|---|---|
| Query | q | What we are looking for | The question |
| Keys | K | What each item contains | Index/labels |
| Values | V | The actual content | The answers |
The attention mechanism computes how well each key matches the query, then uses these matches to weight the values.
General Attention Formula
Given:
- Query:
- Keys:
- Values:
Attention computes:
Where the attention weights are:
The score function measures compatibility between query and key. The softmax ensures .
Types of Attention Mechanisms
Different score functions define different attention mechanisms. The three main types are additive, dot-product, and scaled dot-product.
1. Additive Attention (Bahdanau)
Introduced by Bahdanau et al. (2014) for neural machine translation:
Components:
- : Projects query to attention dimension
- : Projects keys to attention dimension
- : Learned vector that produces scalar score
- : Non-linearity that captures complex query-key interactions
Advantages: More expressive due to non-linearity; works well when query and key dimensions differ.
Disadvantages: Requires additional parameters; slower due to MLP computation.
2. Dot-Product Attention (Luong)
Simpler alternative using direct dot products:
Advantages: No additional parameters; highly efficient using matrix multiplication.
Disadvantages: Requires ; can have numerical issues for large dimensions.
3. Scaled Dot-Product Attention
Used in Transformers (Vaswani et al., 2017) to address the scaling issue:
Why Scaling?
Comparison Summary
| Mechanism | Score Function | Parameters | Complexity |
|---|---|---|---|
| Additive | v^T tanh(W_q q + W_k k) | 3 weight matrices | O(d_a(d_q + d_k)) |
| Dot-product | q^T k | None | O(d_k) |
| Scaled dot-product | q^T k / √d_k | None | O(d_k) |
Temporal Attention for Sequences
For RUL prediction, we apply attention over the temporal dimension to identify which timesteps are most relevant for the prediction.
Setup
After the BiLSTM, we have hidden states for each timestep:
Where (window size) and (BiLSTM output dimension).
Temporal Attention Formulation
We use a learned attention mechanism to compute importance weights:
Interpreting Each Component
| Symbol | Dimension | Role |
|---|---|---|
| W_a | d_a × 2H | Projects hidden state to attention space |
| b_a | d_a | Bias for attention projection |
| u_t | d_a | Attention representation of timestep t |
| v_a | d_a | Query vector (what to look for) |
| e_t | scalar | Compatibility score for timestep t |
| α_t | [0, 1] | Importance weight for timestep t |
| c | 2H | Weighted summary of all timesteps |
Attention Interpretation
The attention weights have a powerful interpretation: they tell us which timesteps the model considers most important for the prediction.
Interpretability Bonus
Attention weights provide post-hoc explanations. When predicting low RUL, we can inspect which sensor readings at which timesteps triggered the prediction—valuable for maintenance engineers trying to understand why the model predicts imminent failure.
Self-Attention Preview
While our model uses the temporal attention mechanism described above, it's worth understanding self-attention—the foundation of modern Transformer architectures.
From Attention to Self-Attention
In the attention we described, the query typically comes from a different source (e.g., decoder hidden state querying encoder states). In self-attention, queries, keys, and values all come from the same sequence:
Where is the sequence of hidden states. Each position can attend to every other position, including itself.
Self-Attention Formula
Key Difference: All-to-All Attention
| Aspect | Temporal Attention | Self-Attention |
|---|---|---|
| Query source | Learned global query | Each position queries |
| Output | Single context vector | Sequence of vectors |
| Computation | O(T) | O(T²) |
| Parallelization | Limited by RNN | Fully parallel |
Self-attention computes attention scores, allowing every timestep to directly attend to every other timestep. This enables:
- Long-range dependencies: Timestep 1 can directly influence timestep 30 (no gradient path through intermediate states)
- Parallel computation: All attention scores computed simultaneously
- Rich representations: Each position gets a context-aware embedding
Why We Use Simpler Attention
For our 30-timestep windows, the cost of self-attention is manageable. However, our temporal attention approach produces a single context vector that feeds directly into the prediction heads, which is simpler and works well for our regression and classification tasks. Self-attention would be more beneficial for tasks requiring per-timestep outputs.
Attention in Our Architecture
Let's trace how attention integrates into the AMNL model pipeline.
Data Flow
- CNN Output: (30 timesteps, 64 features)
- BiLSTM Output: (forward + backward hidden states)
- Attention: (weighted combination)
- Prediction Heads: RUL regression and health state classification
Attention Dimension Design
Common choices for attention hidden dimension :
| Choice | d_a Value | Trade-off |
|---|---|---|
| Small | 16-32 | Fewer parameters, limited capacity |
| Match hidden | 64 | Balanced choice, common default |
| Match BiLSTM | 128 | Maximum capacity, more parameters |
Complete Attention Layer
The attention computation can be written as:
1# Attention layer (PyTorch-style pseudocode)
2# Input: H of shape (batch, T, 2*hidden_dim)
3# Output: context of shape (batch, 2*hidden_dim)
4
5# Step 1: Project to attention space
6U = tanh(self.W_a(H)) # (batch, T, attention_dim)
7
8# Step 2: Compute alignment scores
9e = self.v_a(U).squeeze(-1) # (batch, T)
10
11# Step 3: Softmax to get weights
12alpha = softmax(e, dim=1) # (batch, T)
13
14# Step 4: Weighted sum
15context = torch.sum(alpha.unsqueeze(-1) * H, dim=1) # (batch, 2*hidden_dim)Why Attention Helps RUL Prediction
For RUL prediction specifically, attention provides three key benefits:
- Selective focus: Weights identify which parts of the degradation history matter most
- Noise robustness: The model can downweight timesteps with anomalous sensor readings
- Interpretability: Attention weights explain which time periods influenced the prediction
Attention Weight Visualization
In practice, attention weights can be plotted as a heatmap over the 30-timestep window. This visualization helps maintenance engineers understand model decisions and builds trust in the predictions.
Summary
In this section, we explored attention mechanisms as a solution to the sequence modeling bottleneck:
- Bottleneck problem: Fixed-size hidden states must compress entire sequences, losing information
- Attention solution: Compute weighted averages over all hidden states:
- Query-Key-Value: Framework where queries search keys to retrieve values
- Score functions: Additive (learned), dot-product (efficient), scaled dot-product (stable)
- Temporal attention: Applies attention over time dimension for sequence summarization
- Self-attention: Each position attends to all positions (foundation of Transformers)
| Component | Formula | Purpose |
|---|---|---|
| Attention hidden | uₜ = tanh(Wₐhₜ + bₐ) | Non-linear projection |
| Alignment score | eₜ = vₐᵀuₜ | Query-key compatibility |
| Attention weight | αₜ = exp(eₜ) / Σexp(eⱼ) | Normalized importance |
| Context vector | c = Σ αₜhₜ | Weighted summary |
Looking Ahead: We have now covered the individual components of our model: CNN for local feature extraction, BiLSTM for temporal dependencies, and attention for focused summarization. But our AMNL model has one more innovation—it performs multi-task learning, predicting both continuous RUL and discrete health states simultaneously. In the next section, we'll derive the multi-task learning framework and understand how combining tasks improves overall performance.
With attention understood, we are ready to explore how multi-task learning enables our model to leverage complementary prediction targets.