Learning Objectives
By the end of this section, you will:
- Understand the gating mechanism as the key innovation that enables long-range learning
- Derive all four LSTM gates and understand each gate's role in controlling information flow
- Trace the cell state as the highway for gradient propagation
- Prove how LSTMs prevent vanishing gradients through additive updates
- Compare GRU formulations and understand when to choose each architecture
- Understand bidirectional processing and why we use it for RUL prediction
Why This Matters: Our AMNL model uses a BiLSTM (bidirectional LSTM) as its core sequence processor. Understanding how gates control information flow explains why our model can capture both short-term spikes and long-term degradation trends—essential for accurate RUL prediction across all operating conditions.
The Gating Intuition
Before diving into equations, let's build intuition. The vanilla RNN had a fundamental problem: the same transformation was applied at every timestep, with no ability to control what information to keep or discard.
The Gate Metaphor
Think of a gate as a learned, differentiable switch that outputs values between 0 and 1:
- Gate = 0: Block everything (the gate is closed)
- Gate = 1: Pass everything (the gate is open)
- Gate = 0.7: Allow 70% through (partially open)
Mathematically, if is a gate vector and is some signal, then:
Where is element-wise multiplication. Each dimension can be gated independently!
Why Sigmoid Creates Gates
Gates use the sigmoid activation because:
- Output range: Always in
- Smooth and differentiable: Allows gradient-based learning
- Saturation: For extreme inputs, output approaches 0 or 1
Gates are Learned
Gate values are not fixed—they are computed from the current input and hidden state. The network learns when to open or close each gate based on context!
LSTM Architecture
The Long Short-Term Memory (LSTM) cell, introduced by Hochreiter & Schmidhuber (1997), addresses vanilla RNN limitations with two key innovations:
Innovation 1: The Cell State
LSTMs introduce a separate cell state that runs through the entire sequence with only linear interactions:
This is an additive update (not multiplicative like vanilla RNNs). Information can flow unchanged for many timesteps—the gradient highway!
Innovation 2: Three Gates
LSTMs use three gates to control information flow:
| Gate | Symbol | Function | Intuition |
|---|---|---|---|
| Forget | fₜ | Which old info to discard | Clear irrelevant memory |
| Input | iₜ | Which new info to store | Write new memories |
| Output | oₜ | What to output from cell | Control visibility |
Information Flow Overview
At each timestep, the LSTM:
- Forget: Decide what to remove from cell state
- Store: Create new candidate information
- Update: Combine old (filtered) and new information
- Output: Produce hidden state from cell state
LSTM Equations Derived
Let's derive each LSTM component systematically. Our AMNL model uses hidden dimension .
Notation
| Symbol | Dimension | Meaning |
|---|---|---|
| xₜ | D = 64 | Input at time t (from CNN features) |
| hₜ₋₁ | H = 64 | Previous hidden state |
| Cₜ₋₁ | H = 64 | Previous cell state |
| W_f, W_i, W_c, W_o | (D + H) × H | Weight matrices |
| b_f, b_i, b_c, b_o | H | Bias vectors |
Step 1: Concatenate Input
All gates share the same input vector:
Where denotes concatenation. This combines history (hidden state) with new input.
Step 2: Forget Gate
Dimension:
Intuition: For each dimension of the cell state, how much should we retain from the previous timestep?
Step 3: Input Gate and Candidate State
The input gate controls how much new information to add:
The candidate cell state is what new information to add:
Key insight: Separating "how much" (gate) from "what" (candidate) allows independent control. The candidate uses tanh to produce values in —it can push the cell state in either direction.
Step 4: Cell State Update
This is the critical equation! Breaking it down:
- : Selectively forget old cell state
- : Selectively add new information
- Addition (+): Not multiplication! Gradients flow directly.
The Additive Update
The additive structure is fundamentally different from the vanilla RNN's multiplicative update. This single change is what enables long-range learning.
Step 5: Output Gate and Hidden State
Intuition: The cell state contains all stored information, but we may not want to expose all of it. The output gate filters what becomes visible in the hidden state.
Complete LSTM Equations
For reference, here are all equations together:
How LSTM Solves Vanishing Gradients
Now we can prove why LSTMs don't suffer from vanishing gradients. Consider the gradient of the loss with respect to an earlier cell state.
Cell State Gradient
From the cell update equation :
The gradient from time to time along the cell state path:
Comparison with Vanilla RNN
| Model | Gradient Product | Typical Behavior |
|---|---|---|
| Vanilla RNN | ∏ diag(1 - h²ᵢ) · W_hh | Exponential decay (γ < 1) |
| LSTM (cell) | ∏ fᵢ | Controlled by learned gates |
Why This Prevents Vanishing
The Constant Error Carousel
When and :
The cell state is copied exactly! This "constant error carousel" allows information and gradients to persist indefinitely—solving the core problem of vanilla RNNs.
Forget Gate Bias
In practice, forget gate biases are often initialized to positive values (e.g., ) so gates start near 1. This encourages the model to preserve information by default.
GRU: A Streamlined Alternative
The Gated Recurrent Unit (GRU), introduced by Cho et al. (2014), simplifies the LSTM by merging gates and eliminating the separate cell state.
GRU Design Philosophy
GRU asks: "Can we get LSTM-like benefits with fewer parameters?" The answer is yes, through clever gate merging:
| LSTM | GRU | Simplification |
|---|---|---|
| Forget + Input gates | Update gate (z) | Complementary: z and (1-z) |
| Cell state + Hidden state | Only hidden state | Merged into one |
| 4 weight matrices | 3 weight matrices | 25% fewer parameters |
GRU Equations
Reset Gate: How much of the past to forget when computing the candidate:
Update Gate: How much of the new candidate to use versus keeping old state:
Candidate Hidden State: Proposed new state (reset gate filters history):
Hidden State Update: Interpolation between old and new:
Key Insight: Complementary Gating
The GRU update uses and as complementary weights:
- : Keep old state, ignore new input
- : Replace with new candidate
- : Equal mix of old and new
This is analogous to LSTM's forget and input gates, but with a constraint: (implicitly).
LSTM vs GRU Comparison
Parameter Count
Empirical Comparison
| Criterion | LSTM | GRU | Winner |
|---|---|---|---|
| Parameters | More (4 gates) | Fewer (3 gates) | GRU |
| Training speed | Slower | Faster | GRU |
| Long sequences | Better retention | Good retention | LSTM |
| Small datasets | May overfit | Better generalization | GRU |
| Language modeling | Standard choice | Competitive | LSTM |
| Speech recognition | Comparable | Often preferred | GRU |
Why We Use LSTM for RUL
For our AMNL model, we chose LSTM because:
- Long-range dependencies: Degradation patterns may span the entire 30-timestep window
- Separate cell state: Better for preserving subtle trend information across timesteps
- Dataset size: C-MAPSS has sufficient data to train LSTM without overfitting
- Proven performance: LSTM is well-established for industrial time series
Design Choice: For RUL prediction, the additional capacity of LSTM's separate cell state outweighs GRU's training efficiency. When long-term memory matters most, LSTM is the safer choice.
Bidirectional Processing
Standard RNNs/LSTMs process sequences left-to-right, computing from past context. But what about future context?
The Bidirectional Idea
A Bidirectional LSTM (BiLSTM) runs two LSTMs:
- Forward LSTM: Processes left-to-right
- Backward LSTM: Processes right-to-left
BiLSTM Equations
Forward pass:
Backward pass:
Combined output:
The output at each timestep concatenates forward and backward hidden states, giving dimension in our model.
Why BiLSTM for RUL?
Consider a sensor reading at timestep 15 in a 30-step window:
- Forward context: What happened before (cycles 1-14)
- Backward context: What happens after (cycles 16-30)
For RUL prediction, both matter! A temperature spike at cycle 15 is interpreted differently if:
- It returns to normal (backward context shows recovery)
- It continues rising (backward context shows degradation)
Parallel Computation
An important practical note: forward and backward passes are independent and can run in parallel! This makes BiLSTM only ~1.5× slower than unidirectional LSTM, not 2×.
When NOT to Use BiLSTM
BiLSTM requires seeing the entire sequence before producing output. For real-time prediction where you must output immediately as data arrives, use unidirectional LSTM. For our RUL application, we process complete windows, so BiLSTM is appropriate.
Summary
In this section, we derived the mathematics of LSTM and GRU architectures:
- Gates are learned sigmoid functions that control information flow with values in
- LSTM has four components: forget gate (), input gate (), candidate (), output gate ()
- Cell state update uses addition:
- Gradient flow through cell state is —controlled by learned gates, not fixed weights
- GRU simplifies to 3 gates with complementary gating:
- BiLSTM captures both past and future context by running forward and backward LSTMs
| Component | Formula | Purpose |
|---|---|---|
| Forget gate | fₜ = σ(Wf[hₜ₋₁; xₜ] + bf) | Control memory erasure |
| Input gate | iₜ = σ(Wi[hₜ₋₁; xₜ] + bi) | Control new information |
| Cell update | Cₜ = fₜ ⊙ Cₜ₋₁ + iₜ ⊙ C̃ₜ | Additive memory update |
| Output gate | oₜ = σ(Wo[hₜ₋₁; xₜ] + bo) | Control visibility |
| Hidden state | hₜ = oₜ ⊙ tanh(Cₜ) | Output representation |
Looking Ahead: LSTMs process sequences step-by-step, compressing all history into a fixed-size vector. But what if we want to directly access any past timestep? The attention mechanism solves this by computing a weighted combination of all hidden states, allowing the model to focus on the most relevant parts of the sequence. This is especially powerful for RUL prediction where specific degradation events matter more than others.
With LSTM and BiLSTM understood, we are ready to explore how attention mechanisms further enhance sequence modeling.