Learning Objectives
By the end of this section, you will:
- Understand the LSTM cell architecture with its four main components
- Master the forget gate that controls memory erasure
- Explain the input gate that regulates new information
- Describe the cell state update mechanism
- Understand the output gate that filters the hidden state
- Appreciate why LSTMs avoid vanishing gradients
Why This Matters: The LSTM cell is an elegant solution to the vanishing gradient problem that plagues vanilla RNNs. Understanding its internal mechanics reveals why LSTMs can capture long-range dependencies in degradation trajectories—essential for accurate RUL prediction.
LSTM Cell Overview
The LSTM (Long Short-Term Memory) cell maintains two state vectors that are updated at each timestep:
- Cell state : The "memory" that flows through time with minimal transformation
- Hidden state : The "output" exposed to other layers
The Four Gates
Three gates control information flow, plus a candidate cell state:
| Gate | Symbol | Purpose |
|---|---|---|
| Forget gate | fₜ | What to erase from cell state |
| Input gate | iₜ | How much new information to add |
| Candidate cell | C̃ₜ | What new information to add |
| Output gate | oₜ | What to expose as hidden state |
High-Level Flow
1Inputs: x_t (current input), h_{t-1} (previous hidden), C_{t-1} (previous cell)
2
3 ┌─────────────────────────────────────────────────────────────┐
4 │ │
5 │ ┌────────┐ ┌────────┐ ┌────────┐ ┌────────┐ │
6 │ │ Forget │ │ Input │ │Candidate│ │ Output │ │
7 │ │ Gate │ │ Gate │ │ Cell │ │ Gate │ │
8 │ └───┬────┘ └───┬────┘ └───┬─────┘ └───┬────┘ │
9 │ │ │ │ │ │
10 │ ↓ ↓ ↓ ↓ │
11 │ ┌───────────────────────────────────────────────────┐ │
12 │ │ Cell State Update │ │
13 │ │ C_t = f_t ⊙ C_{t-1} + i_t ⊙ C̃_t │ │
14 │ └───────────────────────────────────────────────────┘ │
15 │ │ │
16 │ ↓ │
17 │ ┌───────────────────────────────────────────────────┐ │
18 │ │ Hidden State Update │ │
19 │ │ h_t = o_t ⊙ tanh(C_t) │ │
20 │ └───────────────────────────────────────────────────┘ │
21 │ │
22 └─────────────────────────────────────────────────────────────┘
23
24Outputs: h_t (current hidden), C_t (current cell)Forget Gate
The forget gate decides what information to discard from the cell state.
Mathematical Formulation
Where:
- : Forget gate activation (one value per cell dimension)
- : Weight matrix
- : Concatenation of previous hidden state and current input
- : Bias vector
- : Sigmoid function, outputting values between 0 and 1
Interpretation
- : Keep the memory (don't forget)
- : Erase the memory (forget)
- Values between 0 and 1: Partial forgetting
Input Gate
The input gate controls how much new information enters the cell state.
Two Components
1. Input gate activation: How much to write
2. Candidate cell state: What to write
Where:
- : Input gate activation
- : Candidate values to add to cell state
- : Hyperbolic tangent, outputs between -1 and 1
Interpretation
- controls the "volume" of new information
- represents the new information itself
- Together: is the filtered new information
Why tanh for Candidate?
The tanh activation outputs values in (-1, 1), allowing both positive and negative updates to the cell state. This enables the cell to increase or decrease stored values, not just accumulate.
Cell State Update
The cell state is the core memory of the LSTM, updated by combining forget and input operations.
Update Equation
Where denotes element-wise multiplication.
Decomposition
| Term | Meaning | Effect |
|---|---|---|
| fₜ ⊙ C_{t-1} | Selective retention | Keep relevant old information |
| iₜ ⊙ C̃ₜ | Selective writing | Add relevant new information |
| Sum | Updated cell state | Blended old + new memory |
Output Gate
The output gate controls what parts of the cell state are exposed as the hidden state.
Mathematical Formulation
1. Output gate activation:
2. Hidden state computation:
Where:
- : Output gate activation
- : Cell state squashed to (-1, 1)
- : Hidden state (filtered cell)
Interpretation
The output gate allows the cell to "think" (maintain memory) without immediately "speaking" (exposing to other layers):
- : Fully expose the cell state
- : Hide the cell state (memory is private)
Why LSTMs Avoid Vanishing Gradients
The cell state pathway is the key to LSTM's ability to learn long-range dependencies.
The Cell State Highway
Consider how gradients flow backward through the cell state update:
The gradient of the loss with respect to :
Comparison with Vanilla RNN
| Property | Vanilla RNN | LSTM |
|---|---|---|
| Gradient path | h_t = tanh(Wh_{t-1} + Ux_t) | C_t = f_t ⊙ C_{t-1} + ... |
| Gradient factor | W × tanh'(·) | fₜ (learned, ~1) |
| After 100 steps | (0.9)¹⁰⁰ ≈ 0.00003 | (0.99)¹⁰⁰ ≈ 0.37 |
| Long-range learning | Very difficult | Feasible |
The Forget Gate Trick
When the forget gate is close to 1, gradients flow almost unchanged through time:
If for all timesteps, then even over T = 30 steps:
The gradient is attenuated by only 26%, not 99.9% as in vanilla RNNs!
The LSTM Memory Highway
The cell state acts as a "highway" for gradient flow. The forget gate controls the "toll"—when fₜ ≈ 1, information (and gradients) pass through freely. This is why LSTMs can learn dependencies spanning hundreds of timesteps.
Summary
In this section, we detailed the LSTM cell mathematics:
- Forget gate:
- Input gate:
- Candidate cell:
- Cell update:
- Output gate:
- Hidden state:
| Gate/State | Activation | Range |
|---|---|---|
| Forget gate fₜ | Sigmoid | (0, 1) |
| Input gate iₜ | Sigmoid | (0, 1) |
| Candidate C̃ₜ | Tanh | (-1, 1) |
| Output gate oₜ | Sigmoid | (0, 1) |
| Cell state Cₜ | Linear combination | ℝ |
| Hidden state hₜ | Gated tanh | (-1, 1) |
Looking Ahead: A single LSTM layer captures temporal dependencies, but stacking multiple layers creates a hierarchy of temporal abstractions. The next section describes our two-layer BiLSTM design and how the layers interact.
With the LSTM cell understood, we now examine the two-layer BiLSTM architecture.