Chapter 6
20 min read
Section 28 of 104

LSTM Cell Mathematics

Bidirectional LSTM Encoder

Learning Objectives

By the end of this section, you will:

  1. Understand the LSTM cell architecture with its four main components
  2. Master the forget gate that controls memory erasure
  3. Explain the input gate that regulates new information
  4. Describe the cell state update mechanism
  5. Understand the output gate that filters the hidden state
  6. Appreciate why LSTMs avoid vanishing gradients
Why This Matters: The LSTM cell is an elegant solution to the vanishing gradient problem that plagues vanilla RNNs. Understanding its internal mechanics reveals why LSTMs can capture long-range dependencies in degradation trajectories—essential for accurate RUL prediction.

LSTM Cell Overview

The LSTM (Long Short-Term Memory) cell maintains two state vectors that are updated at each timestep:

  • Cell state CtRHC_t \in \mathbb{R}^H: The "memory" that flows through time with minimal transformation
  • Hidden state htRHh_t \in \mathbb{R}^H: The "output" exposed to other layers

The Four Gates

Three gates control information flow, plus a candidate cell state:

GateSymbolPurpose
Forget gatefₜWhat to erase from cell state
Input gateiₜHow much new information to add
Candidate cellC̃ₜWhat new information to add
Output gateoₜWhat to expose as hidden state

High-Level Flow

📝text
1Inputs: x_t (current input), h_{t-1} (previous hidden), C_{t-1} (previous cell)
2
3  ┌─────────────────────────────────────────────────────────────┐
4  │                                                             │
5  │   ┌────────┐     ┌────────┐     ┌────────┐     ┌────────┐  │
6  │   │ Forget │     │ Input  │     │Candidate│    │ Output │  │
7  │   │  Gate  │     │  Gate  │     │  Cell   │    │  Gate  │  │
8  │   └───┬────┘     └───┬────┘     └───┬─────┘    └───┬────┘  │
9  │       │              │              │              │       │
10  │       ↓              ↓              ↓              ↓       │
11  │   ┌───────────────────────────────────────────────────┐   │
12  │   │              Cell State Update                     │   │
13  │   │         C_t = f_t ⊙ C_{t-1} + i_t ⊙ C̃_t          │   │
14  │   └───────────────────────────────────────────────────┘   │
15  │                           │                               │
16  │                           ↓                               │
17  │   ┌───────────────────────────────────────────────────┐   │
18  │   │              Hidden State Update                   │   │
19  │   │              h_t = o_t ⊙ tanh(C_t)                │   │
20  │   └───────────────────────────────────────────────────┘   │
21  │                                                             │
22  └─────────────────────────────────────────────────────────────┘
23
24Outputs: h_t (current hidden), C_t (current cell)

Forget Gate

The forget gate decides what information to discard from the cell state.

Mathematical Formulation

ft=σ(Wf[ht1,xt]+bf)f_t = \sigma(W_f \cdot [h_{t-1}, x_t] + b_f)

Where:

  • ft(0,1)Hf_t \in (0, 1)^H: Forget gate activation (one value per cell dimension)
  • WfRH×(H+D)W_f \in \mathbb{R}^{H \times (H + D)}: Weight matrix
  • [ht1,xt][h_{t-1}, x_t]: Concatenation of previous hidden state and current input
  • bfRHb_f \in \mathbb{R}^H: Bias vector
  • σ\sigma: Sigmoid function, outputting values between 0 and 1

Interpretation

  • ft1f_t \approx 1: Keep the memory (don't forget)
  • ft0f_t \approx 0: Erase the memory (forget)
  • Values between 0 and 1: Partial forgetting

Input Gate

The input gate controls how much new information enters the cell state.

Two Components

1. Input gate activation: How much to write

it=σ(Wi[ht1,xt]+bi)i_t = \sigma(W_i \cdot [h_{t-1}, x_t] + b_i)

2. Candidate cell state: What to write

C~t=tanh(WC[ht1,xt]+bC)\tilde{C}_t = \tanh(W_C \cdot [h_{t-1}, x_t] + b_C)

Where:

  • it(0,1)Hi_t \in (0, 1)^H: Input gate activation
  • C~t(1,1)H\tilde{C}_t \in (-1, 1)^H: Candidate values to add to cell state
  • tanh\tanh: Hyperbolic tangent, outputs between -1 and 1

Interpretation

  • iti_t controls the "volume" of new information
  • C~t\tilde{C}_t represents the new information itself
  • Together: itC~ti_t \odot \tilde{C}_t is the filtered new information

Why tanh for Candidate?

The tanh activation outputs values in (-1, 1), allowing both positive and negative updates to the cell state. This enables the cell to increase or decrease stored values, not just accumulate.


Cell State Update

The cell state is the core memory of the LSTM, updated by combining forget and input operations.

Update Equation

Ct=ftCt1+itC~tC_t = f_t \odot C_{t-1} + i_t \odot \tilde{C}_t

Where \odot denotes element-wise multiplication.

Decomposition

TermMeaningEffect
fₜ ⊙ C_{t-1}Selective retentionKeep relevant old information
iₜ ⊙ C̃ₜSelective writingAdd relevant new information
SumUpdated cell stateBlended old + new memory

Output Gate

The output gate controls what parts of the cell state are exposed as the hidden state.

Mathematical Formulation

1. Output gate activation:

ot=σ(Wo[ht1,xt]+bo)o_t = \sigma(W_o \cdot [h_{t-1}, x_t] + b_o)

2. Hidden state computation:

ht=ottanh(Ct)h_t = o_t \odot \tanh(C_t)

Where:

  • ot(0,1)Ho_t \in (0, 1)^H: Output gate activation
  • tanh(Ct)\tanh(C_t): Cell state squashed to (-1, 1)
  • ht(1,1)Hh_t \in (-1, 1)^H: Hidden state (filtered cell)

Interpretation

The output gate allows the cell to "think" (maintain memory) without immediately "speaking" (exposing to other layers):

  • ot1o_t \approx 1: Fully expose the cell state
  • ot0o_t \approx 0: Hide the cell state (memory is private)

Why LSTMs Avoid Vanishing Gradients

The cell state pathway is the key to LSTM's ability to learn long-range dependencies.

The Cell State Highway

Consider how gradients flow backward through the cell state update:

Ct=ftCt1+itC~tC_t = f_t \odot C_{t-1} + i_t \odot \tilde{C}_t

The gradient of the loss with respect to Ct1C_{t-1}:

LCt1=LCtft\frac{\partial L}{\partial C_{t-1}} = \frac{\partial L}{\partial C_t} \cdot f_t

Comparison with Vanilla RNN

PropertyVanilla RNNLSTM
Gradient pathh_t = tanh(Wh_{t-1} + Ux_t)C_t = f_t ⊙ C_{t-1} + ...
Gradient factorW × tanh'(·)fₜ (learned, ~1)
After 100 steps(0.9)¹⁰⁰ ≈ 0.00003(0.99)¹⁰⁰ ≈ 0.37
Long-range learningVery difficultFeasible

The Forget Gate Trick

When the forget gate is close to 1, gradients flow almost unchanged through time:

LC1=LCTt=2Tft\frac{\partial L}{\partial C_1} = \frac{\partial L}{\partial C_T} \cdot \prod_{t=2}^{T} f_t

If ft0.99f_t \approx 0.99 for all timesteps, then even over T = 30 steps:

t=2300.990.74\prod_{t=2}^{30} 0.99 \approx 0.74

The gradient is attenuated by only 26%, not 99.9% as in vanilla RNNs!

The LSTM Memory Highway

The cell state acts as a "highway" for gradient flow. The forget gate controls the "toll"—when fₜ ≈ 1, information (and gradients) pass through freely. This is why LSTMs can learn dependencies spanning hundreds of timesteps.


Summary

In this section, we detailed the LSTM cell mathematics:

  1. Forget gate: ft=σ(Wf[ht1,xt]+bf)f_t = \sigma(W_f \cdot [h_{t-1}, x_t] + b_f)
  2. Input gate: it=σ(Wi[ht1,xt]+bi)i_t = \sigma(W_i \cdot [h_{t-1}, x_t] + b_i)
  3. Candidate cell: C~t=tanh(WC[ht1,xt]+bC)\tilde{C}_t = \tanh(W_C \cdot [h_{t-1}, x_t] + b_C)
  4. Cell update: Ct=ftCt1+itC~tC_t = f_t \odot C_{t-1} + i_t \odot \tilde{C}_t
  5. Output gate: ot=σ(Wo[ht1,xt]+bo)o_t = \sigma(W_o \cdot [h_{t-1}, x_t] + b_o)
  6. Hidden state: ht=ottanh(Ct)h_t = o_t \odot \tanh(C_t)
Gate/StateActivationRange
Forget gate fₜSigmoid(0, 1)
Input gate iₜSigmoid(0, 1)
Candidate C̃ₜTanh(-1, 1)
Output gate oₜSigmoid(0, 1)
Cell state CₜLinear combination
Hidden state hₜGated tanh(-1, 1)
Looking Ahead: A single LSTM layer captures temporal dependencies, but stacking multiple layers creates a hierarchy of temporal abstractions. The next section describes our two-layer BiLSTM design and how the layers interact.

With the LSTM cell understood, we now examine the two-layer BiLSTM architecture.