AI Book - Master Artificial Intelligence by Building from Scratch

Learning Objectives

By the end of this section, you will:

Understand the LSTM cell architecture with its four main components
Master the forget gate that controls memory erasure
Explain the input gate that regulates new information
Describe the cell state update mechanism
Understand the output gate that filters the hidden state
Appreciate why LSTMs avoid vanishing gradients

Why This Matters: The LSTM cell is an elegant solution to the vanishing gradient problem that plagues vanilla RNNs. Understanding its internal mechanics reveals why LSTMs can capture long-range dependencies in degradation trajectories—essential for accurate RUL prediction.

LSTM Cell Overview

The LSTM (Long Short-Term Memory) cell maintains two state vectors that are updated at each timestep:

Cell state $C_t \in \mathbb{R}^H$ : The "memory" that flows through time with minimal transformation
Hidden state $h_t \in \mathbb{R}^H$ : The "output" exposed to other layers

The Four Gates

Three gates control information flow, plus a candidate cell state:

Gate	Symbol	Purpose
Forget gate	fₜ	What to erase from cell state
Input gate	iₜ	How much new information to add
Candidate cell	C̃ₜ	What new information to add
Output gate	oₜ	What to expose as hidden state

High-Level Flow

📝text

1Inputs: x_t (current input), h_{t-1} (previous hidden), C_{t-1} (previous cell)
2
3  ┌─────────────────────────────────────────────────────────────┐
4  │                                                             │
5  │   ┌────────┐     ┌────────┐     ┌────────┐     ┌────────┐  │
6  │   │ Forget │     │ Input  │     │Candidate│    │ Output │  │
7  │   │  Gate  │     │  Gate  │     │  Cell   │    │  Gate  │  │
8  │   └───┬────┘     └───┬────┘     └───┬─────┘    └───┬────┘  │
9  │       │              │              │              │       │
10  │       ↓              ↓              ↓              ↓       │
11  │   ┌───────────────────────────────────────────────────┐   │
12  │   │              Cell State Update                     │   │
13  │   │         C_t = f_t ⊙ C_{t-1} + i_t ⊙ C̃_t          │   │
14  │   └───────────────────────────────────────────────────┘   │
15  │                           │                               │
16  │                           ↓                               │
17  │   ┌───────────────────────────────────────────────────┐   │
18  │   │              Hidden State Update                   │   │
19  │   │              h_t = o_t ⊙ tanh(C_t)                │   │
20  │   └───────────────────────────────────────────────────┘   │
21  │                                                             │
22  └─────────────────────────────────────────────────────────────┘
23
24Outputs: h_t (current hidden), C_t (current cell)

Forget Gate

The forget gate decides what information to discard from the cell state.

Mathematical Formulation

f_t = \sigma(W_f \cdot [h_{t-1}, x_t] + b_f)

Where:

$f_t \in (0, 1)^H$ : Forget gate activation (one value per cell dimension)
$W_f \in \mathbb{R}^{H \times (H + D)}$ : Weight matrix
$[h_{t-1}, x_t]$ : Concatenation of previous hidden state and current input
$b_f \in \mathbb{R}^H$ : Bias vector
$\sigma$ : Sigmoid function, outputting values between 0 and 1

Interpretation

$f_t \approx 1$ : Keep the memory (don't forget)
$f_t \approx 0$ : Erase the memory (forget)
Values between 0 and 1: Partial forgetting

Input Gate

The input gate controls how much new information enters the cell state.

Two Components

1. Input gate activation: How much to write

i_t = \sigma(W_i \cdot [h_{t-1}, x_t] + b_i)

2. Candidate cell state: What to write

\tilde{C}_t = \tanh(W_C \cdot [h_{t-1}, x_t] + b_C)

Where:

$i_t \in (0, 1)^H$ : Input gate activation
$\tilde{C}_t \in (-1, 1)^H$ : Candidate values to add to cell state
$\tanh$ : Hyperbolic tangent, outputs between -1 and 1

Interpretation

$i_t$ controls the "volume" of new information
$\tilde{C}_t$ represents the new information itself
Together: $i_t \odot \tilde{C}_t$ is the filtered new information

Why tanh for Candidate?

The tanh activation outputs values in (-1, 1), allowing both positive and negative updates to the cell state. This enables the cell to increase or decrease stored values, not just accumulate.

Cell State Update

The cell state is the core memory of the LSTM, updated by combining forget and input operations.

Update Equation

C_t = f_t \odot C_{t-1} + i_t \odot \tilde{C}_t

Where $\odot$ denotes element-wise multiplication.

Decomposition

Term	Meaning	Effect
fₜ ⊙ C_{t-1}	Selective retention	Keep relevant old information
iₜ ⊙ C̃ₜ	Selective writing	Add relevant new information
Sum	Updated cell state	Blended old + new memory

Output Gate

The output gate controls what parts of the cell state are exposed as the hidden state.

Mathematical Formulation

1. Output gate activation:

o_t = \sigma(W_o \cdot [h_{t-1}, x_t] + b_o)

2. Hidden state computation:

h_t = o_t \odot \tanh(C_t)

Where:

$o_t \in (0, 1)^H$ : Output gate activation
$\tanh(C_t)$ : Cell state squashed to (-1, 1)
$h_t \in (-1, 1)^H$ : Hidden state (filtered cell)

Interpretation

The output gate allows the cell to "think" (maintain memory) without immediately "speaking" (exposing to other layers):

$o_t \approx 1$ : Fully expose the cell state
$o_t \approx 0$ : Hide the cell state (memory is private)

Why LSTMs Avoid Vanishing Gradients

The cell state pathway is the key to LSTM's ability to learn long-range dependencies.

The Cell State Highway

Consider how gradients flow backward through the cell state update:

C_t = f_t \odot C_{t-1} + i_t \odot \tilde{C}_t

The gradient of the loss with respect to $C_{t-1}$ :

\frac{\partial L}{\partial C_{t-1}} = \frac{\partial L}{\partial C_t} \cdot f_t

Comparison with Vanilla RNN

Property	Vanilla RNN	LSTM
Gradient path	h_t = tanh(Wh_{t-1} + Ux_t)	C_t = f_t ⊙ C_{t-1} + ...
Gradient factor	W × tanh'(·)	fₜ (learned, ~1)
After 100 steps	(0.9)¹⁰⁰ ≈ 0.00003	(0.99)¹⁰⁰ ≈ 0.37
Long-range learning	Very difficult	Feasible

The Forget Gate Trick

When the forget gate is close to 1, gradients flow almost unchanged through time:

\frac{\partial L}{\partial C_1} = \frac{\partial L}{\partial C_T} \cdot \prod_{t=2}^{T} f_t

If $f_t \approx 0.99$ for all timesteps, then even over T = 30 steps:

\prod_{t=2}^{30} 0.99 \approx 0.74

The gradient is attenuated by only 26%, not 99.9% as in vanilla RNNs!

The LSTM Memory Highway

The cell state acts as a "highway" for gradient flow. The forget gate controls the "toll"—when fₜ ≈ 1, information (and gradients) pass through freely. This is why LSTMs can learn dependencies spanning hundreds of timesteps.

Summary

In this section, we detailed the LSTM cell mathematics:

Forget gate: $f_t = \sigma(W_f \cdot [h_{t-1}, x_t] + b_f)$
Input gate: $i_t = \sigma(W_i \cdot [h_{t-1}, x_t] + b_i)$
Candidate cell: $\tilde{C}_t = \tanh(W_C \cdot [h_{t-1}, x_t] + b_C)$
Cell update: $C_t = f_t \odot C_{t-1} + i_t \odot \tilde{C}_t$
Output gate: $o_t = \sigma(W_o \cdot [h_{t-1}, x_t] + b_o)$
Hidden state: $h_t = o_t \odot \tanh(C_t)$

Gate/State	Activation	Range
Forget gate fₜ	Sigmoid	(0, 1)
Input gate iₜ	Sigmoid	(0, 1)
Candidate C̃ₜ	Tanh	(-1, 1)
Output gate oₜ	Sigmoid	(0, 1)
Cell state Cₜ	Linear combination	ℝ
Hidden state hₜ	Gated tanh	(-1, 1)

Looking Ahead: A single LSTM layer captures temporal dependencies, but stacking multiple layers creates a hierarchy of temporal abstractions. The next section describes our two-layer BiLSTM design and how the layers interact.

With the LSTM cell understood, we now examine the two-layer BiLSTM architecture.