AI Book - Master Artificial Intelligence by Building from Scratch

Learning Objectives

By the end of this section, you will:

Understand the gating mechanism as the key innovation that enables long-range learning
Derive all four LSTM gates and understand each gate's role in controlling information flow
Trace the cell state as the highway for gradient propagation
Prove how LSTMs prevent vanishing gradients through additive updates
Compare GRU formulations and understand when to choose each architecture
Understand bidirectional processing and why we use it for RUL prediction

Why This Matters: Our AMNL model uses a BiLSTM (bidirectional LSTM) as its core sequence processor. Understanding how gates control information flow explains why our model can capture both short-term spikes and long-term degradation trends—essential for accurate RUL prediction across all operating conditions.

The Gating Intuition

Before diving into equations, let's build intuition. The vanilla RNN had a fundamental problem: the same transformation was applied at every timestep, with no ability to control what information to keep or discard.

The Gate Metaphor

Think of a gate as a learned, differentiable switch that outputs values between 0 and 1:

Gate = 0: Block everything (the gate is closed)
Gate = 1: Pass everything (the gate is open)
Gate = 0.7: Allow 70% through (partially open)

Mathematically, if $\mathbf{g} \in [0, 1]^d$ is a gate vector and $\mathbf{x} \in \mathbb{R}^d$ is some signal, then:

\mathbf{g} \odot \mathbf{x} = [g_1 x_1, g_2 x_2, \ldots, g_d x_d]

Where $\odot$ is element-wise multiplication. Each dimension can be gated independently!

Why Sigmoid Creates Gates

Gates use the sigmoid activation $\sigma(z) = \frac{1}{1 + e^{-z}}$ because:

Output range: Always in $(0, 1)$
Smooth and differentiable: Allows gradient-based learning
Saturation: For extreme inputs, output approaches 0 or 1

Gates are Learned

Gate values are not fixed—they are computed from the current input and hidden state. The network learns when to open or close each gate based on context!

LSTM Architecture

The Long Short-Term Memory (LSTM) cell, introduced by Hochreiter & Schmidhuber (1997), addresses vanilla RNN limitations with two key innovations:

Innovation 1: The Cell State

LSTMs introduce a separate cell state $\mathbf{C}_t$ that runs through the entire sequence with only linear interactions:

\mathbf{C}_t = \mathbf{f}_t \odot \mathbf{C}_{t-1} + \mathbf{i}_t \odot \tilde{\mathbf{C}}_t

This is an additive update (not multiplicative like vanilla RNNs). Information can flow unchanged for many timesteps—the gradient highway!

Innovation 2: Three Gates

LSTMs use three gates to control information flow:

Gate	Symbol	Function	Intuition
Forget	fₜ	Which old info to discard	Clear irrelevant memory
Input	iₜ	Which new info to store	Write new memories
Output	oₜ	What to output from cell	Control visibility

Information Flow Overview

At each timestep, the LSTM:

Forget: Decide what to remove from cell state
Store: Create new candidate information
Update: Combine old (filtered) and new information
Output: Produce hidden state from cell state

LSTM Equations Derived

Let's derive each LSTM component systematically. Our AMNL model uses hidden dimension $H = 64$ .

Notation

Symbol	Dimension	Meaning
xₜ	D = 64	Input at time t (from CNN features)
hₜ₋₁	H = 64	Previous hidden state
Cₜ₋₁	H = 64	Previous cell state
W_f, W_i, W_c, W_o	(D + H) × H	Weight matrices
b_f, b_i, b_c, b_o	H	Bias vectors

Step 1: Concatenate Input

All gates share the same input vector:

\mathbf{z}_t = [\mathbf{h}_{t-1}; \mathbf{x}_t] \in \mathbb{R}^{D+H}

Where $[\cdot; \cdot]$ denotes concatenation. This combines history (hidden state) with new input.

Step 2: Forget Gate

\mathbf{f}_t = \sigma(\mathbf{W}_f \mathbf{z}_t + \mathbf{b}_f)

Dimension: $\mathbf{f}_t \in (0, 1)^H$

Intuition: For each dimension of the cell state, how much should we retain from the previous timestep?

Step 3: Input Gate and Candidate State

The input gate controls how much new information to add:

\mathbf{i}_t = \sigma(\mathbf{W}_i \mathbf{z}_t + \mathbf{b}_i)

The candidate cell state is what new information to add:

\tilde{\mathbf{C}}_t = \tanh(\mathbf{W}_c \mathbf{z}_t + \mathbf{b}_c)

Key insight: Separating "how much" (gate) from "what" (candidate) allows independent control. The candidate uses tanh to produce values in $(-1, 1)$ —it can push the cell state in either direction.

Step 4: Cell State Update

\mathbf{C}_t = \mathbf{f}_t \odot \mathbf{C}_{t-1} + \mathbf{i}_t \odot \tilde{\mathbf{C}}_t

This is the critical equation! Breaking it down:

$\mathbf{f}_t \odot \mathbf{C}_{t-1}$ : Selectively forget old cell state
$\mathbf{i}_t \odot \tilde{\mathbf{C}}_t$ : Selectively add new information
Addition (+): Not multiplication! Gradients flow directly.

The Additive Update

The additive structure $\mathbf{C}_t = \mathbf{f}_t \odot \mathbf{C}_{t-1} + \ldots$ is fundamentally different from the vanilla RNN's multiplicative update. This single change is what enables long-range learning.

Step 5: Output Gate and Hidden State

\mathbf{o}_t = \sigma(\mathbf{W}_o \mathbf{z}_t + \mathbf{b}_o)

\mathbf{h}_t = \mathbf{o}_t \odot \tanh(\mathbf{C}_t)

Intuition: The cell state $\mathbf{C}_t$ contains all stored information, but we may not want to expose all of it. The output gate filters what becomes visible in the hidden state.

Complete LSTM Equations

For reference, here are all equations together:

\begin{aligned} \mathbf{f}_t &= \sigma(\mathbf{W}_f [\mathbf{h}_{t-1}; \mathbf{x}_t] + \mathbf{b}_f) & \text{(Forget gate)} \\ \mathbf{i}_t &= \sigma(\mathbf{W}_i [\mathbf{h}_{t-1}; \mathbf{x}_t] + \mathbf{b}_i) & \text{(Input gate)} \\ \tilde{\mathbf{C}}_t &= \tanh(\mathbf{W}_c [\mathbf{h}_{t-1}; \mathbf{x}_t] + \mathbf{b}_c) & \text{(Candidate)} \\ \mathbf{C}_t &= \mathbf{f}_t \odot \mathbf{C}_{t-1} + \mathbf{i}_t \odot \tilde{\mathbf{C}}_t & \text{(Cell update)} \\ \mathbf{o}_t &= \sigma(\mathbf{W}_o [\mathbf{h}_{t-1}; \mathbf{x}_t] + \mathbf{b}_o) & \text{(Output gate)} \\ \mathbf{h}_t &= \mathbf{o}_t \odot \tanh(\mathbf{C}_t) & \text{(Hidden state)} \end{aligned}

How LSTM Solves Vanishing Gradients

Now we can prove why LSTMs don't suffer from vanishing gradients. Consider the gradient of the loss with respect to an earlier cell state.

Cell State Gradient

From the cell update equation $\mathbf{C}_t = \mathbf{f}_t \odot \mathbf{C}_{t-1} + \mathbf{i}_t \odot \tilde{\mathbf{C}}_t$ :

\frac{\partial \mathbf{C}_t}{\partial \mathbf{C}_{t-1}} = \text{diag}(\mathbf{f}_t)

The gradient from time $t$ to time $k$ along the cell state path:

\frac{\partial \mathbf{C}_t}{\partial \mathbf{C}_k} = \prod_{i=k+1}^{t} \text{diag}(\mathbf{f}_i) = \prod_{i=k+1}^{t} \mathbf{f}_i

Comparison with Vanilla RNN

Model	Gradient Product	Typical Behavior
Vanilla RNN	∏ diag(1 - h²ᵢ) · W_hh	Exponential decay (γ < 1)
LSTM (cell)	∏ fᵢ	Controlled by learned gates

Why This Prevents Vanishing

The Constant Error Carousel

When $\mathbf{f}_t = \mathbf{1}$ and $\mathbf{i}_t = \mathbf{0}$ :

\mathbf{C}_t = \mathbf{C}_{t-1}

The cell state is copied exactly! This "constant error carousel" allows information and gradients to persist indefinitely—solving the core problem of vanilla RNNs.

Forget Gate Bias

In practice, forget gate biases are often initialized to positive values (e.g., $\mathbf{b}_f = 1$ ) so gates start near 1. This encourages the model to preserve information by default.

GRU: A Streamlined Alternative

The Gated Recurrent Unit (GRU), introduced by Cho et al. (2014), simplifies the LSTM by merging gates and eliminating the separate cell state.

GRU Design Philosophy

GRU asks: "Can we get LSTM-like benefits with fewer parameters?" The answer is yes, through clever gate merging:

LSTM	GRU	Simplification
Forget + Input gates	Update gate (z)	Complementary: z and (1-z)
Cell state + Hidden state	Only hidden state	Merged into one
4 weight matrices	3 weight matrices	25% fewer parameters

GRU Equations

Reset Gate: How much of the past to forget when computing the candidate:

\mathbf{r}_t = \sigma(\mathbf{W}_r [\mathbf{h}_{t-1}; \mathbf{x}_t] + \mathbf{b}_r)

Update Gate: How much of the new candidate to use versus keeping old state:

\mathbf{z}_t = \sigma(\mathbf{W}_z [\mathbf{h}_{t-1}; \mathbf{x}_t] + \mathbf{b}_z)

Candidate Hidden State: Proposed new state (reset gate filters history):

\tilde{\mathbf{h}}_t = \tanh(\mathbf{W}_h [\mathbf{r}_t \odot \mathbf{h}_{t-1}; \mathbf{x}_t] + \mathbf{b}_h)

Hidden State Update: Interpolation between old and new:

\mathbf{h}_t = (1 - \mathbf{z}_t) \odot \mathbf{h}_{t-1} + \mathbf{z}_t \odot \tilde{\mathbf{h}}_t

Key Insight: Complementary Gating

The GRU update uses $(1 - \mathbf{z}_t)$ and $\mathbf{z}_t$ as complementary weights:

$\mathbf{z}_t \approx 0$ : Keep old state, ignore new input
$\mathbf{z}_t \approx 1$ : Replace with new candidate
$\mathbf{z}_t = 0.5$ : Equal mix of old and new

This is analogous to LSTM's forget and input gates, but with a constraint: $\mathbf{f}_t + \mathbf{i}_t = 1$ (implicitly).

LSTM vs GRU Comparison

Parameter Count

Empirical Comparison

Criterion	LSTM	GRU	Winner
Parameters	More (4 gates)	Fewer (3 gates)	GRU
Training speed	Slower	Faster	GRU
Long sequences	Better retention	Good retention	LSTM
Small datasets	May overfit	Better generalization	GRU
Language modeling	Standard choice	Competitive	LSTM
Speech recognition	Comparable	Often preferred	GRU

Why We Use LSTM for RUL

For our AMNL model, we chose LSTM because:

Long-range dependencies: Degradation patterns may span the entire 30-timestep window
Separate cell state: Better for preserving subtle trend information across timesteps
Dataset size: C-MAPSS has sufficient data to train LSTM without overfitting
Proven performance: LSTM is well-established for industrial time series

Design Choice: For RUL prediction, the additional capacity of LSTM's separate cell state outweighs GRU's training efficiency. When long-term memory matters most, LSTM is the safer choice.

Bidirectional Processing

Standard RNNs/LSTMs process sequences left-to-right, computing $\overrightarrow{\mathbf{h}}_t$ from past context. But what about future context?

The Bidirectional Idea

A Bidirectional LSTM (BiLSTM) runs two LSTMs:

Forward LSTM: Processes $\mathbf{x}_1, \mathbf{x}_2, \ldots, \mathbf{x}_T$ left-to-right
Backward LSTM: Processes $\mathbf{x}_T, \mathbf{x}_{T-1}, \ldots, \mathbf{x}_1$ right-to-left

BiLSTM Equations

Forward pass:

\overrightarrow{\mathbf{h}}_t = \text{LSTM}_{\text{forward}}(\mathbf{x}_t, \overrightarrow{\mathbf{h}}_{t-1})

Backward pass:

\overleftarrow{\mathbf{h}}_t = \text{LSTM}_{\text{backward}}(\mathbf{x}_t, \overleftarrow{\mathbf{h}}_{t+1})

Combined output:

\mathbf{h}_t = [\overrightarrow{\mathbf{h}}_t; \overleftarrow{\mathbf{h}}_t] \in \mathbb{R}^{2H}

The output at each timestep concatenates forward and backward hidden states, giving dimension $2H = 128$ in our model.

Why BiLSTM for RUL?

Consider a sensor reading at timestep 15 in a 30-step window:

Forward context: What happened before (cycles 1-14)
Backward context: What happens after (cycles 16-30)

For RUL prediction, both matter! A temperature spike at cycle 15 is interpreted differently if:

It returns to normal (backward context shows recovery)
It continues rising (backward context shows degradation)

Parallel Computation

An important practical note: forward and backward passes are independent and can run in parallel! This makes BiLSTM only ~1.5× slower than unidirectional LSTM, not 2×.

When NOT to Use BiLSTM

BiLSTM requires seeing the entire sequence before producing output. For real-time prediction where you must output immediately as data arrives, use unidirectional LSTM. For our RUL application, we process complete windows, so BiLSTM is appropriate.

Summary

In this section, we derived the mathematics of LSTM and GRU architectures:

Gates are learned sigmoid functions that control information flow with values in $(0, 1)$
LSTM has four components: forget gate ( $\mathbf{f}_t$ ), input gate ( $\mathbf{i}_t$ ), candidate ( $\tilde{\mathbf{C}}_t$ ), output gate ( $\mathbf{o}_t$ )
Cell state update uses addition: $\mathbf{C}_t = \mathbf{f}_t \odot \mathbf{C}_{t-1} + \mathbf{i}_t \odot \tilde{\mathbf{C}}_t$
Gradient flow through cell state is $\prod_i \mathbf{f}_i$ —controlled by learned gates, not fixed weights
GRU simplifies to 3 gates with complementary gating: $\mathbf{h}_t = (1 - \mathbf{z}_t) \odot \mathbf{h}_{t-1} + \mathbf{z}_t \odot \tilde{\mathbf{h}}_t$
BiLSTM captures both past and future context by running forward and backward LSTMs

Component	Formula	Purpose
Forget gate	fₜ = σ(Wf[hₜ₋₁; xₜ] + bf)	Control memory erasure
Input gate	iₜ = σ(Wi[hₜ₋₁; xₜ] + bi)	Control new information
Cell update	Cₜ = fₜ ⊙ Cₜ₋₁ + iₜ ⊙ C̃ₜ	Additive memory update
Output gate	oₜ = σ(Wo[hₜ₋₁; xₜ] + bo)	Control visibility
Hidden state	hₜ = oₜ ⊙ tanh(Cₜ)	Output representation

Looking Ahead: LSTMs process sequences step-by-step, compressing all history into a fixed-size vector. But what if we want to directly access any past timestep? The attention mechanism solves this by computing a weighted combination of all hidden states, allowing the model to focus on the most relevant parts of the sequence. This is especially powerful for RUL prediction where specific degradation events matter more than others.

With LSTM and BiLSTM understood, we are ready to explore how attention mechanisms further enhance sequence modeling.