Chapter 2
22 min read
Section 9 of 104

LSTM and GRU Formulations

Mathematical Foundations

Learning Objectives

By the end of this section, you will:

  1. Understand the gating mechanism as the key innovation that enables long-range learning
  2. Derive all four LSTM gates and understand each gate's role in controlling information flow
  3. Trace the cell state as the highway for gradient propagation
  4. Prove how LSTMs prevent vanishing gradients through additive updates
  5. Compare GRU formulations and understand when to choose each architecture
  6. Understand bidirectional processing and why we use it for RUL prediction
Why This Matters: Our AMNL model uses a BiLSTM (bidirectional LSTM) as its core sequence processor. Understanding how gates control information flow explains why our model can capture both short-term spikes and long-term degradation trends—essential for accurate RUL prediction across all operating conditions.

The Gating Intuition

Before diving into equations, let's build intuition. The vanilla RNN had a fundamental problem: the same transformation was applied at every timestep, with no ability to control what information to keep or discard.

The Gate Metaphor

Think of a gate as a learned, differentiable switch that outputs values between 0 and 1:

  • Gate = 0: Block everything (the gate is closed)
  • Gate = 1: Pass everything (the gate is open)
  • Gate = 0.7: Allow 70% through (partially open)

Mathematically, if g[0,1]d\mathbf{g} \in [0, 1]^d is a gate vector and xRd\mathbf{x} \in \mathbb{R}^d is some signal, then:

gx=[g1x1,g2x2,,gdxd]\mathbf{g} \odot \mathbf{x} = [g_1 x_1, g_2 x_2, \ldots, g_d x_d]

Where \odot is element-wise multiplication. Each dimension can be gated independently!

Why Sigmoid Creates Gates

Gates use the sigmoid activation σ(z)=11+ez\sigma(z) = \frac{1}{1 + e^{-z}} because:

  1. Output range: Always in (0,1)(0, 1)
  2. Smooth and differentiable: Allows gradient-based learning
  3. Saturation: For extreme inputs, output approaches 0 or 1

Gates are Learned

Gate values are not fixed—they are computed from the current input and hidden state. The network learns when to open or close each gate based on context!


LSTM Architecture

The Long Short-Term Memory (LSTM) cell, introduced by Hochreiter & Schmidhuber (1997), addresses vanilla RNN limitations with two key innovations:

Innovation 1: The Cell State

LSTMs introduce a separate cell state Ct\mathbf{C}_t that runs through the entire sequence with only linear interactions:

Ct=ftCt1+itC~t\mathbf{C}_t = \mathbf{f}_t \odot \mathbf{C}_{t-1} + \mathbf{i}_t \odot \tilde{\mathbf{C}}_t

This is an additive update (not multiplicative like vanilla RNNs). Information can flow unchanged for many timesteps—the gradient highway!

Innovation 2: Three Gates

LSTMs use three gates to control information flow:

GateSymbolFunctionIntuition
ForgetfₜWhich old info to discardClear irrelevant memory
InputiₜWhich new info to storeWrite new memories
OutputoₜWhat to output from cellControl visibility

Information Flow Overview

At each timestep, the LSTM:

  1. Forget: Decide what to remove from cell state
  2. Store: Create new candidate information
  3. Update: Combine old (filtered) and new information
  4. Output: Produce hidden state from cell state

LSTM Equations Derived

Let's derive each LSTM component systematically. Our AMNL model uses hidden dimension H=64H = 64.

Notation

SymbolDimensionMeaning
xₜD = 64Input at time t (from CNN features)
hₜ₋₁H = 64Previous hidden state
Cₜ₋₁H = 64Previous cell state
W_f, W_i, W_c, W_o(D + H) × HWeight matrices
b_f, b_i, b_c, b_oHBias vectors

Step 1: Concatenate Input

All gates share the same input vector:

zt=[ht1;xt]RD+H\mathbf{z}_t = [\mathbf{h}_{t-1}; \mathbf{x}_t] \in \mathbb{R}^{D+H}

Where [;][\cdot; \cdot] denotes concatenation. This combines history (hidden state) with new input.

Step 2: Forget Gate

ft=σ(Wfzt+bf)\mathbf{f}_t = \sigma(\mathbf{W}_f \mathbf{z}_t + \mathbf{b}_f)

Dimension: ft(0,1)H\mathbf{f}_t \in (0, 1)^H

Intuition: For each dimension of the cell state, how much should we retain from the previous timestep?

Step 3: Input Gate and Candidate State

The input gate controls how much new information to add:

it=σ(Wizt+bi)\mathbf{i}_t = \sigma(\mathbf{W}_i \mathbf{z}_t + \mathbf{b}_i)

The candidate cell state is what new information to add:

C~t=tanh(Wczt+bc)\tilde{\mathbf{C}}_t = \tanh(\mathbf{W}_c \mathbf{z}_t + \mathbf{b}_c)

Key insight: Separating "how much" (gate) from "what" (candidate) allows independent control. The candidate uses tanh to produce values in (1,1)(-1, 1)—it can push the cell state in either direction.

Step 4: Cell State Update

Ct=ftCt1+itC~t\mathbf{C}_t = \mathbf{f}_t \odot \mathbf{C}_{t-1} + \mathbf{i}_t \odot \tilde{\mathbf{C}}_t

This is the critical equation! Breaking it down:

  • ftCt1\mathbf{f}_t \odot \mathbf{C}_{t-1}: Selectively forget old cell state
  • itC~t\mathbf{i}_t \odot \tilde{\mathbf{C}}_t: Selectively add new information
  • Addition (+): Not multiplication! Gradients flow directly.

The Additive Update

The additive structure Ct=ftCt1+\mathbf{C}_t = \mathbf{f}_t \odot \mathbf{C}_{t-1} + \ldots is fundamentally different from the vanilla RNN's multiplicative update. This single change is what enables long-range learning.

Step 5: Output Gate and Hidden State

ot=σ(Wozt+bo)\mathbf{o}_t = \sigma(\mathbf{W}_o \mathbf{z}_t + \mathbf{b}_o)
ht=ottanh(Ct)\mathbf{h}_t = \mathbf{o}_t \odot \tanh(\mathbf{C}_t)

Intuition: The cell state Ct\mathbf{C}_t contains all stored information, but we may not want to expose all of it. The output gate filters what becomes visible in the hidden state.

Complete LSTM Equations

For reference, here are all equations together:

ft=σ(Wf[ht1;xt]+bf)(Forget gate)it=σ(Wi[ht1;xt]+bi)(Input gate)C~t=tanh(Wc[ht1;xt]+bc)(Candidate)Ct=ftCt1+itC~t(Cell update)ot=σ(Wo[ht1;xt]+bo)(Output gate)ht=ottanh(Ct)(Hidden state)\begin{aligned} \mathbf{f}_t &= \sigma(\mathbf{W}_f [\mathbf{h}_{t-1}; \mathbf{x}_t] + \mathbf{b}_f) & \text{(Forget gate)} \\ \mathbf{i}_t &= \sigma(\mathbf{W}_i [\mathbf{h}_{t-1}; \mathbf{x}_t] + \mathbf{b}_i) & \text{(Input gate)} \\ \tilde{\mathbf{C}}_t &= \tanh(\mathbf{W}_c [\mathbf{h}_{t-1}; \mathbf{x}_t] + \mathbf{b}_c) & \text{(Candidate)} \\ \mathbf{C}_t &= \mathbf{f}_t \odot \mathbf{C}_{t-1} + \mathbf{i}_t \odot \tilde{\mathbf{C}}_t & \text{(Cell update)} \\ \mathbf{o}_t &= \sigma(\mathbf{W}_o [\mathbf{h}_{t-1}; \mathbf{x}_t] + \mathbf{b}_o) & \text{(Output gate)} \\ \mathbf{h}_t &= \mathbf{o}_t \odot \tanh(\mathbf{C}_t) & \text{(Hidden state)} \end{aligned}

How LSTM Solves Vanishing Gradients

Now we can prove why LSTMs don't suffer from vanishing gradients. Consider the gradient of the loss with respect to an earlier cell state.

Cell State Gradient

From the cell update equation Ct=ftCt1+itC~t\mathbf{C}_t = \mathbf{f}_t \odot \mathbf{C}_{t-1} + \mathbf{i}_t \odot \tilde{\mathbf{C}}_t:

CtCt1=diag(ft)\frac{\partial \mathbf{C}_t}{\partial \mathbf{C}_{t-1}} = \text{diag}(\mathbf{f}_t)

The gradient from time tt to time kk along the cell state path:

CtCk=i=k+1tdiag(fi)=i=k+1tfi\frac{\partial \mathbf{C}_t}{\partial \mathbf{C}_k} = \prod_{i=k+1}^{t} \text{diag}(\mathbf{f}_i) = \prod_{i=k+1}^{t} \mathbf{f}_i

Comparison with Vanilla RNN

ModelGradient ProductTypical Behavior
Vanilla RNN∏ diag(1 - h²ᵢ) · W_hhExponential decay (γ < 1)
LSTM (cell)∏ fᵢControlled by learned gates

Why This Prevents Vanishing

When ft=1\mathbf{f}_t = \mathbf{1} and it=0\mathbf{i}_t = \mathbf{0}:

Ct=Ct1\mathbf{C}_t = \mathbf{C}_{t-1}

The cell state is copied exactly! This "constant error carousel" allows information and gradients to persist indefinitely—solving the core problem of vanilla RNNs.

Forget Gate Bias

In practice, forget gate biases are often initialized to positive values (e.g., bf=1\mathbf{b}_f = 1) so gates start near 1. This encourages the model to preserve information by default.


GRU: A Streamlined Alternative

The Gated Recurrent Unit (GRU), introduced by Cho et al. (2014), simplifies the LSTM by merging gates and eliminating the separate cell state.

GRU Design Philosophy

GRU asks: "Can we get LSTM-like benefits with fewer parameters?" The answer is yes, through clever gate merging:

LSTMGRUSimplification
Forget + Input gatesUpdate gate (z)Complementary: z and (1-z)
Cell state + Hidden stateOnly hidden stateMerged into one
4 weight matrices3 weight matrices25% fewer parameters

GRU Equations

Reset Gate: How much of the past to forget when computing the candidate:

rt=σ(Wr[ht1;xt]+br)\mathbf{r}_t = \sigma(\mathbf{W}_r [\mathbf{h}_{t-1}; \mathbf{x}_t] + \mathbf{b}_r)

Update Gate: How much of the new candidate to use versus keeping old state:

zt=σ(Wz[ht1;xt]+bz)\mathbf{z}_t = \sigma(\mathbf{W}_z [\mathbf{h}_{t-1}; \mathbf{x}_t] + \mathbf{b}_z)

Candidate Hidden State: Proposed new state (reset gate filters history):

h~t=tanh(Wh[rtht1;xt]+bh)\tilde{\mathbf{h}}_t = \tanh(\mathbf{W}_h [\mathbf{r}_t \odot \mathbf{h}_{t-1}; \mathbf{x}_t] + \mathbf{b}_h)

Hidden State Update: Interpolation between old and new:

ht=(1zt)ht1+zth~t\mathbf{h}_t = (1 - \mathbf{z}_t) \odot \mathbf{h}_{t-1} + \mathbf{z}_t \odot \tilde{\mathbf{h}}_t

Key Insight: Complementary Gating

The GRU update uses (1zt)(1 - \mathbf{z}_t) and zt\mathbf{z}_t as complementary weights:

  • zt0\mathbf{z}_t \approx 0: Keep old state, ignore new input
  • zt1\mathbf{z}_t \approx 1: Replace with new candidate
  • zt=0.5\mathbf{z}_t = 0.5: Equal mix of old and new

This is analogous to LSTM's forget and input gates, but with a constraint: ft+it=1\mathbf{f}_t + \mathbf{i}_t = 1 (implicitly).


LSTM vs GRU Comparison

Parameter Count

Empirical Comparison

CriterionLSTMGRUWinner
ParametersMore (4 gates)Fewer (3 gates)GRU
Training speedSlowerFasterGRU
Long sequencesBetter retentionGood retentionLSTM
Small datasetsMay overfitBetter generalizationGRU
Language modelingStandard choiceCompetitiveLSTM
Speech recognitionComparableOften preferredGRU

Why We Use LSTM for RUL

For our AMNL model, we chose LSTM because:

  1. Long-range dependencies: Degradation patterns may span the entire 30-timestep window
  2. Separate cell state: Better for preserving subtle trend information across timesteps
  3. Dataset size: C-MAPSS has sufficient data to train LSTM without overfitting
  4. Proven performance: LSTM is well-established for industrial time series
Design Choice: For RUL prediction, the additional capacity of LSTM's separate cell state outweighs GRU's training efficiency. When long-term memory matters most, LSTM is the safer choice.

Bidirectional Processing

Standard RNNs/LSTMs process sequences left-to-right, computing ht\overrightarrow{\mathbf{h}}_t from past context. But what about future context?

The Bidirectional Idea

A Bidirectional LSTM (BiLSTM) runs two LSTMs:

  1. Forward LSTM: Processes x1,x2,,xT\mathbf{x}_1, \mathbf{x}_2, \ldots, \mathbf{x}_T left-to-right
  2. Backward LSTM: Processes xT,xT1,,x1\mathbf{x}_T, \mathbf{x}_{T-1}, \ldots, \mathbf{x}_1 right-to-left

BiLSTM Equations

Forward pass:

ht=LSTMforward(xt,ht1)\overrightarrow{\mathbf{h}}_t = \text{LSTM}_{\text{forward}}(\mathbf{x}_t, \overrightarrow{\mathbf{h}}_{t-1})

Backward pass:

ht=LSTMbackward(xt,ht+1)\overleftarrow{\mathbf{h}}_t = \text{LSTM}_{\text{backward}}(\mathbf{x}_t, \overleftarrow{\mathbf{h}}_{t+1})

Combined output:

ht=[ht;ht]R2H\mathbf{h}_t = [\overrightarrow{\mathbf{h}}_t; \overleftarrow{\mathbf{h}}_t] \in \mathbb{R}^{2H}

The output at each timestep concatenates forward and backward hidden states, giving dimension 2H=1282H = 128 in our model.

Why BiLSTM for RUL?

Consider a sensor reading at timestep 15 in a 30-step window:

  • Forward context: What happened before (cycles 1-14)
  • Backward context: What happens after (cycles 16-30)

For RUL prediction, both matter! A temperature spike at cycle 15 is interpreted differently if:

  • It returns to normal (backward context shows recovery)
  • It continues rising (backward context shows degradation)

Parallel Computation

An important practical note: forward and backward passes are independent and can run in parallel! This makes BiLSTM only ~1.5× slower than unidirectional LSTM, not 2×.

When NOT to Use BiLSTM

BiLSTM requires seeing the entire sequence before producing output. For real-time prediction where you must output immediately as data arrives, use unidirectional LSTM. For our RUL application, we process complete windows, so BiLSTM is appropriate.


Summary

In this section, we derived the mathematics of LSTM and GRU architectures:

  1. Gates are learned sigmoid functions that control information flow with values in (0,1)(0, 1)
  2. LSTM has four components: forget gate (ft\mathbf{f}_t), input gate (it\mathbf{i}_t), candidate (C~t\tilde{\mathbf{C}}_t), output gate (ot\mathbf{o}_t)
  3. Cell state update uses addition: Ct=ftCt1+itC~t\mathbf{C}_t = \mathbf{f}_t \odot \mathbf{C}_{t-1} + \mathbf{i}_t \odot \tilde{\mathbf{C}}_t
  4. Gradient flow through cell state is ifi\prod_i \mathbf{f}_i—controlled by learned gates, not fixed weights
  5. GRU simplifies to 3 gates with complementary gating: ht=(1zt)ht1+zth~t\mathbf{h}_t = (1 - \mathbf{z}_t) \odot \mathbf{h}_{t-1} + \mathbf{z}_t \odot \tilde{\mathbf{h}}_t
  6. BiLSTM captures both past and future context by running forward and backward LSTMs
ComponentFormulaPurpose
Forget gatefₜ = σ(Wf[hₜ₋₁; xₜ] + bf)Control memory erasure
Input gateiₜ = σ(Wi[hₜ₋₁; xₜ] + bi)Control new information
Cell updateCₜ = fₜ ⊙ Cₜ₋₁ + iₜ ⊙ C̃ₜAdditive memory update
Output gateoₜ = σ(Wo[hₜ₋₁; xₜ] + bo)Control visibility
Hidden statehₜ = oₜ ⊙ tanh(Cₜ)Output representation
Looking Ahead: LSTMs process sequences step-by-step, compressing all history into a fixed-size vector. But what if we want to directly access any past timestep? The attention mechanism solves this by computing a weighted combination of all hidden states, allowing the model to focus on the most relevant parts of the sequence. This is especially powerful for RUL prediction where specific degradation events matter more than others.

With LSTM and BiLSTM understood, we are ready to explore how attention mechanisms further enhance sequence modeling.