Chapter 2
20 min read
Section 8 of 104

Recurrent Neural Networks Mathematics

Mathematical Foundations

Learning Objectives

By the end of this section, you will:

  1. Understand recurrence as the fundamental mechanism for processing sequential data
  2. Derive vanilla RNN equations and interpret each component's role
  3. Trace forward propagation through time and understand how hidden states accumulate history
  4. Understand backpropagation through time (BPTT) and why gradient computation is complex for sequences
  5. Diagnose the vanishing gradient problemmathematically and understand why it limits vanilla RNNs
  6. Motivate the need for LSTMs based on the limitations we discover
Why This Matters: RNNs are the foundational architecture for sequence modeling. Even though we use LSTMs in our model, understanding vanilla RNNs is essential—LSTMs are designed specifically to solve the problems we identify here. Without understanding the disease, you cannot appreciate the cure.

What is a Recurrent Neural Network?

A Recurrent Neural Network (RNN) is a class of neural networks designed for sequential data. Unlike feedforward networks that process each input independently, RNNs maintain a hidden state that carries information from previous timesteps.

Historical Context

RNNs were conceived in the 1980s, with key contributions from John Hopfield (Hopfield networks, 1982), David Rumelhart (backpropagation through time, 1986), and Jeffrey Elman (Elman networks, 1990). They became practical for sequence modeling after efficient training algorithms were developed.

The Fundamental Idea

In a feedforward network, each input is processed in isolation:

y=f(x)\mathbf{y} = f(\mathbf{x})

In an RNN, the output depends on both the current input and the accumulated state from previous inputs:

ht=f(xt,ht1)\mathbf{h}_t = f(\mathbf{x}_t, \mathbf{h}_{t-1})

This simple change—feeding the previous output back as input—creates a system with memory.


The Hidden State Concept

The hidden state ht\mathbf{h}_t is the key innovation of RNNs. It serves as a compressed summary of all information seen up to time tt.

What the Hidden State Encodes

  • Past context: Information from x1,x2,,xt1\mathbf{x}_1, \mathbf{x}_2, \ldots, \mathbf{x}_{t-1}
  • Processed features: Non-linear transformations of the input history
  • Sufficient statistics: Ideally, everything needed to predict future outputs

Dimensionality

The hidden state is a vector of fixed size HH (the hidden dimension):

htRH\mathbf{h}_t \in \mathbb{R}^H

This is a design choice. Larger HH allows more information storage but requires more parameters and computation. In our model, H=128H = 128.

Information Bottleneck

The hidden state must compress an arbitrarily long history into a fixed-size vector. This is both a strength (bounded computation) and a weakness (limited capacity). We will see how this relates to the vanishing gradient problem.


Vanilla RNN Equations

The vanilla RNN (also called simple RNN or Elman network) has the following update equations.

Core Equations

At each timestep tt:

ht=tanh(Whhht1+Wxhxt+bh)\mathbf{h}_t = \tanh(\mathbf{W}_{hh} \mathbf{h}_{t-1} + \mathbf{W}_{xh} \mathbf{x}_t + \mathbf{b}_h)
yt=Whyht+by\mathbf{y}_t = \mathbf{W}_{hy} \mathbf{h}_t + \mathbf{b}_y

Component Breakdown

SymbolShapeMeaning
x_tℝ^DInput at timestep t (e.g., 64 CNN features)
h_tℝ^HHidden state at timestep t
h_{t-1}ℝ^HHidden state from previous timestep
W_{xh}ℝ^{H×D}Input-to-hidden weight matrix
W_{hh}ℝ^{H×H}Hidden-to-hidden weight matrix
b_hℝ^HHidden state bias
W_{hy}ℝ^{O×H}Hidden-to-output weight matrix
b_yℝ^OOutput bias
y_tℝ^OOutput at timestep t

The Activation Function

The tanh\tanh function squashes the linear combination to [1,1][-1, 1]:

tanh(z)=ezezez+ez\tanh(z) = \frac{e^z - e^{-z}}{e^z + e^{-z}}

Key properties:

  • Output range: (1,1)(-1, 1)
  • Zero-centered (unlike sigmoid)
  • Derivative: ddztanh(z)=1tanh2(z)\frac{d}{dz}\tanh(z) = 1 - \tanh^2(z)
  • Maximum derivative at z=0z = 0 is 1

Forward Propagation Through Time

Processing an entire sequence involves unrolling the RNN across time:

Unrolled Computation Graph

📝text
1x₁        x₂        x₃        x₄        ...   x_T
2 ↓         ↓         ↓         ↓               ↓
3┌──┐      ┌──┐      ┌──┐      ┌──┐           ┌──┐
4│h₀│─────→│h₁│─────→│h₂│─────→│h₃│─────→ ... →│h_T│
5└──┘      └──┘      └──┘      └──┘           └──┘
6           ↓         ↓         ↓               ↓
7          y₁        y₂        y₃              y_T

Each box represents the same RNN cell with shared weights. The unrolling is conceptual for understanding gradient flow.

Full Sequence Processing

H=[h1,h2,,hT]RT×H\mathbf{H} = [\mathbf{h}_1, \mathbf{h}_2, \ldots, \mathbf{h}_T] \in \mathbb{R}^{T \times H}

The sequence of hidden states forms a matrix where each row is a timestep's hidden representation. For RUL prediction, we typically use hT\mathbf{h}_T—the final hidden state that has seen the entire sequence.


Backpropagation Through Time

Backpropagation Through Time (BPTT) is the algorithm for computing gradients in RNNs. It applies the chain rule across the temporal unrolling.

The Loss Function

For a sequence with outputs at each timestep:

L=t=1Tt(yt,y^t)\mathcal{L} = \sum_{t=1}^{T} \ell_t(\mathbf{y}_t, \hat{\mathbf{y}}_t)

Where t\ell_t is the loss at timestep tt.

Gradient Flow

The gradient of the loss with respect to Whh\mathbf{W}_{hh} involves contributions from all timesteps:

LWhh=t=1TtWhh\frac{\partial \mathcal{L}}{\partial \mathbf{W}_{hh}} = \sum_{t=1}^{T} \frac{\partial \ell_t}{\partial \mathbf{W}_{hh}}

Each term requires backpropagating through the chain of hidden states:

tWhh=k=1tthththkhkWhh\frac{\partial \ell_t}{\partial \mathbf{W}_{hh}} = \sum_{k=1}^{t} \frac{\partial \ell_t}{\partial \mathbf{h}_t} \cdot \frac{\partial \mathbf{h}_t}{\partial \mathbf{h}_k} \cdot \frac{\partial \mathbf{h}_k}{\partial \mathbf{W}_{hh}}

The Critical Jacobian

The term hthk\frac{\partial \mathbf{h}_t}{\partial \mathbf{h}_k} is a product of Jacobians:

hthk=i=k+1thihi1\frac{\partial \mathbf{h}_t}{\partial \mathbf{h}_k} = \prod_{i=k+1}^{t} \frac{\partial \mathbf{h}_i}{\partial \mathbf{h}_{i-1}}

Each Jacobian is:

hihi1=diag(1hi2)Whh\frac{\partial \mathbf{h}_i}{\partial \mathbf{h}_{i-1}} = \text{diag}(1 - \mathbf{h}_i^2) \cdot \mathbf{W}_{hh}

Where diag(1hi2)\text{diag}(1 - \mathbf{h}_i^2) is the derivative of tanh.


The Vanishing Gradient Problem

The product of Jacobians creates a fundamental problem for long sequences.

Mathematical Analysis

Consider the norm of the gradient flow from timestep kk to tt:

hthk=i=k+1tdiag(1hi2)Whh\left\| \frac{\partial \mathbf{h}_t}{\partial \mathbf{h}_k} \right\| = \left\| \prod_{i=k+1}^{t} \text{diag}(1 - \mathbf{h}_i^2) \cdot \mathbf{W}_{hh} \right\|

Bounding the Gradient

Let γ=Whh\gamma = \|\mathbf{W}_{hh}\| and note that 1tanh2(z)1|1 - \tanh^2(z)| \leq 1:

hthkγtk\left\| \frac{\partial \mathbf{h}_t}{\partial \mathbf{h}_k} \right\| \lesssim \gamma^{t-k}

Two Regimes

ConditionBehaviorProblem
γ < 1γ^{t-k} → 0 exponentiallyVanishing gradients
γ > 1γ^{t-k} → ∞ exponentiallyExploding gradients
γ ≈ 1Stable (rare, hard to maintain)Edge of chaos

Why This Matters for RUL

In RUL prediction, degradation patterns may span the entire window:

  • A temperature trend starting at cycle 1 may indicate degradation at cycle 30
  • Early sensor readings provide baseline context needed for accurate prediction
  • The vanishing gradient prevents the model from learning these long-range relationships

The Fundamental Limitation

Vanilla RNNs have a practical memory horizon of 10-20 timesteps in most cases. Beyond this, gradient signal is too weak for effective learning. For our 30-timestep windows, this is inadequate.


Limitations of Vanilla RNNs

The vanishing gradient problem is the most severe, but vanilla RNNs have other limitations:

1. No Selective Memory

The hidden state update has no mechanism to selectively retain or forget information:

ht=tanh(Whhht1+Wxhxt+bh)\mathbf{h}_t = \tanh(\mathbf{W}_{hh} \mathbf{h}_{t-1} + \mathbf{W}_{xh} \mathbf{x}_t + \mathbf{b}_h)

Every update overwrites the previous state. Important early information gets progressively diluted.

2. Fixed Mixing of History and Input

The same transformation mixes past state and current input at every timestep. The network cannot adapt how much to rely on history versus new input based on context.

3. Sequential Computation

RNNs must process sequences step-by-step: ht\mathbf{h}_t depends on ht1\mathbf{h}_{t-1}. This prevents parallelization across timesteps, making training slow on modern GPUs.

4. Bounded Capacity

The fixed-size hidden state must encode the entire history. As sequences get longer, the same capacity must represent more information, leading to information loss.

LimitationConsequenceSolution (Preview)
Vanishing gradientsCannot learn long-range dependenciesLSTM gates control gradient flow
No selective memoryImportant info gets dilutedLSTM cell state for long-term storage
Fixed mixingCannot adapt to contextGates modulate information flow
Sequential computationSlow trainingAttention enables parallelization
Bounded capacityInformation loss for long sequencesAttention directly accesses all timesteps

Summary

In this section, we explored the mathematics of Recurrent Neural Networks:

  1. RNNs process sequences by maintaining a hidden state ht\mathbf{h}_t that accumulates information over time
  2. Vanilla RNN equation: ht=tanh(Whhht1+Wxhxt+bh)\mathbf{h}_t = \tanh(\mathbf{W}_{hh} \mathbf{h}_{t-1} + \mathbf{W}_{xh} \mathbf{x}_t + \mathbf{b}_h)
  3. BPTT computes gradients by unrolling the network and applying the chain rule through time
  4. Gradient flow involves products of Jacobians: idiag(1hi2)Whh\prod_i \text{diag}(1 - \mathbf{h}_i^2) \cdot \mathbf{W}_{hh}
  5. Vanishing gradients: when Whh<1\|\mathbf{W}_{hh}\| < 1, gradients decay exponentially
  6. Practical memory horizon: ~10-20 timesteps for vanilla RNNs
ConceptFormulaImplication
Hidden updateh_t = tanh(W_hh h_{t-1} + W_xh x_t + b)Recurrent processing
Gradient product∂h_t/∂h_k = ∏ JacobiansChain of dependencies
Vanishing rateγ^{t-k} where γ = ||W_hh||Exponential decay if γ < 1
30-step gradient0.9^{30} ≈ 0.04Only 4% signal survives
Looking Ahead: The problems we identified—vanishing gradients, lack of selective memory, fixed mixing—motivated the development of Long Short-Term Memory (LSTM) networks. In the next section, we will derive the LSTM equations and see how gates elegantly solve each of these problems.

With vanilla RNN limitations clearly understood, we are ready to appreciate the elegant engineering of LSTM cells.