AI Book - Master Artificial Intelligence by Building from Scratch

Learning Objectives

By the end of this section, you will:

Understand recurrence as the fundamental mechanism for processing sequential data
Derive vanilla RNN equations and interpret each component's role
Trace forward propagation through time and understand how hidden states accumulate history
Understand backpropagation through time (BPTT) and why gradient computation is complex for sequences
Diagnose the vanishing gradient problemmathematically and understand why it limits vanilla RNNs
Motivate the need for LSTMs based on the limitations we discover

Why This Matters: RNNs are the foundational architecture for sequence modeling. Even though we use LSTMs in our model, understanding vanilla RNNs is essential—LSTMs are designed specifically to solve the problems we identify here. Without understanding the disease, you cannot appreciate the cure.

What is a Recurrent Neural Network?

A Recurrent Neural Network (RNN) is a class of neural networks designed for sequential data. Unlike feedforward networks that process each input independently, RNNs maintain a hidden state that carries information from previous timesteps.

Historical Context

RNNs were conceived in the 1980s, with key contributions from John Hopfield (Hopfield networks, 1982), David Rumelhart (backpropagation through time, 1986), and Jeffrey Elman (Elman networks, 1990). They became practical for sequence modeling after efficient training algorithms were developed.

The Fundamental Idea

In a feedforward network, each input is processed in isolation:

\mathbf{y} = f(\mathbf{x})

In an RNN, the output depends on both the current input and the accumulated state from previous inputs:

\mathbf{h}_t = f(\mathbf{x}_t, \mathbf{h}_{t-1})

This simple change—feeding the previous output back as input—creates a system with memory.

The Hidden State Concept

The hidden state $\mathbf{h}_t$ is the key innovation of RNNs. It serves as a compressed summary of all information seen up to time $t$ .

What the Hidden State Encodes

Past context: Information from $\mathbf{x}_1, \mathbf{x}_2, \ldots, \mathbf{x}_{t-1}$
Processed features: Non-linear transformations of the input history
Sufficient statistics: Ideally, everything needed to predict future outputs

Dimensionality

The hidden state is a vector of fixed size $H$ (the hidden dimension):

\mathbf{h}_t \in \mathbb{R}^H

This is a design choice. Larger $H$ allows more information storage but requires more parameters and computation. In our model, $H = 128$ .

Information Bottleneck

The hidden state must compress an arbitrarily long history into a fixed-size vector. This is both a strength (bounded computation) and a weakness (limited capacity). We will see how this relates to the vanishing gradient problem.

Vanilla RNN Equations

The vanilla RNN (also called simple RNN or Elman network) has the following update equations.

Core Equations

At each timestep $t$ :

\mathbf{h}_t = \tanh(\mathbf{W}_{hh} \mathbf{h}_{t-1} + \mathbf{W}_{xh} \mathbf{x}_t + \mathbf{b}_h)

\mathbf{y}_t = \mathbf{W}_{hy} \mathbf{h}_t + \mathbf{b}_y

Component Breakdown

Symbol	Shape	Meaning
x_t	ℝ^D	Input at timestep t (e.g., 64 CNN features)
h_t	ℝ^H	Hidden state at timestep t
h_{t-1}	ℝ^H	Hidden state from previous timestep
W_{xh}	ℝ^{H×D}	Input-to-hidden weight matrix
W_{hh}	ℝ^{H×H}	Hidden-to-hidden weight matrix
b_h	ℝ^H	Hidden state bias
W_{hy}	ℝ^{O×H}	Hidden-to-output weight matrix
b_y	ℝ^O	Output bias
y_t	ℝ^O	Output at timestep t

The Activation Function

The $\tanh$ function squashes the linear combination to $[-1, 1]$ :

\tanh(z) = \frac{e^z - e^{-z}}{e^z + e^{-z}}

Key properties:

Output range: $(-1, 1)$
Zero-centered (unlike sigmoid)
Derivative: $\frac{d}{dz}\tanh(z) = 1 - \tanh^2(z)$
Maximum derivative at $z = 0$ is 1

Forward Propagation Through Time

Processing an entire sequence involves unrolling the RNN across time:

Unrolled Computation Graph

📝text

1x₁        x₂        x₃        x₄        ...   x_T
2 ↓         ↓         ↓         ↓               ↓
3┌──┐      ┌──┐      ┌──┐      ┌──┐           ┌──┐
4│h₀│─────→│h₁│─────→│h₂│─────→│h₃│─────→ ... →│h_T│
5└──┘      └──┘      └──┘      └──┘           └──┘
6           ↓         ↓         ↓               ↓
7          y₁        y₂        y₃              y_T

Each box represents the same RNN cell with shared weights. The unrolling is conceptual for understanding gradient flow.

Full Sequence Processing

\mathbf{H} = [\mathbf{h}_1, \mathbf{h}_2, \ldots, \mathbf{h}_T] \in \mathbb{R}^{T \times H}

The sequence of hidden states forms a matrix where each row is a timestep's hidden representation. For RUL prediction, we typically use $\mathbf{h}_T$ —the final hidden state that has seen the entire sequence.

Backpropagation Through Time

Backpropagation Through Time (BPTT) is the algorithm for computing gradients in RNNs. It applies the chain rule across the temporal unrolling.

The Loss Function

For a sequence with outputs at each timestep:

\mathcal{L} = \sum_{t=1}^{T} \ell_t(\mathbf{y}_t, \hat{\mathbf{y}}_t)

Where $\ell_t$ is the loss at timestep $t$ .

Gradient Flow

The gradient of the loss with respect to $\mathbf{W}_{hh}$ involves contributions from all timesteps:

\frac{\partial \mathcal{L}}{\partial \mathbf{W}_{hh}} = \sum_{t=1}^{T} \frac{\partial \ell_t}{\partial \mathbf{W}_{hh}}

Each term requires backpropagating through the chain of hidden states:

\frac{\partial \ell_t}{\partial \mathbf{W}_{hh}} = \sum_{k=1}^{t} \frac{\partial \ell_t}{\partial \mathbf{h}_t} \cdot \frac{\partial \mathbf{h}_t}{\partial \mathbf{h}_k} \cdot \frac{\partial \mathbf{h}_k}{\partial \mathbf{W}_{hh}}

The Critical Jacobian

The term $\frac{\partial \mathbf{h}_t}{\partial \mathbf{h}_k}$ is a product of Jacobians:

\frac{\partial \mathbf{h}_t}{\partial \mathbf{h}_k} = \prod_{i=k+1}^{t} \frac{\partial \mathbf{h}_i}{\partial \mathbf{h}_{i-1}}

Each Jacobian is:

\frac{\partial \mathbf{h}_i}{\partial \mathbf{h}_{i-1}} = \text{diag}(1 - \mathbf{h}_i^2) \cdot \mathbf{W}_{hh}

Where $\text{diag}(1 - \mathbf{h}_i^2)$ is the derivative of tanh.

The Vanishing Gradient Problem

The product of Jacobians creates a fundamental problem for long sequences.

Mathematical Analysis

Consider the norm of the gradient flow from timestep $k$ to $t$ :

\left\| \frac{\partial \mathbf{h}_t}{\partial \mathbf{h}_k} \right\| = \left\| \prod_{i=k+1}^{t} \text{diag}(1 - \mathbf{h}_i^2) \cdot \mathbf{W}_{hh} \right\|

Bounding the Gradient

Let $\gamma = \|\mathbf{W}_{hh}\|$ and note that $|1 - \tanh^2(z)| \leq 1$ :

\left\| \frac{\partial \mathbf{h}_t}{\partial \mathbf{h}_k} \right\| \lesssim \gamma^{t-k}

Two Regimes

Condition	Behavior	Problem
γ < 1	γ^{t-k} → 0 exponentially	Vanishing gradients
γ > 1	γ^{t-k} → ∞ exponentially	Exploding gradients
γ ≈ 1	Stable (rare, hard to maintain)	Edge of chaos

Why This Matters for RUL

In RUL prediction, degradation patterns may span the entire window:

A temperature trend starting at cycle 1 may indicate degradation at cycle 30
Early sensor readings provide baseline context needed for accurate prediction
The vanishing gradient prevents the model from learning these long-range relationships

The Fundamental Limitation

Vanilla RNNs have a practical memory horizon of 10-20 timesteps in most cases. Beyond this, gradient signal is too weak for effective learning. For our 30-timestep windows, this is inadequate.

Limitations of Vanilla RNNs

The vanishing gradient problem is the most severe, but vanilla RNNs have other limitations:

1. No Selective Memory

The hidden state update has no mechanism to selectively retain or forget information:

\mathbf{h}_t = \tanh(\mathbf{W}_{hh} \mathbf{h}_{t-1} + \mathbf{W}_{xh} \mathbf{x}_t + \mathbf{b}_h)

Every update overwrites the previous state. Important early information gets progressively diluted.

2. Fixed Mixing of History and Input

The same transformation mixes past state and current input at every timestep. The network cannot adapt how much to rely on history versus new input based on context.

3. Sequential Computation

RNNs must process sequences step-by-step: $\mathbf{h}_t$ depends on $\mathbf{h}_{t-1}$ . This prevents parallelization across timesteps, making training slow on modern GPUs.

4. Bounded Capacity

The fixed-size hidden state must encode the entire history. As sequences get longer, the same capacity must represent more information, leading to information loss.

Limitation	Consequence	Solution (Preview)
Vanishing gradients	Cannot learn long-range dependencies	LSTM gates control gradient flow
No selective memory	Important info gets diluted	LSTM cell state for long-term storage
Fixed mixing	Cannot adapt to context	Gates modulate information flow
Sequential computation	Slow training	Attention enables parallelization
Bounded capacity	Information loss for long sequences	Attention directly accesses all timesteps

Summary

In this section, we explored the mathematics of Recurrent Neural Networks:

RNNs process sequences by maintaining a hidden state $\mathbf{h}_t$ that accumulates information over time
Vanilla RNN equation: $\mathbf{h}_t = \tanh(\mathbf{W}_{hh} \mathbf{h}_{t-1} + \mathbf{W}_{xh} \mathbf{x}_t + \mathbf{b}_h)$
BPTT computes gradients by unrolling the network and applying the chain rule through time
Gradient flow involves products of Jacobians: $\prod_i \text{diag}(1 - \mathbf{h}_i^2) \cdot \mathbf{W}_{hh}$
Vanishing gradients: when $\|\mathbf{W}_{hh}\| < 1$ , gradients decay exponentially
Practical memory horizon: ~10-20 timesteps for vanilla RNNs

Concept	Formula	Implication
Hidden update	h_t = tanh(W_hh h_{t-1} + W_xh x_t + b)	Recurrent processing
Gradient product	∂h_t/∂h_k = ∏ Jacobians	Chain of dependencies
Vanishing rate	γ^{t-k} where γ = \|\|W_hh\|\|	Exponential decay if γ < 1
30-step gradient	0.9^{30} ≈ 0.04	Only 4% signal survives

Looking Ahead: The problems we identified—vanishing gradients, lack of selective memory, fixed mixing—motivated the development of Long Short-Term Memory (LSTM) networks. In the next section, we will derive the LSTM equations and see how gates elegantly solve each of these problems.

With vanilla RNN limitations clearly understood, we are ready to appreciate the elegant engineering of LSTM cells.