Learning Objectives
By the end of this section, you will:
- Understand recurrence as the fundamental mechanism for processing sequential data
- Derive vanilla RNN equations and interpret each component's role
- Trace forward propagation through time and understand how hidden states accumulate history
- Understand backpropagation through time (BPTT) and why gradient computation is complex for sequences
- Diagnose the vanishing gradient problemmathematically and understand why it limits vanilla RNNs
- Motivate the need for LSTMs based on the limitations we discover
Why This Matters: RNNs are the foundational architecture for sequence modeling. Even though we use LSTMs in our model, understanding vanilla RNNs is essential—LSTMs are designed specifically to solve the problems we identify here. Without understanding the disease, you cannot appreciate the cure.
What is a Recurrent Neural Network?
A Recurrent Neural Network (RNN) is a class of neural networks designed for sequential data. Unlike feedforward networks that process each input independently, RNNs maintain a hidden state that carries information from previous timesteps.
Historical Context
RNNs were conceived in the 1980s, with key contributions from John Hopfield (Hopfield networks, 1982), David Rumelhart (backpropagation through time, 1986), and Jeffrey Elman (Elman networks, 1990). They became practical for sequence modeling after efficient training algorithms were developed.
The Fundamental Idea
In a feedforward network, each input is processed in isolation:
In an RNN, the output depends on both the current input and the accumulated state from previous inputs:
This simple change—feeding the previous output back as input—creates a system with memory.
The Hidden State Concept
The hidden state is the key innovation of RNNs. It serves as a compressed summary of all information seen up to time .
What the Hidden State Encodes
- Past context: Information from
- Processed features: Non-linear transformations of the input history
- Sufficient statistics: Ideally, everything needed to predict future outputs
Dimensionality
The hidden state is a vector of fixed size (the hidden dimension):
This is a design choice. Larger allows more information storage but requires more parameters and computation. In our model, .
Information Bottleneck
The hidden state must compress an arbitrarily long history into a fixed-size vector. This is both a strength (bounded computation) and a weakness (limited capacity). We will see how this relates to the vanishing gradient problem.
Vanilla RNN Equations
The vanilla RNN (also called simple RNN or Elman network) has the following update equations.
Core Equations
At each timestep :
Component Breakdown
| Symbol | Shape | Meaning |
|---|---|---|
| x_t | ℝ^D | Input at timestep t (e.g., 64 CNN features) |
| h_t | ℝ^H | Hidden state at timestep t |
| h_{t-1} | ℝ^H | Hidden state from previous timestep |
| W_{xh} | ℝ^{H×D} | Input-to-hidden weight matrix |
| W_{hh} | ℝ^{H×H} | Hidden-to-hidden weight matrix |
| b_h | ℝ^H | Hidden state bias |
| W_{hy} | ℝ^{O×H} | Hidden-to-output weight matrix |
| b_y | ℝ^O | Output bias |
| y_t | ℝ^O | Output at timestep t |
The Activation Function
The function squashes the linear combination to :
Key properties:
- Output range:
- Zero-centered (unlike sigmoid)
- Derivative:
- Maximum derivative at is 1
Forward Propagation Through Time
Processing an entire sequence involves unrolling the RNN across time:
Unrolled Computation Graph
1x₁ x₂ x₃ x₄ ... x_T
2 ↓ ↓ ↓ ↓ ↓
3┌──┐ ┌──┐ ┌──┐ ┌──┐ ┌──┐
4│h₀│─────→│h₁│─────→│h₂│─────→│h₃│─────→ ... →│h_T│
5└──┘ └──┘ └──┘ └──┘ └──┘
6 ↓ ↓ ↓ ↓
7 y₁ y₂ y₃ y_TEach box represents the same RNN cell with shared weights. The unrolling is conceptual for understanding gradient flow.
Full Sequence Processing
The sequence of hidden states forms a matrix where each row is a timestep's hidden representation. For RUL prediction, we typically use —the final hidden state that has seen the entire sequence.
Backpropagation Through Time
Backpropagation Through Time (BPTT) is the algorithm for computing gradients in RNNs. It applies the chain rule across the temporal unrolling.
The Loss Function
For a sequence with outputs at each timestep:
Where is the loss at timestep .
Gradient Flow
The gradient of the loss with respect to involves contributions from all timesteps:
Each term requires backpropagating through the chain of hidden states:
The Critical Jacobian
The term is a product of Jacobians:
Each Jacobian is:
Where is the derivative of tanh.
The Vanishing Gradient Problem
The product of Jacobians creates a fundamental problem for long sequences.
Mathematical Analysis
Consider the norm of the gradient flow from timestep to :
Bounding the Gradient
Let and note that :
Two Regimes
| Condition | Behavior | Problem |
|---|---|---|
| γ < 1 | γ^{t-k} → 0 exponentially | Vanishing gradients |
| γ > 1 | γ^{t-k} → ∞ exponentially | Exploding gradients |
| γ ≈ 1 | Stable (rare, hard to maintain) | Edge of chaos |
Why This Matters for RUL
In RUL prediction, degradation patterns may span the entire window:
- A temperature trend starting at cycle 1 may indicate degradation at cycle 30
- Early sensor readings provide baseline context needed for accurate prediction
- The vanishing gradient prevents the model from learning these long-range relationships
The Fundamental Limitation
Vanilla RNNs have a practical memory horizon of 10-20 timesteps in most cases. Beyond this, gradient signal is too weak for effective learning. For our 30-timestep windows, this is inadequate.
Limitations of Vanilla RNNs
The vanishing gradient problem is the most severe, but vanilla RNNs have other limitations:
1. No Selective Memory
The hidden state update has no mechanism to selectively retain or forget information:
Every update overwrites the previous state. Important early information gets progressively diluted.
2. Fixed Mixing of History and Input
The same transformation mixes past state and current input at every timestep. The network cannot adapt how much to rely on history versus new input based on context.
3. Sequential Computation
RNNs must process sequences step-by-step: depends on . This prevents parallelization across timesteps, making training slow on modern GPUs.
4. Bounded Capacity
The fixed-size hidden state must encode the entire history. As sequences get longer, the same capacity must represent more information, leading to information loss.
| Limitation | Consequence | Solution (Preview) |
|---|---|---|
| Vanishing gradients | Cannot learn long-range dependencies | LSTM gates control gradient flow |
| No selective memory | Important info gets diluted | LSTM cell state for long-term storage |
| Fixed mixing | Cannot adapt to context | Gates modulate information flow |
| Sequential computation | Slow training | Attention enables parallelization |
| Bounded capacity | Information loss for long sequences | Attention directly accesses all timesteps |
Summary
In this section, we explored the mathematics of Recurrent Neural Networks:
- RNNs process sequences by maintaining a hidden state that accumulates information over time
- Vanilla RNN equation:
- BPTT computes gradients by unrolling the network and applying the chain rule through time
- Gradient flow involves products of Jacobians:
- Vanishing gradients: when , gradients decay exponentially
- Practical memory horizon: ~10-20 timesteps for vanilla RNNs
| Concept | Formula | Implication |
|---|---|---|
| Hidden update | h_t = tanh(W_hh h_{t-1} + W_xh x_t + b) | Recurrent processing |
| Gradient product | ∂h_t/∂h_k = ∏ Jacobians | Chain of dependencies |
| Vanishing rate | γ^{t-k} where γ = ||W_hh|| | Exponential decay if γ < 1 |
| 30-step gradient | 0.9^{30} ≈ 0.04 | Only 4% signal survives |
Looking Ahead: The problems we identified—vanishing gradients, lack of selective memory, fixed mixing—motivated the development of Long Short-Term Memory (LSTM) networks. In the next section, we will derive the LSTM equations and see how gates elegantly solve each of these problems.
With vanilla RNN limitations clearly understood, we are ready to appreciate the elegant engineering of LSTM cells.