Boo-AI — Master Artificial Intelligence by Building from Scratch

Learning Objectives

By the end of this section, you will be able to:

Understand why vanilla RNNs fail at learning long-term dependencies in sequences
Explain the vanishing gradient problem mathematically using the chain rule and Jacobian products
Analyze gradient flow through time during backpropagation through time (BPTT)
Identify the conditions under which gradients vanish or explode in RNNs
Recognize long-term dependency tasks and why they are challenging for RNNs
Appreciate the historical context that led to the invention of LSTM and GRU

Why This Matters: The vanishing gradient problem was the central barrier preventing RNNs from achieving their potential for over a decade. Understanding this problem deeply is essential because: (1) it explains why vanilla RNNs fail on real-world tasks like machine translation and speech recognition, (2) it motivates every architectural choice in LSTM and GRU, and (3) it illustrates a fundamental challenge in training any deep network. Without this understanding, LSTM architecture appears arbitrary rather than a carefully designed solution.

The Story Behind Vanishing Gradients

Imagine you're teaching a student to write essays. You give them a 500-word essay to improve. When providing feedback, you might say: "Your conclusion contradicts what you wrote in the introduction." This requires connecting information from the beginning to the end—a long-term dependency.

Now imagine you can only whisper, and with each sentence the student reads, your voice gets quieter. By the time they reach the conclusion and try to connect it back to the introduction, your feedback has become inaudible. This is exactly what happens to gradients in RNNs.

The Promise and Problem of RNNs

Recurrent Neural Networks seemed like the perfect solution for sequential data. Their elegant idea: maintain a hidden state that accumulates information over time:

h_t = \tanh(W_{hh} h_{t-1} + W_{xh} x_t + b_h)

In theory, $h_t$ can remember everything important from $h_1, h_2, ldots, h_{t-1}$ . In practice, by the time we reach timestep 20 or 50, the RNN has "forgotten" what happened in the early timesteps. The culprit? Vanishing gradients during training.

The Core Insight

RNNs can theoretically remember long-term dependencies—the problem is that we cannot train them to do so. The gradient signal needed to learn these dependencies becomes too weak to provide useful weight updates.

Why RNNs Are Different from Feedforward Networks

You might recall that deep feedforward networks also suffer from vanishing gradients. So what makes RNNs worse?

In a feedforward network, each layer has its own weight matrix. Even if gradients shrink through each layer, different layers can have different weight magnitudes that might compensate.

In an RNN, the same weight matrix $W_{hh}$ is applied at every timestep. This creates a multiplicative tunnel:

Network Type	Gradient Path	Key Difference
Feedforward	W₁ × W₂ × W₃ × ... × Wₙ	Different weights per layer
RNN	W_hh × W_hh × W_hh × ... × W_hh	Same weight multiplied T times

When you multiply the same matrix by itself many times, the result depends entirely on its eigenvalues:

If the largest eigenvalue $|\lambda_{max}| < 1$ : the product shrinks exponentially → vanishing gradients
If $|\lambda_{max}| > 1$ : the product grows exponentially → exploding gradients
If $|\lambda_{max}| = 1$ : the product stays bounded → ideal (but rare)

The Eigenvalue Trap

For random initialization, there's almost zero probability of getting eigenvalues exactly equal to 1. RNNs are therefore destined to either vanish or explode over long sequences. This isn't a bug in implementation—it's a fundamental property of repeated matrix multiplication.

The Mathematics of Backpropagation Through Time

Let's derive exactly what happens to gradients as they flow backward through time. This mathematical understanding is crucial for appreciating why LSTM's architecture works.

Setting Up the Problem

Consider a simple RNN processing a sequence of length $T$ :

h_t = \tanh(W_{hh} h_{t-1} + W_{xh} x_t + b_h)

Suppose we have a loss $\mathcal{L}$ computed at time $T$ . We want to compute $\frac{\partial \mathcal{L}}{\partial h_1}$ —how does changing the first hidden state affect the final loss?

Applying the Chain Rule

By the chain rule, we need to trace how $h_1$ influences $h_2$ , then $h_3$ , and so on until $h_T$ :

\frac{\partial \mathcal{L}}{\partial h_1} = \frac{\partial \mathcal{L}}{\partial h_T} \cdot \frac{\partial h_T}{\partial h_{T-1}} \cdot \frac{\partial h_{T-1}}{\partial h_{T-2}} \cdots \frac{\partial h_2}{\partial h_1}

This can be written compactly as:

\frac{\partial \mathcal{L}}{\partial h_1} = \frac{\partial \mathcal{L}}{\partial h_T} \cdot \prod_{t=2}^{T} \frac{\partial h_t}{\partial h_{t-1}}

The Jacobian at Each Step

Each term $\frac{\partial h_t}{\partial h_{t-1}}$ is a Jacobian matrix. For our RNN:

\frac{\partial h_t}{\partial h_{t-1}} = \text{diag}(\tanh'(z_t)) \cdot W_{hh}

where $z_t = W_{hh} h_{t-1} + W_{xh} x_t + b_h$ is the pre-activation value, and $\tanh'(z) = 1 - \tanh^2(z)$ .

Critical Observation

The Jacobian has two factors:

Activation derivative: $\tanh'(z) \leq 1$ always (with max = 1 at z = 0)
Weight matrix: $W_{hh}$ with some spectral norm $||W_{hh}||$

The product of these determines whether gradients grow or shrink at each timestep.

The Product of Jacobians

The full gradient involves a product of $T-1$ Jacobians:

\prod_{t=2}^{T} \frac{\partial h_t}{\partial h_{t-1}} = \prod_{t=2}^{T} \text{diag}(\tanh'(z_t)) \cdot W_{hh}

In the worst case (all activations saturated, $\tanh'(z) \ll 1$ ), this product shrinks exponentially. In the best case (all activations at zero, $\tanh'(z) = 1$ ), the growth depends solely on $W_{hh}$ :

\left\| \prod_{t=2}^{T} \frac{\partial h_t}{\partial h_{t-1}} \right\| \leq \prod_{t=2}^{T} \|\tanh'(z_t)\| \cdot \|W_{hh}\| \leq \|W_{hh}\|^{T-1}

Quick Check

If ||W_hh|| = 0.9 and the sequence length is T = 50, what is the upper bound on the gradient magnitude ratio?

Interactive: Gradient Flow in RNNs

Explore how gradients decay as they propagate backward through time. Adjust the sequence length, weight scale, and activation function to see how these factors affect gradient flow.

RNN Gradient Flow Through Time

Watch how gradients propagate backward through time during backpropagation through time (BPTT). The gradient at time t=1 determines how much the earliest hidden states can influence learning.

Activation Function

Sequence Length: 8

Weight Scale: 0.80

Gradient Flow (backward)

Forward Pass

h₀

init

W_hh

t=1

h₁

1e+0

∂h/∂h: 0.80

W_hh

t=2

h₂

7e-1

∂h/∂h: 0.80

W_hh

t=3

h₃

5e-1

∂h/∂h: 0.80

W_hh

t=4

h₄

2e-1

∂h/∂h: 0.80

W_hh

t=5

h₅

1e-1

∂h/∂h: 0.80

W_hh

t=6

h₆

6e-2

∂h/∂h: 0.80

W_hh

t=7

h₇

3e-2

∂h/∂h: 0.80

W_hh

t=8

h₈

1e-2

∂h/∂h: 0.80

Loss

∂L/∂h = 1

Gradient Magnitude at Each Timestep

Past (earliest)Present (most recent)

The Mathematics of Vanishing Gradients in RNNs

For a simple RNN: h_t = tanh(W_hhh_t-1 + W_xhx_t)

The gradient from time T to time t is:

∂h_T/∂h_t = ∏_k=t+1^T (W_hh^T · diag(σ'(z_k)))

With 8 timesteps and weight scale 0.80:

Max gradient factor per step: 0.800
After 8 steps: (0.80 × 1)⁸ ≈ 1.68e-1

Gradient at t=1 (earliest)

1.68e-1

Gradient is healthy for this sequence length.

Why This Is Devastating for RNNs

• RNNs apply the same weights at every timestep
• Gradients multiply by W_hh for each timestep back
• Sequence of length 8: gradient multiplied 8 times
• If ||W_hh|| < 1: vanishing (can't learn long-term)
• If ||W_hh|| > 1: exploding (training becomes unstable)

Key Insight: Unlike feedforward networks where we can use different weights per layer, RNNs share the same weight matrix across all timesteps. This weight sharing creates a "multiplicative tunnel" where gradients must pass through 8 identical transformations, making vanishing/exploding gradients inevitable for long sequences. This is why LSTM and GRU were invented—they create "shortcut paths" for gradients.

What to Explore

Sequence length: Increase to 15+ timesteps and observe how quickly gradients vanish
Weight scale: Try values below 1.0 (vanishing) and above 1.0 (exploding)
Activation function: Compare sigmoid (max grad = 0.25) vs tanh (max grad = 1.0)
Animation: Watch the gradient flow backward from the loss to early timesteps

The Long-Term Dependency Problem

The vanishing gradient problem has a direct practical consequence: RNNs cannot learn long-term dependencies—relationships between events that are far apart in a sequence.

Real-World Examples

Task	Long-Term Dependency	Why RNNs Fail
Machine Translation	Gender agreement: 'La mesa... ella es roja'	Subject and pronoun may be 20+ tokens apart
Language Modeling	Context: 'I grew up in France... I speak fluent ___'	Answer 'French' requires remembering early context
Speech Recognition	Speaker identity across long utterances	Speaker characteristics from seconds ago needed
Music Generation	Returning to a theme after development	Musical structure spans hundreds of notes
Code Analysis	Matching opening and closing braces	Brackets may be nested deeply

The Subject-Verb Agreement Test

One classic test for long-term dependencies is subject-verb agreement in natural language. The network must remember whether the subject was singular or plural to correctly predict the verb form.

Long-Term Dependency Problem

RNNs struggle with subject-verb agreement when the subject and verb are far apart. Watch how the gradient signal from the verb weakens as it travels back to the subject.

Sentence: Subject-Verb Agreement Task

Distance: 1 tokens

The

cat

SUBJ

sits

VERB

What's happening: Subject 'cat' (singular) is only 1 token away from verb. Easy for RNN.

Subject (must remember)

Verb (must agree)

Intervening tokens

The Tragic Irony: Vanilla RNNs are theoretically capable of learning long-term dependencies—they have the representational power. But the training algorithm (BPTT) cannot find the right weights because the gradient signal needed to learn these connections effectively disappears.

Mathematical Analysis: When Do Gradients Vanish?

Let's be more precise about the conditions under which vanishing gradients occur.

Sufficient Condition for Vanishing

Theorem (Bengio et al., 1994): If the largest singular value of the recurrent Jacobian satisfies:

\sigma_{max}\left(\frac{\partial h_t}{\partial h_{t-1}}\right) < 1 \quad \text{for all } t

then the gradient $\frac{\partial \mathcal{L}}{\partial h_1}$ vanishes exponentially as $T \to \infty$ .

Sufficient Condition for Exploding

Conversely, if:

\sigma_{max}\left(\frac{\partial h_t}{\partial h_{t-1}}\right) > 1 \quad \text{for all } t

then the gradient explodes exponentially.

The Sigmoid Activation Makes It Worse

If we use sigmoid activation $\sigma(x) = \frac{1}{1 + e^{-x}}$ instead of tanh:

\sigma'(x) = \sigma(x)(1 - \sigma(x)) \leq 0.25

The maximum derivative is only 0.25! This means gradients are guaranteed to shrink by at least 4× per timestep, even with perfect weight initialization.

Activation	Max Derivative	After 10 Steps	After 50 Steps
Sigmoid	0.25	≈ 10⁻⁶	≈ 10⁻³⁰
Tanh	1.0	Depends on W	Depends on W
ReLU	1.0 or 0	1.0 or 0	1.0 or 0

Why Tanh Is Preferred Over Sigmoid in RNNs

While tanh can still cause vanishing gradients (when activations saturate), its maximum derivative of 1.0 at least gives us a chance of maintaining gradient flow. Sigmoid's maximum of 0.25 dooms us from the start.

Historical Context

The vanishing gradient problem wasn't just an academic curiosity—it was a major roadblock that halted progress in sequence modeling for years. Understanding this history helps appreciate why LSTM was such a breakthrough.

The Journey to Solving Vanishing Gradients

The vanishing gradient problem was a major barrier in deep learning for over a decade. Here's how researchers identified and eventually solved it.

discovery

breakthrough

application

Backpropagation Popularized

1986

Rumelhart, Hinton, and Williams popularize backpropagation, enabling training of multi-layer networks.

David RumelhartGeoffrey HintonRonald Williams

Elman Networks (Simple RNNs)

1990

Jeffrey Elman introduces simple recurrent networks for processing sequential data.

Jeffrey Elman

Vanishing Gradient Problem Identified

1991

Sepp Hochreiter's diploma thesis provides the first rigorous analysis of why gradients vanish in RNNs, explaining why they cannot learn long-term dependencies.

Sepp Hochreiter

Further Analysis of Gradient Problems

1994

Bengio, Simard, and Frasconi publish influential analysis showing the fundamental difficulty of learning long-term dependencies with gradient descent.

Yoshua BengioPatrice SimardPaolo Frasconi

LSTM Invented

1997

Hochreiter and Schmidhuber introduce Long Short-Term Memory networks with gates and cell state, specifically designed to solve the vanishing gradient problem.

Sepp HochreiterJürgen Schmidhuber

Forget Gate Added to LSTM

2000

Gers, Schmidhuber, and Cummins add the forget gate to LSTM, making it more flexible and practical for real applications.

Felix GersJürgen SchmidhuberFred Cummins

GRU Introduced

2014

Cho et al. introduce Gated Recurrent Units, a simplified alternative to LSTM with comparable performance but fewer parameters.

Kyunghyun ChoYoshua Bengio

Sequence-to-Sequence Models

2014

Sutskever, Vinyals, and Le demonstrate LSTM-based encoder-decoder models for machine translation, showing practical success of addressing vanishing gradients.

Ilya SutskeverOriol VinyalsQuoc V. Le

Transformers: A New Paradigm

2017

Vaswani et al. introduce Transformers with attention mechanisms, bypassing recurrence entirely and enabling direct gradient flow between any positions.

Ashish VaswaniGoogle Brain Team

The Pattern: It took 6 years from identifying the vanishing gradient problem (1991) to inventing LSTM (1997). Another 20 years passed before Transformers (2017) offered a radically different solution by eliminating recurrence entirely. Great breakthroughs often require both deep understanding of the problem and creative architectural innovation.

Detecting Vanishing Gradients in Practice

How do you know if your RNN is suffering from vanishing gradients? Here are practical detection methods.

Diagnosing Vanishing Gradients

🐍gradient_diagnostics.py

Explanation(6)

Code(87)

7Gradient Diagnostics Class

This utility class monitors gradient flow during training by attaching hooks to model layers.

15Backward Hooks

PyTorch backward hooks fire during loss.backward(), allowing us to capture gradient statistics at each layer.

25Gradient Norm Tracking

We track the L2 norm of gradients over time. A sudden drop to near-zero indicates vanishing gradients.

30Vanishing Detection

Gradients below 1e-6 are effectively zero for learning. This threshold may need tuning for your specific model.

52Simple RNN Implementation

This manual RNN loop makes the gradient flow explicit. Each timestep multiplies by W_hh.

64Long Sequence Test

Using seq_len=50 is enough to reveal vanishing gradients with tanh activation. Try increasing to 100+.

81 lines without explanation

1import torch
2import torch.nn as nn
3import matplotlib.pyplot as plt
4from typing import Dict, List
5
6class GradientDiagnostics:
7    """Tools for diagnosing vanishing/exploding gradients in RNNs."""
8
9    def __init__(self, model: nn.Module):
10        self.model = model
11        self.gradient_norms: Dict[str, List[float]] = {}
12        self._register_hooks()
13
14    def _register_hooks(self):
15        """Register backward hooks to capture gradient statistics."""
16        def make_hook(name: str):
17            def hook(module, grad_input, grad_output):
18                if grad_output[0] is not None:
19                    grad_norm = grad_output[0].norm().item()
20                    if name not in self.gradient_norms:
21                        self.gradient_norms[name] = []
22                    self.gradient_norms[name].append(grad_norm)
23            return hook
24
25        for name, module in self.model.named_modules():
26            if hasattr(module, 'weight'):
27                module.register_full_backward_hook(make_hook(name))
28
29    def check_vanishing(self, threshold: float = 1e-6) -> bool:
30        """Check if gradients have vanished."""
31        for name, norms in self.gradient_norms.items():
32            if len(norms) > 0 and norms[-1] < threshold:
33                print(f"Warning: Vanishing gradient in {name}")
34                print(f"  Current gradient norm: {norms[-1]:.2e}")
35                return True
36        return False
37
38    def check_exploding(self, threshold: float = 1e3) -> bool:
39        """Check if gradients are exploding."""
40        for name, norms in self.gradient_norms.items():
41            if len(norms) > 0 and norms[-1] > threshold:
42                print(f"Warning: Exploding gradient in {name}")
43                print(f"  Current gradient norm: {norms[-1]:.2e}")
44                return True
45        return False
46
47
48# Example: Diagnosing a simple RNN
49class SimpleRNN(nn.Module):
50    def __init__(self, input_size: int, hidden_size: int, seq_len: int):
51        super().__init__()
52        self.hidden_size = hidden_size
53        self.seq_len = seq_len
54
55        # Recurrent weights
56        self.W_hh = nn.Linear(hidden_size, hidden_size, bias=False)
57        self.W_xh = nn.Linear(input_size, hidden_size)
58        self.output = nn.Linear(hidden_size, 1)
59
60    def forward(self, x: torch.Tensor) -> torch.Tensor:
61        batch_size = x.size(0)
62        h = torch.zeros(batch_size, self.hidden_size, device=x.device)
63
64        # Process each timestep
65        for t in range(self.seq_len):
66            h = torch.tanh(self.W_hh(h) + self.W_xh(x[:, t]))
67
68        return self.output(h)
69
70
71# Run diagnostic
72model = SimpleRNN(input_size=10, hidden_size=128, seq_len=50)
73diagnostics = GradientDiagnostics(model)
74
75# Forward pass with long sequence
76x = torch.randn(32, 50, 10)  # batch=32, seq_len=50, features=10
77y = torch.randn(32, 1)
78
79output = model(x)
80loss = ((output - y) ** 2).mean()
81loss.backward()
82
83# Check for gradient problems
84if diagnostics.check_vanishing():
85    print("Consider: LSTM, gradient clipping, or shorter sequences")
86elif diagnostics.check_exploding():
87    print("Consider: gradient clipping or smaller learning rate")

Symptoms of Vanishing Gradients

Symptom	How to Detect	What It Means
Training stalls early	Loss plateaus after few epochs	Early layers stopped updating
Short-term only	Model predicts well locally but fails globally	Long-term dependencies not learned
Gradient norm drops	Monitor gradient norms per layer	Gradient signal is dying
Weight stasis	Early layer weights barely change	Gradient too small to update

Why LSTM Was Needed

By 1997, the deep learning community had tried many approaches to fix the vanishing gradient problem in RNNs:

Better activation functions: Tanh instead of sigmoid helped, but didn't solve the problem
Careful initialization: Orthogonal initialization of $W_{hh}$ with eigenvalues near 1
Gradient clipping: Prevents explosion but doesn't help with vanishing
Skip connections: Early attempts, but not formalized for RNNs

None of these fully solved the problem. The fundamental issue remained: multiplying by the same matrix repeatedly will always lead to exponential behavior.

The Key Insight Behind LSTM

Hochreiter and Schmidhuber realized that the solution wasn't to prevent gradient decay—it was to create a parallel pathway where gradients could flow unchanged:

The LSTM Solution: Instead of forcing all information through multiplicative transformations, LSTM creates a "cell state" $C_t$ that uses additive updates. The gradient can flow through this cell state almost unchanged, like water through a pipe rather than through a series of filters.

In the next section, we'll see exactly how LSTM implements this idea with its famous gate mechanisms.

Summary

The vanishing gradient problem is the central challenge that motivated the development of modern sequence models. Here are the key takeaways:

Core Concepts

Concept	Key Point	Implication
Weight sharing	RNNs use same W_hh at every timestep	Creates multiplicative gradient tunnel
Jacobian product	Gradient = product of T-1 Jacobians	Exponential decay or growth
Eigenvalue condition	\|\|W_hh\|\| < 1 → vanishing	Most initializations lead to vanishing
Activation saturation	tanh'(z) < 1 when \|z\| large	Makes vanishing worse
Long-term dependencies	Information far apart in sequence	Cannot be learned with vanishing gradients

Key Equations

RNN forward: $h_t = \tanh(W_{hh} h_{t-1} + W_{xh} x_t + b)$
Gradient chain: $\frac{\partial \mathcal{L}}{\partial h_1} = \frac{\partial \mathcal{L}}{\partial h_T} \prod_{t=2}^{T} \frac{\partial h_t}{\partial h_{t-1}}$
Jacobian: $\frac{\partial h_t}{\partial h_{t-1}} = \text{diag}(\tanh'(z_t)) \cdot W_{hh}$
Bound: $||\text{gradient}|| \leq ||W_{hh}||^{T-1}$

Looking Forward

In the next section, we'll see how Long Short-Term Memory (LSTM) networks solve this problem with three key innovations:

Cell state: A separate pathway using additive updates for persistent memory
Gates: Learned mechanisms to control information flow (forget, input, output)
Constant error carousel: Gradients can flow unchanged through the cell state

Knowledge Check

Test your understanding of the vanishing gradient problem:

Question 1 of 8Score: 0/0

Why do vanilla RNNs suffer from vanishing gradients more severely than feedforward networks?

Exercises

Conceptual Questions

Explain why the vanishing gradient problem is more severe in RNNs than in deep feedforward networks, even though both use backpropagation.
If $||W_{hh}|| = 0.95$ and the sequence length is 100, estimate the gradient magnitude at the first timestep relative to the last. What happens if $||W_{hh}|| = 1.05$ ?
Why doesn't gradient clipping solve the vanishing gradient problem? What does it solve?
A researcher proposes using ReLU activation instead of tanh in an RNN. Analyze the pros and cons of this approach for gradient flow.

Mathematical Exercises

Jacobian Calculation: For a 2D hidden state with $h_t = \tanh(W h_{t-1})$ and $W = \begin{bmatrix} 0.5 & 0.3 \\ 0.2 & 0.4 \end{bmatrix}$ , compute the Jacobian $\frac{\partial h_t}{\partial h_{t-1}}$ when $h_{t-1} = [0, 0]^T$ .
Eigenvalue Analysis: For the weight matrix in Exercise 1, compute its eigenvalues. Based on these, predict whether gradients will vanish or explode over long sequences.
Gradient Bound: Prove that for sigmoid activation, the gradient at timestep 1 is bounded by $(0.25 \cdot ||W_{hh}||)^{T-1}$ .

Coding Exercises

Gradient Flow Visualization: Implement a function that trains a simple RNN on a synthetic sequence task and plots gradient norms at each layer over training steps. Compare sequences of length 10, 50, and 100.
Long-Term Dependency Task: Create a "copy memory" task where the network must remember a pattern from the beginning of the sequence and reproduce it at the end. Show that vanilla RNNs fail when the delay exceeds 20-30 timesteps.
Eigenvalue Experiment: Initialize $W_{hh}$ with different spectral norms (0.8, 1.0, 1.2) and measure gradient norms after 50 timesteps. Plot the relationship between spectral norm and gradient magnitude.

Solution Hints

Exercise 1: When $h = [0, 0]^T$ , $\tanh(0) = 0$ and $\tanh'(0) = 1$ , so the Jacobian simplifies to just $W$ .
Exercise 2: The eigenvalues of a 2×2 matrix can be found by solving the characteristic polynomial $\det(W - \lambda I) = 0$ .
Coding Exercise 2: The "copy memory" task is a classic benchmark. Present a pattern, then N blank steps, then ask for the pattern back.

Challenge Project

Build a Gradient Flow Dashboard: Create an interactive visualization tool that shows real-time gradient flow through an RNN during training. Include:

Gradient magnitude at each timestep (color-coded heatmap)
Eigenvalue spectrum of $W_{hh}$ over training
Comparison between vanilla RNN and LSTM gradient flows
Automatic detection and alerting when gradients vanish below a threshold

Now that you understand why vanilla RNNs fail, you're ready to learn how LSTM solves this problem. In the next section, we'll dive deep into the Long Short-Term Memory architecture and see exactly how its gates create the "constant error carousel" that enables learning of long-term dependencies.

Learning Objectives

The Story Behind Vanishing Gradients

The Promise and Problem of RNNs

The Core Insight

Why RNNs Are Different from Feedforward Networks

Weight Sharing Across Time

The Eigenvalue Trap

The Mathematics of Backpropagation Through Time

Setting Up the Problem

Applying the Chain Rule

The Jacobian at Each Step

Critical Observation

The Product of Jacobians

Quick Check

Interactive: Gradient Flow in RNNs

RNN Gradient Flow Through Time

Gradient Magnitude at Each Timestep

The Mathematics of Vanishing Gradients in RNNs

Gradient at t=1 (earliest)

Why This Is Devastating for RNNs

What to Explore

The Long-Term Dependency Problem

Real-World Examples

The Subject-Verb Agreement Test

Long-Term Dependency Problem

Sentence: Subject-Verb Agreement Task

Mathematical Analysis: When Do Gradients Vanish?

Sufficient Condition for Vanishing

Sufficient Condition for Exploding

The Sigmoid Activation Makes It Worse

Why Tanh Is Preferred Over Sigmoid in RNNs

Historical Context

The Journey to Solving Vanishing Gradients

Backpropagation Popularized

Elman Networks (Simple RNNs)

Vanishing Gradient Problem Identified

Further Analysis of Gradient Problems

LSTM Invented

Forget Gate Added to LSTM

GRU Introduced

Sequence-to-Sequence Models

Transformers: A New Paradigm

Detecting Vanishing Gradients in Practice

Symptoms of Vanishing Gradients

Why LSTM Was Needed

The Key Insight Behind LSTM

Summary

Core Concepts

Key Equations

Looking Forward

Knowledge Check

Why do vanilla RNNs suffer from vanishing gradients more severely than feedforward networks?

Exercises

Conceptual Questions

Mathematical Exercises

Coding Exercises

Solution Hints

Challenge Project