Boo-AI — Master Artificial Intelligence by Building from Scratch

Learning Objectives

By the end of this section, you will be able to:

Understand the LSTM architecture including its cell state and gating mechanisms
Explain the purpose of each gate (forget, input, output) and how they control information flow
Write the complete LSTM equations and understand what each component computes
Explain why LSTM solves the vanishing gradient problem through additive cell state updates
Visualize gradient flow through the "constant error carousel"
Develop intuition for when and how LSTM learns to remember or forget information

Why This Matters: LSTM was a breakthrough that made sequence modeling practical. Before LSTM, training RNNs on sequences longer than 10-20 steps was nearly impossible. After LSTM, neural networks could learn dependencies spanning hundreds of timesteps. This enabled transformative applications in machine translation, speech recognition, and language modeling. Understanding LSTM deeply is essential because: (1) it remains widely used in production systems, (2) its gating mechanisms inspired later architectures including Transformers, and (3) it beautifully illustrates how architectural choices can solve fundamental training problems.

The Story Behind LSTM

In 1997, Sepp Hochreiter and Jürgen Schmidhuber published a paper that would change the course of deep learning: "Long Short-Term Memory." The title itself captures the key insight: they wanted to enable neural networks to have both long-term memory (information that persists over many timesteps) and short-term flexibility (the ability to update and use that information appropriately).

The Problem They Were Solving

Recall from the previous section that vanilla RNNs suffer from the vanishing gradient problem. The gradient from time $T$ to time $t$ involves a product of Jacobians:

\frac{\partial h_T}{\partial h_t} = \prod_{k=t+1}^{T} W_{hh}^T \cdot \text{diag}(\tanh'(z_k))

Each factor in this product is typically less than 1, so the gradient decays exponentially. Hochreiter and Schmidhuber's key insight was: what if we could create a pathway where the gradient multiplication factor is exactly 1?

The Breakthrough Idea

The solution came from a simple but profound observation: addition doesn't shrink gradients the way multiplication does. If we update a quantity by adding to it rather than multiplying:

C_t = C_{t-1} + \text{(something)}

Then the gradient $\frac{\partial C_t}{\partial C_{t-1}} = 1$ . The gradient flows through unchanged! This is the foundation of the LSTM cell state.

The Constant Error Carousel

Hochreiter and Schmidhuber called this the "Constant Error Carousel" (CEC). When the network learns to keep the cell state unchanged (

C_t = C_{t-1}

), errors (gradients) can flow backward through time without decay. The network learns when to preserve and when to update its memory.

The Key Innovation: The Cell State

The LSTM introduces a new quantity called the cell state, denoted $C_t$ . This cell state runs through the entire sequence like a "memory highway," carrying information with minimal modification.

Two Parallel Paths

Unlike vanilla RNNs which have only the hidden state $h_t$ , LSTM maintains two parallel quantities:

Quantity	Symbol	Role	Update Mechanism
Cell State	Cₜ	Long-term memory storage	Additive updates (preserves gradients)
Hidden State	hₜ	Short-term output/representation	Multiplicative gating of cell state

The cell state $C_t$ is the key to LSTM's success. It can remain unchanged across many timesteps if needed, or it can be selectively updated. The hidden state $h_t$ is derived from the cell state but filtered through an "output gate"—so the LSTM can remember things internally that it doesn't expose in its outputs.

Quick Check

What is the key difference between the cell state update in LSTM compared to the hidden state update in vanilla RNNs?

Interactive: LSTM Architecture

Explore the LSTM cell architecture interactively. Click on each gate to learn about its function, or use the "Animate Flow" button to see how information moves through the cell.

LSTM Cell Architecture

LSTM Architecture

Click on any gate in the diagram to learn about its function, or use the "Animate Flow" button to see how information flows through the LSTM cell.

Information Flow Order

Forget Gate

Input Gate

Cell State Update

Output Gate

The LSTM Key Insight: The cell state (top horizontal line) acts as a "memory highway" that allows information to flow unchanged through time. The gates control what gets added to or removed from this highway. Because the cell state update uses addition rather than multiplication, gradients can flow backward through many timesteps without vanishing.

The Four Components of LSTM

An LSTM cell has four key components: three gates (forget, input, output) and a candidate cell state generator. Each plays a distinct role in controlling information flow.

1. The Forget Gate (f_t)

The forget gate decides what information from the previous cell state should be discarded. It looks at the previous hidden state $h_{t-1}$ and current input $x_t$ and outputs values between 0 and 1 for each element of $C_{t-1}$ .

f_t = \sigma(W_f \cdot [h_{t-1}, x_t] + b_f)

Gate Value	Meaning	Effect on Cₜ₋₁
fₜ ≈ 0	"Forget this"	Information is discarded
fₜ ≈ 1	"Remember this"	Information is preserved
fₜ = 0.5	"Partially remember"	Information is attenuated by 50%

Why 'Forget' Gate?

The name is slightly misleading. The forget gate controls what to keep, not what to forget. When

f_t = 1

, everything is kept; when

f_t = 0

, everything is forgotten. The gate "forgets" when it outputs low values.

2. The Input Gate (i_t)

The input gate decides which new information to add to the cell state. Like the forget gate, it uses a sigmoid to output values between 0 and 1:

i_t = \sigma(W_i \cdot [h_{t-1}, x_t] + b_i)

3. The Candidate Cell State (C̃_t)

While the input gate decides how much new information to add, the candidate cell state determines what new information to potentially add. It uses tanh to produce values between -1 and 1:

\tilde{C}_t = \tanh(W_C \cdot [h_{t-1}, x_t] + b_C)

Why tanh for the candidate?

The candidate uses tanh (range: [-1, 1]) rather than sigmoid (range: [0, 1]) because we want to be able to both increase and decrease the cell state. Positive values add to the cell state, negative values subtract from it.

4. The Cell State Update

Now we can update the cell state. This is where the magic happens:

C_t = f_t \odot C_{t-1} + i_t \odot \tilde{C}_t

This equation says: take the old cell state, forget some parts (multiply by $f_t$ ), and add new information (scaled by $i_t$ ). The $\odot$ symbol represents element-wise (Hadamard) multiplication.

Critical Observation

Notice the plus sign between the two terms! This is what enables gradients to flow backward through time. When

f_t \approx 1

and

i_t \approx 0

, we have

C_t \approx C_{t-1}

, and the gradient passes through unchanged.

5. The Output Gate (o_t)

Finally, the output gate controls what part of the cell state is exposed as the hidden state output:

o_t = \sigma(W_o \cdot [h_{t-1}, x_t] + b_o)

The hidden state output is computed by applying tanh to the cell state (to push values to [-1, 1]) and then filtering through the output gate:

h_t = o_t \odot \tanh(C_t)

Why Filter the Output?

The output gate allows the LSTM to "hide" internal memories. The network can remember something in

C_t

without revealing it in

h_t

. This is useful when information is needed later but shouldn't influence immediate outputs.

Interactive: Gate Explorer

Use the sliders below to adjust the inputs and observe how each gate responds. Watch how the gates work together to update the cell state and produce the hidden state output.

Interactive Gate Explorer

Adjust the inputs and observe how each gate responds. The gates work together to update the cell state and produce the hidden state output.

Previous Hidden State h_t-1: 0.500

The output from the previous timestep

Current Input x_t: 0.300

The input at the current timestep

Previous Cell State C_t-1: 0.800

The long-term memory from previous timestep

Forget Gate f_t

0.565

Some of C_t-1 will be partially kept

σ(0.260) = 0.565

Input Gate i_t

0.545

Some new info will be stored

σ(0.180) = 0.545

Candidate &Ctilde;_t

0.291

New candidate memory value (range: -1 to 1)

tanh(0.300) = 0.291

Output Gate o_t

0.613

Some of cell state will be output

σ(0.460) = 0.613

Cell State Update

C_t=0.565×0.800+0.545×0.291=0.610

h_t=0.613×tanh(0.610)=0.334

New Cell State C_t

0.610

The updated long-term memory. Change: -0.190 from C_t-1

New Hidden State h_t

0.334

The output (and input to next timestep)

Try this: Set the forget gate input to favor keeping (f > 0.9) and the input gate to favor ignoring new info (i < 0.1). Watch how the cell state C_t stays close to C_t-1. This is how LSTM "remembers" information across many timesteps!

Complete Mathematical Formulation

Let's bring all the equations together. Given input $x_t$ at time $t$ , previous hidden state $h_{t-1}$ , and previous cell state $C_{t-1}$ , the LSTM computes:

Gate Computations

\begin{aligned} f_t &= \sigma(W_f \cdot [h_{t-1}, x_t] + b_f) && \text{(Forget gate)} \\ i_t &= \sigma(W_i \cdot [h_{t-1}, x_t] + b_i) && \text{(Input gate)} \\ \tilde{C}_t &= \tanh(W_C \cdot [h_{t-1}, x_t] + b_C) && \text{(Candidate cell state)} \\ o_t &= \sigma(W_o \cdot [h_{t-1}, x_t] + b_o) && \text{(Output gate)} \end{aligned}

State Updates

\begin{aligned} C_t &= f_t \odot C_{t-1} + i_t \odot \tilde{C}_t && \text{(Cell state update)} \\ h_t &= o_t \odot \tanh(C_t) && \text{(Hidden state output)} \end{aligned}

Notation Summary

Symbol	Description	Dimensions
xₜ	Input at timestep t	d (input dimension)
hₜ	Hidden state (output)	n (hidden dimension)
Cₜ	Cell state (internal memory)	n (hidden dimension)
fₜ, iₜ, oₜ	Gate activations	n (hidden dimension)
C̃ₜ	Candidate cell state	n (hidden dimension)
Wf, Wᵢ, Wc, Wₒ	Weight matrices	(n) × (n + d)
bf, bᵢ, bc, bₒ	Bias vectors	n
σ	Sigmoid function (outputs 0-1)	—
tanh	Hyperbolic tangent (outputs -1 to 1)	—
⊙	Element-wise multiplication	—

LSTM Cell Implementation from Scratch

🐍lstm_cell.py

Explanation(7)

Code(94)

4LSTM Cell Class

We implement the LSTM cell as an nn.Module, which manages parameters and enables integration with PyTorch's autograd system.

14Combined Weight Matrix

Instead of four separate weight matrices (W_f, W_i, W_C, W_o), we use a single matrix of size (4*hidden_size, input_size). This is more memory-efficient and allows a single matrix multiplication.

23Forget Gate Bias Initialization

A common practice is to initialize the forget gate bias to 1, which makes the initial forget gate output close to 1 (after sigmoid). This encourages the network to remember information by default early in training.

55Gate Computation

We compute all four gates with a single matrix multiplication and then split the result. This is mathematically equivalent to computing each gate separately but more efficient.

63Gate Activations

The three gates (f, i, o) use sigmoid to produce values in [0, 1] for gating. The candidate cell state uses tanh to produce values in [-1, 1].

69Cell State Update

This is the critical equation: C_t = f_t * C_{t-1} + i_t * C_tilde_t. The addition (not just multiplication) is what allows gradients to flow through time.

72Hidden State Output

The hidden state is the cell state (pushed through tanh) filtered by the output gate. This controls what information is exposed as the LSTM's output.

87 lines without explanation

1import torch
2import torch.nn as nn
3
4class LSTMCell(nn.Module):
5    """LSTM cell implemented from first principles."""
6
7    def __init__(self, input_size: int, hidden_size: int):
8        super().__init__()
9        self.input_size = input_size
10        self.hidden_size = hidden_size
11
12        # Combined weight matrix for all four gates
13        # This is more efficient than separate matrices
14        # Computes [f, i, C_tilde, o] in one matrix multiplication
15        self.weight_ih = nn.Parameter(
16            torch.randn(4 * hidden_size, input_size) / (input_size ** 0.5)
17        )
18        self.weight_hh = nn.Parameter(
19            torch.randn(4 * hidden_size, hidden_size) / (hidden_size ** 0.5)
20        )
21        self.bias = nn.Parameter(torch.zeros(4 * hidden_size))
22
23        # Initialize forget gate bias to 1 (common practice)
24        # This encourages the network to remember by default
25        nn.init.ones_(self.bias[0:hidden_size])
26
27    def forward(
28        self,
29        x: torch.Tensor,
30        hx: tuple[torch.Tensor, torch.Tensor] | None = None
31    ) -> tuple[torch.Tensor, torch.Tensor]:
32        """
33        Args:
34            x: Input tensor of shape (batch, input_size)
35            hx: Tuple of (h_prev, c_prev), each (batch, hidden_size)
36
37        Returns:
38            Tuple of (h_t, c_t)
39        """
40        batch_size = x.size(0)
41
42        # Initialize hidden and cell state if not provided
43        if hx is None:
44            h_prev = torch.zeros(batch_size, self.hidden_size, device=x.device)
45            c_prev = torch.zeros(batch_size, self.hidden_size, device=x.device)
46        else:
47            h_prev, c_prev = hx
48
49        # Compute all gates in one go
50        # gates = W_ih @ x + W_hh @ h_prev + b
51        gates = (x @ self.weight_ih.T +
52                 h_prev @ self.weight_hh.T +
53                 self.bias)
54
55        # Split into four gates
56        f_gate, i_gate, c_tilde, o_gate = gates.chunk(4, dim=1)
57
58        # Apply activations
59        f_t = torch.sigmoid(f_gate)  # Forget gate
60        i_t = torch.sigmoid(i_gate)  # Input gate
61        c_tilde_t = torch.tanh(c_tilde)  # Candidate cell state
62        o_t = torch.sigmoid(o_gate)  # Output gate
63
64        # Cell state update: C_t = f_t * C_{t-1} + i_t * C_tilde_t
65        c_t = f_t * c_prev + i_t * c_tilde_t
66
67        # Hidden state output: h_t = o_t * tanh(C_t)
68        h_t = o_t * torch.tanh(c_t)
69
70        return h_t, c_t
71
72
73# Example usage
74def example_forward_pass():
75    batch_size, seq_len = 32, 50
76    input_size, hidden_size = 64, 128
77
78    lstm_cell = LSTMCell(input_size, hidden_size)
79
80    # Create random input sequence
81    x_sequence = torch.randn(seq_len, batch_size, input_size)
82
83    # Process sequence
84    outputs = []
85    h, c = None, None
86    for t in range(seq_len):
87        h, c = lstm_cell(x_sequence[t], (h, c) if h is not None else None)
88        outputs.append(h)
89
90    output_sequence = torch.stack(outputs, dim=0)
91    print(f"Output shape: {output_sequence.shape}")
92    # Output shape: torch.Size([50, 32, 128])
93
94    return output_sequence

Why LSTM Solves the Vanishing Gradient Problem

Now let's analyze why LSTM's architecture solves the vanishing gradient problem. The key is understanding the gradient of the cell state.

Gradient Through the Cell State

Consider the cell state update equation:

C_t = f_t \odot C_{t-1} + i_t \odot \tilde{C}_t

Taking the partial derivative with respect to $C_{t-1}$ (treating the gates as constants for this analysis):

\frac{\partial C_t}{\partial C_{t-1}} = f_t

This is remarkable! The gradient from $C_t$ to $C_{t-1}$ is simply the forget gate value $f_t$ . If the network learns to set $f_t \approx 1$ , the gradient passes through unchanged.

Gradient Over Many Timesteps

For a loss $\mathcal{L}$ at time $T$ , the gradient with respect to an early cell state $C_1$ is:

\frac{\partial \mathcal{L}}{\partial C_1} = \frac{\partial \mathcal{L}}{\partial C_T} \cdot \prod_{t=2}^{T} f_t

Compare this to vanilla RNN where the product involves $W_{hh}$ and $\tanh'$ terms. In LSTM:

Network	Gradient Product	Typical Value	After 50 Steps
Vanilla RNN	∏(Wₕₕ · tanh')	≈ 0.8 per step	≈ 10⁻⁵
LSTM	∏(fₜ)	≈ 0.95 per step	≈ 0.08

The Power of Learning to Remember

The forget gate is not fixed—it's learned. When the network needs to remember something for a long time, it can learn to set

f_t \approx 1

for those memory cells. When information is no longer needed, it can set

f_t \approx 0

to clear the memory. This adaptive memory management is what makes LSTM so powerful.

Quick Check

If the forget gate outputs fₜ = 0.98 at every timestep, what fraction of the original gradient remains after 100 timesteps?

Interactive: The Constant Error Carousel

Compare gradient flow through vanilla RNN versus LSTM. Adjust the forget gate value and sequence length to see how LSTM preserves gradients for long-term dependencies.

The Constant Error Carousel: How LSTM Preserves Gradients

The cell state in LSTM acts as a "gradient highway" that preserves gradients across many timesteps. Compare how gradients decay in vanilla RNN vs. LSTM.

Sequence Length: 10

Avg. Forget Gate f: 0.90

Higher f = better gradient preservation

Compare

Gradient Magnitude at Each Timestep (Flowing Backward from Loss)

LSTM (via Cell State)

Vanilla RNN

t=1 (earliest)← Gradient flows backward ←t=10 (latest/loss)

Why LSTM Gradients Survive

Vanilla RNN Gradient:

∂L/∂h₁ = ∏ W_hh^T · diag(σ')

Each step multiplies by W_hh

Problem: If ||W_hh|| < 1, gradients shrink exponentially

LSTM Cell State Gradient:

∂C_t/∂C_t-1 = f_t

Each step only multiplies by f

Solution: When f ≈ 1, gradients flow unchanged!

Vanilla RNN Gradient at t=1

6.42e-3

Gradient is weak but may allow some learning.

LSTM Gradient at t=1

3.47e-1

Strong gradient! Can learn long-term dependencies.

54x better than RNN

The Constant Error Carousel: Hochreiter and Schmidhuber (1997) called the cell state a "constant error carousel" because when the forget gate f ≈ 1 and input gate i ≈ 0, the cell state update becomes C_t ≈ C_t-1. This means the gradient ∂C_t/∂C_t-1 ≈ 1, allowing errors (gradients) to flow unchanged through many timesteps. The network learns when to preserve and when to update memory.

Building LSTM Intuition

Let's develop intuition for how LSTM processes information through several examples.

Example 1: Remembering a Subject for Verb Agreement

Consider the sentence: "The cat, which was chasing the mice in the garden, is hungry."

The LSTM needs to remember that the subject is "cat" (singular) to correctly predict "is" instead of "are":

When processing "cat": The input gate opens ( $i_t \approx 1$ ) to store the singular subject information in the cell state.
During "which was chasing the mice in the garden": The forget gate stays high ( $f_t \approx 1$ ) to preserve the subject information, while the input gate stays low ( $i_t \approx 0$ ) to avoid overwriting it.
When predicting the verb: The output gate opens ( $o_t \approx 1$ ) to access the stored subject information.

Example 2: Language Modeling with Context

"I grew up in France. ... ... ... I speak fluent French."

Even with many sentences in between, the LSTM can remember "France" and use it to predict "French":

Storing context: The word "France" triggers high input gate activation, storing location information.
Preserving over time: High forget gate values maintain this information across the intervening text.
Using context: When predicting the language, the network accesses the stored location to generate "French".

Example 3: Closing Brackets in Code

Matching opening and closing brackets requires counting: "((())" has 3 open brackets.

Opening bracket "(": Increment the cell state ( $i_t \cdot \tilde{C}_t > 0$ ).
Closing bracket ")": Decrement the cell state ( $i_t \cdot \tilde{C}_t < 0$ ).
Predicting next token: If cell state is positive, more closing brackets are needed.

Why These Examples Work

In each example, the LSTM uses its gates to selectively store information (input gate), preserve it over time (forget gate), and access it when needed (output gate). The cell state acts as a persistent memory that can maintain information across many timesteps without decay.

Summary

LSTM is a carefully designed architecture that solves the vanishing gradient problem through several key innovations:

Key Concepts

Concept	Innovation	Purpose
Cell State	Additive updates (Cₜ = f·C + i·C̃)	Enables gradient flow without decay
Forget Gate	Learned forgetting (fₜ = σ(...))	Selective memory erasure
Input Gate	Learned writing (iₜ = σ(...))	Selective memory writing
Output Gate	Learned reading (oₜ = σ(...))	Selective memory exposure
Dual State	Cell state + Hidden state	Internal memory vs. external output

Key Equations

Gates: $f_t, i_t, o_t = \sigma(W \cdot [h_{t-1}, x_t] + b)$
Candidate: $\tilde{C}_t = \tanh(W_C \cdot [h_{t-1}, x_t] + b_C)$
Cell update: $C_t = f_t \odot C_{t-1} + i_t \odot \tilde{C}_t$
Output: $h_t = o_t \odot \tanh(C_t)$
Gradient: $\frac{\partial C_t}{\partial C_{t-1}} = f_t$ (the key insight!)

Looking Forward

In the next section, we'll implement a complete LSTM from scratch in PyTorch and train it on real sequence tasks. We'll see how the theory translates to practice and explore training techniques specific to LSTMs.

Knowledge Check

Test your understanding of LSTM architecture and mechanics:

LSTM Knowledge Check

Question 1 of 8

What is the primary purpose of the forget gate in an LSTM?

Score: 0/0

Exercises

Conceptual Questions

Explain why the forget gate is crucial for learning long-term dependencies. What would happen if we removed the forget gate and always had $f_t = 1$ ?
The output gate allows the LSTM to "hide" information in the cell state without exposing it in the hidden state. Give an example scenario where this capability would be useful.
Compare the number of learnable parameters in a vanilla RNN cell vs. an LSTM cell with the same hidden size. Why is LSTM more parameter-efficient than simply stacking multiple RNN layers?
Explain the role of tanh in the cell state update. Why is tanh applied before the output gate but not before the forget/input operations on the cell state?

Mathematical Exercises

Gradient Computation: Derive the gradient $\frac{\partial C_T}{\partial C_1}$ for a sequence of length $T$ . Show that it equals $\prod_{t=2}^{T} f_t$ .
Full Gradient: The gradient through LSTM also flows through the hidden state path. Write out the complete gradient $\frac{\partial \mathcal{L}}{\partial C_1}$ including both the cell state and hidden state paths.
Parameter Count: For an LSTM with input dimension $d$ and hidden dimension $n$ , calculate the total number of learnable parameters.

Coding Exercises

Gate Visualization: Implement a function that processes a sequence through an LSTM and records the forget gate values at each timestep. Visualize these as a heatmap for a sentence processing task.
Memory Persistence Test: Create a synthetic task where the network must remember a binary value for $n$ timesteps before using it. Compare vanilla RNN and LSTM performance as $n$ increases.
Gradient Analysis: Modify the LSTM implementation to track gradient magnitudes at each timestep during backpropagation. Compare to vanilla RNN and verify that LSTM maintains better gradient flow.

Solution Hints

Exercise 1: Register forward hooks on the LSTM layer to capture intermediate activations.
Exercise 2: The "copy memory" task: input a bit, then $n$ zeros, then a signal to output the original bit.
Exercise 3: Use retain_graph=True in backward() and register backward hooks on the cell state tensor.

Challenge Project

Build an LSTM Debugger: Create an interactive tool that visualizes LSTM internals during sequence processing. Include:

Real-time visualization of all gate activations
Cell state evolution over time
Attention-style visualization showing which inputs most strongly affected each output
Gradient magnitude tracking through both cell state and hidden state paths
Comparison mode to contrast vanilla RNN behavior

Now that you understand the LSTM architecture and why it works, you're ready to implement it from scratch. In the next section, we'll build a complete LSTM in PyTorch, train it on a real task, and explore practical considerations for getting LSTMs to work well in production.

Learning Objectives

The Story Behind LSTM

The Problem They Were Solving

The Breakthrough Idea

The Constant Error Carousel

The Key Innovation: The Cell State

Two Parallel Paths

Quick Check

Interactive: LSTM Architecture

LSTM Cell Architecture

LSTM Architecture

Information Flow Order

The Four Components of LSTM

1. The Forget Gate (ft)

Why 'Forget' Gate?

2. The Input Gate (it)

3. The Candidate Cell State (C̃t)

Why tanh for the candidate?

4. The Cell State Update

Critical Observation

5. The Output Gate (ot)

Why Filter the Output?

Interactive: Gate Explorer

Interactive Gate Explorer

Forget Gate ft

Input Gate it

Candidate &Ctilde;t

Output Gate ot

Cell State Update

New Cell State Ct

New Hidden State ht

Complete Mathematical Formulation

Gate Computations

State Updates

Notation Summary

Why LSTM Solves the Vanishing Gradient Problem

Gradient Through the Cell State

Gradient Over Many Timesteps

The Power of Learning to Remember

Quick Check

Interactive: The Constant Error Carousel

The Constant Error Carousel: How LSTM Preserves Gradients

Gradient Magnitude at Each Timestep (Flowing Backward from Loss)

Why LSTM Gradients Survive

Vanilla RNN Gradient:

LSTM Cell State Gradient:

Vanilla RNN Gradient at t=1

LSTM Gradient at t=1

Building LSTM Intuition

Example 1: Remembering a Subject for Verb Agreement

Example 2: Language Modeling with Context

Example 3: Closing Brackets in Code

Why These Examples Work

Summary

Key Concepts

Key Equations

Looking Forward

Knowledge Check

LSTM Knowledge Check

Exercises

Conceptual Questions

Mathematical Exercises

Coding Exercises

Solution Hints

Challenge Project

1. The Forget Gate (f_t)

2. The Input Gate (i_t)

3. The Candidate Cell State (C̃_t)

5. The Output Gate (o_t)

Forget Gate f_t

Input Gate i_t

Candidate &Ctilde;_t

Output Gate o_t

New Cell State C_t

New Hidden State h_t