Boo-AI — Master Artificial Intelligence by Building from Scratch

Why Vanilla RNNs Forget

A vanilla RNN has exactly one knob for memory: the hidden state $h_t$ . Every timestep it is completely overwritten by $h_t = \tanh(W_{hh} \, h_{t-1} + W_{xh} \, x_t + b)$ . The word “completely” is the problem. To remember something from ten steps ago, that information must survive ten consecutive multiplications by $W_{hh}$ and ten applications of the tanh nonlinearity.

Mathematically, the gradient that flows back from step $T$ to step $t$ picks up a factor $\prod_{k=t+1}^{T} W_{hh}^{\top} \, \text{diag}(\tanh'(\cdot))$ . If $\|W_{hh}\| < 1$ this product shrinks to zero — the vanishing gradient. If it is bigger than one, the product explodes. Either way, learning long-range dependencies is brittle.

The core problem: a vanilla RNN has no way to selectively preserve information. Every step is a full rewrite of memory through a squashing nonlinearity. There is no highway for gradients.

The Long Short-Term Memory cell, introduced by Hochreiter and Schmidhuber in 1997, is the classic fix. It adds a second state variable — the cell state $C_t$ — whose update is additive rather than multiplicative, and three learned gates that decide how that memory is read, written, and exposed. Once you understand those four pieces, the entire architecture falls into place.

The Cell State: A Memory Highway

Picture a conveyor belt running along the top of each LSTM cell. On that belt rides the cell state $C_t$ , a vector the same size as the hidden state. The entire mechanism of the LSTM is about carefully editing what is on that belt:

The forget gate $f_t$ scales $C_{t-1}$ element-wise, erasing parts of memory the cell no longer needs.
The input gate $i_t$ and the candidate $\tilde{C}_t$ together decide what new content to write onto the belt.
The output gate $o_t$ decides what part of the belt's current contents to expose through the hidden state $h_t$ .

The whole update is captured in one short, additive line: $C_t = f_t \odot C_{t-1} + i_t \odot \tilde{C}_t$ . That plus sign is the entire reason LSTMs work. When we backpropagate through it, the derivative $\partial C_t / \partial C_{t-1} = f_t$ has no squashing nonlinearity wrapped around it — so if the network learns to keep $f_t \approx 1$ , gradients flow unchanged across hundreds of steps.

Mental model: think of

C_t

as private scratchpad memory and

h_t

as the public API. Gates decide what gets written to the scratchpad, and a separate gate decides what is readable from the outside.

The Four Sub-Networks Inside a Cell

Every LSTM cell contains four tiny feed-forward layers, and each of them takes exactly the same input: the concatenation $z_t = [h_{t-1}, x_t]$ . Three of them produce sigmoid-gated vectors in $(0, 1)$ ; one produces a tanh-bounded candidate in $(-1, 1)$ .

Symbol	What it computes	Activation	Role
f_t	σ(W_f · [h_{t-1}, x_t] + b_f)	sigmoid	erase fraction per channel of old memory
i_t	σ(W_i · [h_{t-1}, x_t] + b_i)	sigmoid	write fraction per channel of new memory
C̃_t	tanh(W_C · [h_{t-1}, x_t] + b_C)	tanh	content of the new memory proposal
o_t	σ(W_o · [h_{t-1}, x_t] + b_o)	sigmoid	expose fraction per channel to h_t

The choice of activation is not arbitrary. Gates are switches, and sigmoid's range of $(0, 1)$ is exactly the vocabulary of switches — “let 90% through” is a meaningful thing to say. The candidate is content, and content needs a sign — the negative half of tanh lets a memory slot move in the opposite direction (e.g. a word like not subtracting from an accumulated sentiment).

Parameter count: with hidden size

H

and input size

I

, a single LSTM cell has

4H(H+I) + 4H

parameters — four times a plain RNN. The extra cost buys the gates.

The Forget Gate

$f_t = \sigma(W_f \, [h_{t-1}, x_t] + b_f)$ . For each dimension of the cell state, $f_t$ returns a number between 0 and 1 that will multiply $C_{t-1}$ :

$f_t[k] \approx 1$ — channel $k$ of memory passes through essentially unchanged.
$f_t[k] \approx 0$ — channel $k$ is erased.
Intermediate values fade memory gradually.

A classic intuition: suppose one channel of $C$ encodes the subject's gender (for subject-verb agreement). When a new sentence begins, the forget gate on that channel should go to zero — “flush the gender, a new subject is coming.” The network learns to do this from data.

Forget-bias trick: many frameworks initialize

b_f

+1

(rather than 0). This pushes

f_t

toward

\sigma(1) \approx 0.73

at the start of training, so the cell defaults to remembering until gradients teach it otherwise. Without this, early training can wash out memory too fast.

Input Gate and Candidate Memory

Writing to memory is split into how much and what — the input gate and the candidate.

Input gate $i_t = \sigma(W_i \, [h_{t-1}, x_t] + b_i)$ . A sigmoid-valued vector. Close to 1 means “yes, write heavily into this channel this step”; close to 0 means “skip this channel.”
Candidate $\tilde{C}_t = \tanh(W_C \, [h_{t-1}, x_t] + b_C)$ . A tanh-bounded vector of proposed new content. Unlike the gates, this one is signed.

The two combine as $i_t \odot \tilde{C}_t$ . Think of it as (how loud) × (what to say). Splitting these two concerns is crucial: a single tanh-of-linear cannot express “I have strong content but I don't want to write it hard right now.” With the gate, the network can.

Updating the Cell State

Everything so far has been preparation for one line — the line that defines LSTM:

C_t \;=\; f_t \odot C_{t-1} \;+\; i_t \odot \tilde{C}_t

That's it. Two element-wise products and an element-wise sum. The first term erases; the second term writes. There is no activation wrapped around the result. That is what makes $C_t$ the highway.

Take the derivative along the highway: $\partial C_t / \partial C_{t-1} = f_t$ . No tanh derivative (which can be as small as 0.0), no repeated weight matrix (which magnifies or shrinks across steps). Just a direct element-wise multiplier the network can learn to keep near 1 when it needs to remember — and the memory of the sequence survives.

Constant Error Carousel. Hochreiter and Schmidhuber's original name for this mechanism. When $f_t = 1$ and $i_t = 0$ , the cell state literally just circulates: $C_t = C_{t-1}$ , and gradients pass through with factor 1. Carousels go around forever; so do these gradients.

The Output Gate and Hidden State

The cell state $C_t$ can hold values much larger than 1 if the network wants (it is bounded only by what the gates write over time). But downstream layers expect a bounded, stable signal. So the LSTM exposes a second, tamer view of memory:

h_t \;=\; o_t \odot \tanh(C_t)

Two things happen here. First, $\tanh(C_t)$ squashes memory into $(-1, 1)$ — a “readable” form. Second, $o_t$ masks it: channels the network chooses not to reveal this step are zeroed out in $h_t$ . The cell state itself is untouched.

Because $h_t$ , not $C_t$ , is what the next timestep's gates see, the output gate is also the LSTM's way of saying “this part of memory is not relevant to the next prediction” while still keeping it on the belt for later.

Interactive: LSTM Architecture

Click any gate in the diagram below to see its equation, description, and the intuition behind it. Press Animate Flow to watch information move through the cell in the canonical order: forget → write → update → expose.

Loading LSTM architecture…

Interactive: The Gates at Work

Numbers bring intuition. Drag the sliders to set the previous hidden state $h_{t-1}$ , the current input $x_t$ , and the previous cell state $C_{t-1}$ . Watch how each gate responds and how the two final outputs — $C_t$ and $h_t$ — are assembled from them. Try the perfect remembering configuration in the hint below the visualization: push the forget gate toward 1 and the input gate toward 0. You will see $C_t$ essentially equal $C_{t-1}$ — memory preserved.

Loading gate explorer…

A Full Forward Pass with Real Numbers

Before we write code, let's run the cell by hand for one step. Take hidden size $H = 2$ and input size $I = 2$ . Our first input is the token “I” with embedding $x_1 = [0.5, -0.2]$ , and both states start at zero.

Concatenate: $z_1 = [h_0, x_1] = [0, 0, 0.5, -0.2]$ .
Forget gate: $W_f z_1 + b_f = [0.28, 0.04]$ , so $f_1 = \sigma([0.28, 0.04]) = [0.5695, 0.5100]$ .
Input gate: $i_1 = \sigma([-0.23, 0.21]) = [0.4428, 0.5523]$ .
Candidate: $\tilde{C}_1 = \tanh([0.11, -0.05]) = [0.1096, -0.0500]$ .
Output gate: $o_1 = \sigma([-0.16, 0.12]) = [0.4601, 0.5300]$ .
Cell state: $C_1 = f_1 \odot 0 + i_1 \odot \tilde{C}_1 = [0.0485, -0.0276]$ .
Hidden state: $h_1 = o_1 \odot \tanh(C_1) = [0.0223, -0.0146]$ .

These are not made-up numbers; they are what you will see printed when you run the NumPy code below. The point is that once the formulas are written out, running the cell is just algebra — five affine layers and a few element-wise products.

LSTM From Scratch in Python

Here is the complete cell, implemented with nothing but NumPy. Click any line to see the exact tensor values at that point — including the actual matrix entries of each gate's weight matrix, the pre-activations, and the gate outputs for all three timesteps of our toy sequence “I love it”.

A Complete LSTM Cell in NumPy — Click Any Line

🐍lstm_numpy.py

Explanation(18)

Code(49)

1import numpy as np

NumPy gives us fast N-dimensional arrays and the basic math (matrix multiply via @, element-wise *, np.tanh, np.exp). Every mathematical symbol in the LSTM equations maps directly to a NumPy operation.

EXECUTION STATE

numpy = Scientific computing library — ndarray, linear algebra, elementwise math.

as np = Alias so we write np.exp() instead of numpy.exp() — universal Python convention.

3def sigmoid(x)

The gating activation. σ(x) = 1/(1+e^{-x}) squashes any real number into (0, 1). In an LSTM, gate values are sigmoids because we want each channel to act like a soft switch: 0 = block, 1 = pass through.

EXECUTION STATE

⬇ input: x = A NumPy array of pre-activations. Can be any real number, positive or negative.

→ why sigmoid? = Outputs live in (0, 1). Multiplying a memory vector by a sigmoid value is literally 'keep this much' — 0.9 keeps 90%, 0.1 keeps 10%. A ReLU would not cap the upper end, a tanh would flip signs — neither makes sense for a soft switch.

⬆ returns = Same shape as x, every element squashed into (0, 1).

4return 1.0 / (1.0 + np.exp(-x))

Computes sigmoid element-wise. np.exp(-x) flips the sign then applies the exponential to every element; NumPy broadcasts the scalar 1.0 over the resulting array.

EXECUTION STATE

📚 np.exp = Element-wise exponential e^x. np.exp(0)=1.0, np.exp(1)=2.718, np.exp(-1)=0.368.

→ example: x = [0, 1, -1] = -x = [0, -1, 1] → exp = [1.0, 0.368, 2.718] → 1/(1+exp) = [0.50, 0.731, 0.269]

7def lstm_cell(x_t, h_prev, C_prev, params)

One LSTM cell processing a SINGLE timestep. It takes the current input, the previous hidden state, and the previous cell state, and returns the new hidden and cell states. The whole sequence is processed by calling this function once per token.

EXECUTION STATE

⬇ input: x_t (shape (2,)) = The current token's feature vector. For step 1: x_t = [0.5, -0.2] (our toy embedding for 'I').

→ x_t purpose = The external signal driving this timestep. In a language model it would be a word embedding; in time-series it is the sensor reading at time t.

⬇ input: h_prev (shape (2,)) = Hidden state carried from t-1. For step 1 this is zeros; for later steps it is the previous h_t.

→ h_prev purpose = The short-term / exposed memory. It is the 'what the network said about the past' signal that gates look at when deciding what to do now.

⬇ input: C_prev (shape (2,)) = Cell state carried from t-1. Starts at zeros, then gets updated additively each step.

→ C_prev purpose = The long-term memory. Unlike h, C is NOT squashed each step — values can accumulate. This is the memory highway that makes LSTMs work.

⬇ input: params (dict) = All 8 weight arrays: 4 matrices W_f, W_i, W_C, W_o (each 2×4) and 4 bias vectors b_f, b_i, b_C, b_o (each shape (2,)).

→ params purpose = These are the learnable parameters. During training each is adjusted by gradient descent. Here we set them manually so we can trace values by hand.

⬆ returns = Tuple (h_t, C_t) — both shape (2,). h_t is the new exposed output and becomes h_prev for the next step; C_t is the updated long-term memory.

9z = np.concatenate([h_prev, x_t])

Stack the previous hidden state and the current input into one long vector. Every gate will take THIS vector as its input — so each gate sees both 'what the network remembers' (h_prev) and 'what just arrived' (x_t) at the same time.

EXECUTION STATE

📚 np.concatenate = Joins a list of arrays along an existing axis. For 1-D arrays it just appends them end-to-end. concatenate([[1,2],[3,4,5]]) → [1,2,3,4,5].

⬇ arg: [h_prev, x_t] = A Python list with two 1-D arrays: h_prev (shape (2,)) and x_t (shape (2,)).

→ step 1 values = h_prev = [0.00, 0.00], x_t = [0.50, -0.20]

⬆ z (shape (4,)) = [0.00, 0.00, 0.50, -0.20] — first 2 entries are the hidden state, last 2 are the input.

→ why concatenate? = Each gate needs to decide based on BOTH past context and current input. Concatenating lets a single weight matrix learn a joint function of both at once.

12f_t = sigmoid(W_f @ z + b_f) — forget gate

The forget gate computes, for each channel of the cell state, a number in (0,1) that will MULTIPLY C_{t-1}. A value near 1 keeps that channel of memory; a value near 0 erases it. Conceptually: 'should I keep what I already remember in this slot?'

EXECUTION STATE

📚 @ = Matrix multiply. W_f is (2, 4), z is (4,); W_f @ z is the (2,) vector of dot products row-by-row.

W_f (2×4) =

    h0     h1     x0     x1
c0  0.2  -0.3   0.4    0.1
c1  0.1   0.5  -0.2    0.3

→ W_f purpose = Row c0 decides what the forget gate says about channel 0 of memory; row c1 does channel 1. Each row mixes hidden state and input into one pre-activation number per channel.

b_f (shape (2,)) = [0.1, 0.2] — positive bias nudges the gate toward 'keep' at initialization. In practice this is why many frameworks initialize b_f near +1.

→ step 1 compute = z = [0, 0, 0.5, -0.2] row c0: 0.2·0 + (-0.3)·0 + 0.4·0.5 + 0.1·(-0.2) + 0.1 = 0.28 row c1: 0.1·0 + 0.5·0 + (-0.2)·0.5 + 0.3·(-0.2) + 0.2 = 0.04 σ([0.28, 0.04]) = [0.5695, 0.5100]

⬆ f_t (shape (2,)) = [0.5695, 0.5100] — channel 0 keeps ~57% of old memory, channel 1 keeps ~51%.

13i_t = sigmoid(W_i @ z + b_i) — input gate

The input gate decides, per channel, how much of the NEW candidate memory to admit. Sigmoid again, so each entry is a soft valve in (0, 1).

EXECUTION STATE

W_i (2×4) =

    h0     h1     x0     x1
c0  0.3   0.2  -0.1    0.4
c1 -0.2   0.1   0.5    0.2

b_i (shape (2,)) = [-0.1, 0.0]

→ step 1 compute = row c0: 0.3·0 + 0.2·0 + (-0.1)·0.5 + 0.4·(-0.2) + (-0.1) = -0.23 row c1: -0.2·0 + 0.1·0 + 0.5·0.5 + 0.2·(-0.2) + 0.0 = 0.21 σ([-0.23, 0.21]) = [0.4428, 0.5523]

⬆ i_t (shape (2,)) = [0.4428, 0.5523] — let in ~44% of the new info in channel 0, ~55% in channel 1.

→ i_t vs f_t = These are INDEPENDENT decisions. i_t and f_t don't need to sum to 1 — you can 'forget a lot AND write a lot' or 'keep a lot AND write a little' per channel.

14C_tilde = np.tanh(W_C @ z + b_C) — candidate memory

The candidate vector is a proposal for NEW content to add to memory. Tanh (not sigmoid) because candidate values are content, not switches — they should be signed and live in (-1, 1). The input gate will later decide how much of this proposal to actually write.

EXECUTION STATE

📚 np.tanh = Hyperbolic tangent. Squashes any real number into (-1, 1). tanh(0)=0, tanh(1)=0.7616, tanh(-1)=-0.7616.

W_C (2×4) =

    h0     h1     x0     x1
c0  0.1  -0.4   0.3    0.2
c1  0.4   0.2  -0.1    0.5

b_C (shape (2,)) = [0.0, 0.1]

→ step 1 compute = row c0: 0.1·0 + (-0.4)·0 + 0.3·0.5 + 0.2·(-0.2) + 0.0 = 0.11 row c1: 0.4·0 + 0.2·0 + (-0.1)·0.5 + 0.5·(-0.2) + 0.1 = -0.05 tanh([0.11, -0.05]) = [0.1096, -0.0500]

⬆ C_tilde (shape (2,)) = [0.1096, -0.0500] — the CONTENT we are proposing to add (sign matters; tanh allows negative values to subtract information).

→ why tanh, not sigmoid? = Memory content should be signed: a word like 'not' should push channels NEGATIVELY. A sigmoid (0,1) could only ever add positive content.

15o_t = sigmoid(W_o @ z + b_o) — output gate

The output gate is computed now but USED later (after the cell state is updated). It decides which parts of C_t will be exposed through the hidden state h_t. Same sigmoid-of-a-linear-layer pattern as the other gates.

EXECUTION STATE

W_o (2×4) =

    h0     h1     x0     x1
c0  0.5   0.1  -0.2    0.3
c1  0.2  -0.3   0.4   -0.1

b_o (shape (2,)) = [0.0, -0.1]

→ step 1 compute = row c0: 0.5·0 + 0.1·0 + (-0.2)·0.5 + 0.3·(-0.2) + 0.0 = -0.16 row c1: 0.2·0 + (-0.3)·0 + 0.4·0.5 + (-0.1)·(-0.2) + (-0.1) = 0.12 σ([-0.16, 0.12]) = [0.4601, 0.5300]

⬆ o_t (shape (2,)) = [0.4601, 0.5300] — expose ~46% of channel 0's memory and ~53% of channel 1's to the outside.

18C_t = f_t * C_prev + i_t * C_tilde

This single additive update is the HEART of the LSTM. It erases (f_t * C_prev) then writes (i_t * C_tilde). Because it is ADDITIVE, the derivative ∂C_t/∂C_{t-1} is simply f_t — no multiplicative chain of squashing derivatives. That is why gradients can survive over long sequences.

EXECUTION STATE

📚 * = Element-wise multiplication (Hadamard product) when both operands are 1-D arrays of the same shape. Channel 0 talks to channel 0 only.

⬇ f_t = [0.5695, 0.5100] — the keep-what-you-had fractions.

⬇ C_prev = [0.0, 0.0] — nothing was remembered yet at step 1.

→ f_t * C_prev = [0.5695·0, 0.5100·0] = [0.0, 0.0] — erased (well, there was nothing to keep).

⬇ i_t = [0.4428, 0.5523]

⬇ C_tilde = [0.1096, -0.0500]

→ i_t * C_tilde = [0.4428·0.1096, 0.5523·(-0.0500)] = [0.0485, -0.0276]

⬆ C_t (new cell state) = [0.0485, -0.0276] — the first piece of long-term memory has been written.

→ why addition matters = Compare a vanilla RNN: h_t = tanh(W · h_{t-1} + ...). The derivative passes through tanh AND a multiplicative W every step → shrinks exponentially. LSTM's C_t = f_t·C_{t-1} + ... passes through NO activation — only a multiplicative f_t that the network can push toward 1.

21h_t = o_t * np.tanh(C_t)

The hidden state is a SQUASHED, GATED view of the cell state. Tanh bounds the exposed memory into (-1, 1) so downstream layers see a stable signal; o_t then masks which channels are visible.

EXECUTION STATE

np.tanh(C_t) = tanh([0.0485, -0.0276]) = [0.0485, -0.0276] (tanh(x) ≈ x for small |x|).

⬇ o_t = [0.4601, 0.5300]

⬆ h_t (new hidden state) = [0.4601·0.0485, 0.5300·(-0.0276)] = [0.0223, -0.0146]

→ C vs h = C_t is PRIVATE long-term memory (can hold large, unsquashed values over many steps). h_t is the PUBLIC view (bounded by tanh, masked by o_t). Next step's gates read h_t, not C_t.

23return h_t, C_t

Both states are returned as a tuple. The caller feeds h_t forward as the new h_prev and C_t as the new C_prev.

EXECUTION STATE

⬆ return: (h_t, C_t) = step 1: ([0.0223, -0.0146], [0.0485, -0.0276])

26params = {…} — the 8 learnable tensors

Every gate has its own weight matrix (mixing h and x) and bias vector. For hidden size H and input size I, each W has shape (H, H+I) and each b has shape (H,). Total params for one LSTM cell: 4·H·(H+I) + 4·H.

EXECUTION STATE

parameter count (H=2, I=2) = 4 matrices of shape (2, 4) = 32 weights, plus 4 biases of shape (2,) = 8 biases. 40 parameters total, all learnable.

→ scaling = For real models (H=512, I=512): 4·512·1024 + 4·512 = 2,099,200 parameters PER LSTM cell. Large, but the mechanism is exactly the same as what we do here with 40.

40sequence = [np.array([0.5, -0.2]), …]

A 3-token sequence, each token represented by a tiny 2-D embedding. In a real model these would be word embeddings of dimension 256 or 512.

EXECUTION STATE

sequence[0] = [0.5, -0.2] = Our stand-in embedding for 'I'.

sequence[1] = [0.8, 0.3] = Embedding for 'love'.

sequence[2] = [0.1, 0.9] = Embedding for 'it'.

44h = np.zeros(2); C = np.zeros(2)

Initialize both states to zero vectors before we start processing. This is the convention: the LSTM has no memory yet at t=0.

EXECUTION STATE

h (shape (2,)) = [0.0, 0.0]

C (shape (2,)) = [0.0, 0.0]

→ why zeros? = Any fixed starting value works — zeros are simplest. Some architectures learn the initial states as additional parameters, but zero init is overwhelmingly the standard.

45for t, x_t in enumerate(sequence, 1):

Iterate through the sequence one token at a time. enumerate(..., 1) gives 1-based indices so our printed output reads step 1, step 2, step 3 instead of 0, 1, 2.

LOOP TRACE · 3 iterations

t=1, x_t = 'I' = [0.50, -0.20]

after lstm_cell = h = [0.0223, -0.0146] C = [0.0485, -0.0276]

t=2, x_t = 'love' = [0.80, 0.30]

f_t = [0.6127, 0.5312] — remember more of channel 0 than channel 1

i_t = [0.4859, 0.6116]

C_tilde = [0.2987, 0.1742]

o_t = [0.4849, 0.5495]

C_t = [0.1749, 0.0919] — grew from [0.0485, -0.0276]

h_t = [0.0839, 0.0504]

t=3, x_t = 'it' = [0.10, 0.90]

f_t = [0.5577, 0.6186]

i_t = [0.5708, 0.5543]

C_tilde = [0.1957, 0.5253] — note channel 1 wants to write a lot

o_t = [0.5737, 0.4630]

C_t = [0.2092, 0.3480] — channel 1 memory is accumulating

h_t = [0.1183, 0.1549]

46h, C = lstm_cell(x_t, h, C, params)

Call the cell with the current input and the PREVIOUS h and C. Reassign both — this is how state carries forward. After the loop, h and C hold the final memory for the whole sequence.

EXECUTION STATE

→ state threading = h on the left becomes h_prev on the next call; same for C. This is the ENTIRE 'recurrent' part of a recurrent network — everything else is just a feed-forward step.

47print(f"step {t}: h={h.round(4)}, C={C.round(4)}")

Report the state after each timestep, rounded for readability.

EXECUTION STATE

⬆ expected output = step 1: h=[ 0.0223 -0.0146], C=[ 0.0485 -0.0276] step 2: h=[0.0839 0.0504], C=[0.1749 0.0919] step 3: h=[0.1183 0.1549], C=[0.2092 0.3480]

→ what to notice = |C| grows over time (memory accumulating), |h| stays small (tanh + output gate bounding it). Channel 1 of C grows the fastest here — that channel is learning to react to inputs with a positive x_1 component.

31 lines without explanation

1import numpy as np
2
3def sigmoid(x):
4    return 1.0 / (1.0 + np.exp(-x))
5
6# ---- One LSTM cell: processes a single timestep ----
7def lstm_cell(x_t, h_prev, C_prev, params):
8    # Concatenate previous hidden state and current input: shape (hidden + input,)
9    z = np.concatenate([h_prev, x_t])
10
11    # Four gates — each is a linear layer on z, then an activation
12    f_t     = sigmoid(params["W_f"] @ z + params["b_f"])   # forget gate
13    i_t     = sigmoid(params["W_i"] @ z + params["b_i"])   # input gate
14    C_tilde = np.tanh(params["W_C"] @ z + params["b_C"])   # candidate memory
15    o_t     = sigmoid(params["W_o"] @ z + params["b_o"])   # output gate
16
17    # Update the long-term memory (the cell state)
18    C_t = f_t * C_prev + i_t * C_tilde
19
20    # Produce the new hidden state (the exposed output)
21    h_t = o_t * np.tanh(C_t)
22
23    return h_t, C_t
24
25# ---- Run over a sequence ----
26params = {
27    "W_f": np.array([[ 0.2, -0.3,  0.4,  0.1],
28                     [ 0.1,  0.5, -0.2,  0.3]]),
29    "b_f": np.array([ 0.1,  0.2]),
30    "W_i": np.array([[ 0.3,  0.2, -0.1,  0.4],
31                     [-0.2,  0.1,  0.5,  0.2]]),
32    "b_i": np.array([-0.1,  0.0]),
33    "W_C": np.array([[ 0.1, -0.4,  0.3,  0.2],
34                     [ 0.4,  0.2, -0.1,  0.5]]),
35    "b_C": np.array([ 0.0,  0.1]),
36    "W_o": np.array([[ 0.5,  0.1, -0.2,  0.3],
37                     [ 0.2, -0.3,  0.4, -0.1]]),
38    "b_o": np.array([ 0.0, -0.1]),
39}
40
41sequence = [np.array([0.5, -0.2]),   # "I"
42            np.array([0.8,  0.3]),   # "love"
43            np.array([0.1,  0.9])]   # "it"
44
45h = np.zeros(2)
46C = np.zeros(2)
47for t, x_t in enumerate(sequence, 1):
48    h, C = lstm_cell(x_t, h, C, params)
49    print(f"step {t}: h={h.round(4)}, C={C.round(4)}")

Run the code. You should see:

Step	Token	h_t	C_t
1	I	[ 0.0223, -0.0146]	[ 0.0485, -0.0276]
2	love	[ 0.0839, 0.0504]	[ 0.1749, 0.0919]
3	it	[ 0.1183, 0.1549]	[ 0.2092, 0.3480]

Notice how the magnitudes of $C_t$ are accumulating across steps — memory building up — while the magnitudes of $h_t$ stay small, bounded by $\tanh$ and masked by $o_t$ . That is the highway-vs-readout distinction playing out in actual numbers.

The Same LSTM in PyTorch

Now watch how the same mechanism becomes two lines in PyTorch. The NumPy version makes the math concrete; the PyTorch version is what you will actually use. We show both ways: nn.LSTMCell (one step at a time, mirroring our NumPy loop) and nn.LSTM (the whole sequence in a single call).

LSTM in PyTorch — Cell-Level and Sequence-Level

🐍lstm_pytorch.py

Explanation(16)

Code(27)

1import torch

The core PyTorch package. Gives us torch.Tensor (GPU-ready N-dim arrays), autograd, and the building blocks for neural networks.

EXECUTION STATE

torch.Tensor = Like NumPy's ndarray but with requires_grad tracking and GPU support.

2import torch.nn as nn

torch.nn is the module with ready-made layers. We import it as nn so nn.LSTMCell, nn.LSTM, nn.Linear etc. are all at hand.

EXECUTION STATE

nn.LSTMCell = A single-timestep LSTM cell. Equivalent to our lstm_cell function above.

nn.LSTM = A full multi-step LSTM. Wraps a cell in a loop over time. Also supports multi-layer stacking and bidirectional operation.

4torch.manual_seed(42)

Make the random weight initialization reproducible. Without this line, every run of nn.LSTMCell would get different weights and different outputs.

EXECUTION STATE

📚 torch.manual_seed = Seeds the global PyTorch RNG. Any random operations (Linear init, Dropout masks, etc.) become deterministic.

7cell = nn.LSTMCell(input_size=2, hidden_size=2)

Construct a single LSTM cell. Internally it holds weight_ih (shape (4·H, I)), weight_hh (shape (4·H, H)), and two bias vectors — all the four gates packed into one big matrix for speed.

EXECUTION STATE

📚 nn.LSTMCell(input_size, hidden_size) = Creates one LSTM cell. Input and hidden sizes determine tensor shapes; the 4× in the weight shapes is because forget/input/candidate/output are computed in one fused matmul.

⬇ arg: input_size=2 = Feature dimension of x_t. Must match the last dimension of the input tensor passed to the cell.

⬇ arg: hidden_size=2 = Dimension of h_t AND C_t. LSTM always has hidden and cell of the same size.

→ cell.weight_ih shape = (8, 2) = (4·hidden, input) — rows [0:2] are input gate, [2:4] are forget, [4:6] are candidate, [6:8] are output (PyTorch's packing order is i, f, g, o).

→ cell.weight_hh shape = (8, 2) = (4·hidden, hidden) — same packing.

→ params = Same 40 learnable numbers we had in NumPy (4 matrices + 4 biases). PyTorch also has a second bias vector by default for historical/perf reasons; the two biases add so the math is identical.

9h = torch.zeros(1, 2)

Initial hidden state. Shape is (batch, hidden). Batch is 1 because we're feeding one sequence at a time; hidden is 2 to match the cell.

EXECUTION STATE

📚 torch.zeros(*sizes) = Allocate a tensor of the given shape filled with 0.0. Same as NumPy's np.zeros but returns a torch.Tensor.

h shape (1, 2) = [[0.0, 0.0]]

→ why batch dim? = nn.LSTMCell ALWAYS expects a batch dimension, even if batch=1. Forgetting it is a very common error — the cell will broadcast incorrectly or throw a shape mismatch.

10C = torch.zeros(1, 2)

Initial cell state. Same shape as h; same reason.

EXECUTION STATE

C shape (1, 2) = [[0.0, 0.0]]

11seq = [torch.tensor([[0.5, -0.2]]), …]

Our 3-token sequence. Each element is a (1, 2) tensor — (batch=1, features=2). We use a Python list so we can loop one step at a time for this demo.

EXECUTION STATE

📚 torch.tensor(data) = Build a tensor from nested lists. dtype is inferred (float → torch.float32). Each inner list becomes one row.

seq[0].shape = torch.Size([1, 2]) — (batch, features). The extra dimension vs. NumPy is PyTorch's batch convention.

15for t, x_t in enumerate(seq, 1):

Walk through the 3 timesteps, passing state from one step to the next — exactly like the NumPy version.

LOOP TRACE · 3 iterations

t=1, x_t = [[0.5, -0.2]]

h after cell = [[0.3217, 0.1095]]

C after cell = [[0.5192, 0.2124]]

t=2, x_t = [[0.8, 0.3]]

h after cell = [[0.4319, 0.2324]]

C after cell = [[0.7113, 0.4321]]

t=3, x_t = [[0.1, 0.9]]

h after cell = [[0.1188, 0.2279]]

C after cell = [[0.1939, 0.4966]]

16h, C = cell(x_t, (h, C))

Forward one step. The cell takes the input tensor and a tuple of (h, C), applies the four gates internally (in a single fused matmul), and returns the new (h, C).

EXECUTION STATE

📚 cell.__call__(input, hx=None) = Calls forward(). input shape (batch, input); hx is a tuple (h, C) each shape (batch, hidden). Returns a new (h, C) tuple.

⬇ arg: x_t = (1, 2) tensor — this step's feature vector.

⬇ arg: (h, C) = Tuple of previous states. PyTorch accepts this as a SINGLE positional argument after input.

→ values differ from NumPy = PyTorch uses RANDOM weight initialization (Xavier-like, scaled by 1/√hidden). Our NumPy run used hand-picked weights. The MECHANISM is identical — only numbers differ.

17print(f"step {t}: h=…, C=…")

.detach() removes autograd tracking (we're just printing, not training); .numpy() converts to a NumPy array; .round(4) trims to 4 decimals for readability.

EXECUTION STATE

📚 .detach() = Returns a new tensor that shares storage but is not tracked by autograd. Necessary before .numpy() on a tensor that has requires_grad=True.

📚 .numpy() = Converts a CPU tensor to a NumPy array (zero-copy when possible). Errors if the tensor is on GPU.

21lstm = nn.LSTM(input_size=2, hidden_size=2, batch_first=True)

The idiomatic PyTorch LSTM. Instead of writing a Python for-loop, we give it the WHOLE sequence as a (batch, seq_len, features) tensor and it runs the cell over all timesteps in an optimized C++/CUDA kernel.

EXECUTION STATE

📚 nn.LSTM = Multi-step, multi-layer LSTM. Defaults: num_layers=1, bidirectional=False, dropout=0.

⬇ arg: batch_first=True = Changes the expected input shape from (seq_len, batch, features) to (batch, seq_len, features). Most users find batch_first=True more intuitive, but the PyTorch default is False for historical reasons — always double-check.

→ why batch_first matters = With batch_first=False, the first dim is TIME. If you pass (1, 3, 2) thinking 'batch=1, seq=3, features=2' but didn't set batch_first, PyTorch reads it as 'seq=1, batch=3, features=2' — you'd process 3 separate batches of length 1 instead of one batch of length 3.

22batch = torch.tensor([[[0.5, -0.2], [0.8, 0.3], [0.1, 0.9]]])

Pack the whole sequence into one (1, 3, 2) tensor: one batch, three timesteps, two features each. The triple bracket matters — outer is batch, middle is time, inner is features.

EXECUTION STATE

batch.shape = torch.Size([1, 3, 2]) — (batch=1, seq_len=3, input=2)

→ ordering = batch[0] is the ONLY sequence in this batch; batch[0][0] is the first timestep; batch[0][0][0] is the first feature of the first timestep.

24output, (h_n, c_n) = lstm(batch)

Run the whole sequence through the LSTM in a single call. output contains the hidden state at EVERY timestep; (h_n, c_n) is the state AFTER the last timestep — the same thing h and C would hold at the end of your manual loop.

EXECUTION STATE

⬆ output = Shape (batch, seq_len, hidden) = (1, 3, 2). output[0][t] is h_t for this batch at step t.

→ output values = [[[0.3217, 0.1095], [0.4319, 0.2324], [0.1188, 0.2279]]]

⬆ h_n = Shape (num_layers·num_directions, batch, hidden) = (1, 1, 2). This is h at t=3: [[[0.1188, 0.2279]]].

⬆ c_n = Same shape as h_n. This is C at t=3: [[[0.1939, 0.4966]]].

→ output vs h_n = output[:, -1, :] (last timestep of output) equals h_n[-1] (last layer's final h). For a 1-layer LSTM they are literally the same numbers in different shapes.

25print("output shape:", tuple(output.shape))

Convert the torch.Size to a plain tuple for cleaner printing. Verifies the shape is the expected (1, 3, 2).

EXECUTION STATE

→ printed = output shape: (1, 3, 2)

26print("final h_n :", …)

Print the final hidden state. Useful for classification: for a sentiment classifier, this is the vector you would feed into a Linear layer.

EXECUTION STATE

→ printed = final h_n : [[[0.1188 0.2279]]]

27print("final c_n :", …)

Print the final cell state. Usually you don't pass this downstream — it stays inside the LSTM — but it is returned for cases like encoder-decoder models where the decoder needs to initialize its own state from the encoder's.

EXECUTION STATE

→ printed = final c_n : [[[0.1939 0.4966]]]

→ NumPy vs PyTorch reconciled = Different numbers because different weights. Same formulas, same five lines of math (f, i, C̃, o, then C_t = f·C_{t-1} + i·C̃ and h_t = o·tanh(C_t)).

11 lines without explanation

1import torch
2import torch.nn as nn
3
4torch.manual_seed(42)
5
6# ---- Option A: one cell, manual loop over timesteps ----
7cell = nn.LSTMCell(input_size=2, hidden_size=2)
8
9h = torch.zeros(1, 2)           # (batch=1, hidden=2)
10C = torch.zeros(1, 2)
11seq = [torch.tensor([[0.5, -0.2]]),
12       torch.tensor([[0.8,  0.3]]),
13       torch.tensor([[0.1,  0.9]])]
14
15for t, x_t in enumerate(seq, 1):
16    h, C = cell(x_t, (h, C))
17    print(f"step {t}: h={h.detach().numpy().round(4)}, "
18          f"C={C.detach().numpy().round(4)}")
19
20# ---- Option B: full nn.LSTM — one call processes the whole sequence ----
21lstm = nn.LSTM(input_size=2, hidden_size=2, batch_first=True)
22batch = torch.tensor([[[0.5, -0.2], [0.8, 0.3], [0.1, 0.9]]])   # (1, 3, 2)
23
24output, (h_n, c_n) = lstm(batch)
25print("output shape:", tuple(output.shape))     # (1, 3, 2)
26print("final h_n   :", h_n.detach().numpy().round(4))
27print("final c_n   :", c_n.detach().numpy().round(4))

The output values differ from the NumPy version because PyTorch uses its own random initialization for the weight matrices. The mechanism is identical, line for line — the same four gates, the same additive cell update, the same gated hidden state.

Aspect	NumPy from scratch	nn.LSTMCell	nn.LSTM
Weights	You provide	Random init	Random init
Gates	Four separate matmuls	One fused matmul	One fused matmul, batched over time
Loop over time	Python for-loop	Python for-loop	Optimized C++/CUDA kernel
Batching	One sequence	Batch-aware	Batch + sequence in one tensor
GPU	None	.to('cuda')	.to('cuda')
When to use	Learning / debugging	Custom step logic	Standard training

Verify they match. For any one LSTM cell, the output of a Python loop over nn.LSTMCell must equal the output of a single nn.LSTM call with the same weights — the second is just a faster implementation of the first. PyTorch's tests enforce this. Your own custom RNN variants should do the same.

Why LSTMs Actually Learn Long Memory

We have talked about the gradient highway abstractly; the visualization below makes it quantitative. It compares the magnitude of the gradient $\partial L / \partial h_1$ (for a vanilla RNN) against $\partial L / \partial C_1$ (for an LSTM) as a function of sequence length. Drag the sliders and watch the RNN gradients collapse while the LSTM gradients survive.

Loading gradient flow…

Read off the math from the visualization. In a vanilla RNN the gradient at timestep 1 is $\prod_{k} W_{hh}^{\top} \tanh'(\cdot)$ — a product of many potentially-small terms. In the LSTM the dominant term is $\prod_{k} f_k$ , and the network is free to learn $f_k \approx 1$ whenever it needs memory to survive.

A good sanity check: if you are training an LSTM on long sequences and still seeing vanishing gradients, print the mean of

f_t

during training. If it is stuck near 0.5, the forget-bias-to-+1 initialization trick is missing.

Interactive: BPTT Gradient Unroller

Unroll the cell in 3D and watch a gradient particle travel backward along the memory highway. Drag $T$ , flip between LSTM and vanilla RNN, and compare the final gradient magnitude displayed above the canvas. The LSTM's particle stays bright while the RNN's fades — that is the exponential shrink from $\prod_k |W_{hh}| \cdot \tanh'(\cdot)$ playing out in pixels.

Loading BPTT unroller…

BPTT by Hand vs Autograd

The highway argument says $\partial C_t / \partial C_{t-1} = \text{diag}(f_t)$ — but that is only the dominant backward path. The full Jacobian also includes smaller contributions that flow through $h_{t-1}$ into the next step's gates. We check both against PyTorch's autograd.

BPTT by Hand — Highway Term vs Autograd

🐍bptt_by_hand.py

Explanation(21)

Code(45)

1import torch

We need torch for autograd, which is the whole point of this experiment — hand-derived gradients compared to the real thing.

EXECUTION STATE

torch = Root PyTorch package, provides tensors and autograd.

8f_1 = torch.tensor([0.5695, 0.5100])

Forget-gate output at t=1, copied from §17.1's worked example so the numbers are self-consistent across the whole section. Channel 0 keeps 57% of its memory, channel 1 keeps 51%.

EXECUTION STATE

⬆ f_1 = tensor([0.5695, 0.5100])

9f_2 = torch.tensor([0.6127, 0.5312])

Forget gate at t=2 (also from §17.1).

EXECUTION STATE

⬆ f_2 = tensor([0.6127, 0.5312])

10f_3 = torch.tensor([0.5577, 0.6186])

Forget gate at t=3.

EXECUTION STATE

⬆ f_3 = tensor([0.5577, 0.6186])

18dC3_dC1_highway = f_3 * f_2

The whole BPTT argument for the LSTM in one element-wise multiplication. ∂C_t/∂C_{t-1} = diag(f_t), so the highway contribution to ∂C_3/∂C_1 is f_3 ⊙ f_2. Because f_t ∈ (0, 1), repeated products shrink — but much more slowly than a vanilla RNN's matrix-product-through-tanh path, and the network can learn to keep f_t near 1 when it wants memory to survive.

EXECUTION STATE

📚 * (elemwise) = Hadamard product between two same-shape tensors. Gives a channel-wise diagonal Jacobian.

→ step-by-step = channel 0: 0.5577 · 0.6127 = 0.3418 channel 1: 0.6186 · 0.5312 = 0.3287

⬆ dC3_dC1_highway = tensor([0.3418, 0.3287])

→ intuition = Over just two steps the gradient shrinks by ~3×. After 30 steps it would be ~0.33^30 ≈ 5e-15 — but the network can push f closer to 1 during training to slow or eliminate that decay.

19print(f'highway term … = {dC3_dC1_highway.numpy().round(4)}')

Report the hand-derived highway term.

EXECUTION STATE

→ printed = highway term diag(∂C_3/∂C_1) = [0.3418 0.3287]

25from torch import nn

We need nn.LSTMCell for the autograd check.

26torch.manual_seed(42)

Same seed as §17.1's PyTorch run, so the weights match the earlier trace exactly.

EXECUTION STATE

📚 torch.manual_seed = Seeds the global RNG. Reproducibility across runs.

27cell = nn.LSTMCell(input_size=2, hidden_size=2)

One-step LSTM cell. Same shape and seed as the §17.1 PyTorch example.

EXECUTION STATE

cell.weight_ih.shape = (8, 2) — 4 gates × 2 hidden × 2 input

cell.weight_hh.shape = (8, 2) — 4 gates × 2 hidden × 2 hidden

29x1 = torch.tensor([[0.5, -0.2]])

Token 1 (‘I’), shape (batch=1, features=2).

EXECUTION STATE

x1 =

tensor([[0.5000, -0.2000]])

30x2 = torch.tensor([[0.8, 0.3]])

Token 2 (‘love’).

31x3 = torch.tensor([[0.1, 0.9]])

Token 3 (‘it’).

33h0 = torch.zeros(1, 2); C0 = torch.zeros(1, 2)

Start with both states at zero — the canonical initial condition.

EXECUTION STATE

h0 =

tensor([[0.0, 0.0]])

C0 =

tensor([[0.0, 0.0]])

34h1, C1 = cell(x1, (h0, C0))

Forward pass, step 1. At this point autograd is tracking the full graph — but we only care about the gradient from step 3 back to C_1, so we will detach and re-attach on the next line.

35C1 = C1.detach().clone().requires_grad_(True)

THIS is the trick. .detach() drops the computation graph leading from (x1, h0, C0) to C1; .clone() gives an independent tensor; .requires_grad_(True) makes autograd start tracking C1 as a leaf. Now when we compute C_3 from C_1 and call .backward(), the gradient will flow through exactly two time-steps (t=2, t=3) and populate C1.grad.

EXECUTION STATE

📚 .detach() = Returns a tensor that shares storage but has no autograd history. Used to 'break' the graph at a specific point.

📚 .clone() = Deep copy of the tensor's values — needed because requires_grad_() modifies in place and we want a fresh leaf.

📚 .requires_grad_(True) = In-place flag that turns this tensor into a differentiable leaf. Trailing underscore = in-place modification (PyTorch convention).

→ why all three? = detach removes history, clone gives independence, requires_grad_ registers it as a new starting point. Skipping any one of these would either crash .backward() or give the wrong gradient.

36h2, C2 = cell(x2, (h1, C1))

Forward step 2, this time tracked. Autograd records: C_2 depends on C_1 via f_2 (the highway) AND via h_1 that also routes into x_t concatenation.

37h3, C3 = cell(x3, (h2, C2))

Forward step 3. By now C_3 depends on C_1 through both paths.

40C3.sum().backward()

Seed the backward pass. .sum() collapses C_3 (shape (1, 2)) into a scalar so .backward() has a single output to differentiate. Because the seed ∇C_3 is [1, 1], C1.grad will literally equal ∂ΣC_3/∂C_1 — the column-sum of the Jacobian.

EXECUTION STATE

📚 .sum() = Sums all elements of a tensor into a single scalar — required by .backward() on non-scalar outputs.

📚 .backward() = Triggers reverse-mode autodiff from the scalar to every requires_grad=True leaf in the graph, populating each leaf's .grad attribute.

41print(f'autograd ∂ΣC_3/∂C_1 = {C1.grad.detach().numpy().round(4)}')

Print the full autograd-computed gradient — highway term PLUS the smaller contributions routed through h_1 into subsequent gates.

EXECUTION STATE

→ representative output = autograd ∂ΣC_3/∂C_1 = [0.3568 0.3421]

42print(f'highway approx = …')

Compare to f_3 * f_2 = [0.3418, 0.3287]. The autograd answer is slightly larger because C_1 also leaks into later cell-state updates via h_1, which enters the gate pre-activations at step 2.

EXECUTION STATE

→ printed = highway approx = [0.3418 0.3287]

→ relative difference = ~4% in channel 0, ~4% in channel 1. The highway is the DOMINANT backward path; the h-routed paths are corrections.

43print('Difference is the h-path contributions autograd captures exactly.')

The punchline. Hand-derivation tells you the shape of the story (highway + corrections). Autograd gives you the exact number. They agree to leading order.

EXECUTION STATE

→ printed = Difference is the h-path contributions autograd captures exactly.

→ the lesson = When you want to UNDERSTAND why gradients survive, derive the highway term by hand. When you want to TRAIN a model, use autograd — it captures every path exactly and for free.

24 lines without explanation

1import torch
2
3# Re-use the §17.1 hand-picked weights, cast to torch.
4# We only need the forget-gate outputs to make the highway argument, so rather
5# than reconstruct the full LSTM, we start from the pre-computed f_t values.
6#
7# From the §17.1 trace:
8f_1 = torch.tensor([0.5695, 0.5100])    # forget gate at t=1
9f_2 = torch.tensor([0.6127, 0.5312])    # forget gate at t=2
10f_3 = torch.tensor([0.5577, 0.6186])    # forget gate at t=3
11
12# ---- Analytical highway term ----
13# The LSTM cell-state update is additive: C_t = f_t * C_{t-1} + i_t * g_t.
14# Taking the Jacobian while holding the other paths fixed:
15#     ∂C_t / ∂C_{t-1} = diag(f_t)
16# So the backward product from C_3 all the way back to C_1 is simply
17#     ∂C_3 / ∂C_1 (diag) = f_3 * f_2
18# — two element-wise multiplications, no tanh, no W_hh, no repeated squash.
19dC3_dC1_highway = f_3 * f_2
20print(f"highway term diag(∂C_3/∂C_1) = {dC3_dC1_highway.numpy().round(4)}")
21
22# ---- Autograd verification ----
23# Run the forward pass again, this time with C_1 marked as requires_grad.
24# Then sum C_3 and call .backward() — C_1.grad gives the true gradient,
25# which also includes the (smaller) paths through h_1, h_2 into the gates.
26from torch import nn
27torch.manual_seed(42)
28cell = nn.LSTMCell(input_size=2, hidden_size=2)
29
30x1 = torch.tensor([[0.5, -0.2]])
31x2 = torch.tensor([[0.8,  0.3]])
32x3 = torch.tensor([[0.1,  0.9]])
33
34h0 = torch.zeros(1, 2)
35C0 = torch.zeros(1, 2)
36h1, C1 = cell(x1, (h0, C0))
37C1 = C1.detach().clone().requires_grad_(True)   # seed autograd here
38h2, C2 = cell(x2, (h1, C1))
39h3, C3 = cell(x3, (h2, C2))
40
41# Sum C_3 so .backward() populates C1.grad.
42C3.sum().backward()
43print(f"autograd  ∂ΣC_3/∂C_1 = {C1.grad.detach().numpy().round(4)}")
44print(f"highway   approx     = {dC3_dC1_highway.numpy().round(4)}")
45print("Difference is the h-path contributions autograd captures exactly.")

The autograd result is a few percent larger than the highway approximation — those extra percent are the h-path contributions your hand-derivation explicitly ignored. The key point stands: the dominant term is the diagonal $f_3 \odot f_2$ , and it is the one the network can keep near 1 by learning appropriate gate weights.

The Forget-Bias Trick, Ablated

The forget-bias-to-+1 heuristic (Jozefowicz, Zaremba & Sutskever, 2015) initializes $b_f$ so that at the start of training the forget gate is not halving memory every step. The following zero-input experiment isolates the effect: we seed one unit of memory and watch it decay over 20 steps with $b_f = 0$ vs $b_f = 1$ .

Forget-Bias Ablation — b_f = 0 vs b_f = +1

🐍forget_bias_ablation.py

Explanation(22)

Code(33)

1import numpy as np

We need NumPy for the 1-dimensional toy arrays and elementwise math. Everything here runs in microseconds because hidden size is 1.

EXECUTION STATE

numpy = Numerical-computing library — ndarray, linear algebra, elementwise math.

3def sigmoid(x)

Same σ(x) = 1/(1+e^(-x)) used everywhere in this chapter. The key fact we will lean on: σ(0) = 0.5, σ(1) = 0.731. Those two numbers are the entire forget-bias story.

EXECUTION STATE

σ(0) = 0.5 — a gate that 'starts half-closed' halves memory every step.

σ(1) = 0.7311 — a gate that 'starts mostly open' preserves 73% per step.

7def decay_experiment(b_f_value, steps=20, C0=1.0)

A minimal LSTM cell with ALL weights set to zero and only the forget-bias variable. With W=0 the pre-activation of every gate is just its own bias — so the gate values are constant in time. This isolates the effect of b_f.

EXECUTION STATE

⬇ input: b_f_value = The only thing we vary between runs. 0.0 → forget gate stays at σ(0)=0.5 (half-close every step). 1.0 → σ(1)=0.731 (keep most).

⬇ input: steps=20 = How many timesteps of pure-zero input we feed.

⬇ input: C0=1.0 = We write one unit of memory at t=0 and then just watch it decay. No new information arrives.

⬆ returns = Python list of length 21 — [C_0, C_1, …, C_20] in order.

9W = np.zeros((1, 2))

Every gate weight matrix is zero. Shape (hidden=1, hidden+input=2) — so W·z is always the zero vector regardless of z. Only biases drive the gates.

EXECUTION STATE

📚 np.zeros = Allocate an all-zeros array of the given shape. np.zeros((1, 2)) = [[0.0, 0.0]].

⬆ W =

[[0.0, 0.0]] — (hidden, hidden+input) = (1, 2)

→ why zero weights? = Strips the experiment down to the bias question. In a real cell W and b are both learned and interact; here we pin W to isolate b_f.

10b_i, b_g, b_o = 0.0, 0.0, 0.0 — neutral biases on the other gates

Every gate except the forget gate starts at σ(0)=0.5 (or tanh(0)=0 for the candidate). With zero weights and zero biases, none of these do anything interesting — i·g = 0.5 · 0 = 0, so the cell state never writes anything new. That is what lets us observe PURE decay.

EXECUTION STATE

b_i = 0.0 = Input gate = σ(0) = 0.5 (doesn't matter — g will be 0).

b_g = 0.0 = Candidate = tanh(0) = 0. Nothing to write ever.

b_o = 0.0 = Output gate = σ(0) = 0.5.

12h = np.array([0.0]); C = np.array([C0])

Seed the initial state: h starts at zero, C starts at 1.0. Now we just let the LSTM run — with zero input every step and no fresh content to write, the only thing that can happen to C is multiplication by f.

EXECUTION STATE

h (initial) = [0.0]

C (initial) = [1.0] — this is the memory we will watch decay

14trace = [C[0]]

Record the initial cell-state value. Each loop iteration will append C at the end of the step, so trace ends up length steps+1.

EXECUTION STATE

trace = [1.0]

16for _ in range(steps):

Run the pure-decay loop. With W=0 every gate is locked to its bias, so the dynamics per step are independent of t — the whole experiment reduces to 'multiply C by σ(b_f), 20 times'.

LOOP TRACE · 6 iterations

t = 1 (b_f = 0.0 case)

f_1 = σ(0) = 0.5

C_1 = f_1 · C_0 = 0.5 · 1.0 = 0.5000

t = 10 (b_f = 0.0)

C_10 = 0.5^10 · 1.0 = ≈ 9.77e-4 — less than a thousandth of the original

t = 20 (b_f = 0.0)

C_20 = 0.5^20 · 1.0 = ≈ 9.54e-7 — memory essentially wiped

t = 1 (b_f = 1.0 case)

f_1 = σ(1) = 0.7311

C_1 = f_1 · C_0 = 0.7311 · 1.0 = 0.7311

t = 10 (b_f = 1.0)

C_10 = 0.7311^10 · 1.0 = ≈ 0.0458 — still 5% of the original

t = 20 (b_f = 1.0)

C_20 = 0.7311^20 · 1.0 = ≈ 2.10e-3 — 2000× larger than the b_f=0 case

17z = np.concatenate([h, np.array([0.0])])

Input to every gate: [h, x] with x = 0.0 fixed. With h also starting at zero and W = 0, z will be multiplied by zero no matter what, so the gates are bias-driven.

EXECUTION STATE

z = [h_prev, 0.0] — second element is the zero input

→ W @ z = 0.0 — always, because W is zero

18f = sigmoid(W @ z + b_f_value)

The only gate whose value depends on the experimental variable. W·z is zero, so f = σ(b_f_value) — constant in t for this contrived setup. Two runs, two constants: σ(0)=0.5 vs σ(1)=0.731.

EXECUTION STATE

📚 @ = Matrix multiply — here it is W (1×2) @ z (2,) = a 1-element array holding 0.

→ run 1 (b_f=0) = f = σ(0 + 0) = 0.5 every step

→ run 2 (b_f=1) = f = σ(0 + 1) = 0.7311 every step

19i = sigmoid(W @ z + b_i)

Input gate. We set b_i = 0 so i = σ(0) = 0.5 every step. Its value does not matter for this experiment because the candidate g is always zero.

EXECUTION STATE

i = 0.5 every step (b_i = 0, W·z = 0)

20g = np.tanh(W @ z + b_g)

Candidate. W·z + b_g = 0 every step, so g = tanh(0) = 0. i·g = 0.5 · 0 = 0 — nothing is ever written into C. This is how we engineer a pure-decay experiment.

EXECUTION STATE

📚 np.tanh = tanh(0) = 0. The ONLY input to the cell-state update is zero → no new memory.

→ i · g = 0.5 · 0.0 = 0.0 — the write path is off by construction

21o = sigmoid(W @ z + b_o)

Output gate. We do not use it for the decay argument — we look at C, not h. Included for completeness.

EXECUTION STATE

o = σ(0) = 0.5

22C = f * C + i * g

The LSTM cell-state update — the ONE equation the forget-bias trick is really about. With i·g = 0 by construction, this collapses to C_t = f · C_{t-1}. A scalar multiplication every step. Run 20 times, the memory is multiplied by f^20.

EXECUTION STATE

→ simplified to C = f * C = because i*g = 0 by our setup

→ 20-step product (b_f=0) = C_20 = 0.5^20 · 1.0 ≈ 9.54e-7

→ 20-step product (b_f=1) = C_20 = 0.7311^20 · 1.0 ≈ 2.10e-3 — 2200× larger

→ the gradient version = ∂C_20/∂C_0 = f^20 for the same reason. So gradients through a 20-step unroll are 2200× stronger with the forget-bias trick.

23h = o * np.tanh(C)

Hidden state update — squash C then gate by o. Not the focus here, but included so the cell matches the canonical LSTM equations.

24trace.append(float(C[0]))

Record this step's cell state. Python floats for easy plotting.

25return trace

Hand back the full [C_0, C_1, …, C_20] record. The caller will compare two traces.

27no_bias = decay_experiment(b_f_value=0.0)

Run 1: canonical LSTM init, forget-bias = 0. Gate locks to σ(0) = 0.5 → memory halves every step.

EXECUTION STATE

⬆ no_bias[0] = 1.0000 — initial memory

⬆ no_bias[10] = ≈ 9.77e-4

⬆ no_bias[20] = ≈ 9.54e-7 — wiped

28with_bias = decay_experiment(b_f_value=1.0)

Run 2: Jozefowicz et al. 2015 recommendation, forget-bias = +1. Gate locks to σ(1) = 0.7311 → memory multiplied by 0.73 each step, much slower decay.

EXECUTION STATE

⬆ with_bias[0] = 1.0000

⬆ with_bias[10] = ≈ 0.0458 — 47× larger than the no-bias run at t=10

⬆ with_bias[20] = ≈ 2.10e-3 — 2200× larger than the no-bias run at t=20

30print(f"b_f = 0: ...")

Final printed line for the no-bias case. Notice the 3 orders of magnitude drop in 20 steps.

EXECUTION STATE

→ printed = b_f = 0: C_0=1.0000 C_20=9.537e-07

31print(f"b_f = 1: ...")

Final printed line for the with-bias case. Still decaying, but at a rate the rest of the training process can counter by learning a higher effective f_t on real data.

EXECUTION STATE

→ printed = b_f = 1: C_0=1.0000 C_20=2.105e-03

32print(f"ratio ...")

The take-away number: a +1 initial forget-bias makes the 20-step gradient product about 2200× stronger. Jozefowicz et al. 2015 Table 1 shows this alone accounts for a significant fraction of the training-speed gap between default and tuned LSTMs.

EXECUTION STATE

→ printed = ratio (with / without): 2207x larger

→ why this matters = During training, if gradients are 2000× stronger, the network has 2000× the signal to learn long-range patterns from. The forget-bias trick is almost free — one scalar — and closes most of the gap.

11 lines without explanation

1import numpy as np
2
3def sigmoid(x):
4    return 1.0 / (1.0 + np.exp(-x))
5
6# A minimal LSTM, zero inputs, with a CHOICE of forget-bias.
7# We write a non-zero cell state at t=0 and watch it decay.
8def decay_experiment(b_f_value, steps=20, C0=1.0):
9    # All weights zero → gates depend only on biases.
10    W = np.zeros((1, 2))               # shape (hidden=1, hidden+input=2)
11    b_i, b_g, b_o = 0.0, 0.0, 0.0      # neutral
12
13    h = np.array([0.0])
14    C = np.array([C0])                 # seed one unit of memory at t=0
15    trace = [C[0]]
16
17    for _ in range(steps):
18        z = np.concatenate([h, np.array([0.0])])   # zero input
19        f = sigmoid(W @ z + b_f_value)   # the only gate that matters here
20        i = sigmoid(W @ z + b_i)
21        g = np.tanh(W @ z + b_g)
22        o = sigmoid(W @ z + b_o)
23        C = f * C + i * g                # i*g = 0 because g = tanh(0) = 0
24        h = o * np.tanh(C)
25        trace.append(float(C[0]))
26    return trace
27
28no_bias    = decay_experiment(b_f_value=0.0)
29with_bias  = decay_experiment(b_f_value=1.0)
30
31print(f"b_f = 0:  C_0={no_bias[0]:.4f}   C_20={no_bias[-1]:.3e}")
32print(f"b_f = 1:  C_0={with_bias[0]:.4f}   C_20={with_bias[-1]:.3e}")
33print(f"ratio (with / without): {with_bias[-1] / no_bias[-1]:.0f}x larger")

Over 20 steps with

b_f = 1

the gradient product is roughly

0.73^{20} \approx 2.1 \times 10^{-3}

; with

b_f = 0

it is

0.5^{20} \approx 9.5 \times 10^{-7}

. One scalar of initialization buys ~2000× stronger gradients at the start of training, which is why this trick is a default in almost every modern LSTM implementation.

The Copy Task: An Experimental Proof

Hochreiter and Schmidhuber introduced the copy task in the original 1997 paper as a stress test for long-range memory. The network sees $N$ content tokens, then $L$ distractor steps, then a cue telling it to reproduce the original tokens. A vanilla RNN cannot solve this when $L$ exceeds ~10; an LSTM solves it at $L = 100$ and beyond (Arjovsky, Shah & Bengio, 2016; Bai, Kolter & Koltun, 2018).

Loading copy-task demo…

Slide

L

from 5 to 120 and press Run inference. Around

L = 20

the vanilla RNN's answers start going red; at

L = 60

it is guessing at chance. The LSTM stays confident deep into large-

L

territory because its highway lets the early-step embedding survive across the distractors.

LSTM Variants — Peephole, CIFG, ConvLSTM

The LSTM you have just learned is the 1997 / 2000 canonical cell. Over the next two decades researchers proposed dozens of mutations. Greff et al. (2017) — “LSTM: A Search Space Odyssey” — ran 5200 training runs sweeping eight of the most plausible variants on three sequence-modelling tasks and reported a striking result: none of them consistently beat the vanilla cell by a meaningful margin. The ones worth knowing about are below, not because you should usually reach for them, but because you will see them in the wild and the names should mean something.

Variant	What changes	When it is used
Peephole LSTM (Gers, Schraudolph & Schmidhuber, 2002)	Gates see the cell state directly: f_t, i_t, o_t also read C_{t-1} (or C_t for o_t), not just [h_{t-1}, x_t]	Timing-sensitive tasks (rhythm, precise interval detection) — lets the gate condition on the memory itself, not just the gated output.
CIFG / Coupled Input-Forget Gate (Greff et al., 2017)	i_t = 1 − f_t, forcing keep and write to sum to 1 (exactly the GRU's 1−z / z coupling, grafted onto an LSTM)	Reduces parameter count by ~25% with near-identical task performance — the explicit bridge between LSTM and GRU.
ConvLSTM (Shi et al., 2015)	Replace every matrix multiply inside the cell with a 2-D convolution; gates and states are feature maps, not vectors	Spatio-temporal problems — precipitation nowcasting (the original paper), video prediction, weather modelling.
Layer-Norm LSTM (Ba, Kiros & Hinton, 2016)	Apply LayerNorm to the pre-activations of each gate	Helps training stability on long sequences; standard in many speech models.
Bidirectional LSTM (Schuster & Paliwal, 1997; Graves & Schmidhuber, 2005)	Run one LSTM forward, one backward; concatenate hidden states per step	Tagging, NER, acoustic modelling — any task where both past and future context are available at inference.

The practical rule. Start with the vanilla LSTM (with the forget-bias +1 trick from §17.1). Reach for a variant only when you have a specific reason: precise-timing tasks → peephole; spatio-temporal fields → ConvLSTM; parameter budget tight → CIFG or GRU; long sequences with training instability → LayerNorm LSTM. The vanilla cell is a surprisingly strong default.

Check Your Understanding

Four short questions to make sure the pieces are connected. The answers lead with a one-sentence justification, not just “A”.

Loading quiz…

Summary

Two states, not one. $C_t$ is long-term memory (no activation on its update), $h_t$ is the gated, bounded output.
Four sub-networks, one input. All of $f_t, i_t, \tilde{C}_t, o_t$ are affine functions of the same vector $[h_{t-1}, x_t]$ .
Gates use sigmoid; content uses tanh. Switches want $(0, 1)$ ; signed content wants $(-1, 1)$ .
The whole point is one equation: $C_t = f_t \odot C_{t-1} + i_t \odot \tilde{C}_t$ . Additive updates make the gradient highway possible.
Hidden state reads memory, doesn't own it: $h_t = o_t \odot \tanh(C_t)$ .

With the LSTM cell in hand, the next section unpacks the GRU — a later simplification that merges the forget and input gates into a single update gate and drops the cell state entirely. You will see how many LSTM ideas survive in a smaller package, and where the two architectures diverge.

References

Hochreiter, S. & Schmidhuber, J. (1997). Long Short-Term Memory. Neural Computation 9(8), 1735–1780. [Introduces the LSTM cell and the Constant Error Carousel.]
Gers, F. A., Schmidhuber, J. & Cummins, F. (2000). Learning to Forget: Continual Prediction with LSTM. Neural Computation 12(10), 2451–2471. [Adds the forget gate to the original 1997 architecture.]
Jozefowicz, R., Zaremba, W. & Sutskever, I. (2015). An Empirical Exploration of Recurrent Network Architectures. ICML 2015. [Shows empirically that initializing the forget-gate bias to +1 consistently improves training.]
Pascanu, R., Mikolov, T. & Bengio, Y. (2013). On the difficulty of training recurrent neural networks. ICML 2013 / arXiv:1211.5063. [The canonical analysis of the vanishing/exploding gradient problem in RNNs.]
Greff, K., Srivastava, R. K., Koutník, J., Steunebrink, B. R. & Schmidhuber, J. (2017). LSTM: A Search Space Odyssey. IEEE TNNLS 28(10), 2222–2232 / arXiv:1503.04069. [Large-scale ablation of LSTM variants — the vanilla cell holds up.]
Gers, F. A., Schraudolph, N. N. & Schmidhuber, J. (2002). Learning Precise Timing with LSTM Recurrent Networks. Journal of Machine Learning Research 3, 115–143. [Introduces peephole connections.]
Shi, X., Chen, Z., Wang, H., Yeung, D.-Y., Wong, W.-K. & Woo, W. (2015). Convolutional LSTM Network: A Machine Learning Approach for Precipitation Nowcasting. NeurIPS 2015 / arXiv:1506.04214. [ConvLSTM.]
Ba, J. L., Kiros, J. R. & Hinton, G. E. (2016). Layer Normalization. arXiv:1607.06450. [Includes the Layer-Norm LSTM variant.]
Schuster, M. & Paliwal, K. K. (1997). Bidirectional Recurrent Neural Networks. IEEE Transactions on Signal Processing 45(11), 2673–2681. [Introduces the bidirectional recurrent architecture.]
Graves, A. & Schmidhuber, J. (2005). Framewise phoneme classification with bidirectional LSTM and other neural network architectures. Neural Networks 18(5-6), 602–610. [Bidirectional LSTM applied to speech.]
Arjovsky, M., Shah, A. & Bengio, Y. (2016). Unitary Evolution Recurrent Neural Networks. ICML 2016 / arXiv:1511.06464. [Benchmarks the copy task at long distractor lengths — LSTM solves L=100+, vanilla RNN fails at L>10.]
Bai, S., Kolter, J. Z. & Koltun, V. (2018). An Empirical Evaluation of Generic Convolutional and Recurrent Networks for Sequence Modeling. arXiv:1803.01271. [Additional copy-task benchmarks cited by the Copy-Task demo.]