Boo-AI — Master Artificial Intelligence by Building from Scratch

Why the GRU Exists

The LSTM was a breakthrough — finally, a recurrent cell that could remember across many timesteps. But by 2014, researchers had noticed something curious: the LSTM has four sub-networks (forget, input, candidate, output) operating on two state vectors ( $h_t$ and $C_t$ ). Was all that machinery really necessary? Or could a simpler design solve the same vanishing-gradient problem with fewer moving parts?

In Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation, Kyunghyun Cho and colleagues proposed the Gated Recurrent Unit. It makes two ruthless simplifications. First, the cell state and hidden state are merged into a single vector $h_t$ . Second, the forget and input gates are coupled into a single update gate whose fractions sum to 1 by construction. A separate reset gate remains, but it plays a very narrow role — it only affects the candidate, not the memory highway.

The core bet: we don't need two independent gates for forgetting and writing. If we keep $1-z$ of the old memory, we have exactly $z$ budget left for new content. One knob, two effects, no loss of expressivity on most real tasks.

The result is a cell with three weight matrices instead of four, a single state vector, and — empirically — performance that tracks the LSTM closely on language modelling, machine translation, and music generation while training faster. For many practitioners the GRU is the default; for others the LSTM's extra flexibility earns its keep. Understanding the GRU's minimalism is the best way to appreciate both.

One State Vector, Two Gates

An LSTM carries two vectors through time: the cell state $C_t$ (internal memory) and the hidden state $h_t$ (external output). A GRU carries only $h_t$ . That single vector plays both roles — it is the memory that persists and the signal that flows to the next layer. This is not just notation: it is a genuine reduction in the cell's internal degrees of freedom.

The three learned linear layers are:

Component	Formula	Role
Reset gate	r_t = σ(W_r · [h_{t-1}, x_t] + b_r)	How much past memory to let into the candidate
Update gate	z_t = σ(W_z · [h_{t-1}, x_t] + b_z)	How much of the candidate to write into the new memory
Candidate	h̃_t = tanh(W_h · [r_t ⊙ h_{t-1}, x_t] + b_h)	The proposed new content — signed (-1, 1)

And the single update rule that glues them together:

$h_t = (1 - z_t) \odot h_{t-1} + z_t \odot \tilde{h}_t$

Read it like a budget. Every channel has one unit of memory to spend. It can spend

1-z

on the past and

z

on the present, and those two fractions are forced to sum to 1. There is no way to spend “zero on the past and zero on the present” — you cannot forget without writing, and you cannot write without forgetting. That coupling is the GRU's central commitment.

The Update Gate: One Knob Replaces Two

The update gate $z_t \in (0, 1)^d$ is a per-channel sigmoid. Its value at channel $c$ answers a single question: what fraction of the new candidate do I write into channel $c$ of memory? The remaining fraction $1 - z_c$ automatically goes to keeping the old value. Three extreme settings show the gate's range:

Setting	Meaning	Effect on h_t per channel
z_c ≈ 0	Keep everything	h_t,c ≈ h_{t-1},c — memory is frozen this step
z_c ≈ 1	Overwrite everything	h_t,c ≈ h̃_t,c — new candidate replaces memory
z_c ≈ 0.5	Balance	h_t,c ≈ half past + half candidate

Compare this to the LSTM, where the forget gate $f_t$ and input gate $i_t$ are two independent sigmoids. An LSTM can set $f = 0.9, i = 0.9$ — simultaneously keeping a lot of the old memory AND writing a lot of new content (letting the cell state grow in magnitude). A GRU cannot express that combination. If it keeps 90% of the past, it writes at most 10% of the candidate.

Is this a problem in practice? Usually not. On most sequence-modeling tasks, the coupling behaves as an implicit regularizer — it prevents the memory vector from drifting by encouraging partial replacement rather than unbounded accumulation. Chung et al. (2014) compared GRU and LSTM on polyphonic music modelling and speech-signal modelling: the two architectures tracked each other closely across tasks, with the GRU slightly preferred on music datasets and the two tied on speech. Jozefowicz et al. (2015) later swept a much larger space of gated variants and found no mutation that consistently dominated both. Treat them as interchangeable first choices, not as known winner-and-loser.

The Reset Gate: Forgetting Just for the Proposal

The reset gate is the more subtle of the two gates, and the one most beginners misunderstand. Its equation looks almost identical to the update gate's:

$r_t = \sigma(W_r \cdot [h_{t-1}, x_t] + b_r)$

But its role in the network is completely different. The reset gate does not appear in the final mixing step. It appears only inside the candidate:

$\tilde{h}_t = \tanh(W_h \cdot [r_t \odot h_{t-1}, \, x_t] + b_h)$

Look carefully at what $r_t$ multiplies: $h_{t-1}$ , but only for the purpose of building the proposal. When $r_c \approx 0$ on some channel, that channel of the past is hidden from the candidate — the proposal depends essentially on $x_t$ alone for that channel. When $r_c \approx 1$ , the past participates fully.

Intuition — starting a new sentence. Imagine a GRU reading a novel. At the end of a chapter, the reset gate can close — the candidate for the next step is built from the new chapter's first sentence with the previous chapter's plot pushed out of the proposal's input. The update gate then decides how much of the new candidate to actually write. Reset = a clean slate for proposing; update = how much to commit.

The identity path

(1 - z_t) \odot h_{t-1}

is completely untouched by

r_t

. That is what lets gradients flow through time unchanged — the reset gate cannot sever the memory highway, it can only influence what gets proposed for writing.

The Candidate Hidden State

With the reset gate in hand we can build $\tilde{h}_t$ :

$\tilde{h}_t = \tanh(W_h \cdot [r_t \odot h_{t-1}, \, x_t] + b_h)$

Three things are worth pausing on. First, the activation is $\tanh$ , not $\sigma$ . Gates are soft switches in $(0, 1)$ — they should never flip signs. The candidate is content, and content must be able to cancel or flip previous content, so it lives in $(-1, 1)$ .

Second, the candidate's input is $[r_t \odot h_{t-1}, \, x_t]$ — reset-masked past concatenated with the current input. The update gate $z_t$ saw raw $h_{t-1}$ ; the candidate sees a possibly-silenced version. This asymmetry is a deliberate design choice that makes the reset gate's role well-defined: shape the proposal, not the keep-fraction.

Third, $\tilde{h}_t$ is just a proposal. It does not become the new hidden state on its own. The update gate in the next step decides how much of it to actually commit.

The Final Mix: Convex Combination in Time

Every piece has been assembled; the final step is mechanical:

$h_t = (1 - z_t) \odot h_{t-1} + z_t \odot \tilde{h}_t$

Per channel this is a convex combination — a weighted average whose weights sum to 1. Geometrically, $h_t$ lies on the line segment between the old memory $h_{t-1}$ and the new candidate $\tilde{h}_t$ , at the position dictated by $z$ .

Why is this equation the heart of the GRU? Because of what it says about gradients. Taking a derivative of $h_t$ with respect to $h_{t-1}$ (treating gates as constants for the moment):

$\frac{\partial h_t}{\partial h_{t-1}} \approx (1 - z_t) \cdot I + z_t \cdot \frac{\partial \tilde{h}_t}{\partial h_{t-1}}$

The first term is a scaled identity matrix. When $z_t \approx 0$ (keep memory), the gradient flows backward essentially unchanged — no repeated multiplication by small numbers, no vanishing. This is exactly the same mechanism that makes the LSTM's cell state a gradient highway, accomplished with one fewer gate and one fewer state vector.

The unifying picture: LSTM and GRU both solve the vanishing-gradient problem by inserting an additive identity path into the recurrence. The LSTM's path runs on $C$ ; the GRU's runs on $h$ itself. Once you see that, the surface differences (number of gates, number of states) become a matter of taste and parameter count.

Interactive: GRU Architecture

Click through the four stages — reset gate, candidate, update gate, final mix — and watch each block of the cell light up. The dashed orange line is the identity path $(1 - z_t) \odot h_{t-1}$ that lets gradients travel through time unchanged.

Loading GRU architecture…

Interactive: The Gates at Work

The explorer uses scalar versions of $r_t$ , $z_t$ , and $\tilde{h}_t$ so you can watch all three values respond to the two sliders in real time. In a full network each gate is a vector and each channel acts independently — but the qualitative behavior is exactly what the scalar case shows.

Loading gate explorer…

Try the following. (1) Set

x_t

very negative. What happens to

z_t

? What happens to

h_t

? (2) Set

h_{t-1}

to a large positive value and slide

x_t

from −1 to +1 — observe how much the memory can change in a single step depending on

z_t

Interactive: GRU vs LSTM Side by Side

The following visualization puts the two architectures next to each other. You can slide the input and hidden dimensions and watch the parameter count update — the GRU consistently uses $3/4$ of the LSTM's parameters because it has three gate/candidate matrices instead of four, and no separate cell state.

Loading comparison…

A Full Forward Pass with Real Numbers

Before we look at code, let us trace one GRU cell by hand on the three-token toy sequence “I love it”. We use the same two-dimensional toy embeddings as the LSTM section — this makes every matmul a 2×4 matrix times a (4,) vector, small enough to write out.

Weights

Matrix	Values	Bias
W_r (2×4)	[[ 0.3, -0.2, 0.4, 0.1], [ 0.1, 0.5, -0.3, 0.2]]	b_r = [0.1, 0.0]
W_z (2×4)	[[ 0.2, 0.3, -0.1, 0.4], [-0.2, 0.1, 0.5, 0.2]]	b_z = [-0.1, 0.1]
W_h (2×4)	[[ 0.1, -0.4, 0.3, 0.2], [ 0.4, 0.2, -0.1, 0.5]]	b_h = [0.0, 0.1]

Step 1 — token “I”, x = [0.5, −0.2]

Concatenate: $[h_0, x_1] = [0, 0, 0.5, -0.2]$ . Compute the reset pre-activation row by row:

Row 0 of $W_r [h, x]$ : $0.3 \cdot 0 + (-0.2)\cdot 0 + 0.4\cdot 0.5 + 0.1\cdot(-0.2) = 0.18$ ; add bias 0.1 → $0.28$ .
Row 1: $0.1\cdot 0 + 0.5\cdot 0 + (-0.3)\cdot 0.5 + 0.2\cdot(-0.2) = -0.19$ ; add bias 0.0 → $-0.19$ .
$r_1 = \sigma([0.28, -0.19]) = [0.5695, 0.4526]$ .

Update gate, same drill but with $W_z, b_z$ : $z_1 = \sigma([-0.23, 0.31]) = [0.4428, 0.5769]$ .

Candidate: because $h_0 = 0$ , the reset-masked past is $r \odot h_0 = 0$ , so $[r \odot h_0, x_1] = [0, 0, 0.5, -0.2]$ .

Row 0 of $W_h$ : $0.3\cdot 0.5 + 0.2\cdot(-0.2) + 0.0 = 0.11$ .
Row 1: $(-0.1)\cdot 0.5 + 0.5\cdot(-0.2) + 0.1 = -0.05$ .
$\tilde{h}_1 = \tanh([0.11, -0.05]) = [0.1096, -0.0500]$ .

Final mix: $h_1 = (1 - z_1) \odot 0 + z_1 \odot \tilde{h}_1 = [0.4428\cdot 0.1096,\; 0.5769\cdot(-0.05)] = [0.0485, -0.0288]$ .

Step 2 — token “love”, x = [0.8, 0.3]

Now the reset gate has real memory to act on. After the arithmetic (exactly the same recipe, this time with $h_1 = [0.0485, -0.0288]$ ):

Quantity	Value
r_2	[0.6155, 0.4528]
z_2	[0.4853, 0.6335]
r_2 ⊙ h_1	[0.0299, -0.0130]
h̃_2 = tanh(W_h [r⊙h, x] + b)	[0.2988, 0.1774]
(1 − z_2) ⊙ h_1	[0.0250, -0.0106]
z_2 ⊙ h̃_2	[0.1450, 0.1124]
h_2 (sum)	[0.1700, 0.1018]

Step 3 — token “it”, x = [0.1, 0.9]

Quantity	Value
r_3	[0.5648, 0.5543]
z_3	[0.5780, 0.5760]
r_3 ⊙ h_2	[0.0960, 0.0565]
h̃_3	[0.1945, 0.5297]
(1 − z_3) ⊙ h_2	[0.0717, 0.0432]
z_3 ⊙ h̃_3	[0.1124, 0.3051]
h_3 (sum)	[0.1842, 0.3483]

Compare this to the LSTM section's worked example. There, the cell state grew in magnitude from step 1 to step 3 (≈0.21 → 0.35) while the hidden state stayed small. Here, the single $h$ vector grows from ≈0.05 to ≈0.35 across three steps — the GRU does not split memory from output, so you see the accumulation directly in the exposed vector.

GRU From Scratch in Python

Every line of the forward pass you just traced by hand shows up as a line of NumPy below. Click any line in the code — the explanation panel reveals shapes, values, and the mathematical role of every symbol at that point in the computation. This is the same cell you just computed; running it will reproduce the hand-traced numbers exactly.

A Complete GRU Cell in NumPy — Click Any Line

🐍gru_numpy.py

Explanation(18)

Code(45)

1import numpy as np

NumPy provides fast N-dimensional arrays and the basic math used everywhere in this cell: matrix multiply with @, element-wise multiplication with *, np.tanh, np.exp. Every symbol in the GRU equations maps directly to a NumPy operation below.

EXECUTION STATE

numpy = Scientific computing library — ndarray, linear algebra, elementwise math.

as np = Alias so we write np.exp() instead of numpy.exp() — universal Python convention.

3def sigmoid(x)

The gating activation. σ(x) = 1/(1+e^{-x}) squashes any real number into (0, 1). In a GRU, BOTH gates are sigmoids because each channel should behave like a soft switch: 0 = block, 1 = pass through.

EXECUTION STATE

⬇ input: x = A NumPy array of pre-activations. Can be any real number.

→ why sigmoid? = Outputs live in (0, 1). Multiplying memory by a sigmoid value is literally 'keep this much'. A ReLU would not cap the upper end; tanh flips signs. Neither fits the soft-switch role.

⬆ returns = Same shape as x, every element squashed into (0, 1).

4return 1.0 / (1.0 + np.exp(-x))

Computes sigmoid element-wise. np.exp(-x) flips the sign then applies the exponential to every element; NumPy broadcasts the scalar 1.0 across the resulting array.

EXECUTION STATE

📚 np.exp = Element-wise exponential e^x. np.exp(0)=1.0, np.exp(1)=2.718, np.exp(-1)=0.368.

→ example: x = [0, 1, -1] = -x = [0, -1, 1] → exp = [1.0, 0.368, 2.718] → 1/(1+exp) = [0.50, 0.731, 0.269]

7def gru_cell(x_t, h_prev, params)

One GRU cell processing a SINGLE timestep. Unlike LSTM there is only ONE state flowing in and out — h_prev. No separate cell state. The whole sequence is processed by calling this function once per token.

EXECUTION STATE

⬇ input: x_t (shape (2,)) = The current token's feature vector. For step 1: x_t = [0.5, -0.2] (our toy embedding for 'I').

→ x_t purpose = The external signal driving this timestep. In a language model it would be a word embedding; in time series, a sensor reading at time t.

⬇ input: h_prev (shape (2,)) = The hidden state carried from t-1. For step 1 this is zeros; for later steps it is the previous h_t.

→ h_prev purpose = Both memory AND output. Where LSTM had two states (C for memory, h for output), GRU fuses them — h_prev is simultaneously what we remember and what the cell just emitted.

⬇ input: params (dict) = Six weight arrays: three matrices W_r, W_z, W_h (each 2×4) and three bias vectors b_r, b_z, b_h (each shape (2,)).

→ params purpose = The learnable parameters. During training each is updated by gradient descent. Here we set them manually so every number below is hand-traceable.

⬆ returns = h_t — shape (2,). Becomes h_prev for the next step. No cell state is returned because there isn't one.

9z_in = np.concatenate([h_prev, x_t])

Stack the previous hidden state and the current input into one long vector. The reset and update gates will BOTH take this vector as input — they each need to see past context and current input simultaneously.

EXECUTION STATE

📚 np.concatenate = Joins a list of arrays along an existing axis. For 1-D arrays it just appends them end-to-end. concatenate([[1,2],[3,4,5]]) → [1,2,3,4,5].

⬇ arg: [h_prev, x_t] = Python list of two 1-D arrays: h_prev (shape (2,)) and x_t (shape (2,)).

→ step 1 values = h_prev = [0.00, 0.00], x_t = [0.50, -0.20]

⬆ z_in (shape (4,)) = [0.00, 0.00, 0.50, -0.20] — first 2 entries are the hidden state, last 2 are the input.

→ why this variable name? = z_in is the INPUT to the z (and r) gate's linear layer. Don't confuse with z_t, the update-gate OUTPUT we compute a few lines below.

12r_t = sigmoid(W_r @ z_in + b_r) — reset gate

The reset gate computes, for each channel, a number in (0, 1) that will multiply h_prev before it feeds the candidate. r ≈ 0 for a channel means 'ignore that memory channel when proposing new content'; r ≈ 1 means 'let it participate fully'. Crucially, r never touches the identity path (1 − z)·h_{t-1} — only the candidate.

EXECUTION STATE

📚 @ = Matrix multiply. W_r is (2, 4), z_in is (4,); W_r @ z_in is the (2,) vector of dot products row-by-row.

W_r (2×4) =

    h0     h1     x0     x1
c0  0.3  -0.2   0.4    0.1
c1  0.1   0.5  -0.3    0.2

→ W_r purpose = Row c0 decides the reset gate for channel 0 of h; row c1 for channel 1. Each row mixes hidden state and input into one pre-activation.

b_r (shape (2,)) = [0.1, 0.0] — small positive bias on channel 0 nudges toward 'let past participate' at initialization.

→ step 1 compute = z_in = [0, 0, 0.5, -0.2] row c0: 0.3·0 + (-0.2)·0 + 0.4·0.5 + 0.1·(-0.2) + 0.1 = 0.28 row c1: 0.1·0 + 0.5·0 + (-0.3)·0.5 + 0.2·(-0.2) + 0.0 = -0.19 σ([0.28, -0.19]) = [0.5695, 0.4526]

⬆ r_t (shape (2,)) = [0.5695, 0.4526] — channel 0 lets ~57% of h_prev through to the candidate, channel 1 lets ~45% through.

13z_t = sigmoid(W_z @ z_in + b_z) — update gate

The update gate decides per channel how much of h_prev to REPLACE with the candidate. This single gate does what an LSTM needs two gates (forget + input) to do: (1 − z_t) is the keep fraction, z_t is the write fraction, and they are forced to sum to 1 per channel.

EXECUTION STATE

W_z (2×4) =

    h0     h1     x0     x1
c0  0.2   0.3  -0.1    0.4
c1 -0.2   0.1   0.5    0.2

b_z (shape (2,)) = [-0.1, 0.1]

→ step 1 compute = row c0: 0.2·0 + 0.3·0 + (-0.1)·0.5 + 0.4·(-0.2) + (-0.1) = -0.23 row c1: -0.2·0 + 0.1·0 + 0.5·0.5 + 0.2·(-0.2) + 0.1 = 0.31 σ([-0.23, 0.31]) = [0.4428, 0.5769]

⬆ z_t (shape (2,)) = [0.4428, 0.5769] — channel 0 will replace ~44% of old memory with the candidate; channel 1 will replace ~58%.

→ z_t vs r_t = Two DIFFERENT decisions with two DIFFERENT weight matrices. r decides 'how much past to see inside the candidate'. z decides 'how much of the final h to be written from the candidate'. Both outputs of sigmoid, but they act at different points.

16rh = r_t * h_prev

Element-wise (Hadamard) product. The reset gate gates the previous hidden state BEFORE it enters the candidate's tanh. On step 1, h_prev is zero so rh is zero — the reset gate has nothing to act on yet.

EXECUTION STATE

📚 * (NumPy) = Element-wise multiplication when operands have the same shape. For shape (2,) arrays: rh[i] = r_t[i] * h_prev[i] for each i. NOT matrix multiply — that would be @.

→ step 1 values = r_t = [0.5695, 0.4526], h_prev = [0.00, 0.00] rh = [0.5695·0, 0.4526·0] = [0.00, 0.00]

⬆ rh (shape (2,)) = [0.0000, 0.0000] — the reset-masked past. Only non-zero once h_prev itself is non-zero, i.e. from step 2 onward.

→ step 2 peek = At step 2, h_prev = [0.0485, -0.0288]. Then rh = [0.6155·0.0485, 0.4528·(-0.0288)] = [0.0299, -0.0130]. That is the ONLY place r_t acts.

17cand_in = np.concatenate([rh, x_t])

Build the candidate's linear-layer input: reset-masked past hidden state followed by the current input. This is the precise difference between the two concatenations in this function — the reset gate shapes the input that the candidate sees, but not the input that r or z themselves saw.

EXECUTION STATE

⬇ arg: [rh, x_t] = List of two (2,) arrays: rh is h_prev after being multiplied by r_t; x_t is the raw input.

→ step 1 values = rh = [0.00, 0.00], x_t = [0.50, -0.20] cand_in = [0.00, 0.00, 0.50, -0.20]

⬆ cand_in (shape (4,)) = [0.00, 0.00, 0.50, -0.20]

→ why not reuse z_in? = z_in contained raw h_prev; cand_in contains r_t ⊙ h_prev. Using z_in for the candidate would bypass the reset gate and defeat its purpose.

18h_tilde = np.tanh(W_h @ cand_in + b_h) — candidate

Propose new content for h. Tanh (not sigmoid) because the candidate is CONTENT, not a switch — it needs to span (-1, 1) so individual channels can push memory up OR down. The update gate will decide later how much of this proposal to actually write.

EXECUTION STATE

📚 np.tanh = Hyperbolic tangent. Squashes any real number into (-1, 1). tanh(0)=0, tanh(1)=0.7616, tanh(-1)=-0.7616.

W_h (2×4) =

     rh0    rh1    x0     x1
c0  0.1  -0.4   0.3    0.2
c1  0.4   0.2  -0.1    0.5

b_h (shape (2,)) = [0.0, 0.1]

→ step 1 compute = cand_in = [0, 0, 0.5, -0.2] row c0: 0.1·0 + (-0.4)·0 + 0.3·0.5 + 0.2·(-0.2) + 0.0 = 0.11 row c1: 0.4·0 + 0.2·0 + (-0.1)·0.5 + 0.5·(-0.2) + 0.1 = -0.05 tanh([0.11, -0.05]) = [0.1096, -0.0500]

⬆ h_tilde (shape (2,)) = [0.1096, -0.0500] — signed, bounded proposal. Channel 0 wants to push h up a little; channel 1 wants to push it slightly down.

→ why tanh not sigmoid? = A word like 'not' should reverse the meaning of surrounding context — its contribution to memory must be negative. Sigmoid's (0,1) range could never encode that.

21h_t = (1.0 - z_t) * h_prev + z_t * h_tilde

The heart of the GRU. A per-channel convex combination: (1 − z) is the keep fraction, z is the write fraction, they sum to 1. When z ≈ 0 the old hidden state flows through unchanged — that is the identity path that defeats vanishing gradients. When z ≈ 1 we overwrite with the candidate.

EXECUTION STATE

📚 (1.0 - z_t) = NumPy broadcasts the scalar 1.0 across the (2,) array z_t and subtracts element-wise. Result shape (2,). This is the keep-fraction vector.

→ step 1 keep vec = 1 - z_t = 1 - [0.4428, 0.5769] = [0.5572, 0.4231]

→ step 1 keep term = (1 - z_t) * h_prev = [0.5572·0, 0.4231·0] = [0.0000, 0.0000]

→ step 1 write term = z_t * h_tilde = [0.4428·0.1096, 0.5769·(-0.0500)] = [0.0485, -0.0288]

⬆ h_t (shape (2,)) = [0.0485, -0.0288] — small in magnitude because this is the first step and there was no past memory to keep.

→ connection to LSTM = LSTM uses C_t = f·C_{t-1} + i·C̃ with f and i independent. GRU couples them: keep = (1 − z), write = z. Less flexibility, fewer parameters, nearly identical performance on many tasks.

23return h_t

Return the single new hidden state. The caller will pass this back in as h_prev at the next step, and it is also the cell's output for downstream layers.

EXECUTION STATE

⬆ return: h_t = Shape (2,) — the new memory AND the new output. One vector does both jobs.

26params = { ... }

A plain Python dictionary holding all six learnable arrays. Keeping them in a dict keeps the function signature clean and lets us pass everything in one argument. In PyTorch these live inside an nn.Module as registered parameters.

EXECUTION STATE

📚 dict = Python's built-in hash map. Keys are strings here; values are NumPy arrays.

W_r, W_z, W_h = Each (2, 4) — 8 numbers per matrix, 24 weights total across the three matrices.

b_r, b_z, b_h = Each (2,) — 2 numbers per bias, 6 biases total.

→ total params = 3 × (2 × (2 + 2) + 2) = 3 × 10 = 30 learnable numbers. (Here we set them by hand.)

38sequence = [...] (three toy tokens)

Three 2-dimensional feature vectors standing in for the word embeddings of 'I', 'love', 'it'. Real systems use 128- or 512-dim embeddings, but 2-D keeps every number on one line and every matmul inspectable.

EXECUTION STATE

sequence[0] = [ 0.5, -0.2] = toy embedding for 'I'

sequence[1] = [ 0.8, 0.3] = toy embedding for 'love'

sequence[2] = [ 0.1, 0.9] = toy embedding for 'it'

42h = np.zeros(2)

Initialize the hidden state to zeros. This is the canonical starting h_0 — the cell has no memory before step 1. Other strategies exist (learned initial state, random), but zero-init is standard and plays well with the identity path.

EXECUTION STATE

📚 np.zeros(shape) = Allocates an array of the given shape filled with 0.0. np.zeros(2) → [0.0, 0.0].

→ h shape = (2,) — must match hidden_size so the recurrence is dimensionally consistent.

⬆ h (shape (2,)) = [0.0, 0.0]

43for t, x_t in enumerate(sequence, 1):

Iterate over the three tokens, numbering them 1, 2, 3 for printing. Each iteration calls gru_cell once — this is where the recurrence actually happens.

LOOP TRACE · 3 iterations

t=1, x_t='I' = [0.5, -0.2]

h_prev (input) = [0.0000, 0.0000]

r_t = [0.5695, 0.4526]

z_t = [0.4428, 0.5769]

h_tilde = [0.1096, -0.0500]

h_t (output) = [0.0485, -0.0288]

t=2, x_t='love' = [0.8, 0.3]

h_prev (input) = [0.0485, -0.0288]

r_t = [0.6155, 0.4528]

z_t = [0.4853, 0.6335]

rh = r·h_prev = [0.0299, -0.0130]

h_tilde = [0.2988, 0.1774]

(1-z)·h_prev = [0.0250, -0.0106]

z·h_tilde = [0.1450, 0.1124]

h_t (output) = [0.1700, 0.1018]

t=3, x_t='it' = [0.1, 0.9]

h_prev (input) = [0.1700, 0.1018]

r_t = [0.5648, 0.5543]

z_t = [0.5780, 0.5760]

rh = r·h_prev = [0.0960, 0.0565]

h_tilde = [0.1945, 0.5297]

(1-z)·h_prev = [0.0717, 0.0432]

z·h_tilde = [0.1124, 0.3051]

h_t (output) = [0.1842, 0.3483]

44h = gru_cell(x_t, h, params)

Feed the current token and the old hidden state into the cell, replace h with the returned new hidden state. One line, one timestep. All the interesting math happened inside gru_cell.

EXECUTION STATE

⬇ arg: x_t = The embedding at this step (changes every iteration).

⬇ arg: h = The hidden state from the previous iteration — zeros on step 1, otherwise the last returned h.

⬇ arg: params = The dictionary of all six weight arrays. Unchanged each step (weights are fixed during inference).

⬆ h (reassigned) = The new hidden state. We reuse the name h so the loop is self-contained.

45print(f"step {t}: h={h.round(4)}")

Print the new hidden state at four decimals so it is easy to read. .round(4) is an NumPy method on arrays — it returns a new array with each element rounded.

EXECUTION STATE

📚 f-string = Python formatted string literal. {t} and {h.round(4)} are inlined expressions.

📚 .round(4) = NumPy method: rounds every element to 4 decimal places. Not in-place — returns a new array.

→ expected output = step 1: h=[ 0.0485 -0.0288] step 2: h=[0.17 0.1018] step 3: h=[0.1842 0.3483]

27 lines without explanation

1import numpy as np
2
3def sigmoid(x):
4    return 1.0 / (1.0 + np.exp(-x))
5
6# ---- One GRU cell: processes a single timestep ----
7def gru_cell(x_t, h_prev, params):
8    # Concatenate previous hidden state with current input: shape (hidden + input,)
9    z_in = np.concatenate([h_prev, x_t])
10
11    # Two gates — reset r and update z, each a sigmoid of a linear layer
12    r_t = sigmoid(params["W_r"] @ z_in + params["b_r"])   # reset gate
13    z_t = sigmoid(params["W_z"] @ z_in + params["b_z"])   # update gate
14
15    # Candidate hidden state — reset gate acts ONLY here, on h_prev
16    rh = r_t * h_prev
17    cand_in = np.concatenate([rh, x_t])
18    h_tilde = np.tanh(params["W_h"] @ cand_in + params["b_h"])
19
20    # Final hidden state — convex combination (1-z) keeps old, z writes new
21    h_t = (1.0 - z_t) * h_prev + z_t * h_tilde
22
23    return h_t
24
25# ---- Run over a sequence ----
26params = {
27    "W_r": np.array([[ 0.3, -0.2,  0.4,  0.1],
28                     [ 0.1,  0.5, -0.3,  0.2]]),
29    "b_r": np.array([ 0.1,  0.0]),
30    "W_z": np.array([[ 0.2,  0.3, -0.1,  0.4],
31                     [-0.2,  0.1,  0.5,  0.2]]),
32    "b_z": np.array([-0.1,  0.1]),
33    "W_h": np.array([[ 0.1, -0.4,  0.3,  0.2],
34                     [ 0.4,  0.2, -0.1,  0.5]]),
35    "b_h": np.array([ 0.0,  0.1]),
36}
37
38sequence = [np.array([0.5, -0.2]),   # "I"
39            np.array([0.8,  0.3]),   # "love"
40            np.array([0.1,  0.9])]   # "it"
41
42h = np.zeros(2)
43for t, x_t in enumerate(sequence, 1):
44    h = gru_cell(x_t, h, params)
45    print(f"step {t}: h={h.round(4)}")

Running the code prints:

Step	Token	h_t
1	I	[ 0.0485, -0.0288]
2	love	[ 0.1700, 0.1018]
3	it	[ 0.1842, 0.3483]

Notice that there is no separate $C_t$ column. The GRU only has one state, and what you see is what downstream layers receive.

The Same GRU in PyTorch

PyTorch ships two ways to use a GRU: nn.GRUCell, which mirrors our NumPy loop exactly (one step at a time, you manage the recurrence in Python), and nn.GRU, which handles the whole sequence in a single optimized call. Both use the same underlying arithmetic; they differ in where the time-loop lives.

GRU in PyTorch — Cell-Level and Sequence-Level

🐍gru_pytorch.py

Explanation(15)

Code(23)

1import torch

PyTorch is a deep-learning framework built around tensors (GPU-accelerated multi-dim arrays) with automatic differentiation. Everything in this file goes through torch — tensors, modules, and the GRU kernel itself.

EXECUTION STATE

torch = Root PyTorch package: tensor ops (torch.tensor, torch.zeros), randomness (torch.manual_seed), device management (.to('cuda')).

2import torch.nn as nn

torch.nn is the neural-network module system. It holds every pre-built layer — including GRUCell and GRU — plus the Module base class you subclass when writing your own.

EXECUTION STATE

torch.nn = Layers as PyTorch Modules. Each Module owns its learnable parameters and defines a forward() method.

as nn = Universal alias so we write nn.GRU instead of torch.nn.GRU.

5torch.manual_seed(0)

Seed the global RNG so that the random weight initialization inside nn.GRUCell is reproducible. Without this, you would get different numbers every run, making it impossible to compare.

EXECUTION STATE

📚 torch.manual_seed(seed) = Sets the seed for CPU RNG. For GPU reproducibility you also need torch.cuda.manual_seed_all and torch.backends.cudnn.deterministic = True.

⬇ arg: 0 = Arbitrary integer; the same integer always gives the same weights on the same PyTorch build.

6cell = nn.GRUCell(input_size=2, hidden_size=2)

Construct a GRU cell that handles ONE timestep at a time. Internally PyTorch stores three fused weight matrices and two bias vectors (by default it uses two biases — one on the input projection and one on the hidden projection, which is equivalent to a single bias for inference).

EXECUTION STATE

📚 nn.GRUCell(input_size, hidden_size, bias=True) = Single-step GRU module. Stores: weight_ih_l0 shape (3·hidden, input) and weight_hh_l0 shape (3·hidden, hidden) — the 3 corresponds to r, z, h̃ stacked together.

⬇ arg: input_size=2 = Dimension of x_t. Must match the last dim of the tensor you pass into forward().

⬇ arg: hidden_size=2 = Dimension of the hidden state h. The output tensor will have this size, and you must pass an h of this size into forward().

→ param count = weight_ih (6, 2) + weight_hh (6, 6) + bias_ih (6) + bias_hh (6) = 12 + 36 + 12 = 60 learnable numbers. Exactly double our NumPy count because PyTorch uses TWO biases per gate by convention.

8sequence = torch.tensor([[0.5, -0.2], [0.8, 0.3], [0.1, 0.9]])

A (3, 2) tensor: three tokens, two features each. We will index it as sequence[t] to get one token at a time. No batch dimension yet — we add it before calling the cell.

EXECUTION STATE

📚 torch.tensor(data) = Creates a tensor from nested Python lists. Inherits dtype from data (floats → float32 by default).

sequence.shape = torch.Size([3, 2]) — 3 timesteps, 2 features

→ content = row 0 = [0.5, -0.2] ('I') row 1 = [0.8, 0.3] ('love') row 2 = [0.1, 0.9] ('it')

12h = torch.zeros(1, 2)

Initial hidden state. Shape (batch, hidden) = (1, 2). The batch dimension is required even when batch size is 1 — GRUCell expects a 2-D tensor, not a 1-D vector.

EXECUTION STATE

📚 torch.zeros(*sizes) = Returns a tensor of the given shape filled with 0.0. torch.zeros(1, 2) → tensor([[0., 0.]]).

→ shape reasoning = Both args to cell(x, h) must have a leading batch dim. batch=1 here because we are processing one sequence at a time.

⬆ h = tensor([[0., 0.]]) — shape (1, 2)

13for t, x_t in enumerate(sequence, 1):

Iterate over the time dimension in Python. This mirrors exactly what our NumPy loop did — one Python-level step per timestep. (The more idiomatic nn.GRU version below does the loop in C++/CUDA.)

EXECUTION STATE

📚 enumerate(iterable, start) = Built-in Python: yields (index, item) tuples. start=1 makes the first index 1 instead of 0 — a cosmetic choice for readable prints.

→ iteration 1 = t=1, x_t = tensor([0.5, -0.2]) — shape (2,)

→ iteration 2 = t=2, x_t = tensor([0.8, 0.3])

→ iteration 3 = t=3, x_t = tensor([0.1, 0.9])

14x_t = x_t.unsqueeze(0)

Add a batch dimension at the front. sequence[t] is a 1-D tensor of shape (2,); GRUCell wants (batch, input) = (1, 2). .unsqueeze(0) inserts a size-1 axis at position 0.

EXECUTION STATE

📚 .unsqueeze(dim) = Returns a new tensor with a size-1 dim inserted at position dim. tensor of shape (2,).unsqueeze(0) → shape (1, 2); .unsqueeze(1) → shape (2, 1).

⬇ arg: 0 = Insert the new axis at position 0 (the very front). This adds the batch dim expected by GRUCell.

→ before = tensor([0.5, -0.2]) shape (2,)

⬆ after = tensor([[0.5, -0.2]]) shape (1, 2) — one batch, one timestep

15h = cell(x_t, h)

Forward one step. cell(x, h) runs the GRU update internally: one fused matmul for all three pre-activations, then splits them into r, z, h̃ slices, applies sigmoid/tanh, and returns (1 − z)·h + z·h̃. All in optimized C++.

EXECUTION STATE

📚 cell.__call__(input, hx=None) = Invokes forward(). input shape (batch, input_size); hx shape (batch, hidden_size) or None (zero-init). Returns new h of shape (batch, hidden_size).

⬇ arg: x_t = (1, 2) — this step's token with batch dim.

⬇ arg: h = (1, 2) — hidden state carried from previous step.

→ values differ from NumPy = PyTorch uses its own weight init (uniform in ±1/√hidden). Our NumPy run used hand-picked weights. The MECHANISM is identical — only the numbers differ.

⬆ h (new) = (1, 2) — the updated hidden state, ready to be fed in again next iteration.

16print(f"step {t}: h={h.detach().numpy().round(4)}")

.detach() drops autograd tracking so we can convert to NumPy for easy printing; .numpy() gives a zero-copy NumPy view (CPU only); .round(4) trims decimals. None of these change h itself — they only format the printout.

EXECUTION STATE

📚 .detach() = Returns a new tensor that shares storage but is not tracked by autograd. Necessary before .numpy() if requires_grad was ever True.

📚 .numpy() = Converts a CPU tensor to a NumPy ndarray (zero-copy when contiguous). Errors if the tensor lives on GPU — you'd need .cpu().numpy().

📚 .round(4) = NumPy method — rounds each element to 4 decimals for readable output.

19gru = nn.GRU(input_size=2, hidden_size=2, batch_first=True)

The idiomatic PyTorch GRU. Instead of a Python for-loop, give it the whole sequence as a (batch, seq_len, features) tensor and it runs the cell over all timesteps in a single optimized kernel.

EXECUTION STATE

📚 nn.GRU = Multi-step, multi-layer GRU. Defaults: num_layers=1, bidirectional=False, dropout=0.

⬇ arg: batch_first=True = Changes the expected input shape from (seq_len, batch, features) — PyTorch's historical default — to (batch, seq_len, features), which most people find more intuitive. If you forget this flag, PyTorch will silently reinterpret time as batch.

→ why batch_first matters = With batch_first=False, passing (1, 3, 2) thinking 'batch=1, seq=3, features=2' causes PyTorch to read it as seq=1, batch=3 — three separate length-1 sequences instead of one length-3 sequence. The GRU will run but the semantics are wrong.

20batch = torch.tensor([[[0.5, -0.2], [0.8, 0.3], [0.1, 0.9]]])

Pack the whole sequence into one (1, 3, 2) tensor: one batch, three timesteps, two features. The triple bracket matters — outer is batch, middle is time, inner is features.

EXECUTION STATE

batch.shape = torch.Size([1, 3, 2]) — (batch=1, seq_len=3, input=2)

→ indexing = batch[0] — the only sequence (shape (3, 2)) batch[0][0] — first timestep (shape (2,)) batch[0][0][0] — first feature of first timestep (scalar 0.5)

21output, h_n = gru(batch)

Run the whole sequence through the GRU in a single call. output contains the hidden state at EVERY timestep; h_n is the hidden state AFTER the last timestep. Unlike LSTM, there is only one return element for state — no c_n — because GRU has no separate cell state.

EXECUTION STATE

⬆ output = Shape (batch, seq_len, hidden) = (1, 3, 2). output[0][t] is h_t at step t. Used when you need per-step outputs (sequence tagging, seq2seq encoder).

⬆ h_n = Shape (num_layers·num_directions, batch, hidden) = (1, 1, 2). The final hidden state. For sentence classification you typically feed this into a Linear head.

→ output vs h_n = output[:, -1, :] (last timestep of output) equals h_n[-1] — same numbers, different shapes. nn.LSTM returned (output, (h_n, c_n)); nn.GRU returns (output, h_n). One tuple element fewer.

22print("output shape:", tuple(output.shape))

Convert torch.Size to a plain tuple for cleaner printing. Verifies that the shape is the expected (1, 3, 2).

EXECUTION STATE

→ printed = output shape: (1, 3, 2)

23print("final h_n :", h_n.detach().numpy().round(4))

Print the final hidden state. For a sentiment classifier, this is the vector you would feed into an nn.Linear(hidden_size, num_classes) to produce logits.

EXECUTION STATE

→ printed (representative) = final h_n : [[[0.1734 0.2102]]] (exact values depend on torch version / seed, but shape is always (1, 1, 2))

→ reconciling with NumPy = Our NumPy run ended at h_3 = [0.1842, 0.3483] because we set weights by hand. PyTorch's randomly initialized weights produce different numbers. Same formulas, same three learned matrices — just different numerical fillings.

8 lines without explanation

1import torch
2import torch.nn as nn
3
4# ---- nn.GRUCell: one step at a time (mirrors our NumPy loop) ----
5torch.manual_seed(0)
6cell = nn.GRUCell(input_size=2, hidden_size=2)
7
8sequence = torch.tensor([[0.5, -0.2],
9                         [0.8,  0.3],
10                         [0.1,  0.9]])
11
12h = torch.zeros(1, 2)                 # (batch=1, hidden=2)
13for t, x_t in enumerate(sequence, 1):
14    x_t = x_t.unsqueeze(0)            # (1, 2) batch dim
15    h = cell(x_t, h)
16    print(f"step {t}: h={h.detach().numpy().round(4)}")
17
18# ---- nn.GRU: whole sequence in one call ----
19gru = nn.GRU(input_size=2, hidden_size=2, batch_first=True)
20batch = torch.tensor([[[0.5, -0.2], [0.8, 0.3], [0.1, 0.9]]])  # (1, 3, 2)
21output, h_n = gru(batch)
22print("output shape:", tuple(output.shape))
23print("final h_n   :", h_n.detach().numpy().round(4))

The printed numbers will differ from the NumPy version because PyTorch uses random weight initialization instead of our hand-picked matrices. The mechanism is identical, line for line — three linear projections, two sigmoid gates, one tanh candidate, one convex mix.

Aspect	NumPy from scratch	nn.GRUCell	nn.GRU
Weights	You provide	Random init	Random init
Gates	Three separate matmuls	One fused matmul	One fused matmul, batched over time
Loop over time	Python for-loop	Python for-loop	Optimized C++/CUDA kernel
Batching	One sequence	Batch-aware	Batch + sequence in one tensor
GPU	None	.to('cuda')	.to('cuda')
When to use	Learning / debugging	Custom step logic	Standard training

Why GRUs Still Solve Vanishing Gradients

The canonical objection to vanilla RNNs is that gradients vanish when backpropagated through many timesteps because every step multiplies by the same small Jacobian. The LSTM escapes this via an additive cell state whose recurrence has an identity term. Does the GRU — with its single state — really preserve the same property?

Yes. Differentiate the update rule:

$\frac{\partial h_t}{\partial h_{t-1}} = \text{diag}(1 - z_t) \;+\; \text{diag}(z_t) \cdot \frac{\partial \tilde{h}_t}{\partial h_{t-1}} \;+\; (\text{terms from } \partial z_t / \partial h_{t-1})$

The first term, $\text{diag}(1 - z_t)$ , is a diagonal matrix whose entries are the keep-fractions. When the update gate chooses to preserve channel $c$ , that entry is close to 1, so the gradient of $h_{t,c}$ with respect to $h_{t-1,c}$ is close to 1 — no shrinkage. Over $T$ steps the backward path is a product of such near-identity factors, which stays bounded.

The additional terms come from gradients flowing through the gates and the candidate tanh. These paths do decay — but they are not the only path back. The identity path ensures that a non-vanishing gradient always exists, and that is enough to let learned weights find long-range dependencies.

The mathematical trick is identical to the LSTM's: add, don't multiply. The only difference is that the additive identity in GRU lives on

h_t

itself rather than on a separate

C_t

. Same gradient physics, one state vector fewer.

Interactive: Gradient Flow (RNN vs LSTM vs GRU)

The three curves below are the same $\|\partial \ell_t / \partial \text{state}_1\|$ measurement run through each cell type on a $\log_{10}$ axis. Slide the keep-fraction control — it is $f_t$ for the LSTM and $1 - z_t$ for the GRU. Both gated cells stay near 1; only the vanilla RNN collapses.

Loading gradient flow…

Numerical Gradient Check

To make the argument concrete, compare the highway approximation $(1 - z_3) \odot (1 - z_2)$ to the numerical Jacobian of the three-step GRU from §17.2's worked example. We use finite differences (step $\varepsilon = 10^{-6}$ ) — autograd would give the same answer, exactly, for free.

Highway Term vs Numerical Jacobian

🐍gru_grad_check.py

Explanation(31)

Code(59)

1import numpy as np

NumPy gives us element-wise arithmetic on the 2-element update-gate vectors and a clean finite-difference loop.

EXECUTION STATE

numpy = Numerical Python library.

4z_2 = np.array([0.4853, 0.6335])

Update-gate output at t=2, copied from §17.2's worked example. Per channel it says: 'write ~49% new content to channel 0, ~63% to channel 1'. Channel 0 will therefore keep (1 − 0.49) = 0.51 of the past; channel 1 keeps 0.37.

EXECUTION STATE

⬆ z_2 = [0.4853, 0.6335]

5z_3 = np.array([0.5780, 0.5760])

Update gate at t=3 (also from the §17.2 trace).

EXECUTION STATE

⬆ z_3 = [0.5780, 0.5760]

11dh3_dh1_highway = (1 - z_3) * (1 - z_2)

The GRU equivalent of the LSTM's f_t·f_{t-1} highway product. The identity path in h_t = (1−z_t)·h_{t-1} + z_t·h̃_t contributes a diag(1 − z_t) Jacobian each step. Multiplying two of those (and ignoring the h̃-path correction for now) gives a diagonal approximation for ∂h_3/∂h_1.

EXECUTION STATE

(1 − z_2) = [0.5147, 0.3665]

(1 − z_3) = [0.4220, 0.4240]

→ elem-wise product = [0.4220·0.5147, 0.4240·0.3665] = [0.2172, 0.1554]

⬆ dh3_dh1_highway = [0.2172, 0.1554]

→ intuition = Shrinking by ~3× per step — same order as the LSTM's worked example. Both cells depend on learning gates that push closer to 1 on genuinely long-range tasks.

12print((1 - z_2) …)

Report the keep-fractions per channel at t=2.

EXECUTION STATE

→ printed = (1 - z_2) = [0.5147 0.3665]

13print((1 - z_3) …)

Keep-fractions at t=3.

EXECUTION STATE

→ printed = (1 - z_3) = [0.4220 0.4240]

14print(highway …)

Report the highway-term approximation to the diagonal Jacobian.

EXECUTION STATE

→ printed = highway ∂h_3/∂h_1 (diag) ≈ [0.2172 0.1554]

21def sigmoid(x)

Same sigmoid we have used throughout — needed for the forward pass below.

24W_r, b_r, W_z, b_z, W_h, b_h

The §17.2 hand-picked weights repeated here so this block is self-contained. With these values plus the toy ‘I love it’ sequence, the forward pass reproduces the exact numbers in §17.2's worked example.

EXECUTION STATE

W_r (2×4) =

    h0     h1     x0     x1
c0  0.3  -0.2   0.4    0.1
c1  0.1   0.5  -0.3    0.2

W_z (2×4) =

    h0     h1     x0     x1
c0  0.2   0.3  -0.1    0.4
c1 -0.2   0.1   0.5    0.2

W_h (2×4) =

     rh0    rh1    x0     x1
c0  0.1  -0.4   0.3    0.2
c1  0.4   0.2  -0.1    0.5

33def gru_run(h0)

Forward pass through all three timesteps, parametrized by the initial hidden state h_0. We will call this twice per channel to estimate the numerical Jacobian.

EXECUTION STATE

⬇ input: h0 (shape (2,)) = The initial hidden state we will perturb.

⬆ returns h_3 = The final hidden state after 3 steps — shape (2,).

34seq = [np.array(...), np.array(...), np.array(...)]

Hard-code the same 3-token sequence used in §17.2 so this block stands alone and reproduces those numbers.

EXECUTION STATE

seq[0] = [0.5, -0.2] — ‘I’

seq[1] = [0.8, 0.3] — ‘love’

seq[2] = [0.1, 0.9] — ‘it’

37h = h0.copy()

.copy() because we do NOT want to mutate the caller's h0 — the finite-difference loop reuses the original to compare.

EXECUTION STATE

📚 np.ndarray.copy() = Independent copy with its own memory. Without it, h = h0 would alias and the perturbation would spread.

38for x in seq:

Iterate over the three timesteps.

LOOP TRACE · 3 iterations

step 1 (x = [0.5, -0.2])

r_1 = [0.5695, 0.4526]

z_1 = [0.4428, 0.5769]

h_tilde_1 = [0.1096, -0.0500]

h_1 = [0.0485, -0.0288]

step 2 (x = [0.8, 0.3])

z_2 = [0.4853, 0.6335] ← used in highway term above

h_2 = [0.1700, 0.1018]

step 3 (x = [0.1, 0.9])

z_3 = [0.5780, 0.5760] ← used in highway term above

h_3 = [0.1842, 0.3483]

39z_in = np.concatenate([h, x])

Build the gate input by concatenating the current hidden state with the current observation.

40r = sigmoid(W_r @ z_in + b_r)

Reset gate — controls only the candidate, not the highway.

41z = sigmoid(W_z @ z_in + b_z)

Update gate — the knob that drives the highway / write split.

42cand_in = np.concatenate([r * h, x])

Reset-masked past concatenated with the current input — the input to the candidate's linear layer.

43h_tilde = np.tanh(W_h @ cand_in + b_h)

Candidate new content in (-1, 1).

44h = (1 - z) * h + z * h_tilde

The GRU's heart: convex combination of old and new per channel. Its Jacobian contributes a diag(1 − z) term — the identity path that makes the gradient highway possible.

45return h

Hand back the final hidden state after three steps.

47eps = 1e-6

Finite-difference step. Smaller than 1e-6 and round-off noise dominates; larger than 1e-4 and the O(eps²) truncation error shows. 1e-6 is the sweet spot for float64 one-sided differences.

EXECUTION STATE

eps = 1.0e-06

48h0 = np.zeros(2)

Baseline initial state — same zero init used throughout the chapter.

49h3 = gru_run(h0)

Unperturbed final state. We need it as the reference in the difference quotient.

EXECUTION STATE

h3 (reference) = [0.1842, 0.3483]

50jacob = np.zeros((2, 2))

Pre-allocate the 2×2 Jacobian. Column j of this matrix will be the partial derivative of h_3 with respect to h_0[j].

EXECUTION STATE

jacob shape = (2, 2)

51for j in range(2):

Loop over the two channels of h_0. For each channel we bump it by eps and rerun the forward pass.

LOOP TRACE · 2 iterations

j = 0 (perturb channel 0)

h0_plus = [1e-06, 0]

h3_plus − h3 (per channel, ×1e6) = ≈ [0.2283, 0.0143]

jacob[:, 0] = [0.2283, 0.0143] — column 0 of ∂h_3/∂h_0

j = 1 (perturb channel 1)

h0_plus = [0, 1e-06]

h3_plus − h3 (per channel, ×1e6) = ≈ [0.0087, 0.1642]

jacob[:, 1] = [0.0087, 0.1642] — column 1 of ∂h_3/∂h_0

52h0_plus = h0.copy(); h0_plus[j] += eps

Make a private copy and push channel j by eps. Using .copy() avoids mutating the baseline h0.

EXECUTION STATE

→ one-sided FD = We use forward differences (f(x+eps) − f(x)) / eps for simplicity. Central differences would be more accurate but require twice the forward passes.

53h3_plus = gru_run(h0_plus)

Forward pass with the perturbed initial state.

54jacob[:, j] = (h3_plus - h3) / eps

Numerical partial derivative: change in h_3 divided by the change in h_0[j]. Storing it as column j of the Jacobian.

EXECUTION STATE

📚 slicing jacob[:, j] = jacob[:, j] selects all rows of column j — a length-2 vector in our 2×2 case.

56print(f'finite-diff ∂h_3/∂h_0 =\n{jacob.round(4)}')

The full 2×2 numerical Jacobian. Off-diagonal entries are small — cross-channel coupling comes only from the h-path through the gates, not from the diagonal highway.

EXECUTION STATE

→ printed = finite-diff ∂h_3/∂h_0 = [[0.2283 0.0087] [0.0143 0.1642]]

57print(f'diag of FD Jacobian = {np.diag(jacob).round(4)}')

Extract the diagonal — the same quantity the highway term approximates.

EXECUTION STATE

📚 np.diag(matrix) = Returns the diagonal entries of a 2-D array as a 1-D array. np.diag(np.eye(3)) = [1, 1, 1].

→ printed = diag of FD Jacobian = [0.2283 0.1642]

58print(f'highway approximation = {dh3_dh1_highway.round(4)}')

Side-by-side comparison. The highway approximation is [0.2172, 0.1554]; the true Jacobian diagonal is [0.2283, 0.1642]. Difference: ~5% in both channels. The extra mass is precisely the gradient flowing through the reset gate and the tanh candidate — autograd would capture it exactly, but our by-hand derivation shows the dominant path is the identity.

EXECUTION STATE

→ printed = highway approximation = [0.2172 0.1554]

→ take-away = LSTM gradient physics: ∏ f_t. GRU gradient physics: ∏ (1 − z_t) plus small corrections. Same additive-identity trick, fewer parameters, nearly the same behaviour.

28 lines without explanation

1import numpy as np
2
3# Update-gate values from §17.2's worked example, copied over.
4z_2 = np.array([0.4853, 0.6335])   # at t=2
5z_3 = np.array([0.5780, 0.5760])   # at t=3
6
7# ---- Analytical highway term ----
8# Just as the LSTM has ∂C_t/∂C_{t-1} = f_t, the GRU has
9#     ∂h_t / ∂h_{t-1}  ≈  diag(1 − z_t)   (ignoring the small tanh-candidate path).
10# So the highway contribution to ∂h_3/∂h_1 is (1 − z_3) ⊙ (1 − z_2):
11dh3_dh1_highway = (1 - z_3) * (1 - z_2)
12print(f"(1 - z_2) = {(1 - z_2).round(4)}")
13print(f"(1 - z_3) = {(1 - z_3).round(4)}")
14print(f"highway   ∂h_3/∂h_1 (diag) ≈ {dh3_dh1_highway.round(4)}")
15
16# ---- Finite-difference verification ----
17# Rerun the full GRU forward pass from §17.2 (same weights) but perturb
18# each channel of h_0 by ε and observe the effect on h_3.  The ratio
19# (Δh_3 / Δh_0) is the numerical Jacobian diagonal — agreeing with autograd
20# to O(ε) and with the highway term to leading order.
21def sigmoid(x):
22    return 1.0 / (1.0 + np.exp(-x))
23
24W_r = np.array([[ 0.3, -0.2,  0.4,  0.1],
25                [ 0.1,  0.5, -0.3,  0.2]])
26b_r = np.array([ 0.1,  0.0])
27W_z = np.array([[ 0.2,  0.3, -0.1,  0.4],
28                [-0.2,  0.1,  0.5,  0.2]])
29b_z = np.array([-0.1,  0.1])
30W_h = np.array([[ 0.1, -0.4,  0.3,  0.2],
31                [ 0.4,  0.2, -0.1,  0.5]])
32b_h = np.array([ 0.0,  0.1])
33
34def gru_run(h0):
35    seq = [np.array([0.5, -0.2]),
36           np.array([0.8,  0.3]),
37           np.array([0.1,  0.9])]
38    h = h0.copy()
39    for x in seq:
40        z_in    = np.concatenate([h, x])
41        r       = sigmoid(W_r @ z_in + b_r)
42        z       = sigmoid(W_z @ z_in + b_z)
43        cand_in = np.concatenate([r * h, x])
44        h_tilde = np.tanh(W_h @ cand_in + b_h)
45        h       = (1 - z) * h + z * h_tilde
46    return h
47
48eps   = 1e-6
49h0    = np.zeros(2)
50h3    = gru_run(h0)
51jacob = np.zeros((2, 2))
52for j in range(2):                          # perturb channel j
53    h0_plus      = h0.copy(); h0_plus[j]  += eps
54    h3_plus      = gru_run(h0_plus)
55    jacob[:, j]  = (h3_plus - h3) / eps     # numerical ∂h_3/∂h_0[:, j]
56
57print(f"finite-diff ∂h_3/∂h_0 =\n{jacob.round(4)}")
58print(f"diag of FD Jacobian   = {np.diag(jacob).round(4)}")
59print(f"highway approximation = {dh3_dh1_highway.round(4)}")

The diagonal of the finite-difference Jacobian is $[0.2283, 0.1642]$ ; the highway term approximation is $[0.2172, 0.1554]$ . The ~5% gap is the reset-gate and candidate-tanh contribution autograd would also include. The leading story — an additive identity path that keeps gradients from vanishing — is the highway term.

When to Pick GRU Over LSTM

The empirical literature (Chung et al., 2014; Jozefowicz et al., 2015; Greff et al., 2017) has not produced a clean winner. On many language, speech, and music tasks the two perform within noise of each other. Greff and colleagues ran the definitive ablation — eight LSTM variants (with and without the forget gate, peephole connections, coupled input-forget gate, and others) on 5200 runs covering speech, handwriting, and polyphonic music — and found that none of the mutations beat the vanilla LSTM by a meaningful margin, and the most impactful single change was the forget-bias initialization trick from §17.1. The choice between GRU and LSTM usually rests on three practical factors instead.

Parameter budget and training speed

With hidden size $n$ and input size $m$ :

GRU: $3 \cdot (n \cdot (n + m) + n)$ parameters.
LSTM: $4 \cdot (n \cdot (n + m) + n)$ parameters.

A GRU uses exactly $75\%$ of the LSTM's parameters for the same hidden size. On a constrained memory or latency budget — edge devices, mobile, real-time ASR — GRU lets you either run faster or spend the savings on a larger hidden state.

Dataset size

With fewer parameters to fit, a GRU tends to generalize slightly better on small datasets and overfit more gracefully. On very large datasets (hundreds of millions of tokens) the LSTM's extra flexibility sometimes pays off, especially at larger hidden sizes.

Task structure

Very long sequences with sparse dependencies (e.g., long-document summarization): an LSTM's decoupled forget + input gates sometimes model “remember this AND record that” more precisely.
Moderate sequences, streaming inference (speech, real-time translation): GRUs win on latency and usually match accuracy.
Encoder in an encoder-decoder: GRU is the modern default in many seq2seq implementations before the rise of Transformers.

Practical starter rule. If you are unsure, start with a GRU, measure, and only switch to an LSTM if the validation curve suggests the model is under-expressive (training loss plateaus well above what a bigger GRU also cannot beat). That way you pay for extra gates only when there is evidence the task needs them.

Check Your Understanding

Eight questions covering the two gates, the identity path, parameter count, and the practical trade-offs against the LSTM. Each explanation ties the answer back to the equation it came from.

Loading quiz…

Summary

One state, two gates. The GRU carries a single vector $h_t$ through time and uses two gates — reset $r_t$ and update $z_t$ — to control it.
The update gate couples forget and input. The rule $h_t = (1 - z_t) \odot h_{t-1} + z_t \odot \tilde{h}_t$ forces keep and write to sum to 1 per channel. One knob, no slack.
The reset gate only shapes the candidate. Inside $\tilde{h}_t = \tanh(W_h \cdot [r_t \odot h_{t-1}, x_t] + b_h)$ , the reset gate decides how much past context the proposal sees. It never touches the identity path.
The identity path preserves gradients. $\partial h_t/\partial h_{t-1}$ contains a $\text{diag}(1 - z_t)$ term — a scaled identity that carries gradients through time when the gate chooses to keep memory.
Three matrices, not four. A GRU uses $W_r, W_z, W_h$ — 75% of the LSTM's parameter count for the same hidden size. No cell state means no output gate.
Empirically on par. GRU and LSTM perform comparably on most benchmarks. GRU wins on training speed and small datasets; LSTM sometimes wins at very large scales.
Same mathematical trick as the LSTM. Both solve vanishing gradients with an additive recurrence — the GRU puts the addition directly on $h$ ; the LSTM puts it on $C$ . The rest is a matter of how aggressively you simplify.

The core idea in one line: $h_t = (1 - z_t) \odot h_{t-1} + z_t \odot \tilde{h}_t$ — a soft slider between keeping memory and writing new content, and that slider is all the GRU needs.

With both cells understood, the next section — §17.3 Implementing Sequence Models in PyTorch — turns the equations into production-grade code. You will build an LSTM from scratch in NumPy, rebuild it in PyTorch, verify the two agree to machine precision, and wire up the full embedding → packed BiLSTM → linear sentiment classifier that was the 2016 workhorse of applied NLP.

References

Cho, K., van Merriënboer, B., Gulcehre, C., Bahdanau, D., Bougares, F., Schwenk, H. & Bengio, Y. (2014). Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation. EMNLP 2014 / arXiv:1406.1078. [Introduces the GRU.]
Chung, J., Gulcehre, C., Cho, K. & Bengio, Y. (2014). Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling. NIPS 2014 Deep Learning Workshop / arXiv:1412.3555. [Compares GRU and LSTM across several sequence tasks; found them roughly comparable, with GRU slightly preferred on polyphonic music.]
Jozefowicz, R., Zaremba, W. & Sutskever, I. (2015). An Empirical Exploration of Recurrent Network Architectures. ICML 2015. [Large-scale search over gated cell variants; no mutation consistently beat both GRU and LSTM.]
Greff, K., Srivastava, R. K., Koutník, J., Steunebrink, B. R. & Schmidhuber, J. (2017). LSTM: A Search Space Odyssey. IEEE TNNLS 28(10), 2222–2232 / arXiv:1503.04069. [Eight LSTM variants, none dominating the vanilla architecture; the Coupled Input-Forget Gate variant is effectively the GRU's 1-z/z idea grafted onto an LSTM.]
Hochreiter, S. & Schmidhuber, J. (1997). Long Short-Term Memory. Neural Computation 9(8), 1735–1780. [Original LSTM paper — provides the vanishing-gradient analysis that both LSTMs and GRUs address.]