Why the GRU Exists
The LSTM was a breakthrough — finally, a recurrent cell that could remember across many timesteps. But by 2014, researchers had noticed something curious: the LSTM has four sub-networks (forget, input, candidate, output) operating on two state vectors ( and ). Was all that machinery really necessary? Or could a simpler design solve the same vanishing-gradient problem with fewer moving parts?
In Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation, Kyunghyun Cho and colleagues proposed the Gated Recurrent Unit. It makes two ruthless simplifications. First, the cell state and hidden state are merged into a single vector . Second, the forget and input gates are coupled into a single update gate whose fractions sum to 1 by construction. A separate reset gate remains, but it plays a very narrow role — it only affects the candidate, not the memory highway.
The core bet: we don't need two independent gates for forgetting and writing. If we keep of the old memory, we have exactly budget left for new content. One knob, two effects, no loss of expressivity on most real tasks.
The result is a cell with three weight matrices instead of four, a single state vector, and — empirically — performance that tracks the LSTM closely on language modelling, machine translation, and music generation while training faster. For many practitioners the GRU is the default; for others the LSTM's extra flexibility earns its keep. Understanding the GRU's minimalism is the best way to appreciate both.
One State Vector, Two Gates
An LSTM carries two vectors through time: the cell state (internal memory) and the hidden state (external output). A GRU carries only . That single vector plays both roles — it is the memory that persists and the signal that flows to the next layer. This is not just notation: it is a genuine reduction in the cell's internal degrees of freedom.
The three learned linear layers are:
| Component | Formula | Role |
|---|---|---|
| Reset gate | r_t = σ(W_r · [h_{t-1}, x_t] + b_r) | How much past memory to let into the candidate |
| Update gate | z_t = σ(W_z · [h_{t-1}, x_t] + b_z) | How much of the candidate to write into the new memory |
| Candidate | h̃_t = tanh(W_h · [r_t ⊙ h_{t-1}, x_t] + b_h) | The proposed new content — signed (-1, 1) |
And the single update rule that glues them together:
The Update Gate: One Knob Replaces Two
The update gate is a per-channel sigmoid. Its value at channel answers a single question: what fraction of the new candidate do I write into channel of memory? The remaining fraction automatically goes to keeping the old value. Three extreme settings show the gate's range:
| Setting | Meaning | Effect on h_t per channel |
|---|---|---|
| z_c ≈ 0 | Keep everything | h_t,c ≈ h_{t-1},c — memory is frozen this step |
| z_c ≈ 1 | Overwrite everything | h_t,c ≈ h̃_t,c — new candidate replaces memory |
| z_c ≈ 0.5 | Balance | h_t,c ≈ half past + half candidate |
Compare this to the LSTM, where the forget gate and input gate are two independent sigmoids. An LSTM can set — simultaneously keeping a lot of the old memory AND writing a lot of new content (letting the cell state grow in magnitude). A GRU cannot express that combination. If it keeps 90% of the past, it writes at most 10% of the candidate.
The Reset Gate: Forgetting Just for the Proposal
The reset gate is the more subtle of the two gates, and the one most beginners misunderstand. Its equation looks almost identical to the update gate's:
But its role in the network is completely different. The reset gate does not appear in the final mixing step. It appears only inside the candidate:
Look carefully at what multiplies: , but only for the purpose of building the proposal. When on some channel, that channel of the past is hidden from the candidate — the proposal depends essentially on alone for that channel. When , the past participates fully.
The Candidate Hidden State
With the reset gate in hand we can build :
Three things are worth pausing on. First, the activation is , not . Gates are soft switches in — they should never flip signs. The candidate is content, and content must be able to cancel or flip previous content, so it lives in .
Second, the candidate's input is — reset-masked past concatenated with the current input. The update gate saw raw ; the candidate sees a possibly-silenced version. This asymmetry is a deliberate design choice that makes the reset gate's role well-defined: shape the proposal, not the keep-fraction.
Third, is just a proposal. It does not become the new hidden state on its own. The update gate in the next step decides how much of it to actually commit.
The Final Mix: Convex Combination in Time
Every piece has been assembled; the final step is mechanical:
Per channel this is a convex combination — a weighted average whose weights sum to 1. Geometrically, lies on the line segment between the old memory and the new candidate , at the position dictated by .
Why is this equation the heart of the GRU? Because of what it says about gradients. Taking a derivative of with respect to (treating gates as constants for the moment):
The first term is a scaled identity matrix. When (keep memory), the gradient flows backward essentially unchanged — no repeated multiplication by small numbers, no vanishing. This is exactly the same mechanism that makes the LSTM's cell state a gradient highway, accomplished with one fewer gate and one fewer state vector.
The unifying picture: LSTM and GRU both solve the vanishing-gradient problem by inserting an additive identity path into the recurrence. The LSTM's path runs on ; the GRU's runs on itself. Once you see that, the surface differences (number of gates, number of states) become a matter of taste and parameter count.
Interactive: GRU Architecture
Click through the four stages — reset gate, candidate, update gate, final mix — and watch each block of the cell light up. The dashed orange line is the identity path that lets gradients travel through time unchanged.
Interactive: The Gates at Work
The explorer uses scalar versions of , , and so you can watch all three values respond to the two sliders in real time. In a full network each gate is a vector and each channel acts independently — but the qualitative behavior is exactly what the scalar case shows.
Interactive: GRU vs LSTM Side by Side
The following visualization puts the two architectures next to each other. You can slide the input and hidden dimensions and watch the parameter count update — the GRU consistently uses of the LSTM's parameters because it has three gate/candidate matrices instead of four, and no separate cell state.
A Full Forward Pass with Real Numbers
Before we look at code, let us trace one GRU cell by hand on the three-token toy sequence “I love it”. We use the same two-dimensional toy embeddings as the LSTM section — this makes every matmul a 2×4 matrix times a (4,) vector, small enough to write out.
Weights
| Matrix | Values | Bias |
|---|---|---|
| W_r (2×4) | [[ 0.3, -0.2, 0.4, 0.1], [ 0.1, 0.5, -0.3, 0.2]] | b_r = [0.1, 0.0] |
| W_z (2×4) | [[ 0.2, 0.3, -0.1, 0.4], [-0.2, 0.1, 0.5, 0.2]] | b_z = [-0.1, 0.1] |
| W_h (2×4) | [[ 0.1, -0.4, 0.3, 0.2], [ 0.4, 0.2, -0.1, 0.5]] | b_h = [0.0, 0.1] |
Step 1 — token “I”, x = [0.5, −0.2]
Concatenate: . Compute the reset pre-activation row by row:
- Row 0 of : ; add bias 0.1 → .
- Row 1: ; add bias 0.0 → .
- .
Update gate, same drill but with : .
Candidate: because , the reset-masked past is , so .
- Row 0 of : .
- Row 1: .
- .
Final mix: .
Step 2 — token “love”, x = [0.8, 0.3]
Now the reset gate has real memory to act on. After the arithmetic (exactly the same recipe, this time with ):
| Quantity | Value |
|---|---|
| r_2 | [0.6155, 0.4528] |
| z_2 | [0.4853, 0.6335] |
| r_2 ⊙ h_1 | [0.0299, -0.0130] |
| h̃_2 = tanh(W_h [r⊙h, x] + b) | [0.2988, 0.1774] |
| (1 − z_2) ⊙ h_1 | [0.0250, -0.0106] |
| z_2 ⊙ h̃_2 | [0.1450, 0.1124] |
| h_2 (sum) | [0.1700, 0.1018] |
Step 3 — token “it”, x = [0.1, 0.9]
| Quantity | Value |
|---|---|
| r_3 | [0.5648, 0.5543] |
| z_3 | [0.5780, 0.5760] |
| r_3 ⊙ h_2 | [0.0960, 0.0565] |
| h̃_3 | [0.1945, 0.5297] |
| (1 − z_3) ⊙ h_2 | [0.0717, 0.0432] |
| z_3 ⊙ h̃_3 | [0.1124, 0.3051] |
| h_3 (sum) | [0.1842, 0.3483] |
Compare this to the LSTM section's worked example. There, the cell state grew in magnitude from step 1 to step 3 (≈0.21 → 0.35) while the hidden state stayed small. Here, the single vector grows from ≈0.05 to ≈0.35 across three steps — the GRU does not split memory from output, so you see the accumulation directly in the exposed vector.
GRU From Scratch in Python
Every line of the forward pass you just traced by hand shows up as a line of NumPy below. Click any line in the code — the explanation panel reveals shapes, values, and the mathematical role of every symbol at that point in the computation. This is the same cell you just computed; running it will reproduce the hand-traced numbers exactly.
Running the code prints:
| Step | Token | h_t |
|---|---|---|
| 1 | I | [ 0.0485, -0.0288] |
| 2 | love | [ 0.1700, 0.1018] |
| 3 | it | [ 0.1842, 0.3483] |
Notice that there is no separate column. The GRU only has one state, and what you see is what downstream layers receive.
The Same GRU in PyTorch
PyTorch ships two ways to use a GRU: nn.GRUCell, which mirrors our NumPy loop exactly (one step at a time, you manage the recurrence in Python), and nn.GRU, which handles the whole sequence in a single optimized call. Both use the same underlying arithmetic; they differ in where the time-loop lives.
The printed numbers will differ from the NumPy version because PyTorch uses random weight initialization instead of our hand-picked matrices. The mechanism is identical, line for line — three linear projections, two sigmoid gates, one tanh candidate, one convex mix.
| Aspect | NumPy from scratch | nn.GRUCell | nn.GRU |
|---|---|---|---|
| Weights | You provide | Random init | Random init |
| Gates | Three separate matmuls | One fused matmul | One fused matmul, batched over time |
| Loop over time | Python for-loop | Python for-loop | Optimized C++/CUDA kernel |
| Batching | One sequence | Batch-aware | Batch + sequence in one tensor |
| GPU | None | .to('cuda') | .to('cuda') |
| When to use | Learning / debugging | Custom step logic | Standard training |
Why GRUs Still Solve Vanishing Gradients
The canonical objection to vanilla RNNs is that gradients vanish when backpropagated through many timesteps because every step multiplies by the same small Jacobian. The LSTM escapes this via an additive cell state whose recurrence has an identity term. Does the GRU — with its single state — really preserve the same property?
Yes. Differentiate the update rule:
The first term, , is a diagonal matrix whose entries are the keep-fractions. When the update gate chooses to preserve channel , that entry is close to 1, so the gradient of with respect to is close to 1 — no shrinkage. Over steps the backward path is a product of such near-identity factors, which stays bounded.
The additional terms come from gradients flowing through the gates and the candidate tanh. These paths do decay — but they are not the only path back. The identity path ensures that a non-vanishing gradient always exists, and that is enough to let learned weights find long-range dependencies.
Interactive: Gradient Flow (RNN vs LSTM vs GRU)
The three curves below are the same measurement run through each cell type on a axis. Slide the keep-fraction control — it is for the LSTM and for the GRU. Both gated cells stay near 1; only the vanilla RNN collapses.
Numerical Gradient Check
To make the argument concrete, compare the highway approximation to the numerical Jacobian of the three-step GRU from §17.2's worked example. We use finite differences (step ) — autograd would give the same answer, exactly, for free.
The diagonal of the finite-difference Jacobian is ; the highway term approximation is . The ~5% gap is the reset-gate and candidate-tanh contribution autograd would also include. The leading story — an additive identity path that keeps gradients from vanishing — is the highway term.
When to Pick GRU Over LSTM
The empirical literature (Chung et al., 2014; Jozefowicz et al., 2015; Greff et al., 2017) has not produced a clean winner. On many language, speech, and music tasks the two perform within noise of each other. Greff and colleagues ran the definitive ablation — eight LSTM variants (with and without the forget gate, peephole connections, coupled input-forget gate, and others) on 5200 runs covering speech, handwriting, and polyphonic music — and found that none of the mutations beat the vanilla LSTM by a meaningful margin, and the most impactful single change was the forget-bias initialization trick from §17.1. The choice between GRU and LSTM usually rests on three practical factors instead.
Parameter budget and training speed
With hidden size and input size :
- GRU: parameters.
- LSTM: parameters.
A GRU uses exactly of the LSTM's parameters for the same hidden size. On a constrained memory or latency budget — edge devices, mobile, real-time ASR — GRU lets you either run faster or spend the savings on a larger hidden state.
Dataset size
With fewer parameters to fit, a GRU tends to generalize slightly better on small datasets and overfit more gracefully. On very large datasets (hundreds of millions of tokens) the LSTM's extra flexibility sometimes pays off, especially at larger hidden sizes.
Task structure
- Very long sequences with sparse dependencies (e.g., long-document summarization): an LSTM's decoupled forget + input gates sometimes model “remember this AND record that” more precisely.
- Moderate sequences, streaming inference (speech, real-time translation): GRUs win on latency and usually match accuracy.
- Encoder in an encoder-decoder: GRU is the modern default in many seq2seq implementations before the rise of Transformers.
Check Your Understanding
Eight questions covering the two gates, the identity path, parameter count, and the practical trade-offs against the LSTM. Each explanation ties the answer back to the equation it came from.
Summary
- One state, two gates. The GRU carries a single vector through time and uses two gates — reset and update — to control it.
- The update gate couples forget and input. The rule forces keep and write to sum to 1 per channel. One knob, no slack.
- The reset gate only shapes the candidate. Inside , the reset gate decides how much past context the proposal sees. It never touches the identity path.
- The identity path preserves gradients. contains a term — a scaled identity that carries gradients through time when the gate chooses to keep memory.
- Three matrices, not four. A GRU uses — 75% of the LSTM's parameter count for the same hidden size. No cell state means no output gate.
- Empirically on par. GRU and LSTM perform comparably on most benchmarks. GRU wins on training speed and small datasets; LSTM sometimes wins at very large scales.
- Same mathematical trick as the LSTM. Both solve vanishing gradients with an additive recurrence — the GRU puts the addition directly on ; the LSTM puts it on . The rest is a matter of how aggressively you simplify.
The core idea in one line: — a soft slider between keeping memory and writing new content, and that slider is all the GRU needs.
With both cells understood, the next section — §17.3 Implementing Sequence Models in PyTorch — turns the equations into production-grade code. You will build an LSTM from scratch in NumPy, rebuild it in PyTorch, verify the two agree to machine precision, and wire up the full embedding → packed BiLSTM → linear sentiment classifier that was the 2016 workhorse of applied NLP.
References
- Cho, K., van Merriënboer, B., Gulcehre, C., Bahdanau, D., Bougares, F., Schwenk, H. & Bengio, Y. (2014). Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation. EMNLP 2014 / arXiv:1406.1078. [Introduces the GRU.]
- Chung, J., Gulcehre, C., Cho, K. & Bengio, Y. (2014). Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling. NIPS 2014 Deep Learning Workshop / arXiv:1412.3555. [Compares GRU and LSTM across several sequence tasks; found them roughly comparable, with GRU slightly preferred on polyphonic music.]
- Jozefowicz, R., Zaremba, W. & Sutskever, I. (2015). An Empirical Exploration of Recurrent Network Architectures. ICML 2015. [Large-scale search over gated cell variants; no mutation consistently beat both GRU and LSTM.]
- Greff, K., Srivastava, R. K., Koutník, J., Steunebrink, B. R. & Schmidhuber, J. (2017). LSTM: A Search Space Odyssey. IEEE TNNLS 28(10), 2222–2232 / arXiv:1503.04069. [Eight LSTM variants, none dominating the vanilla architecture; the Coupled Input-Forget Gate variant is effectively the GRU's 1-z/z idea grafted onto an LSTM.]
- Hochreiter, S. & Schmidhuber, J. (1997). Long Short-Term Memory. Neural Computation 9(8), 1735–1780. [Original LSTM paper — provides the vanishing-gradient analysis that both LSTMs and GRUs address.]