Chapter 16
18 min read
Section 50 of 65

Sequential Data and RNNs

Recurrent Neural Networks

Why Sequences Need a New Architecture

Every neural network we have built so far — perceptrons, MLPs, and CNNs — has a fundamental limitation: they process each input independently. Feed the same image to a CNN twice, and you get the same output both times. There is no memory of what came before.

This works perfectly for images. A photo of a cat is a photo of a cat, regardless of which photo you looked at previously. But language, music, stock prices, and biological signals are fundamentally different: the meaning of each element depends on what came before it.

Consider the word “bank” in these two sentences:

  • “The river bank is steep” — bank means the edge of a river
  • “The savings bank is closed” — bank means a financial institution

A feedforward network that sees only the word “bank” has no way to distinguish these two meanings. It maps the same input to the same output every time. To resolve the ambiguity, the network needs to remember the words that came before — it needs memory.

Loading visualization...
Key Insight: Sequential data has a property that images do not: the order of elements carries information. A network that processes sequences must somehow carry information forward from earlier elements to later ones.

What Makes Data Sequential

Data is sequential when its elements have a meaningful order, and changing that order changes the meaning. Three properties define sequential data:

  1. Order dependence — Rearranging elements changes the meaning. “Dog bites man” and “Man bites dog” use the same words but mean very different things.
  2. Variable length — Sequences can have different lengths. A tweet has 280 characters; a novel has millions. The network must handle both.
  3. Long-range dependencies — Elements far apart can influence each other. In “The cat, which sat on the mat, was happy,” the verb “was” (singular) depends on “cat” (singular), not the closer “mat.”
DomainInput SequenceWhat Order Encodes
Natural LanguageWords in a sentenceGrammar, meaning, intent
Time SeriesStock prices over daysTemporal trends, seasonality
Audio/SpeechSound waveform samplesPhonemes, words, intonation
DNA/ProteinNucleotide/amino acid chainsGene structure, protein folding
MusicNotes over timeMelody, rhythm, harmony
VideoFrames over timeMotion, scene transitions

The fundamental challenge is this: a feedforward network takes a fixed-size input and produces a fixed-size output. But sequences have variable length. A 4-word sentence and a 40-word sentence must both go through the same network. We need an architecture that can process any number of elements, one at a time, while maintaining a running summary of everything it has seen.


The Core Idea: Memory Through Recurrence

The solution is surprisingly elegant: give the network a hidden state — a vector that persists across timesteps. At each timestep, the network reads the current input and the previous hidden state, then produces a new hidden state. This new state becomes the input for the next timestep.

This is recurrence: the output of one step feeds back as the input to the next step. The hidden state acts as the network's memory, carrying a compressed summary of everything it has seen so far.

RNN Unrolled Through Time

Each hidden state ht depends on the previous state ht-1

x1
h1
y1
x2
h2
y2
x3
h3
y3
x4
h4
...
y4
Input (xt)
Hidden state (ht)
Output (yt)
Sequential flow

The diagram above shows an RNN unrolled through time. Though it looks like four separate networks, it is actually the same network applied repeatedly. The same weight matrices are used at every timestep — this is called weight sharing.

The recurrence relation

At each timestep tt, the RNN performs two operations:

  1. Combine the current input xtx_t with the previous hidden state ht1h_{t-1} using learned weight matrices
  2. Squash the result through a nonlinearity (tanh) to produce the new hidden state hth_t

Written as a single equation:

ht=tanh(Wxhxt+Whhht1+bh)h_t = \tanh(W_{xh} \cdot x_t + W_{hh} \cdot h_{t-1} + b_h)

And optionally, at each step (or just the final step), we can produce an output:

yt=Whyht+byy_t = W_{hy} \cdot h_t + b_y

Let's break down every symbol:

SymbolShapeMeaning
x_t(input_size,)Input vector at timestep t (e.g., a word embedding)
h_{t-1}(hidden_size,)Hidden state from the previous timestep — the "memory"
h_t(hidden_size,)New hidden state after processing x_t
W_{xh}(hidden_size, input_size)Weight matrix transforming input to hidden space
W_{hh}(hidden_size, hidden_size)Weight matrix transforming previous hidden state (the recurrence)
b_h(hidden_size,)Bias vector for the hidden state
W_{hy}(output_size, hidden_size)Weight matrix projecting hidden state to output
b_y(output_size,)Bias vector for the output
tanhelement-wiseSquashes values to (-1, +1), preventing unbounded growth
Why tanh and not ReLU? The hidden state serves double duty — it is both the output and the recurrent input. ReLU is unbounded above, so repeatedly multiplying by WhhW_{hh} could cause values to explode. tanh bounds the output to (1,+1)(-1, +1), which keeps the hidden state stable across many timesteps.

Weight sharing: the power of recurrence

A critical feature of RNNs is that WxhW_{xh}, WhhW_{hh}, and bhb_h are the same at every timestep. Whether the network is processing the first word or the hundredth, it uses identical weights. This has three benefits:

  1. Parameter efficiency — A 100-word sequence uses the same number of parameters as a 5-word sequence. The model size does not grow with sequence length.
  2. Generalization — Patterns learned at one position transfer to others. If the network learns that “not” negates the next word, it applies this everywhere.
  3. Variable length handling — The same model processes sequences of any length. Simply run more iterations of the recurrence.

Inside the RNN Cell: The Mathematics

Let's build intuition for what happens inside the RNN cell at a single timestep. We will use concrete numbers to trace through every operation.

The two matrix multiplications

The RNN cell receives two inputs: the current word embedding xtx_t and the previous hidden state ht1h_{t-1}. It applies two separate weight matrices:

  • WxhxtW_{xh} \cdot x_t — transforms the current input into the hidden space. This answers: “What does this word contribute to the memory?”
  • Whhht1W_{hh} \cdot h_{t-1} — transforms the previous memory. This answers: “What should the network remember from before?”

These two contributions are added together, and then tanh squashes the sum into the range (1,+1)(-1, +1). The intuition is:

The new hidden state is a blend of what the network just saw and what it remembers. The weight matrices control the balance. Large values in WhhW_{hh} mean the network holds onto old information; large values in WxhW_{xh} mean the network pays more attention to the new input.

Worked example: processing “love” at t=1

Let's trace through one step in detail. At t=1t=1, the RNN has already processed “I” and has hidden state h1=[0.4621,  0.1974,  0.0997]h_1 = [0.4621,\; 0.1974,\; {-0.0997}]. Now it receives the embedding for “love”: x1=[0.5,  0.8]x_1 = [0.5,\; 0.8].

Step 1 — Transform the input:

Wxhx1=[0.50.30.20.70.10.4][0.50.8]=[0.010.660.27]W_{xh} \cdot x_1 = \begin{bmatrix} 0.5 & -0.3 \\ 0.2 & 0.7 \\ -0.1 & 0.4 \end{bmatrix} \cdot \begin{bmatrix} 0.5 \\ 0.8 \end{bmatrix} = \begin{bmatrix} 0.01 \\ 0.66 \\ 0.27 \end{bmatrix}

Step 2 — Transform the previous memory:

Whhh1=[0.10.40.20.30.20.50.30.10.1][0.46210.19740.0997]=[0.14510.14900.1089]W_{hh} \cdot h_1 = \begin{bmatrix} 0.1 & 0.4 & -0.2 \\ -0.3 & 0.2 & 0.5 \\ 0.3 & -0.1 & 0.1 \end{bmatrix} \cdot \begin{bmatrix} 0.4621 \\ 0.1974 \\ -0.0997 \end{bmatrix} = \begin{bmatrix} 0.1451 \\ -0.1490 \\ 0.1089 \end{bmatrix}

Step 3 — Add and apply tanh:

h2=tanh ⁣([0.010.660.27]+[0.14510.14900.1089])=tanh ⁣[0.15510.51100.3789]=[0.15390.47070.3618]h_2 = \tanh\!\left(\begin{bmatrix} 0.01 \\ 0.66 \\ 0.27 \end{bmatrix} + \begin{bmatrix} 0.1451 \\ -0.1490 \\ 0.1089 \end{bmatrix}\right) = \tanh\!\begin{bmatrix} 0.1551 \\ 0.5110 \\ 0.3789 \end{bmatrix} = \begin{bmatrix} 0.1539 \\ 0.4707 \\ 0.3618 \end{bmatrix}

The new hidden state h2=[0.1539,  0.4707,  0.3618]h_2 = [0.1539,\; 0.4707,\; 0.3618] now encodes a compressed representation of “I love”. Notice how it differs from h1h_1 — the memory has been updated to incorporate the emotional word “love.” In particular, hidden unit 1 jumped from 0.1974 to 0.4707, suggesting it responds strongly to emotional content.


Step-by-Step: An RNN Processing a Sentence

Now let's watch the complete forward pass. The interactive visualization below processes “I love this movie” one word at a time. Use the controls to step through each timestep and observe how the hidden state evolves.

Loading RNN visualizer...

Notice the key pattern: each hidden state is influenced by all previous words. By t=3t=3, the hidden state h4=[0.5597,  0.4605,  0.0660]h_4 = [0.5597,\; 0.4605,\; {-0.0660}] is not just about “movie” — it carries traces of “I,” “love,” and “this” as well. This is the RNN's memory in action.

A compressed summary, not a recording. The hidden state does not store every word verbatim. With only 3 numbers, it cannot. Instead, it stores the features that the weight matrices have learned to extract. With proper training, these features capture sentiment, topic, syntactic structure — whatever is useful for the task.

Building an RNN from Scratch in Python

Let's implement the complete RNN forward pass in pure Python with NumPy. We will build a tiny sentiment classifier that processes “I love this movie” and outputs a probability of positive sentiment. Every line is annotated — click any line to see exactly what happens in memory.

RNN Forward Pass — Pure NumPy Implementation
🐍rnn_from_scratch.py
1import numpy as np

NumPy is Python’s numerical computing library. It provides ndarray — a fast, memory-efficient matrix type that runs optimized C code under the hood. We use it for all the matrix multiplications (@ operator), element-wise operations (tanh), and array creation in this RNN implementation.

EXECUTION STATE
numpy = Numerical computing library — provides ndarray, linear algebra (@ for matmul), mathematical functions (np.tanh, np.exp, np.zeros), and random number generation
as np = Standard alias — lets us write np.array() instead of numpy.array(). Universal Python convention.
3# --- Configuration ---

These constants define the shape of our RNN. In a real network, input_size would be the word embedding dimension (e.g., 300), hidden_size would be 256-512, and seq_length would vary per sentence. We use tiny values so every matrix fits on screen.

4input_size = 2

The dimension of each input vector (word embedding). Each word in our vocabulary is represented as a 2D vector. In production models, this would be 100-300 dimensions (Word2Vec, GloVe) or 768+ (BERT embeddings). We use 2 so you can see every single number in the matrices.

EXECUTION STATE
input_size = 2 = Each word is encoded as a 2-element vector. Example: 'I' → [1.0, 0.0], 'love' → [0.5, 0.8]. Think of these two dimensions as capturing two different aspects of word meaning.
5hidden_size = 3

The dimension of the hidden state vector — the RNN’s “memory.” This determines how much information the network can remember from previous timesteps. Larger hidden size = more memory capacity but more parameters to learn. We use 3 for full visibility of every computation.

EXECUTION STATE
hidden_size = 3 = The hidden state h will be a 3-element vector. At each timestep, the RNN compresses all information seen so far into these 3 numbers. In practice, 128-512 is common.
6seq_length = 4

Number of timesteps in our input sequence. The RNN will process 4 words one at a time, updating its hidden state after each word. In real applications, sequences can be hundreds or thousands of tokens long.

EXECUTION STATE
seq_length = 4 = We process 4 words: 'I', 'love', 'this', 'movie'. The RNN sees them in order, one per timestep: t=0, t=1, t=2, t=3.
8tokens = ["I", "love", "this", "movie"]

The words in our sentence. We’ll convert each word to a numeric vector (embedding) before feeding it to the RNN. The order matters — “I love this movie” has a different meaning than “this movie love I.”

EXECUTION STATE
tokens = ["I", "love", "this", "movie"] — a 4-word sentence. The RNN will process these left-to-right, building up its understanding one word at a time.
11X = np.array([...]) — word embeddings

Creates a 2D NumPy array where each row is a word’s embedding vector. In a real system, these would come from a pretrained embedding layer (Word2Vec, GloVe) or be learned during training. Here we assign simple hand-crafted values.

EXECUTION STATE
⬇ X shape = (4, 2) — 4 words, each represented as a 2D vector
X[0] = [1.0, 0.0] — "I" = The pronoun 'I' is encoded with feature 0 fully on, feature 1 off
X[1] = [0.5, 0.8] — "love" = The verb 'love' has moderate feature 0 and high feature 1 — perhaps feature 1 captures emotional intensity
X[2] = [0.3, 0.7] — "this" = The demonstrative 'this' has low-to-moderate values on both features
X[3] = [0.9, 0.1] — "movie" = The noun 'movie' has high feature 0 and low feature 1 — a concrete object, low emotional loading
18W_xh = np.array([...]) — input-to-hidden weights

This weight matrix transforms each input vector into the hidden state space. It has shape (hidden_size, input_size) = (3, 2). Each row defines how one hidden unit responds to the two input features. These weights are shared across all timesteps — the same W_xh is used for every word. In training, gradient descent learns these values.

EXECUTION STATE
W_xh shape: (3, 2) = 3 rows (one per hidden unit) × 2 columns (one per input feature). When we compute W_xh @ x_t, we get a 3-element vector.
W_xh values =
      x0    x1
h0 [ 0.5  -0.3]
h1 [ 0.2   0.7]
h2 [-0.1   0.4]
→ W_xh[0] = [0.5, -0.3] = Hidden unit 0 responds positively to input feature 0 (weight 0.5) and negatively to feature 1 (weight -0.3)
→ W_xh[1] = [0.2, 0.7] = Hidden unit 1 responds mildly to feature 0 (0.2) but strongly to feature 1 (0.7) — it’s an 'emotional intensity detector'
→ Weight sharing = The SAME W_xh is used at t=0, t=1, t=2, t=3. This is the key idea: weight sharing across time makes RNNs parameter-efficient and able to handle variable-length sequences.
22W_hh = np.array([...]) — hidden-to-hidden weights

This is the recurrence matrix — the heart of the RNN. It transforms the previous hidden state to influence the current one. This is what gives the RNN its “memory.” W_hh has shape (hidden_size, hidden_size) = (3, 3). It defines how each hidden unit at time t depends on ALL hidden units at time t-1.

EXECUTION STATE
W_hh shape: (3, 3) = 3 × 3 square matrix. The recurrence matrix must be square because it maps hidden state (size 3) back to hidden state (size 3).
W_hh values =
       h0    h1    h2
h0 [ 0.1   0.4  -0.2]
h1 [-0.3   0.2   0.5]
h2 [ 0.3  -0.1   0.1]
→ Why W_hh matters = Without W_hh, the RNN has no memory — it becomes a regular feedforward network. W_hh lets information from word 'I' (t=0) flow into the hidden state at 'love' (t=1), then 'this' (t=2), and 'movie' (t=3).
→ W_hh[0] = [0.1, 0.4, -0.2] = Hidden unit 0 at time t is influenced by: 0.1×(h0 at t-1) + 0.4×(h1 at t-1) + (-0.2)×(h2 at t-1). It cares most about what h1 remembered.
26b_h = np.zeros(hidden_size)

The hidden bias vector, initialized to all zeros. In practice, biases are learned during training and shift the pre-activation values, allowing each hidden unit to have a different “default” activation level. Starting at zero is standard — it means no bias until learning adjusts them.

EXECUTION STATE
📚 np.zeros(n) = NumPy function: creates an array of n zeros. np.zeros(3) → [0.0, 0.0, 0.0]
b_h = [0.0, 0.0, 0.0] = One bias per hidden unit. Added to W_xh @ x_t + W_hh @ h before tanh. With zeros, the pre-activation depends only on the weighted inputs.
28W_hy = np.array([[0.6, -0.4, 0.3]])

The hidden-to-output weight matrix. Shape (output_size, hidden_size) = (1, 3). It projects the final hidden state into a single output value (the sentiment logit). The weights determine which hidden units matter most for the prediction.

EXECUTION STATE
W_hy shape: (1, 3) = 1 output × 3 hidden units. The single output is a sentiment score — positive for 'positive sentiment', negative for 'negative sentiment'.
W_hy = [[0.6, -0.4, 0.3]] = Output = 0.6×h0 + (-0.4)×h1 + 0.3×h2. Hidden unit 0 contributes most positively (0.6), unit 1 is weighted negatively (-0.4), unit 2 moderately (0.3).
29b_y = np.array([0.1])

The output bias. A small positive value (0.1) gives the model a slight default toward positive sentiment before seeing any input. This is added to W_hy @ h to produce the final logit.

EXECUTION STATE
b_y = [0.1] = Output bias — shifts the raw score by 0.1 before sigmoid. Acts as a prior: slight lean toward positive prediction.
31# --- Forward pass ---

This section runs the RNN forward through all 4 timesteps. At each step: (1) read the current word’s embedding, (2) combine it with the previous hidden state, (3) apply tanh to get the new hidden state. The key equation is: h_t = tanh(W_xh · x_t + W_hh · h_{t-1} + b_h).

32h = np.zeros(hidden_size) — initial hidden state

Initialize the hidden state to all zeros. This means the RNN starts with “no memory” — a blank slate. The first word 'I' will be processed with this zero vector as the “previous” hidden state, meaning W_hh @ h contributes nothing at t=0.

EXECUTION STATE
h = h_0 = [0.0, 0.0, 0.0] = The initial hidden state. All zeros = no prior information. After processing 'I', this becomes h_1 = [0.4621, 0.1974, -0.0997]. After 'love', h_2 = [0.1539, 0.4707, 0.3618]. And so on.
→ Why zeros? = Zero initialization is standard because we have no prior information about the sequence. Some advanced models learn the initial hidden state as a parameter.
34for t in range(seq_length): — loop over timesteps

Iterate through each timestep t = 0, 1, 2, 3. At each step, the RNN reads one word and updates its hidden state. This loop IS the recurrence — each iteration depends on the hidden state from the previous iteration. The same weight matrices (W_xh, W_hh) are reused at every timestep.

LOOP TRACE · 4 iterations
t=0, token='I'
x_0 = [1.0, 0.0] = Embedding for 'I'
h_prev = [0.0, 0.0, 0.0] = Blank slate — no memory yet
→ h_1 = [0.4621, 0.1974, -0.0997] = First real hidden state — captures 'I'
t=1, token='love'
x_1 = [0.5, 0.8] = Embedding for 'love'
h_prev = [0.4621, 0.1974, -0.0997] = Memory of 'I'
→ h_2 = [0.1539, 0.4707, 0.3618] = Now captures 'I love'
t=2, token='this'
x_2 = [0.3, 0.7] = Embedding for 'this'
h_prev = [0.1539, 0.4707, 0.3618] = Memory of 'I love'
→ h_3 = [0.0712, 0.6521, 0.2778] = Now captures 'I love this'
t=3, token='movie'
x_3 = [0.9, 0.1] = Embedding for 'movie'
h_prev = [0.0712, 0.6521, 0.2778] = Memory of 'I love this'
→ h_4 = [0.5597, 0.4605, -0.0660] = Final hidden state: captures 'I love this movie'
35x_t = X[t] — get current word embedding

Index into the embedding matrix to get the current word’s vector. X[t] extracts row t from the (4, 2) matrix. This is the input the RNN sees at this timestep — the only new information it receives. Everything else comes from the previous hidden state.

EXECUTION STATE
📚 X[t] (NumPy indexing) = Extracts row t from a 2D array. X[0] = [1.0, 0.0], X[1] = [0.5, 0.8], X[2] = [0.3, 0.7], X[3] = [0.9, 0.1]
x_t shape = (2,) — a 1D vector with 2 elements (input_size = 2)
36h = np.tanh(W_xh @ x_t + W_hh @ h + b_h)

THE core RNN equation. This single line is the entire recurrence computation. It takes the current input x_t and previous hidden state h, combines them through learned weight matrices, and squashes the result with tanh to produce the new hidden state. Let’s break down each part:

EXECUTION STATE
📚 @ (matrix multiply operator) = Python’s matrix multiplication operator (equivalent to np.matmul). W_xh(3,2) @ x_t(2,) = result(3,). Computes dot products between W_xh rows and x_t.
W_xh @ x_t = Transforms the current word embedding into hidden space. Shape: (3,2) @ (2,) → (3,). This answers: 'what does the current word contribute to each hidden unit?'
W_hh @ h = Transforms the previous hidden state. Shape: (3,3) @ (3,) → (3,). This answers: 'what does the memory of previous words contribute to each hidden unit?'
+ (element-wise addition) = Adds the three vectors element-by-element: (W_xh @ x_t)[i] + (W_hh @ h)[i] + b_h[i] for each hidden unit i = 0, 1, 2
📚 np.tanh() = Element-wise hyperbolic tangent. Maps any real number to the range (-1, +1). tanh(0) = 0, tanh(2) ≈ 0.964, tanh(-2) ≈ -0.964. This nonlinearity prevents values from exploding and lets the network learn nonlinear patterns.
→ Why tanh not ReLU? = tanh outputs in (-1, +1) which is centered around 0. This helps because the hidden state is used as BOTH output and recurrent input. Centered outputs mean W_hh @ h doesn’t drift. ReLU (unbounded) can cause hidden states to grow without limit.
── Example: t=1, token='love' ── =
W_xh @ x_1 = [0.5×0.5 + (-0.3)×0.8, 0.2×0.5 + 0.7×0.8, (-0.1)×0.5 + 0.4×0.8] = [0.0100, 0.6600, 0.2700]
W_hh @ h_1 = [0.1×0.4621 + 0.4×0.1974 + (-0.2)×(-0.0997), ...] = [0.1451, -0.1490, 0.1089]
pre_activation = [0.0100+0.1451, 0.6600+(-0.1490), 0.2700+0.1089] = [0.1551, 0.5110, 0.3789]
⬆ h_2 = tanh(pre) = [tanh(0.1551), tanh(0.5110), tanh(0.3789)] = [0.1539, 0.4707, 0.3618]
37print(f"t={t} '{tokens[t]}': h = {h}")

Prints the hidden state after processing each word. This is how we trace the RNN’s evolving memory. Watch how h changes after each word — it’s not just the current word’s representation, it’s the compressed summary of ALL words seen so far.

EXECUTION STATE
Output at t=0 = t=0 'I': h = [ 0.4621 0.1974 -0.0997]
Output at t=1 = t=1 'love': h = [0.1539 0.4707 0.3618]
Output at t=2 = t=2 'this': h = [0.0712 0.6521 0.2778]
Output at t=3 = t=3 'movie': h = [ 0.5597 0.4605 -0.0660]
39# --- Output from final hidden state ---

After processing all 4 words, the final hidden state h_4 = [0.5597, 0.4605, -0.0660] is a compressed summary of the entire sentence 'I love this movie'. We now project this into a single sentiment score.

40y = W_hy @ h + b_y — compute output logit

Projects the final hidden state into a single output value (the sentiment logit). This is a simple linear transformation: a dot product between W_hy and the hidden state, plus a bias. The output y is a raw score — positive values lean toward positive sentiment, negative toward negative.

EXECUTION STATE
📚 @ (matrix multiply) = W_hy(1,3) @ h(3,) = result(1,). Computes: 0.6×h[0] + (-0.4)×h[1] + 0.3×h[2]
W_hy @ h = 0.6×0.5597 + (-0.4)×0.4605 + 0.3×(-0.0660) = 0.3358 - 0.1842 - 0.0198 = 0.1318
+ b_y = 0.1318 + 0.1 = 0.2318
⬆ y = [0.2318] = The raw logit is positive (0.2318), which leans toward positive sentiment. After sigmoid, this becomes a probability.
41sentiment = 1 / (1 + np.exp(-y[0])) — sigmoid

Applies the sigmoid function to convert the raw logit into a probability between 0 and 1. Sigmoid(σ) maps any real number to (0, 1): large positive → close to 1, large negative → close to 0, zero → 0.5. Values > 0.5 predict positive sentiment.

EXECUTION STATE
📚 sigmoid formula = σ(x) = 1 / (1 + e^(-x)). Example: σ(0) = 0.5, σ(2) = 0.88, σ(-2) = 0.12
y[0] = 0.2318 = The raw logit from the previous line. Positive but small — a mild positive prediction.
np.exp(-y[0]) = e^(-0.2318) = 0.7932 — the denominator term
1 + np.exp(-y[0]) = 1 + 0.7932 = 1.7932
⬆ sentiment = 0.5577 = 1 / 1.7932 = 0.5577. Probability > 0.5, so the model predicts POSITIVE sentiment. With proper training, this confidence would be much higher (e.g., 0.95).
42print(f"Output logit: {y[0]:.4f}")

Prints the raw logit value before sigmoid. The format specifier .4f shows 4 decimal places.

EXECUTION STATE
Output = Output logit: 0.2318
43print(f"Sentiment: {sentiment:.4f}")

Prints the final sentiment probability. 0.5577 means the model is 55.77% confident the sentence has positive sentiment. This is barely above the 0.5 threshold because our weights are random — with training, the model would learn weights that push this much closer to 1.0 for 'I love this movie'.

EXECUTION STATE
Output = Sentiment: 0.5577
→ Interpretation = 0.5577 > 0.5 → Positive sentiment prediction. Random weights give near-chance prediction. After training on thousands of labeled reviews, the model would output ~0.95 for 'I love this movie' and ~0.05 for 'I hate this movie'.
21 lines without explanation
1import numpy as np
2
3# --- Configuration ---
4input_size  = 2
5hidden_size = 3
6seq_length  = 4
7
8tokens = ["I", "love", "this", "movie"]
9
10# Simple 2D embeddings for each word
11X = np.array([
12    [1.0, 0.0],   # "I"
13    [0.5, 0.8],   # "love"
14    [0.3, 0.7],   # "this"
15    [0.9, 0.1],   # "movie"
16])
17
18# --- Weight matrices (normally learned) ---
19W_xh = np.array([[ 0.5, -0.3],
20                  [ 0.2,  0.7],
21                  [-0.1,  0.4]])
22
23W_hh = np.array([[ 0.1,  0.4, -0.2],
24                  [-0.3,  0.2,  0.5],
25                  [ 0.3, -0.1,  0.1]])
26
27b_h = np.zeros(hidden_size)
28
29W_hy = np.array([[0.6, -0.4, 0.3]])
30b_y  = np.array([0.1])
31
32# --- Forward pass ---
33h = np.zeros(hidden_size)
34
35for t in range(seq_length):
36    x_t = X[t]
37    h = np.tanh(W_xh @ x_t + W_hh @ h + b_h)
38    print(f"t={t} '{tokens[t]}': h = {h}")
39
40# --- Output from final hidden state ---
41y = W_hy @ h + b_y
42sentiment = 1 / (1 + np.exp(-y[0]))
43print(f"Output logit: {y[0]:.4f}")
44print(f"Sentiment: {sentiment:.4f}")

The entire forward pass boils down to a single line inside a loop: h = np.tanh(W_xh @ x_t + W_hh @ h + b_h). This is the beauty of recurrence — one equation, applied repeatedly, gives the network the ability to process sequences of any length while maintaining memory.

Tracing the hidden state evolution

TimestepTokenh[0]h[1]h[2]Interpretation
t=0"I"0.46210.1974-0.0997Subject pronoun detected
t=1"love"0.15390.47070.3618Emotional word → h[1] jumps
t=2"this"0.07120.65210.2778Demonstrative reinforces context
t=3"movie"0.55970.4605-0.0660Object noun → h[0] rises
Why do the hidden values change so much between steps? Because the weight matrices WxhW_{xh} and WhhW_{hh} each apply a linear transformation, and tanh is nonlinear. The combination means each new word can significantly reshape the hidden state. During training, gradient descent adjusts these weights so that the hidden state captures features relevant to the task (like sentiment) and suppresses irrelevant details.

The Same RNN in PyTorch

PyTorch provides nn.RNN, which implements everything we just built — the weight matrices, the recurrence loop, and the tanh activation — in a single optimized module. Let's see how the same sentiment classifier looks with PyTorch:

RNN Sentiment Classifier — PyTorch Implementation
🐍rnn_pytorch.py
1import torch

PyTorch is an open-source deep learning framework. It provides tensor operations (like NumPy but with GPU support and automatic differentiation), neural network modules (nn.RNN, nn.Linear), and optimizers (SGD, Adam). All computations in PyTorch operate on torch.Tensor objects.

EXECUTION STATE
torch = Core PyTorch library — provides Tensor class (multidimensional array with autograd), mathematical functions, and the foundation for all neural network operations
2import torch.nn as nn

The neural network module contains all building blocks for neural networks: layers (Linear, RNN, LSTM, Conv2d), loss functions (CrossEntropyLoss), and the Module base class. We alias it as ‘nn’ for conciseness.

EXECUTION STATE
torch.nn = Neural network module — provides nn.Module (base class), nn.RNN (recurrent layer), nn.Linear (fully connected layer), nn.Embedding, and many more
as nn = Standard alias so we write nn.RNN() instead of torch.nn.RNN()
4# --- Define a simple RNN for sentiment ---

We define a custom neural network class that wraps PyTorch’s built-in nn.RNN with an output layer. This is the standard PyTorch pattern: define the architecture in __init__, define the computation in forward().

5class SentimentRNN(nn.Module):

Defines a new neural network by inheriting from nn.Module. Every PyTorch model must extend nn.Module — this gives it parameter tracking (for optimization), GPU transfer (.to(device)), serialization (.state_dict()), and the ability to be composed with other modules.

EXECUTION STATE
📚 nn.Module = Base class for all PyTorch neural networks. Provides: .parameters() to list all learnable weights, .train()/.eval() for mode switching, .forward() as the computation method, .to(device) for GPU transfer
SentimentRNN = Our custom model: RNN layer + Linear output layer. Takes a sequence of word embeddings and outputs a single sentiment score.
6def __init__(self, input_size, hidden_size, output_size):

Constructor that defines the network’s architecture (which layers exist and their sizes). Called once when the model is created. All nn.Module submodules must be created here so PyTorch can discover their parameters.

EXECUTION STATE
⬇ input: self = The model instance being constructed. Used to store layers as attributes (self.rnn, self.fc) so PyTorch can find them.
⬇ input: input_size = 2 = Dimension of each input vector (word embedding size). Each word is a 2D vector.
⬇ input: hidden_size = 3 = Dimension of the RNN’s hidden state (memory). The RNN compresses all sequence information into this many numbers.
⬇ input: output_size = 1 = Dimension of the final output. For binary sentiment (positive/negative), we need just 1 number.
7super().__init__()

Calls nn.Module’s constructor. This is mandatory — it initializes PyTorch’s internal parameter tracking, hook system, and state management. Without this call, self.rnn and self.fc would not be registered as submodules and their parameters would be invisible to the optimizer.

EXECUTION STATE
📚 super().__init__() = Initializes the nn.Module base class. Sets up internal dictionaries for parameters, buffers, and submodules. Must be called before assigning any nn.Module attributes.
8self.hidden_size = hidden_size

Stores the hidden size as an instance attribute. We need this in forward() to create the initial hidden state h0 with the correct size. It’s just a plain integer, not a learnable parameter.

EXECUTION STATE
self.hidden_size = 3 = Saved for later use in forward() when creating h0 = torch.zeros(1, batch, 3). Not a learnable parameter — just a configuration value.
9self.rnn = nn.RNN(input_size, hidden_size, batch_first=True)

Creates PyTorch’s built-in RNN layer. This single line replaces our entire NumPy loop — it handles the recurrence, weight matrices, and tanh activation internally. nn.RNN creates and manages four parameter tensors: weight_ih, weight_hh, bias_ih, bias_hh.

EXECUTION STATE
📚 nn.RNN(input_size, hidden_size, ...) = PyTorch’s recurrent layer. Internally implements: h_t = tanh(x_t @ W_ihᵀ + b_ih + h_{t-1} @ W_hhᵀ + b_hh). Creates learnable weight matrices automatically.
⬇ arg 1: input_size = 2 = Size of each input vector. Determines the shape of weight_ih: (hidden_size, input_size) = (3, 2). Each input word is a 2D embedding.
⬇ arg 2: hidden_size = 3 = Size of the hidden state. Determines weight_hh shape: (hidden_size, hidden_size) = (3, 3). The memory vector has 3 elements.
⬇ arg 3: batch_first = True = Changes input shape convention. With batch_first=True: input shape is (batch, seq_len, input_size). Without (default): (seq_len, batch, input_size). batch_first=True is more intuitive because it matches how we think of data: N samples, each a sequence of T vectors.
→ Parameters created = weight_ih_l0: (3, 2) = 6 params weight_hh_l0: (3, 3) = 9 params bias_ih_l0: (3,) = 3 params bias_hh_l0: (3,) = 3 params Total: 21 learnable parameters
→ 'l0' suffix = Means 'layer 0'. nn.RNN supports stacking multiple layers (num_layers parameter). l0 = first (and only) layer.
10self.fc = nn.Linear(hidden_size, output_size)

Creates a fully connected (linear) layer that maps the final hidden state to the output. This is equivalent to our W_hy and b_y from the NumPy version. nn.Linear stores a weight matrix and bias, computing: output = input @ weightᵀ + bias.

EXECUTION STATE
📚 nn.Linear(in_features, out_features) = Fully connected layer. Stores weight of shape (out_features, in_features) and bias of shape (out_features). Forward: y = x @ Wᵀ + b
⬇ arg 1: hidden_size = 3 = Input dimension — the hidden state from the RNN has 3 elements
⬇ arg 2: output_size = 1 = Output dimension — one number for binary sentiment classification
→ Parameters created = weight: (1, 3) = 3 params bias: (1,) = 1 param Total: 4 learnable parameters
12def forward(self, x):

Defines the forward pass computation. This is called when you do model(input) — PyTorch automatically calls forward() and sets up the computation graph for backpropagation. The entire RNN + classification pipeline runs here.

EXECUTION STATE
⬇ input: self = The model instance with its layers (self.rnn, self.fc) and stored hidden_size
⬇ input: x — shape (batch, seq_len, input_size) = The input tensor. Example: (1, 4, 2) = 1 sentence, 4 words, 2D embeddings. [[1.0, 0.0], ← 'I' [0.5, 0.8], ← 'love' [0.3, 0.7], ← 'this' [0.9, 0.1]] ← 'movie'
⬆ returns = Tensor of shape (batch, output_size) = (1, 1). A single sentiment logit per sentence.
13h0 = torch.zeros(1, x.size(0), self.hidden_size)

Creates the initial hidden state as a tensor of zeros. The shape follows PyTorch’s convention: (num_layers, batch_size, hidden_size). We use 1 layer, and x.size(0) gives the batch size from the input tensor.

EXECUTION STATE
📚 torch.zeros(*size) = Creates a tensor filled with zeros. torch.zeros(1, 1, 3) → tensor([[[0., 0., 0.]]])
⬇ dim 0: 1 (num_layers) = We have 1 RNN layer. For a 2-layer stacked RNN, this would be 2, providing a separate initial state for each layer.
⬇ dim 1: x.size(0) (batch_size) = x.size(0) extracts the first dimension of x. For x of shape (1, 4, 2), this returns 1. This ensures h0 matches the batch size of the input.
⬇ dim 2: self.hidden_size = 3 = Each hidden state is a 3-element vector, matching the RNN’s hidden_size
⬆ h0 = tensor([[[0., 0., 0.]]]) — shape (1, 1, 3). One layer, one batch, three hidden units all starting at zero.
14output, h_n = self.rnn(x, h0)

Runs the entire RNN forward pass in one call. PyTorch’s nn.RNN internally loops through all 4 timesteps, applying the recurrence equation at each step. It returns TWO tensors: output (hidden states at ALL timesteps) and h_n (hidden state at the LAST timestep only).

EXECUTION STATE
📚 self.rnn(x, h0) — nn.RNN forward = Runs the RNN recurrence for all timesteps. Internally: for each t, computes h_t = tanh(x_t @ W_ihᵀ + b_ih + h_{t-1} @ W_hhᵀ + b_hh). Returns (output, h_n).
⬇ arg 1: x — shape (1, 4, 2) = Input sequence: 1 batch, 4 timesteps, 2 features per timestep
⬇ arg 2: h0 — shape (1, 1, 3) = Initial hidden state: all zeros. If omitted, PyTorch uses zeros by default.
⬆ output — shape (1, 4, 3) = Hidden states at ALL timesteps. output[0, t, :] = h_{t+1}. Contains: t=0: [0.4621, 0.1974, -0.0997] (after 'I') t=1: [0.1539, 0.4707, 0.3618] (after 'love') t=2: [0.0712, 0.6521, 0.2778] (after 'this') t=3: [0.5597, 0.4605, -0.0660] (after 'movie') (Values shown are for our specific weights)
⬆ h_n — shape (1, 1, 3) = Final hidden state only. h_n[0, 0, :] = output[0, -1, :] = [0.5597, 0.4605, -0.0660]. This is the 'summary' of the entire sentence.
→ output vs h_n = output gives you h at every timestep (useful for sequence-to-sequence tasks). h_n gives you just the last one (useful for classification). For sentiment analysis, we only need h_n.
15logit = self.fc(h_n.squeeze(0))

Passes the final hidden state through the linear layer to get a sentiment score. We first squeeze out the num_layers dimension (dim 0) because nn.Linear expects input of shape (batch, features), not (num_layers, batch, features).

EXECUTION STATE
📚 .squeeze(0) = Removes dimension 0 if it has size 1. h_n shape: (1, 1, 3) → squeeze(0) → (1, 3). This removes the num_layers dimension, leaving (batch, hidden_size).
⬇ h_n.squeeze(0) shape = (1, 3) — batch=1, hidden_size=3. Now compatible with nn.Linear(3, 1).
📚 self.fc() — nn.Linear forward = Computes: logit = h_n_squeezed @ W_fcᵀ + b_fc. Shape: (1, 3) @ (1, 3)ᵀ + (1,) → (1, 1).
⬆ logit = Shape (1, 1) — one sentiment score per sentence in the batch. With our specific weights: [[0.2318]]
16return logit

Returns the raw logit (before sigmoid). In PyTorch, it’s common to return raw logits and apply sigmoid/softmax outside the model — this lets you use nn.BCEWithLogitsLoss (which combines sigmoid + loss for numerical stability) during training.

EXECUTION STATE
⬆ return: logit = Shape (batch, 1) = (1, 1). Raw sentiment score. Apply torch.sigmoid() externally to get probability.
18# --- Create model and run inference ---

Now we instantiate the model with our chosen dimensions and run it on a sample sentence. With torch.no_grad() disables gradient tracking since we’re only doing inference, not training.

19model = SentimentRNN(input_size=2, hidden_size=3, output_size=1)

Creates an instance of our SentimentRNN. This calls __init__, which creates the nn.RNN and nn.Linear layers with randomly initialized weights. The model now has 21 (RNN) + 4 (Linear) = 25 total learnable parameters.

EXECUTION STATE
📚 SentimentRNN(...) = Calls __init__ which creates: • self.rnn = nn.RNN(2, 3, batch_first=True) → 21 params • self.fc = nn.Linear(3, 1) → 4 params Total: 25 parameters
input_size = 2 = Each word embedding has 2 dimensions
hidden_size = 3 = The hidden state (memory) has 3 dimensions
output_size = 1 = Single output for binary sentiment
21X = torch.tensor([[[1.0, 0.0], ...]]) — input tensor

Creates the input tensor with shape (batch=1, seq_len=4, input_size=2). The triple nesting represents: one sentence, containing four words, each as a 2D vector. Same data as our NumPy example but in PyTorch tensor format.

EXECUTION STATE
📚 torch.tensor(data) = Creates a tensor from a Python list. Infers dtype from data (float32 for floats). Unlike np.array, tensors support autograd and GPU transfer.
X shape: (1, 4, 2) = Batch dim=1 (one sentence), seq_len=4 (four words), input_size=2 (2D embeddings). batch_first=True in nn.RNN expects this ordering.
X[0, 0] = [1.0, 0.0] = Batch 0, timestep 0: embedding for 'I'
X[0, 1] = [0.5, 0.8] = Batch 0, timestep 1: embedding for 'love'
X[0, 2] = [0.3, 0.7] = Batch 0, timestep 2: embedding for 'this'
X[0, 3] = [0.9, 0.1] = Batch 0, timestep 3: embedding for 'movie'
26with torch.no_grad():

Context manager that disables gradient computation. Since we’re doing inference (not training), we don’t need PyTorch to build the computation graph or store intermediate values for backpropagation. This saves memory and speeds up computation.

EXECUTION STATE
📚 torch.no_grad() = Disables autograd. Operations inside this block don’t track gradients. Reduces memory usage by ~50% for inference. Always use this for evaluation/prediction.
→ When to use = During inference (model.eval() + no_grad), validation loops, and any computation where you don’t need gradients. Don’t use during training’s forward pass.
27logit = model(X) — run forward pass

Calls the model’s forward() method. PyTorch routes model(X) through forward(X) automatically (plus any registered hooks). The entire pipeline runs: create h0, process all 4 timesteps through RNN, project final hidden state through Linear layer.

EXECUTION STATE
📚 model(X) = Equivalent to model.forward(X) but also runs any registered hooks. Always call model(X), never model.forward(X) directly.
⬇ X shape = (1, 4, 2) — batch=1, seq_len=4, input_size=2
⬆ logit shape = (1, 1) — one sentiment score for the one sentence in the batch
28prob = torch.sigmoid(logit)

Applies sigmoid to convert the raw logit into a probability. torch.sigmoid is the element-wise sigmoid function: σ(x) = 1 / (1 + e^(-x)). Maps any real number to the range (0, 1).

EXECUTION STATE
📚 torch.sigmoid(x) = Element-wise sigmoid: σ(x) = 1/(1 + e^(-x)). Maps R → (0, 1). Example: sigmoid(0)=0.5, sigmoid(2)=0.88, sigmoid(-2)=0.12
⬇ logit = Raw score from the model. With our weights: tensor([[0.2318]])
⬆ prob = tensor([[0.5577]]) — 55.77% probability of positive sentiment
29print(f"Logit: {logit.item():.4f}")

Prints the raw logit. .item() extracts a single number from a 1-element tensor as a Python float. The :.4f format shows 4 decimal places.

EXECUTION STATE
📚 .item() = Converts a single-element tensor to a Python number. tensor([[0.2318]]).item() → 0.2318. Only works on tensors with exactly one element.
Output = Logit: 0.2318 (with our specific weights)
30print(f"Probability: {prob.item():.4f}")

Prints the final sentiment probability. This is the model’s answer: how likely is the sentence 'I love this movie' to have positive sentiment? With random weights, the answer is near 0.5 (uncertain). After training on labeled data, it would be near 1.0.

EXECUTION STATE
Output = Probability: 0.5577 (with our specific weights)
→ After training = A trained model would output: • 'I love this movie' → prob ≈ 0.95 • 'I hate this movie' → prob ≈ 0.05 • 'The movie was okay' → prob ≈ 0.55
8 lines without explanation
1import torch
2import torch.nn as nn
3
4# --- Define a simple RNN for sentiment ---
5class SentimentRNN(nn.Module):
6    def __init__(self, input_size, hidden_size, output_size):
7        super().__init__()
8        self.hidden_size = hidden_size
9        self.rnn = nn.RNN(input_size, hidden_size, batch_first=True)
10        self.fc  = nn.Linear(hidden_size, output_size)
11
12    def forward(self, x):
13        h0 = torch.zeros(1, x.size(0), self.hidden_size)
14        output, h_n = self.rnn(x, h0)
15        logit = self.fc(h_n.squeeze(0))
16        return logit
17
18# --- Create model and run inference ---
19model = SentimentRNN(input_size=2, hidden_size=3, output_size=1)
20
21X = torch.tensor([[[1.0, 0.0],
22                    [0.5, 0.8],
23                    [0.3, 0.7],
24                    [0.9, 0.1]]])
25
26with torch.no_grad():
27    logit = model(X)
28    prob  = torch.sigmoid(logit)
29    print(f"Logit: {logit.item():.4f}")
30    print(f"Probability: {prob.item():.4f}")

NumPy vs. PyTorch: what changed?

AspectNumPy (from scratch)PyTorch (nn.RNN)
Weight creationManual np.arrayAutomatic (nn.RNN creates them)
Recurrence loopExplicit for loopBuilt into nn.RNN forward()
BackpropagationNot implementedAutomatic (autograd)
GPU supportNoYes (.to('cuda'))
Batch processingNot implementedBuilt-in (batch dimension)
Code lines~20 lines~10 lines

The math is identical. PyTorch's nn.RNN uses the same equation internally: ht=tanh(xtWih+bih+ht1Whh+bhh)h_t = \tanh(x_t W_{ih}^\top + b_{ih} + h_{t-1} W_{hh}^\top + b_{hh}). The only difference is notation — PyTorch uses WihW_{ih} (input-to-hidden) where we wrote WxhW_{xh}, and stores separate biases for the input and hidden transformations.

nn.RNN returns two things: output contains the hidden state at every timestep (shape: batch, seq_len, hidden_size), while h_n contains the hidden state at only the last timestep (shape: num_layers, batch, hidden_size). For classification, we typically use h_n; for sequence-to-sequence tasks, we use output.

Types of RNN Tasks

RNNs are flexible because you can choose which hidden states to use as output. This creates several common architectures, each suited to a different type of task:

Many-to-One

Process the entire sequence, then use only the final hidden state to produce a single output. This is what we just built — sentiment classification. The network reads all words and produces one sentiment score.

  • Sentiment analysis: sentence → positive/negative
  • Document classification: article → topic
  • Spam detection: email → spam/not-spam

One-to-Many

Start with a single input and generate a sequence of outputs. The input sets the initial hidden state, and the network produces one output per timestep.

  • Image captioning: image → “A cat sitting on a mat”
  • Music generation: seed note → melody

Many-to-Many (same length)

Produce an output at every timestep. Input and output sequences have the same length.

  • Part-of-speech tagging: each word → noun/verb/adjective
  • Named entity recognition: each word → person/location/organization/other

Many-to-Many (different lengths — Encoder-Decoder)

An encoder RNN reads the input sequence and compresses it into a hidden state. A decoder RNN then generates an output sequence of potentially different length. This is the foundation of early machine translation systems.

  • Machine translation: “I love cats” → “J'aime les chats”
  • Text summarization: long article → short summary
ArchitectureInputOutputExample Task
Many-to-OneSequenceSingle valueSentiment classification
One-to-ManySingle valueSequenceImage captioning
Many-to-Many (synced)SequenceSame-length sequencePOS tagging
Many-to-Many (encoder-decoder)SequenceDifferent-length sequenceTranslation

What Lies Ahead: The Limits of Vanilla RNNs

The vanilla RNN we have built is elegant and powerful in principle. But in practice, it has a critical weakness: it struggles with long sequences.

Think about what happens when you process a 100-word sentence. The hidden state from word 1 must pass through 99 matrix multiplications (by WhhW_{hh}) to influence the output at word 100. During backpropagation, gradients must flow backward through those same 99 multiplications. If WhhW_{hh} has eigenvalues less than 1, the gradients shrink exponentially — this is the vanishing gradient problem.

The result: vanilla RNNs effectively have a memory span of about 10-20 timesteps. For longer dependencies — like remembering the subject of a sentence when the verb is 50 words away — the signal simply fades to zero.

Coming next: In Section 2, we will implement a complete RNN training loop from scratch, including backpropagation through time (BPTT). In Section 3, we will dissect the vanishing gradient problem mathematically and see exactly why gradients decay — setting the stage for LSTM and GRU in Chapter 17.

For now, the key takeaway is this: RNNs introduced the revolutionary idea that a neural network can have memory. By sharing weights across timesteps and maintaining a hidden state, a single small network can process sequences of any length, carrying information forward from the past to influence the present. This insight — that you can build a network with a feedback loop — is one of the most important ideas in deep learning.

Loading comments...