Why Sequences Need a New Architecture
Every neural network we have built so far — perceptrons, MLPs, and CNNs — has a fundamental limitation: they process each input independently. Feed the same image to a CNN twice, and you get the same output both times. There is no memory of what came before.
This works perfectly for images. A photo of a cat is a photo of a cat, regardless of which photo you looked at previously. But language, music, stock prices, and biological signals are fundamentally different: the meaning of each element depends on what came before it.
Consider the word “bank” in these two sentences:
- “The river bank is steep” — bank means the edge of a river
- “The savings bank is closed” — bank means a financial institution
A feedforward network that sees only the word “bank” has no way to distinguish these two meanings. It maps the same input to the same output every time. To resolve the ambiguity, the network needs to remember the words that came before — it needs memory.
Key Insight: Sequential data has a property that images do not: the order of elements carries information. A network that processes sequences must somehow carry information forward from earlier elements to later ones.
What Makes Data Sequential
Data is sequential when its elements have a meaningful order, and changing that order changes the meaning. Three properties define sequential data:
- Order dependence — Rearranging elements changes the meaning. “Dog bites man” and “Man bites dog” use the same words but mean very different things.
- Variable length — Sequences can have different lengths. A tweet has 280 characters; a novel has millions. The network must handle both.
- Long-range dependencies — Elements far apart can influence each other. In “The cat, which sat on the mat, was happy,” the verb “was” (singular) depends on “cat” (singular), not the closer “mat.”
| Domain | Input Sequence | What Order Encodes |
|---|---|---|
| Natural Language | Words in a sentence | Grammar, meaning, intent |
| Time Series | Stock prices over days | Temporal trends, seasonality |
| Audio/Speech | Sound waveform samples | Phonemes, words, intonation |
| DNA/Protein | Nucleotide/amino acid chains | Gene structure, protein folding |
| Music | Notes over time | Melody, rhythm, harmony |
| Video | Frames over time | Motion, scene transitions |
The fundamental challenge is this: a feedforward network takes a fixed-size input and produces a fixed-size output. But sequences have variable length. A 4-word sentence and a 40-word sentence must both go through the same network. We need an architecture that can process any number of elements, one at a time, while maintaining a running summary of everything it has seen.
The Core Idea: Memory Through Recurrence
The solution is surprisingly elegant: give the network a hidden state — a vector that persists across timesteps. At each timestep, the network reads the current input and the previous hidden state, then produces a new hidden state. This new state becomes the input for the next timestep.
This is recurrence: the output of one step feeds back as the input to the next step. The hidden state acts as the network's memory, carrying a compressed summary of everything it has seen so far.
RNN Unrolled Through Time
Each hidden state ht depends on the previous state ht-1
The diagram above shows an RNN unrolled through time. Though it looks like four separate networks, it is actually the same network applied repeatedly. The same weight matrices are used at every timestep — this is called weight sharing.
The recurrence relation
At each timestep , the RNN performs two operations:
- Combine the current input with the previous hidden state using learned weight matrices
- Squash the result through a nonlinearity (tanh) to produce the new hidden state
Written as a single equation:
And optionally, at each step (or just the final step), we can produce an output:
Let's break down every symbol:
| Symbol | Shape | Meaning |
|---|---|---|
| x_t | (input_size,) | Input vector at timestep t (e.g., a word embedding) |
| h_{t-1} | (hidden_size,) | Hidden state from the previous timestep — the "memory" |
| h_t | (hidden_size,) | New hidden state after processing x_t |
| W_{xh} | (hidden_size, input_size) | Weight matrix transforming input to hidden space |
| W_{hh} | (hidden_size, hidden_size) | Weight matrix transforming previous hidden state (the recurrence) |
| b_h | (hidden_size,) | Bias vector for the hidden state |
| W_{hy} | (output_size, hidden_size) | Weight matrix projecting hidden state to output |
| b_y | (output_size,) | Bias vector for the output |
| tanh | element-wise | Squashes values to (-1, +1), preventing unbounded growth |
Weight sharing: the power of recurrence
A critical feature of RNNs is that , , and are the same at every timestep. Whether the network is processing the first word or the hundredth, it uses identical weights. This has three benefits:
- Parameter efficiency — A 100-word sequence uses the same number of parameters as a 5-word sequence. The model size does not grow with sequence length.
- Generalization — Patterns learned at one position transfer to others. If the network learns that “not” negates the next word, it applies this everywhere.
- Variable length handling — The same model processes sequences of any length. Simply run more iterations of the recurrence.
Inside the RNN Cell: The Mathematics
Let's build intuition for what happens inside the RNN cell at a single timestep. We will use concrete numbers to trace through every operation.
The two matrix multiplications
The RNN cell receives two inputs: the current word embedding and the previous hidden state . It applies two separate weight matrices:
- — transforms the current input into the hidden space. This answers: “What does this word contribute to the memory?”
- — transforms the previous memory. This answers: “What should the network remember from before?”
These two contributions are added together, and then tanh squashes the sum into the range . The intuition is:
The new hidden state is a blend of what the network just saw and what it remembers. The weight matrices control the balance. Large values in mean the network holds onto old information; large values in mean the network pays more attention to the new input.
Worked example: processing “love” at t=1
Let's trace through one step in detail. At , the RNN has already processed “I” and has hidden state . Now it receives the embedding for “love”: .
Step 1 — Transform the input:
Step 2 — Transform the previous memory:
Step 3 — Add and apply tanh:
The new hidden state now encodes a compressed representation of “I love”. Notice how it differs from — the memory has been updated to incorporate the emotional word “love.” In particular, hidden unit 1 jumped from 0.1974 to 0.4707, suggesting it responds strongly to emotional content.
Step-by-Step: An RNN Processing a Sentence
Now let's watch the complete forward pass. The interactive visualization below processes “I love this movie” one word at a time. Use the controls to step through each timestep and observe how the hidden state evolves.
Notice the key pattern: each hidden state is influenced by all previous words. By , the hidden state is not just about “movie” — it carries traces of “I,” “love,” and “this” as well. This is the RNN's memory in action.
Building an RNN from Scratch in Python
Let's implement the complete RNN forward pass in pure Python with NumPy. We will build a tiny sentiment classifier that processes “I love this movie” and outputs a probability of positive sentiment. Every line is annotated — click any line to see exactly what happens in memory.
The entire forward pass boils down to a single line inside a loop: h = np.tanh(W_xh @ x_t + W_hh @ h + b_h). This is the beauty of recurrence — one equation, applied repeatedly, gives the network the ability to process sequences of any length while maintaining memory.
Tracing the hidden state evolution
| Timestep | Token | h[0] | h[1] | h[2] | Interpretation |
|---|---|---|---|---|---|
| t=0 | "I" | 0.4621 | 0.1974 | -0.0997 | Subject pronoun detected |
| t=1 | "love" | 0.1539 | 0.4707 | 0.3618 | Emotional word → h[1] jumps |
| t=2 | "this" | 0.0712 | 0.6521 | 0.2778 | Demonstrative reinforces context |
| t=3 | "movie" | 0.5597 | 0.4605 | -0.0660 | Object noun → h[0] rises |
The Same RNN in PyTorch
PyTorch provides nn.RNN, which implements everything we just built — the weight matrices, the recurrence loop, and the tanh activation — in a single optimized module. Let's see how the same sentiment classifier looks with PyTorch:
NumPy vs. PyTorch: what changed?
| Aspect | NumPy (from scratch) | PyTorch (nn.RNN) |
|---|---|---|
| Weight creation | Manual np.array | Automatic (nn.RNN creates them) |
| Recurrence loop | Explicit for loop | Built into nn.RNN forward() |
| Backpropagation | Not implemented | Automatic (autograd) |
| GPU support | No | Yes (.to('cuda')) |
| Batch processing | Not implemented | Built-in (batch dimension) |
| Code lines | ~20 lines | ~10 lines |
The math is identical. PyTorch's nn.RNN uses the same equation internally: . The only difference is notation — PyTorch uses (input-to-hidden) where we wrote , and stores separate biases for the input and hidden transformations.
output contains the hidden state at every timestep (shape: batch, seq_len, hidden_size), while h_n contains the hidden state at only the last timestep (shape: num_layers, batch, hidden_size). For classification, we typically use h_n; for sequence-to-sequence tasks, we use output.Types of RNN Tasks
RNNs are flexible because you can choose which hidden states to use as output. This creates several common architectures, each suited to a different type of task:
Many-to-One
Process the entire sequence, then use only the final hidden state to produce a single output. This is what we just built — sentiment classification. The network reads all words and produces one sentiment score.
- Sentiment analysis: sentence → positive/negative
- Document classification: article → topic
- Spam detection: email → spam/not-spam
One-to-Many
Start with a single input and generate a sequence of outputs. The input sets the initial hidden state, and the network produces one output per timestep.
- Image captioning: image → “A cat sitting on a mat”
- Music generation: seed note → melody
Many-to-Many (same length)
Produce an output at every timestep. Input and output sequences have the same length.
- Part-of-speech tagging: each word → noun/verb/adjective
- Named entity recognition: each word → person/location/organization/other
Many-to-Many (different lengths — Encoder-Decoder)
An encoder RNN reads the input sequence and compresses it into a hidden state. A decoder RNN then generates an output sequence of potentially different length. This is the foundation of early machine translation systems.
- Machine translation: “I love cats” → “J'aime les chats”
- Text summarization: long article → short summary
| Architecture | Input | Output | Example Task |
|---|---|---|---|
| Many-to-One | Sequence | Single value | Sentiment classification |
| One-to-Many | Single value | Sequence | Image captioning |
| Many-to-Many (synced) | Sequence | Same-length sequence | POS tagging |
| Many-to-Many (encoder-decoder) | Sequence | Different-length sequence | Translation |
What Lies Ahead: The Limits of Vanilla RNNs
The vanilla RNN we have built is elegant and powerful in principle. But in practice, it has a critical weakness: it struggles with long sequences.
Think about what happens when you process a 100-word sentence. The hidden state from word 1 must pass through 99 matrix multiplications (by ) to influence the output at word 100. During backpropagation, gradients must flow backward through those same 99 multiplications. If has eigenvalues less than 1, the gradients shrink exponentially — this is the vanishing gradient problem.
The result: vanilla RNNs effectively have a memory span of about 10-20 timesteps. For longer dependencies — like remembering the subject of a sentence when the verb is 50 words away — the signal simply fades to zero.
Coming next: In Section 2, we will implement a complete RNN training loop from scratch, including backpropagation through time (BPTT). In Section 3, we will dissect the vanishing gradient problem mathematically and see exactly why gradients decay — setting the stage for LSTM and GRU in Chapter 17.
For now, the key takeaway is this: RNNs introduced the revolutionary idea that a neural network can have memory. By sharing weights across timesteps and maintaining a hidden state, a single small network can process sequences of any length, carrying information forward from the past to influence the present. This insight — that you can build a network with a feedback loop — is one of the most important ideas in deep learning.