Chapter 1
25 min read
Section 4 of 75

The Evolution of Sequence Modeling

Introduction to Transformers

Introduction

Before we dive into transformers, it's essential to understand the journey that led to their creation. This section traces the evolution of sequence modeling in deep learning, from the earliest recurrent neural networks to the challenges that eventually motivated the development of attention mechanisms and transformers.

Understanding this history will help you appreciate why transformers represent such a fundamental breakthrough, and why they've become the dominant architecture for nearly all modern AI systems.

The Road to Transformers

20 years of innovation in sequence modeling

1997

LSTM

Hochreiter & Schmidhuber

Gating mechanisms to control information flow, addressing vanishing gradients

2014

Seq2Seq

Sutskever et al.

Encoder-decoder architecture for machine translation

2014

GRU

Cho et al.

Simplified gating with fewer parameters, comparable performance

2014

Bahdanau Attention

Bahdanau, Cho & Bengio

First attention mechanism: direct connections between encoder-decoder

2015

Luong Attention

Luong et al.

Simplified dot-product attention, global vs local variants

2017

Transformer

Vaswani et al.

"Attention Is All You Need" - removes recurrence entirely

Key insight: Each innovation addressed a specific limitation of its predecessor. LSTMs fixed vanishing gradients, attention fixed the bottleneck, and Transformers removed sequential processing entirely.

Why this backstory matters

Knowing what RNNs and LSTMs struggle with (long-range context, parallelism) makes it obvious why attention was a breakthrough and helps you debug models that still mix these components (e.g., hybrid speech models or time-series stacks).

1.1 The Sequence Modeling Problem

What is Sequence Modeling?

Sequence modeling refers to the task of processing and generating sequential data—data where the order of elements matters. Examples include:

  • Natural Language: Sentences are sequences of words where order changes meaning
    • "The dog bit the man" vs "The man bit the dog"
  • Time Series: Stock prices, weather readings, sensor data
  • Audio: Speech as sequences of acoustic features
  • Video: Sequences of image frames
  • DNA: Sequences of nucleotides (A, T, G, C)

Key Challenges in Sequence Modeling

  1. Variable Length: Sequences can have different lengths
  2. Order Dependency: The meaning depends on element positions
  3. Long-Range Dependencies: Elements far apart may be related
  4. Context Sensitivity: The meaning of an element depends on surrounding elements

Typical failure modes

Subtle word order swaps invert meaning; long-distance references (coreference, long time-series drift) get lost; repeated patterns make models overfit to the most recent signal instead of the right one.

Word Representations Before Transformers

Before diving into RNNs, it's worth understanding how words were represented. This context explains why contextual representations (from RNNs and later Transformers) were such a breakthrough.

Static Word Embeddings

MethodYearKey IdeaLimitation
One-hotClassicSparse vector, 1 at word indexNo similarity info, huge vectors
Word2Vec2013Predict neighbors, dense vectorsOne vector per word (no context)
GloVe2014Global co-occurrence statisticsSame: one vector per word
FastText2016Subword n-gramsBetter OOV, still static

The fundamental problem: In Word2Vec and GloVe, the word "bank" has exactly ONE vector, whether it means a financial institution or a river bank. Context is ignored.

🐍static_embeddings.py
1# Word2Vec/GloVe: Static embeddings
2embeddings["bank"]  # Same vector for ALL uses of "bank"
3
4# Sentences with different meanings:
5"I went to the bank to deposit money"  # financial
6"The river bank was steep"             # geographical
7"Bank left in the turn"                # aviation/driving
8
9# All three use THE SAME embedding for "bank"!

The Shift to Contextual Embeddings

The realization that word meaning depends on context led to:

  1. RNN-based representations: Hidden states encode context (but limited by sequential processing)
  2. ELMo (2018): BiLSTM-based contextual embeddings—same word gets different vectors in different sentences
  3. Transformers (BERT, GPT): Attention-based contextual embeddings—each token's representation depends on ALL other tokens
The key insight: A word's meaning isn't fixed—it emerges from its context. Transformers make computing context-dependent representations efficient and parallelizable.

Real-World Failure Examples

These aren't hypothetical—these are the kinds of errors that plagued pre-Transformer NLP systems:

TaskFailure ExampleWhy RNNs Struggle
Translation"The old man the boats" → mistranslated as elderly personGarden-path sentences require reanalysis after "man" is seen as a verb, but RNNs commit early
Coreference"The trophy doesn't fit in the suitcase because it is too big" → wrong "it" resolutionMust track which noun "it" refers to across 10+ tokens
SummarizationFirst paragraph of article ignored in summaryBy the end, information from the beginning has decayed
Question AnsweringAnswer found in first sentence, question asks about it at the endGradient from answer position can't reach the evidence
Sentiment"I thought the movie would be bad, but it was actually amazing" → negativeInitial negative words dominate if final context is lost

The pattern: Whenever the "answer" or critical information is far from where it's needed, RNNs fail. This isn't a bug in specific implementations—it's a fundamental architectural limitation.


1.2 Recurrent Neural Networks (RNNs)

The RNN Architecture

RNNs were the first neural architecture designed specifically for sequences. The key idea: maintain a hidden state that gets updated as you process each element.

ht=f(Whht1+Wxxt+b)h_t = f(W_h \cdot h_{t-1} + W_x \cdot x_t + b)

Where:

  • hth_t is the hidden state at time tt
  • xtx_t is the input at time tt
  • WhW_h, WxW_x are weight matrices
  • ff is an activation function (typically tanh)

Micro Example: two time steps

What to notice

Each step mixes prior hidden state with the new input; the tanh squashes values, which is why saturation hurts gradient flow in long sequences.

Visual Representation

RNN Unrolled Through Time

Each hidden state ht depends on the previous state ht-1

x1
h1
y1
x2
h2
y2
x3
h3
y3
x4
h4
...
y4
Input (xt)
Hidden state (ht)
Output (yt)
Sequential flow

The Sequential Processing Bottleneck

Critical Limitation: RNNs process sequences one element at a time, in order.

  • To compute h3h_3, you must first compute h1h_1 and h2h_2
  • No parallelization possible during training or inference
  • Training time scales linearly with sequence length
  • Cannot leverage modern GPU parallelism effectively

For a sequence of length nn:

  • RNN: O(n)O(n) sequential operations
  • Transformer (spoiler): O(1)O(1) parallel operations

Mental model: telephone game

Information passes token by token; by the time it reaches the end, details may be distorted or forgotten. The longer the chain, the harder it is to keep fidelity.

1.3 The Vanishing and Exploding Gradient Problem

What Happens During Backpropagation?

When training RNNs, gradients must flow backward through time. For a sequence of length TT:

Lh1=LhT×hThT1××h2h1\frac{\partial L}{\partial h_1} = \frac{\partial L}{\partial h_T} \times \frac{\partial h_T}{\partial h_{T-1}} \times \cdots \times \frac{\partial h_2}{\partial h_1}

This involves multiplying many Jacobian matrices together.

Gradient Flow Through Time (BPTT)

Watch how gradients change as they flow backward through the network

∂L/∂h61.000
t = 6
×0.5
∂L/∂h50.500
t = 5
×0.5
∂L/∂h40.250
t = 4
×0.5
∂L/∂h30.125
t = 3
×0.5
∂L/∂h20.063
t = 2
×0.5
∂L/∂h10.031
t = 1
Vanishing Gradient Problem

With |Wh| = 0.5, gradients shrink exponentially: 1.0 → 0.5 → 0.25 → ... → ≈0.
Effect: Early layers learn extremely slowly or not at all. Long-range dependencies are forgotten.

Each step multiplies by ∂ht/∂ht-1 ≈ Wh · f'(·), so after T steps: gradient ∝ |W|T

Why Does This Happen Mechanically?

Recall the RNN recurrence:

ht=tanh(Whht1+Wxxt)h_t = \tanh(W_h h_{t-1} + W_x x_t)

During backpropagation, the derivative at each step is:

htht1=WhTdiag(1ht2)\frac{\partial h_t}{\partial h_{t-1}} = W_h^T \cdot \text{diag}(1 - h_t^2)

Two problems immediately appear:

Problem 1: The tanh Derivative

The derivative of tanh is:

tanh(x)=1tanh2(x)\tanh'(x) = 1 - \tanh^2(x)

This derivative is always ≤ 1 and often much smaller:

Activation Valuetanh(x)Derivative tanh'(x)
Saturated high0.991 − 0.99² = 0.02
Moderately high0.801 − 0.64 = 0.36
Near zero0.201 − 0.04 = 0.96

Saturation kills gradients

When RNN activations saturate (tanh ≈ ±1), the derivative approaches zero. This is extremely common in practice, especially for long sequences where hidden states accumulate information.

Problem 2: Weight Matrix Magnitudes

Even small deviations from 1 create exponential effects over time:

MultiplierAfter 30 stepsAfter 50 stepsAfter 100 steps
W=0.9|W| = 0.90.0420.005≈ 0
W=0.95|W| = 0.950.2150.0770.006
W=1.05|W| = 1.054.3211.47131.5
W=1.1|W| = 1.117.45117.3913,780

The core insight

It's like repeatedly multiplying by 0.9 or 1.1. A tiny deviation from 1 is enough to kill or blow up gradients: 0.9500.0050.9^{50} \approx 0.005 while 1.1501171.1^{50} \approx 117.

Vanishing Gradients: Symptoms

If the weight matrices have eigenvalues < 1, gradients shrink exponentially:

  • Loss decreases extremely slowly or plateaus early
  • Gradients become literally 0.0 in early time steps
  • Early positions receive almost no learning signal
  • Long-term dependencies are impossible to learn

Real-world example: In "The cat that sat on the mat that was red was happy", the verb "was" must agree with "cat" (singular). But with vanishing gradients, the model cannot learn this dependency—the gradient from "was" never reaches "cat".

The subtle danger

Vanishing gradients are more dangerous than exploding because they're silent. Your model trains without errors, but simply fails to learn long-range patterns. You might not notice until evaluation.

Exploding Gradients: Symptoms

If eigenvalues > 1, gradients grow exponentially:

  • Loss suddenly becomes NaN or infinity
  • Weights become enormously large (1e⁸, 1e¹⁵, ...)
  • Training is unstable—loss oscillates wildly
  • Model outputs become nonsensical after a few steps

This happens because the weight update becomes massive:

Wnew=WηLWhuge!weights shoot to ±W_{\text{new}} = W - \eta \cdot \underbrace{\frac{\partial L}{\partial W}}_{\text{huge!}} \rightarrow \text{weights shoot to } \pm\infty

Numerical Deep Dive

See It In Code: Gradient Decay Visualization

Here's a minimal PyTorch example that demonstrates gradient vanishing in a simple RNN. Run this to see how gradients decay:

🐍gradient_decay_demo.py
1import torch
2import torch.nn as nn
3import matplotlib.pyplot as plt
4
5class SimpleRNN(nn.Module):
6    def __init__(self, hidden_size=64):
7        super().__init__()
8        self.hidden_size = hidden_size
9        self.rnn = nn.RNN(input_size=1, hidden_size=hidden_size, batch_first=True)
10        self.fc = nn.Linear(hidden_size, 1)
11
12    def forward(self, x, hidden=None):
13        out, hidden = self.rnn(x, hidden)
14        return self.fc(out[:, -1, :]), hidden
15
16# Create a sequence and track gradients at each position
17seq_length = 100
18model = SimpleRNN()
19x = torch.randn(1, seq_length, 1, requires_grad=True)
20
21# Forward pass
22output, _ = model(x)
23loss = output.sum()
24
25# Backward pass - capture gradients
26loss.backward()
27
28# Measure gradient magnitude at each time step
29# (This is a simplified proxy - actual gradient flow is more complex)
30grad_norms = []
31for t in range(seq_length):
32    # Gradient of loss w.r.t. input at position t
33    grad_norm = x.grad[0, t, 0].abs().item()
34    grad_norms.append(grad_norm)
35
36# Plot the decay
37plt.figure(figsize=(10, 4))
38plt.plot(range(seq_length), grad_norms)
39plt.xlabel('Position in sequence (earlier → later)')
40plt.ylabel('Gradient magnitude')
41plt.title('Gradient Decay in RNN: Earlier positions receive weaker gradients')
42plt.yscale('log')  # Log scale to see the exponential decay
43plt.grid(True, alpha=0.3)
44plt.savefig('gradient_decay.png')
45print(f"Gradient at position 0: {grad_norms[0]:.6f}")
46print(f"Gradient at position 99: {grad_norms[-1]:.6f}")
47print(f"Decay ratio: {grad_norms[0]/grad_norms[-1]:.2f}x")

Expected output: The gradient at position 0 will be orders of magnitude smaller than at position 99, demonstrating the vanishing gradient problem visually.

Try this experiment

Modify the code to use an LSTM instead of RNN (nn.LSTM) and compare the gradient decay. You'll see that LSTMs maintain more uniform gradients—but the sequential bottleneck remains.

Practical Debugging: Detecting Gradient Problems

Here's how to diagnose gradient issues in your own RNN training:

Symptoms Checklist

SymptomLikely CauseQuick Fix
Loss stuck at initial valueVanishing gradientsReduce sequence length, check initialization
Loss becomes NaN after N stepsExploding gradientsAdd gradient clipping, reduce learning rate
Early tokens never learnedVanishing + truncated BPTTIncrease BPTT window, use LSTM/GRU
Training unstable (oscillates)Exploding gradientsGradient clipping, smaller learning rate
Works on short sequences, fails on longVanishing gradientsSwitch architecture (Transformer)

Monitoring Code

🐍gradient_monitoring.py
1def monitor_gradients(model, threshold_low=1e-7, threshold_high=1e3):
2    """Call after loss.backward() to check gradient health."""
3    total_norm = 0
4    for name, param in model.named_parameters():
5        if param.grad is not None:
6            param_norm = param.grad.data.norm(2).item()
7            total_norm += param_norm ** 2
8
9            # Flag problematic gradients
10            if param_norm < threshold_low:
11                print(f"⚠️  VANISHING: {name} grad norm = {param_norm:.2e}")
12            elif param_norm > threshold_high:
13                print(f"🔥 EXPLODING: {name} grad norm = {param_norm:.2e}")
14
15    total_norm = total_norm ** 0.5
16    print(f"Total gradient norm: {total_norm:.4f}")
17    return total_norm
18
19# Usage in training loop:
20loss.backward()
21grad_norm = monitor_gradients(model)
22if grad_norm > 10:
23    torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
24optimizer.step()

TensorBoard integration

For production training, log gradient norms to TensorBoard/W&B. A healthy training run shows stable gradient norms; sudden spikes or decay to zero indicate problems.

Why Transformers Don't Have This Problem

Transformers fundamentally avoid this issue because they don't multiply through time:

  • No recurrence: No chaining of derivatives over T steps
  • Direct connections: Each attention layer connects tokens directly via QKTQK^T
  • Depth vs. sequence: A 24-layer Transformer has 24 gradient steps, not 512 like an RNN processing 512 tokens
  • Residual connections: Skip connections allow gradients to flow unchanged

This is one of the key reasons Transformers can handle sequences of thousands of tokens while RNNs struggle with hundreds.

Mitigations you should know

Gradient clipping (e.g., clip norm to 1.0), orthogonal/identity initialization for WhW_h, gating (LSTMs/GRUs), residual connections, and truncated BPTT (shorter backprop windows) help tame exploding or vanishing gradients—but they do not fix the sequential bottleneck.

Weight Initialization: The First Line of Defense

Proper initialization can delay (but not prevent) gradient problems. The goal: keep the eigenvalues of WhW_h close to 1.

Initialization Strategies for RNNs

StrategyHow It WorksBest For
Xavier/GlorotScale by √(2 / (fan_in + fan_out))Feedforward layers, not RNNs
OrthogonalW^T W = I (eigenvalues = 1)RNN hidden-to-hidden weights
IdentityW = I (start as identity)Simple RNNs, preserves gradients
LSTM bias initSet forget gate bias to 1-2Encourages remembering early

Why Orthogonal Initialization Works

For a matrix WW to preserve gradient norms, we want:

WTg=g(gradient magnitude preserved)\|W^T g\| = \|g\| \quad \text{(gradient magnitude preserved)}

Orthogonal matrices satisfy WTW=IW^T W = I, which means WTg2=gTWWTg=gTg=g2\|W^T g\|^2 = g^T W W^T g = g^T g = \|g\|^2. This preserves gradient norms exactly—at least initially.

🐍orthogonal_init.py
1import torch
2import torch.nn as nn
3
4# Method 1: PyTorch built-in
5rnn = nn.RNN(input_size=64, hidden_size=128)
6nn.init.orthogonal_(rnn.weight_hh_l0)  # Hidden-to-hidden weights
7
8# Method 2: Manual orthogonal initialization
9def orthogonal_init(shape):
10    """Create an orthogonal matrix of given shape."""
11    flat_shape = (shape[0], max(1, shape[1] if len(shape) > 1 else 1))
12    a = torch.randn(flat_shape)
13    q, r = torch.linalg.qr(a)
14    # Fix the signs of the diagonal of r
15    d = torch.diag(r)
16    ph = d.sign()
17    q *= ph
18    return q[:shape[0], :shape[1]] if len(shape) > 1 else q[:shape[0], 0]
19
20# Method 3: LSTM forget gate bias trick
21lstm = nn.LSTM(input_size=64, hidden_size=128)
22# Set forget gate bias to 1.0 (encourages remembering)
23# Bias is stored as [input_gate, forget_gate, cell_gate, output_gate]
24nn.init.constant_(lstm.bias_ih_l0[128:256], 1.0)
25nn.init.constant_(lstm.bias_hh_l0[128:256], 1.0)

The catch

Orthogonal initialization only helps at the start of training. As weights update, they drift away from orthogonality. The gradient problem eventually returns for long sequences—which is why architectural solutions (LSTM gates, attention) are needed.

Truncated Backpropagation Through Time (TBPTT)

In practice, full backpropagation through an entire sequence is often infeasible. Truncated BPTT is a workaround that limits how far back gradients flow.

How It Works

Instead of backpropagating through all T time steps, we split the sequence into chunks and backpropagate only within each chunk:

Full BPTT: LW=t=1TLtW(through all T steps)\text{Full BPTT: } \frac{\partial L}{\partial W} = \sum_{t=1}^{T} \frac{\partial L_t}{\partial W} \quad \text{(through all } T \text{ steps)}
Truncated BPTT(k1,k2):Forward k1 steps, backprop k2 steps\text{Truncated BPTT}(k_1, k_2): \text{Forward } k_1 \text{ steps, backprop } k_2 \text{ steps}
ParameterTypical ValueEffect
k₁ (forward steps)35-100How many tokens to process before an update
k₂ (backward steps)35-100How far back gradients flow
k₁ = k₂Common choiceSimpler implementation, balanced updates

The Trade-off

k₂ ValueMemoryLong-Range LearningTraining Speed
Small (20-35)LowPoor - can't learn distant dependenciesFast
Medium (50-100)ModerateOkay - captures paragraph-level contextModerate
Large (200+)HighBetter - but still truncatedSlow
Full sequenceVery HighBest (in theory) - but gradients vanish anywayVery Slow

The fundamental limitation

Even with large truncation windows, you're still limited by the window size. A document with 1000 tokens and k₂ = 100 means the first 900 tokens can never directly influence learning of the last token. This is why LSTMs and GRUs struggle with long documents regardless of truncation settings.
🐍truncated_bptt.py
1# Typical PyTorch truncated BPTT pattern
2k1, k2 = 50, 50  # forward and backward window sizes
3hidden = None
4
5for i in range(0, seq_len, k1):
6    # Forward k1 steps
7    chunk = sequence[i:i+k1]
8    output, hidden = rnn(chunk, hidden)
9
10    # Compute loss for this chunk
11    loss = criterion(output, targets[i:i+k1])
12
13    # Backprop only k2 steps (hidden is detached)
14    loss.backward()
15    optimizer.step()
16    optimizer.zero_grad()
17
18    # CRITICAL: Detach hidden state to truncate gradient flow
19    hidden = hidden.detach()  # Gradient stops here!

When to use larger windows

If your task requires understanding context beyond ~100 tokens (legal docs, scientific papers, novels), RNNs with truncated BPTT will struggle. This is a strong signal to consider Transformers or other architectures.

1.4 LSTMs and GRUs: Addressing Vanishing Gradients

Long Short-Term Memory (LSTM)

Hochreiter & Schmidhuber (1997) introduced LSTMs with a key innovation: gating mechanisms that control information flow.

LSTM Components

  1. Cell State (CtC_t): The "memory highway" that can carry information unchanged
  2. Forget Gate (ftf_t): Decides what to remove from cell state
  3. Input Gate (iti_t): Decides what new information to store
  4. Output Gate (oto_t): Decides what to output
ft=σ(Wf[ht1,xt]+bf)(Forget gate)it=σ(Wi[ht1,xt]+bi)(Input gate)C~t=tanh(WC[ht1,xt]+bC)(Candidate values)Ct=ftCt1+itC~t(New cell state)ot=σ(Wo[ht1,xt]+bo)(Output gate)ht=ottanh(Ct)(Hidden state)\begin{aligned} f_t &= \sigma(W_f \cdot [h_{t-1}, x_t] + b_f) & \text{(Forget gate)} \\ i_t &= \sigma(W_i \cdot [h_{t-1}, x_t] + b_i) & \text{(Input gate)} \\ \tilde{C}_t &= \tanh(W_C \cdot [h_{t-1}, x_t] + b_C) & \text{(Candidate values)} \\ C_t &= f_t \odot C_{t-1} + i_t \odot \tilde{C}_t & \text{(New cell state)} \\ o_t &= \sigma(W_o \cdot [h_{t-1}, x_t] + b_o) & \text{(Output gate)} \\ h_t &= o_t \odot \tanh(C_t) & \text{(Hidden state)} \end{aligned}

Why LSTMs Help

The cell state acts as a gradient highway:

  • When ft1f_t \approx 1, gradients flow through unchanged
  • Mitigates vanishing gradient problem
  • Can maintain information over hundreds of time steps

Gated Recurrent Units (GRUs)

Cho et al. (2014) proposed a simplified variant:

zt=σ(Wz[ht1,xt])(Update gate)rt=σ(Wr[ht1,xt])(Reset gate)h~t=tanh(W[rtht1,xt])(Candidate)ht=(1zt)ht1+zth~t(New hidden state)\begin{aligned} z_t &= \sigma(W_z \cdot [h_{t-1}, x_t]) & \text{(Update gate)} \\ r_t &= \sigma(W_r \cdot [h_{t-1}, x_t]) & \text{(Reset gate)} \\ \tilde{h}_t &= \tanh(W \cdot [r_t \odot h_{t-1}, x_t]) & \text{(Candidate)} \\ h_t &= (1 - z_t) \odot h_{t-1} + z_t \odot \tilde{h}_t & \text{(New hidden state)} \end{aligned}
  • Fewer parameters than LSTM
  • Often comparable performance
  • Faster to train

Gate intuition

The cell state in an LSTM is like a protected highway with on-ramps (input gate) and off-ramps (forget gate). GRUs merge some ramps, making the road shorter and cheaper to build but sometimes a bit less flexible on very long routes.

LSTM vs GRU: quick guide

Use casePick LSTM when...Pick GRU when...
Long contextYou need stronger control over what to forget/keepContext is short-to-medium
Compute budgetYou can afford more parametersYou need faster training/inference
Data sizePlenty of data to fit extra paramsData is smaller or noisier

1.5 Remaining Limitations

Despite their improvements, LSTMs and GRUs still suffer from fundamental limitations:

1. Sequential Processing

Both architectures still process sequences step-by-step:

  • Cannot parallelize across time steps
  • Training time still O(n)O(n) in sequence length
  • GPU utilization is poor

2. Limited Long-Range Modeling

Even with gates, information must still pass through many steps:

  • Effective context window is limited (typically 100-300 tokens)
  • Very long documents remain challenging
  • Information gets diluted over distance

3. Fixed Computation Per Step

Every step uses the same computation regardless of:

  • How relevant the current token is
  • How much context is needed
  • Whether the task requires local or global information

4. Difficulty with Bidirectional Context

Understanding text often requires both past and future context. Consider:

"The bank was steep" - Is this a river bank or a financial institution? You need future context ("steep") to disambiguate.

Bidirectional RNNs (BiRNNs)

The solution was to run two RNNs: one forward, one backward, then concatenate their hidden states:

ht=RNN(xt,ht1)(forward)\overrightarrow{h_t} = \text{RNN}(x_t, \overrightarrow{h_{t-1}}) \quad \text{(forward)}
ht=RNN(xt,ht+1)(backward)\overleftarrow{h_t} = \text{RNN}(x_t, \overleftarrow{h_{t+1}}) \quad \text{(backward)}
ht=[ht;ht](concatenated)h_t = [\overrightarrow{h_t}; \overleftarrow{h_t}] \quad \text{(concatenated)}
PropertyForward RNNBiRNNTransformer
Sees past contextYesYesYes
Sees future contextNoYesYes
Computation costO(n)O(2n)O(n²) but parallel
ParallelizableNoNoYes
Direct long-range linksNoNoYes

Why BiRNNs still aren't enough:

  • Still sequential: You must wait for both passes to complete
  • Separate passes: Forward and backward don't directly interact during processing
  • No direct token connections: To relate token 1 to token 100, information still flows through all intermediate states in both directions
  • Double memory: Hidden states from both directions must be stored

Where BiRNNs are still used

Despite their limitations, BiLSTMs remain competitive for some tasks (NER, POS tagging) where sequences are short and bidirectional context is essential. ELMo (2018) used BiLSTMs to create contextual word embeddings before BERT made Transformers dominant.

Why this still fell short

Even with gates, training relies on truncated BPTT (you backprop only through a limited window), so documents longer than that window still lose signal; and the inability to parallelize across time is a fundamental hardware mismatch for GPUs.

1.6 The Encoder-Decoder Bottleneck

Before attention mechanisms, the dominant approach for sequence-to-sequence tasks (like machine translation) was the encoder-decoder architecture. While revolutionary, it had a critical flaw.

The Seq2Seq Architecture (Sutskever et al., 2014)

The basic idea was elegant: use one RNN (the encoder) to read the input sequence and compress it into a fixed-size vector, then use another RNN (the decoder) to generate the output sequence from that vector.

Encoder: hT=RNNenc(x1,x2,,xT)\text{Encoder: } h_T = \text{RNN}_{\text{enc}}(x_1, x_2, \ldots, x_T)
Context: c=hT (or some function of final hidden state)\text{Context: } c = h_T \text{ (or some function of final hidden state)}
Decoder: yt=RNNdec(yt1,c)\text{Decoder: } y_t = \text{RNN}_{\text{dec}}(y_{t-1}, c)

The Encoder-Decoder Bottleneck

All source information must squeeze through a single fixed-size vector

ENCODER
The
cat
sat
on
the
mat
Hidden states
c
BOTTLENECK
Fixed size!
DECODER
Le
chat
...
?
?
Decoder only sees the single context vector, not individual encoder states
Problem 1: Capacity
A 256-dim vector must encode an entire paragraph? That's like compressing a movie into a thumbnail.
Problem 2: Long Sentences
Performance degrades sharply after ~20-30 tokens. Early words get "overwritten" by later ones.
Problem 3: No Selection
When generating "chat", the decoder can't specifically look at "cat" — it only sees the blended context.
Encoder states
Context vector (bottleneck)
Decoder outputs

Why This Is a Problem

Imagine trying to translate a complex legal document or a technical paper. The entire meaning of potentially thousands of words must be compressed into a vector of, say, 256 or 512 dimensions. This creates several issues:

  1. Information Compression: Early tokens in long sequences get "overwritten" as the encoder processes more tokens. By the time you reach the end, the beginning is a faint echo.
  2. Fixed Capacity: The context vector has the same size regardless of whether you're encoding "Hello" or "War and Peace." Clearly, more information requires more capacity.
  3. No Selective Access: When the decoder generates word 50, it might need to specifically look at word 3 of the input. But it can only see the blended context vector—it cannot selectively attend to specific parts of the input.

Empirical Evidence

Researchers observed a sharp performance cliff:

Sentence LengthBLEU Score (approx.)Quality
5-10 words35-40Good translations
15-20 words28-32Acceptable
25-30 words20-25Noticeable errors
40+ words< 15Often incoherent
The Bottleneck Insight: It's not that RNNs are bad at processing sequences—it's that forcing all information through a single fixed-size vector is fundamentally limiting. What if the decoder could look back at any encoder state, not just the final compressed one?

This motivated attention

The bottleneck problem directly led to attention mechanisms. Instead of a single context vector, what if we computed a different context for each decoder step, weighted by relevance to the current output position?

1.7 The Path to Attention

The Core Insight

The key limitation of RNNs: indirect access to distant elements.

To relate the first and last word of a sentence:

  • RNN: Information must pass through ALL intermediate hidden states
  • What we want: Direct connection between ANY two positions

Bahdanau Attention (2014)

The first attention mechanism for sequence-to-sequence models:

eij=alignment(si1,hj)(How relevant is source position j?)αij=softmax(eij)(Attention weights, sum to 1)ci=jαij×hj(Context vector)\begin{aligned} e_{ij} &= \text{alignment}(s_{i-1}, h_j) & \text{(How relevant is source position } j \text{?)} \\ \alpha_{ij} &= \text{softmax}(e_{ij}) & \text{(Attention weights, sum to 1)} \\ c_i &= \sum_j \alpha_{ij} \times h_j & \text{(Context vector)} \end{aligned}

Key insight: Create shortcuts in the computation graph!

📝bahdanau-mini.txt
1Target word to generate: "bank" (river sense)
2Encoder states: [h_river, h_flowing, h_fast]
3Alignment scores (dot products): [2.1, 0.2, 0.1]
4Softmax -> attention: [0.82, 0.09, 0.09]
5Context vector: 0.82*h_river + 0.09*h_flowing + 0.09*h_fast

What the weights mean

The decoder explicitly leans on the source token that disambiguates the sense of "bank." Early attention made this link possible without forcing the gradient through every intermediate step.

Luong Attention (2015): Simplifying the Score

Luong et al. proposed several simplifications that would directly influence the Transformer's design:

Scoring Functions

NameFormulaComplexity
Bahdanau (additive)v^T tanh(W₁h_j + W₂s_i)O(d²) per pair
Luong (dot)h_j^T s_iO(d) per pair
Luong (general)h_j^T W s_iO(d²) per pair
Luong (concat)v^T tanh(W[h_j; s_i])O(d²) per pair

The key insight: Simple dot-product attention (hjTsih_j^T s_i) works nearly as well as the more complex additive variant—and it's much faster because it's just a matrix multiply.

score(hj,si)=hjTsi(Dot-product attention)\text{score}(h_j, s_i) = h_j^T s_i \quad \text{(Dot-product attention)}

Global vs. Local Attention

  • Global attention: Attend to ALL encoder positions (like Bahdanau)
  • Local attention: Attend only to a window around an aligned position

Local attention was an attempt to reduce the O(n×m)O(n \times m) cost, foreshadowing later work on efficient Transformers (Longformer, BigBird).

Direct link to Transformers

The Transformer's "scaled dot-product attention" is essentially Luong's dot-product attention with a scaling factor: QKTdk\frac{QK^T}{\sqrt{d_k}}. The transition from Luong to Transformers was incremental, not revolutionary—the revolution was removing the RNN backbone entirely.

Benefits of Attention

  1. Direct connections: Any position can attend to any other
  2. Interpretability: Attention weights show what the model "looks at"
  3. Variable context: Different queries access different information

Limitations of Early Attention

  • Still used RNNs as the backbone
  • Attention was a "add-on" to sequential processing
  • The question emerged: What if we use ONLY attention?

Alternative Approaches (The Road Not Taken)

While attention was gaining traction, researchers explored other ways to solve the parallelization problem. These approaches are worth knowing because they influenced Transformer design and remain useful in specific domains.

1. Convolutional Models for Sequences

CNNs process data in parallel by design. Several architectures adapted them for sequences:

ModelYearKey IdeaLimitation
WaveNet2016Dilated causal convolutions for audioVery deep for long range
ByteNet2016Encoder-decoder with dilated convsStill O(log n) depth for range n
ConvSeq2Seq2017Fully convolutional translationRequired many layers for long context

Why CNNs didn't win: While parallelizable, they need many stacked layers to capture long-range dependencies (receptive field grows logarithmically with depth). A 512-token dependency requires ~9 layers of dilated convolutions vs. a single attention layer.

2. Memory Networks & Neural Turing Machines

These architectures added explicit external memory that the model could read from and write to:

  • Memory Networks (Weston, 2014): External memory + attention-based retrieval
  • Neural Turing Machines (Graves, 2014): Differentiable tape with read/write heads
  • Differentiable Neural Computers (Graves, 2016): More sophisticated memory management

Why they didn't win: Complex to train, hard to scale, and the memory addressing mechanisms introduced their own bottlenecks. However, the idea of attending to stored representations directly influenced the key-value attention in Transformers.

Legacy of these approaches

WaveNet is still used for audio synthesis; memory mechanisms appear in retrieval-augmented models (RAG); and the idea of treating self-attention as "all tokens attending to a shared memory of all tokens" is a direct conceptual descendant.

1.8 The Stage is Set

By 2017, the NLP community recognized:

  1. Sequential processing is a bottleneck - We need parallelization
  2. Long-range dependencies are hard - Direct connections help
  3. Attention mechanisms work - They capture relationships effectively
  4. Hardware is evolving - GPUs excel at parallel matrix operations

The Hardware Reality: GPU Utilization

The shift to Transformers wasn't just about model quality—it was about hardware efficiency. Here's what researchers were observing:

ModelGPU UtilizationTokens/sec (V100)WMT Training Time
LSTM (seq)15-25%~8,0002-3 weeks
LSTM (optimized)30-40%~15,0001-2 weeks
ConvSeq2Seq50-60%~25,0004-5 days
Transformer80-95%~50,000+12-36 hours

Why Transformers utilize GPUs better:

  • Matrix multiplication: Attention is essentially batched matrix multiplies (QKTQK^T, then softmax, then multiply by V)—exactly what GPUs are optimized for
  • No sequential dependencies: All positions can be computed in parallel within a layer
  • Predictable memory access: Dense operations with regular access patterns, unlike RNN's scattered hidden state updates
  • Tensor cores: Modern GPUs have specialized hardware for the exact operations Transformers need

The economic argument

Training a large LSTM for 3 weeks on 8 V100s costs roughly $15,000-20,000 in cloud compute. The same task with a Transformer might take 2 days on the same hardware: ~$2,000. This 10x cost reduction accelerated research dramatically.

The question was: Can we build a sequence model using only attention, no recurrence?

The answer came in June 2017 with "Attention Is All You Need" - the Transformer architecture.

The Paper That Changed Everything

"Attention Is All You Need" (Vaswani et al., 2017) was published at NeurIPS 2017 by a team at Google Brain and Google Research. The title itself was provocative—a direct challenge to the RNN orthodoxy.

The Authors

The paper had eight authors, several of whom have become highly influential:

  • Ashish Vaswani - Lead author, now at Essential AI
  • Noam Shazeer - Co-inventor of many key techniques, co-founded Character.AI
  • Niki Parmar - Research scientist, contributed to architecture design
  • Jakob Uszkoreit - Co-founded Inceptive (RNA design)
  • Llion Jones - Co-founded Sakana AI
  • Aidan Gomez - Co-founded Cohere
  • Łukasz Kaiser - Key contributor to Tensor2Tensor, now at OpenAI
  • Illia Polosukhin - Co-founded NEAR Protocol

The diaspora effect

Nearly every author went on to found or lead major AI companies. The Transformer didn't just change NLP—it launched an industry.

Initial Reception

The paper's reception was mixed at first:

ReactionArgument
Skeptical"RNNs have inductive biases for sequences—you need recurrence for temporal modeling"
Skeptical"O(n²) attention won't scale to long sequences"
Impressed"The translation results are state-of-the-art with much less training time"
Curious"If this works for translation, what else might it work for?"

Within a year, BERT (2018) and GPT (2018) would answer that last question definitively: everything.

The provocative title: By claiming "Attention Is All You Need," the authors were making a bold statement: the inductive biases we thought were necessary (recurrence, convolution) were actually holding us back. The simplest architecture—just attention and feedforward layers—worked best.

Design checklist for the next step

Parallelize across time; connect any two positions directly; keep gradients short; let the model decide how much context to use per token; map cleanly onto GPU-friendly matrix multiplies. Transformers check all of these boxes.

Summary

ArchitectureLong-RangeParallelizableKey Innovation
Vanilla RNNPoorNoHidden state
LSTMBetterNoGating mechanisms
GRUBetterNoSimplified gates
TransformerExcellentYesSelf-attention

Key Takeaways

  1. RNNs introduced sequential hidden states but suffer from vanishing gradients and slow training
  2. LSTMs/GRUs added gating mechanisms that help with gradient flow but don't solve the parallelization problem
  3. Attention mechanisms enable direct connections between any positions in a sequence
  4. The transformer removes recurrence entirely, using only attention for sequence modeling

Exercises

Conceptual Questions

  1. Why can't RNNs be parallelized across time steps during training?
  2. Explain intuitively why gradients vanish when multiplying many small numbers together.
  3. How does the cell state in an LSTM act as a "gradient highway"?
  4. What is the computational complexity of processing a sequence of length nn with an RNN vs. a Transformer?

Thought Experiments

  1. Consider the sentence: "The trophy doesn't fit in the suitcase because it is too big."
    • What does "it" refer to?
    • Why is this hard for an RNN to model?
  2. If you had to design a neural network for sequence modeling from scratch, knowing what you know now, what properties would you want it to have?

Hands-on Drills

  1. Implement a two-layer RNN in NumPy/PyTorch and log hidden states over a toy sequence; visualize where information decays.
  2. Add gradient clipping to that RNN and compare loss curves with and without clipping.
  3. Write a tiny Bahdanau attention over a 3-token source and 2-token target; print the attention weights to see which source positions are favored.

Further Reading

  • Hochreiter, S., & Schmidhuber, J. (1997). Long short-term memory. Neural computation.
  • Cho, K., et al. (2014). Learning phrase representations using RNN encoder-decoder.
  • Bahdanau, D., Cho, K., & Bengio, Y. (2014). Neural machine translation by jointly learning to align and translate.
  • Pascanu, R., Mikolov, T., & Bengio, Y. (2013). On the difficulty of training recurrent neural networks.

In the next section, we'll dive into the Transformer architecture itself, understanding how it eliminates recurrence entirely and achieves state-of-the-art performance across virtually all sequence modeling tasks.