Chapter 0
15 min read
Section 2 of 75

The Core Insight: How Transformers Actually Work

Prerequisites

The Big Idea

Before we dive into the technical details of Transformers, let's understand the fundamental insight that makes them work. This single idea will help you understand not just Transformers, but modern AI in general.

The Core Insight: If you can convert something into a vector (a list of numbers), you can train a neural network to learn meaningful patterns from it. This applies to text, images, audio, or anything else.

Here's the journey from text to intelligence:

StepWhat HappensResult
1Convert text to tokens"Hello world" → [15496, 995]
2Assign each token a vector[15496, 995] → [[0.2, -0.5, ...], [0.8, 0.1, ...]]
3Train the network to update vectorsVectors encode contextual meaning
4Use attention to mix informationEach word understands its context
5Predict next token probabilitiesP(next) → "!" (0.7), "." (0.2), ...

Let's walk through each step to truly understand how this creates intelligence.


The Whole Flow in One Glance

Pipeline: Text → Tokens → Embeddings → (Multi-layer Attention + Feedforward) → Logits → Probabilities → Next token.
StageShape ExampleWhat Changes
Tokens[seq]Discrete IDs only
Embeddings[seq, d_model]Static meaning lookup
Self-Attention[seq, d_model]Context blended in
Logits[seq, vocab]Scores for each token
Probabilities[seq, vocab]Softmax → sums to 1

You'll see this exact flow repeatedly—everything else is improving these steps or stacking them deeper.


Step 1: Everything is a Vector

The first breakthrough in modern AI is understanding that anything can be represented as a vector – a simple list of numbers. Once something is a vector, mathematics can work on it.

What is a Vector?

Think of a vector as coordinates in space. In 2D, you might have [3, 4] representing a point. But vectors can have hundreds or thousands of dimensions:

🐍vectors_demo.py
1# A simple 2D vector
2point = [3, 4]  # x=3, y=4
3
4# A word embedding vector (512 dimensions)
5word_vector = [0.23, -0.45, 0.12, 0.89, ..., -0.34]  # 512 numbers
6
7# Why 512? More dimensions = more capacity to encode meaning
8# Each dimension can capture different aspects of meaning
Key Insight: The magic is that similar things end up with similar vectors. The word "king" will have a vector close to "queen" but far from "banana". Neural networks learn to position vectors in this space meaningfully.

From Discrete to Continuous

Text is inherently discrete – just symbols. But math works best with continuous numbers. Vectors bridge this gap:

RepresentationTypeExampleMath Friendly?
TextDiscrete symbols"cat"❌ No
Token IDDiscrete integer4521❌ Limited
VectorContinuous numbers[0.2, -0.5, 0.8, ...]✅ Yes!

Step 2: Tokens and Initial Vectors

Before we can work with text, we need to break it into pieces called tokens and give each piece a vector.

Tokenization: Breaking Text into Pieces

🐍tokenization_example.py
1text = "The cat sat on the mat"
2
3# Simple word tokenization
4tokens = ["The", "cat", "sat", "on", "the", "mat"]
5
6# Each token gets a unique ID from a vocabulary
7vocab = {"The": 0, "cat": 1, "sat": 2, "on": 3, "the": 4, "mat": 5}
8token_ids = [0, 1, 2, 3, 4, 5]
9
10# Real tokenizers use subwords (we'll cover this later)
11# "unhappiness" → ["un", "happiness"] or ["un", "hap", "pi", "ness"]

The Embedding Matrix: Token → Vector

Here's where the magic starts. We create a giant lookup table called an embedding matrix:

🐍embedding_matrix.py
1import torch
2import torch.nn as nn
3
4# Vocabulary size: 50,000 unique tokens
5# Embedding dimension: 512 numbers per token
6vocab_size = 50000
7embed_dim = 512
8
9# The embedding matrix: 50,000 × 512 = 25.6 million parameters!
10embedding = nn.Embedding(vocab_size, embed_dim)
11
12# Look up vectors for our tokens
13token_ids = torch.tensor([0, 1, 2, 3, 4, 5])  # "The cat sat on the mat"
14vectors = embedding(token_ids)
15
16print(vectors.shape)  # [6, 512] - 6 tokens, each with 512 numbers
17
18# Initially, these vectors are RANDOM!
19# Training will make them meaningful
Important: Initially, the embedding vectors are just random numbers. The entire training process is about adjusting these numbers so that semantically similar words get similar vectors.

Visualizing the Embedding Space

Imagine a vast 512-dimensional space where every word in the vocabulary has a position. Below is an interactive 3D visualization (simplified to 3 dimensions) showing how similar words cluster together:

3D Embedding Space

Drag to rotate • Scroll to zoom • Each token is a 3D vector

Royalty/People
Age Groups
Activities
Animals
Fruits
💡 Each vector's direction and magnitude encode semantic meaning

Token Vectors (d_model = 3)

King
XYZ
-1.8
1.2
-0.8
Queen
XYZ
-2.2
1.5
-0.5
Boy
XYZ
1.5
2.2
0.8
Girl
XYZ
1.0
2.0
0.5
Sport
XYZ
2.2
-0.8
1.5
Game
XYZ
1.8
-1.2
1.8
Man
XYZ
-1.2
0.5
-0.3
Woman
XYZ
-1.5
0.8
0.0
Cat
XYZ
0.5
-1.8
-1.5
Dog
XYZ
0.8
-1.5
-1.8
Apple
XYZ
-0.8
-1.5
2.0
Banana
XYZ
-1.2
-1.8
2.3
Orange
XYZ
-0.5
-2.0
1.8
Mango
XYZ
-1.0
-1.2
2.5

Notice how the visualization demonstrates key properties of embedding spaces:

  • Semantic Clustering: Related words (royalty, people, animals, food) form distinct clusters
  • Gender as a Direction: The vector from "man" to "woman" is similar to "king" to "queen"
  • Vector Arithmetic: Click "Vector Math" to see king − man + woman ≈ queen
Tokenization gotchas: Subwords, punctuation, and whitespace matter. Modern tokenizers split "unhappiness" into subpieces, preserve spaces, and fall back to bytes for rare characters.

Step 3: Learning Contextual Meaning

Here's the crucial insight: the same word can mean different things in different contexts. The word "bank" in "river bank" vs "bank account" should have different representations.

Static vs Contextual Embeddings

TypeDescriptionExample
Static (Word2Vec)Same vector always"bank" → same vector everywhere
Contextual (Transformer)Vector depends on context"river bank" → different from "money bank"
The Transformer's Job: Take the initial static embeddings and transform them into contextual embeddings. Each vector gets updated based on ALL the other words in the sentence.

How Context Changes Vectors

🐍contextual_embeddings.py
1# Initial embeddings (from lookup table)
2# These are the SAME for "bank" regardless of context
3sentence_1 = "I walked along the river bank"
4sentence_2 = "I deposited money in the bank"
5
6# After the Transformer processes these sentences:
7# The vector for "bank" is DIFFERENT in each case!
8
9# Conceptually:
10# bank_river = original_bank + context_from_river_walked_along
11# bank_money = original_bank + context_from_deposited_money
12
13# The neural network learns to:
14# 1. Look at surrounding words
15# 2. Decide which aspects of meaning are relevant
16# 3. Update the vector accordingly

This is why Transformers are so powerful. Each vector in the output contains information from the entire input sequence, not just the original token.


Step 4: Attention – The Magic Ingredient

How does a Transformer update vectors based on context? Through Attention – a mechanism that allows each token to "look at" every other token and decide what information to gather.

The Intuition Behind Attention

📝attention_intuition.txt
1Consider: "The cat sat on the mat because it was tired"
2
3When processing "it", which words should influence its meaning?
4───────────────────────────────────────────────────────────
5
6The    → 0.02  (not very relevant)
7cat    → 0.85  (highly relevant! "it" refers to "cat")
8sat    → 0.05  (somewhat relevant)
9on     → 0.01  (not relevant)
10the    → 0.01  (not relevant)
11mat    → 0.03  (could be relevant, but "cat" is more likely)
12because→ 0.01  (not relevant)
13it     → 0.01  (self-reference)
14was    → 0.01  (not relevant)
15tired  → 0.00  (not relevant)
16       ─────
17       = 1.00  (probabilities sum to 1)
18
19Attention learns these weights automatically!

The Math in One Slide

Each token asks three questions:

  1. Query: "What am I looking for?"
  2. Key: "What do I contain?"
  3. Value: "What information should I share?"
🐍attention_simple.py
1# Simplified attention mechanism
2def attention(query, keys, values):
3    """
4    query: What this token is looking for [d_k]
5    keys: What each token contains [seq_len, d_k]
6    values: What each token can share [seq_len, d_v]
7    """
8    # Step 1: How well does my query match each key?
9    scores = query @ keys.T  # [seq_len] - one score per token
10
11    # Step 2: Convert to probabilities (which tokens to focus on)
12    weights = softmax(scores)  # [seq_len] - sums to 1
13
14    # Step 3: Weighted sum of values
15    output = weights @ values  # [d_v] - blend of all values
16
17    return output
18
19# The result: A new vector that contains information
20# from all tokens, weighted by relevance!
Why "Attention"? Just like humans pay attention to relevant words when understanding a sentence, the Transformer learns to focus on the most relevant tokens for each position.

Bridging to the Math (Shapes)

📝attention_shapes.txt
1Inputs: Q, K, V each [seq, d_model]
2Linear projections: W_Q, W_K, W_V map d_model -> d_k/d_v
3Q = X W_Q  -> [seq, d_k]
4K = X W_K  -> [seq, d_k]
5V = X W_V  -> [seq, d_v]
6Scores = Q K^T / sqrt(d_k) -> [seq, seq]
7Weights = softmax(Scores, dim=-1) -> [seq, seq] (rows sum to 1)
8Output = Weights V -> [seq, d_v] (context-blended vectors)
Micro-exercise: Pick one token (e.g., "it"). Imagine assigning higher weights to relevant words (e.g., "cat"). The weighted sum of value vectors is the new, contextual embedding for that token.

Step 5: Predicting the Next Token

After all the attention layers have processed the input, we have rich contextual vectors. Now we need to make a prediction: what comes next?

From Vectors to Probabilities

🐍prediction_head.py
1import torch.nn.functional as F
2
3# After Transformer processing, we have a contextual vector
4# for the last position
5last_hidden_state = transformer_output[:, -1, :]  # [batch, hidden_dim]
6
7# Project to vocabulary size
8logits = linear_layer(last_hidden_state)  # [batch, vocab_size]
9# If vocab_size = 50,000, we get 50,000 numbers
10
11# Convert to probabilities
12probabilities = F.softmax(logits, dim=-1)  # [batch, vocab_size]
13# Now each of the 50,000 tokens has a probability
14
15# Example output for "The cat sat on the ___"
16# mat:     0.15  ← highest probability
17# floor:   0.12
18# ground:  0.08
19# chair:   0.07
20# ...
21# banana:  0.0001  ← very unlikely in this context

Sampling the Next Token

Once we have probabilities, we can choose the next token:

MethodDescriptionUse Case
GreedyAlways pick highest probabilityDeterministic, but boring
Top-kSample from top k tokensMore creative, some randomness
Top-p (nucleus)Sample from smallest set with cumulative prob > pMost natural balance
TemperatureSharpen or flatten distributionControl creativity level
🐍sampling.py
1def generate_next_token(probabilities, method="greedy", temperature=1.0):
2    if method == "greedy":
3        # Always pick the most likely token
4        return probabilities.argmax()
5
6    elif method == "sample":
7        # Apply temperature (higher = more random)
8        scaled = probabilities ** (1 / temperature)
9        scaled = scaled / scaled.sum()  # renormalize
10
11        # Randomly sample based on probabilities
12        return torch.multinomial(scaled, 1)
13
14# With temperature:
15# - temperature < 1: More focused (picks likely tokens)
16# - temperature = 1: Original distribution
17# - temperature > 1: More random (explores unlikely tokens)
Sampling pitfalls: Greedy can loop on common words; high temperature can produce nonsense; top-k/top-p balance surprise vs coherence. Repetition penalties and minimum-length constraints often help avoid degenerate loops.

Putting It All Together

Now let's see the complete flow from input text to generated output. Click each step to expand the details, or use the "Play Animation" button to see the entire flow in action:

The Complete Transformer Flow

From input text to generated output — click each step to see details

INPUT

STEP 1: TOKENIZE

STEP 2: EMBED

STEP 3: ADD POSITION

STEP 4: TRANSFORMER LAYERS

STEP 5: PREDICT NEXT TOKEN

FINAL OUTPUT

💡 Click each step to expand • Each step shows its output with shape information

Why This Works So Well

The Transformer architecture has several key properties that make it remarkably effective:

1. Parallel Processing

Unlike RNNs that process one token at a time, Transformers process all tokens simultaneously. This makes training on GPUs massively efficient.

2. Long-Range Dependencies

Every token can directly attend to every other token, no matter how far apart:

📝long_range.txt
1"The cat that the dog chased ran away"
2     ↑                          ↑
3     └──────────────────────────┘
4     Direct attention connection!
5
6RNN would need to pass information through 6 intermediate steps.
7Transformer connects them directly in one attention operation.

3. Learned Representations

The embedding vectors aren't hand-crafted – they're learned from data:

Training ObjectiveWhat the Model Learns
Predict next tokenGrammar, facts, reasoning patterns
Masked word predictionBidirectional context understanding
TranslationCross-lingual meaning preservation
Question answeringInformation retrieval and synthesis

4. Scalable Architecture

🐍scaling.py
1# The power of scale
2models = {
3    "GPT-2 Small":  {"params": "117M",  "layers": 12,  "d_model": 768},
4    "GPT-2 Large":  {"params": "774M",  "layers": 36,  "d_model": 1280},
5    "GPT-3":        {"params": "175B",  "layers": 96,  "d_model": 12288},
6    "GPT-4":        {"params": "~1.7T", "layers": "?",  "d_model": "?"},
7}
8
9# More parameters = more capacity to learn patterns
10# Same architecture scales from millions to trillions!
FactorEffectTrade-off
Longer contextCan attend farther backAttention cost grows ~seq^2
More layersDeeper reasoning signalTraining time ↑, overfitting risk
Bigger d_modelRicher embeddingsMemory/compute ↑, harder to train
Longer context doesn't mean perfect memory of everything—attention weights still dilute across many tokens.

Common Questions

  • Are embeddings the same as meaning? No—they're learned coordinates that correlate with usage; they can encode bias or spurious patterns.
  • Why 512 or 768 dimensions? It's a capacity/efficiency trade-off; more dims can express more nuance but cost more.
  • Why do probabilities sum to 1? Softmax normalizes logits so each position represents a valid distribution over the vocab.
  • Is attention a filter or a blender? It blends values using learned weights; low weights effectively filter.
  • Do Transformers read left-to-right? Only if masked; unmasked self-attention sees the whole sequence at once.

Check Your Understanding

  1. Why do we convert tokens to continuous vectors before doing anything else?
  2. How does attention decide which tokens influence each other?
  3. When would you prefer top-p sampling over greedy decoding?

Try answering before peeking back—if you can explain these in your own words, you're ready to dive into the math in the next chapters.


Summary

You now understand the fundamental insight behind Transformers:

  1. Vectorize: Convert tokens into learnable vectors
  2. Contextualize: Use attention to update vectors with context
  3. Predict: Convert final vectors to probability distributions
  4. Generate: Sample from probabilities to produce output
The Key Takeaway: Each number in a token's vector encodes information about how that token is typically used. Training adjusts these numbers so the model can predict what comes next. The better the predictions, the more the vectors capture about language, knowledge, and reasoning.

In the following chapters, we'll implement each of these components from scratch. You'll build:

  • Embedding layers that convert tokens to vectors
  • Attention mechanisms that enable contextual understanding
  • Transformer layers that stack to create deep representations
  • Output heads that convert vectors to predictions
  • A complete translation model that puts it all together

Next: In the following sections, we'll cover the technical prerequisites you need to implement these ideas in code.

Loading comments...