The Big Idea
Before we dive into the technical details of Transformers, let's understand the fundamental insight that makes them work. This single idea will help you understand not just Transformers, but modern AI in general.
The Core Insight: If you can convert something into a vector (a list of numbers), you can train a neural network to learn meaningful patterns from it. This applies to text, images, audio, or anything else.
Here's the journey from text to intelligence:
| Step | What Happens | Result |
|---|---|---|
| 1 | Convert text to tokens | "Hello world" → [15496, 995] |
| 2 | Assign each token a vector | [15496, 995] → [[0.2, -0.5, ...], [0.8, 0.1, ...]] |
| 3 | Train the network to update vectors | Vectors encode contextual meaning |
| 4 | Use attention to mix information | Each word understands its context |
| 5 | Predict next token probabilities | P(next) → "!" (0.7), "." (0.2), ... |
Let's walk through each step to truly understand how this creates intelligence.
The Whole Flow in One Glance
Pipeline: Text → Tokens → Embeddings → (Multi-layer Attention + Feedforward) → Logits → Probabilities → Next token.
| Stage | Shape Example | What Changes |
|---|---|---|
| Tokens | [seq] | Discrete IDs only |
| Embeddings | [seq, d_model] | Static meaning lookup |
| Self-Attention | [seq, d_model] | Context blended in |
| Logits | [seq, vocab] | Scores for each token |
| Probabilities | [seq, vocab] | Softmax → sums to 1 |
You'll see this exact flow repeatedly—everything else is improving these steps or stacking them deeper.
Step 1: Everything is a Vector
The first breakthrough in modern AI is understanding that anything can be represented as a vector – a simple list of numbers. Once something is a vector, mathematics can work on it.
What is a Vector?
Think of a vector as coordinates in space. In 2D, you might have [3, 4] representing a point. But vectors can have hundreds or thousands of dimensions:
1# A simple 2D vector
2point = [3, 4] # x=3, y=4
3
4# A word embedding vector (512 dimensions)
5word_vector = [0.23, -0.45, 0.12, 0.89, ..., -0.34] # 512 numbers
6
7# Why 512? More dimensions = more capacity to encode meaning
8# Each dimension can capture different aspects of meaningKey Insight: The magic is that similar things end up with similar vectors. The word "king" will have a vector close to "queen" but far from "banana". Neural networks learn to position vectors in this space meaningfully.
From Discrete to Continuous
Text is inherently discrete – just symbols. But math works best with continuous numbers. Vectors bridge this gap:
| Representation | Type | Example | Math Friendly? |
|---|---|---|---|
| Text | Discrete symbols | "cat" | ❌ No |
| Token ID | Discrete integer | 4521 | ❌ Limited |
| Vector | Continuous numbers | [0.2, -0.5, 0.8, ...] | ✅ Yes! |
Step 2: Tokens and Initial Vectors
Before we can work with text, we need to break it into pieces called tokens and give each piece a vector.
Tokenization: Breaking Text into Pieces
1text = "The cat sat on the mat"
2
3# Simple word tokenization
4tokens = ["The", "cat", "sat", "on", "the", "mat"]
5
6# Each token gets a unique ID from a vocabulary
7vocab = {"The": 0, "cat": 1, "sat": 2, "on": 3, "the": 4, "mat": 5}
8token_ids = [0, 1, 2, 3, 4, 5]
9
10# Real tokenizers use subwords (we'll cover this later)
11# "unhappiness" → ["un", "happiness"] or ["un", "hap", "pi", "ness"]The Embedding Matrix: Token → Vector
Here's where the magic starts. We create a giant lookup table called an embedding matrix:
1import torch
2import torch.nn as nn
3
4# Vocabulary size: 50,000 unique tokens
5# Embedding dimension: 512 numbers per token
6vocab_size = 50000
7embed_dim = 512
8
9# The embedding matrix: 50,000 × 512 = 25.6 million parameters!
10embedding = nn.Embedding(vocab_size, embed_dim)
11
12# Look up vectors for our tokens
13token_ids = torch.tensor([0, 1, 2, 3, 4, 5]) # "The cat sat on the mat"
14vectors = embedding(token_ids)
15
16print(vectors.shape) # [6, 512] - 6 tokens, each with 512 numbers
17
18# Initially, these vectors are RANDOM!
19# Training will make them meaningfulImportant: Initially, the embedding vectors are just random numbers. The entire training process is about adjusting these numbers so that semantically similar words get similar vectors.
Visualizing the Embedding Space
Imagine a vast 512-dimensional space where every word in the vocabulary has a position. Below is an interactive 3D visualization (simplified to 3 dimensions) showing how similar words cluster together:
3D Embedding Space
Drag to rotate • Scroll to zoom • Each token is a 3D vector
Token Vectors (d_model = 3)
Notice how the visualization demonstrates key properties of embedding spaces:
- Semantic Clustering: Related words (royalty, people, animals, food) form distinct clusters
- Gender as a Direction: The vector from "man" to "woman" is similar to "king" to "queen"
- Vector Arithmetic: Click "Vector Math" to see king − man + woman ≈ queen
Tokenization gotchas: Subwords, punctuation, and whitespace matter. Modern tokenizers split "unhappiness" into subpieces, preserve spaces, and fall back to bytes for rare characters.
Step 3: Learning Contextual Meaning
Here's the crucial insight: the same word can mean different things in different contexts. The word "bank" in "river bank" vs "bank account" should have different representations.
Static vs Contextual Embeddings
| Type | Description | Example |
|---|---|---|
| Static (Word2Vec) | Same vector always | "bank" → same vector everywhere |
| Contextual (Transformer) | Vector depends on context | "river bank" → different from "money bank" |
The Transformer's Job: Take the initial static embeddings and transform them into contextual embeddings. Each vector gets updated based on ALL the other words in the sentence.
How Context Changes Vectors
1# Initial embeddings (from lookup table)
2# These are the SAME for "bank" regardless of context
3sentence_1 = "I walked along the river bank"
4sentence_2 = "I deposited money in the bank"
5
6# After the Transformer processes these sentences:
7# The vector for "bank" is DIFFERENT in each case!
8
9# Conceptually:
10# bank_river = original_bank + context_from_river_walked_along
11# bank_money = original_bank + context_from_deposited_money
12
13# The neural network learns to:
14# 1. Look at surrounding words
15# 2. Decide which aspects of meaning are relevant
16# 3. Update the vector accordinglyThis is why Transformers are so powerful. Each vector in the output contains information from the entire input sequence, not just the original token.
Step 4: Attention – The Magic Ingredient
How does a Transformer update vectors based on context? Through Attention – a mechanism that allows each token to "look at" every other token and decide what information to gather.
The Intuition Behind Attention
1Consider: "The cat sat on the mat because it was tired"
2
3When processing "it", which words should influence its meaning?
4───────────────────────────────────────────────────────────
5
6The → 0.02 (not very relevant)
7cat → 0.85 (highly relevant! "it" refers to "cat")
8sat → 0.05 (somewhat relevant)
9on → 0.01 (not relevant)
10the → 0.01 (not relevant)
11mat → 0.03 (could be relevant, but "cat" is more likely)
12because→ 0.01 (not relevant)
13it → 0.01 (self-reference)
14was → 0.01 (not relevant)
15tired → 0.00 (not relevant)
16 ─────
17 = 1.00 (probabilities sum to 1)
18
19Attention learns these weights automatically!The Math in One Slide
Each token asks three questions:
- Query: "What am I looking for?"
- Key: "What do I contain?"
- Value: "What information should I share?"
1# Simplified attention mechanism
2def attention(query, keys, values):
3 """
4 query: What this token is looking for [d_k]
5 keys: What each token contains [seq_len, d_k]
6 values: What each token can share [seq_len, d_v]
7 """
8 # Step 1: How well does my query match each key?
9 scores = query @ keys.T # [seq_len] - one score per token
10
11 # Step 2: Convert to probabilities (which tokens to focus on)
12 weights = softmax(scores) # [seq_len] - sums to 1
13
14 # Step 3: Weighted sum of values
15 output = weights @ values # [d_v] - blend of all values
16
17 return output
18
19# The result: A new vector that contains information
20# from all tokens, weighted by relevance!Why "Attention"? Just like humans pay attention to relevant words when understanding a sentence, the Transformer learns to focus on the most relevant tokens for each position.
Bridging to the Math (Shapes)
1Inputs: Q, K, V each [seq, d_model]
2Linear projections: W_Q, W_K, W_V map d_model -> d_k/d_v
3Q = X W_Q -> [seq, d_k]
4K = X W_K -> [seq, d_k]
5V = X W_V -> [seq, d_v]
6Scores = Q K^T / sqrt(d_k) -> [seq, seq]
7Weights = softmax(Scores, dim=-1) -> [seq, seq] (rows sum to 1)
8Output = Weights V -> [seq, d_v] (context-blended vectors)Micro-exercise: Pick one token (e.g., "it"). Imagine assigning higher weights to relevant words (e.g., "cat"). The weighted sum of value vectors is the new, contextual embedding for that token.
Step 5: Predicting the Next Token
After all the attention layers have processed the input, we have rich contextual vectors. Now we need to make a prediction: what comes next?
From Vectors to Probabilities
1import torch.nn.functional as F
2
3# After Transformer processing, we have a contextual vector
4# for the last position
5last_hidden_state = transformer_output[:, -1, :] # [batch, hidden_dim]
6
7# Project to vocabulary size
8logits = linear_layer(last_hidden_state) # [batch, vocab_size]
9# If vocab_size = 50,000, we get 50,000 numbers
10
11# Convert to probabilities
12probabilities = F.softmax(logits, dim=-1) # [batch, vocab_size]
13# Now each of the 50,000 tokens has a probability
14
15# Example output for "The cat sat on the ___"
16# mat: 0.15 ← highest probability
17# floor: 0.12
18# ground: 0.08
19# chair: 0.07
20# ...
21# banana: 0.0001 ← very unlikely in this contextSampling the Next Token
Once we have probabilities, we can choose the next token:
| Method | Description | Use Case |
|---|---|---|
| Greedy | Always pick highest probability | Deterministic, but boring |
| Top-k | Sample from top k tokens | More creative, some randomness |
| Top-p (nucleus) | Sample from smallest set with cumulative prob > p | Most natural balance |
| Temperature | Sharpen or flatten distribution | Control creativity level |
1def generate_next_token(probabilities, method="greedy", temperature=1.0):
2 if method == "greedy":
3 # Always pick the most likely token
4 return probabilities.argmax()
5
6 elif method == "sample":
7 # Apply temperature (higher = more random)
8 scaled = probabilities ** (1 / temperature)
9 scaled = scaled / scaled.sum() # renormalize
10
11 # Randomly sample based on probabilities
12 return torch.multinomial(scaled, 1)
13
14# With temperature:
15# - temperature < 1: More focused (picks likely tokens)
16# - temperature = 1: Original distribution
17# - temperature > 1: More random (explores unlikely tokens)Sampling pitfalls: Greedy can loop on common words; high temperature can produce nonsense; top-k/top-p balance surprise vs coherence. Repetition penalties and minimum-length constraints often help avoid degenerate loops.
Putting It All Together
Now let's see the complete flow from input text to generated output. Click each step to expand the details, or use the "Play Animation" button to see the entire flow in action:
The Complete Transformer Flow
From input text to generated output — click each step to see details
INPUT
STEP 1: TOKENIZE
STEP 2: EMBED
STEP 3: ADD POSITION
STEP 4: TRANSFORMER LAYERS
STEP 5: PREDICT NEXT TOKEN
FINAL OUTPUT
Why This Works So Well
The Transformer architecture has several key properties that make it remarkably effective:
1. Parallel Processing
Unlike RNNs that process one token at a time, Transformers process all tokens simultaneously. This makes training on GPUs massively efficient.
2. Long-Range Dependencies
Every token can directly attend to every other token, no matter how far apart:
1"The cat that the dog chased ran away"
2 ↑ ↑
3 └──────────────────────────┘
4 Direct attention connection!
5
6RNN would need to pass information through 6 intermediate steps.
7Transformer connects them directly in one attention operation.3. Learned Representations
The embedding vectors aren't hand-crafted – they're learned from data:
| Training Objective | What the Model Learns |
|---|---|
| Predict next token | Grammar, facts, reasoning patterns |
| Masked word prediction | Bidirectional context understanding |
| Translation | Cross-lingual meaning preservation |
| Question answering | Information retrieval and synthesis |
4. Scalable Architecture
1# The power of scale
2models = {
3 "GPT-2 Small": {"params": "117M", "layers": 12, "d_model": 768},
4 "GPT-2 Large": {"params": "774M", "layers": 36, "d_model": 1280},
5 "GPT-3": {"params": "175B", "layers": 96, "d_model": 12288},
6 "GPT-4": {"params": "~1.7T", "layers": "?", "d_model": "?"},
7}
8
9# More parameters = more capacity to learn patterns
10# Same architecture scales from millions to trillions!| Factor | Effect | Trade-off |
|---|---|---|
| Longer context | Can attend farther back | Attention cost grows ~seq^2 |
| More layers | Deeper reasoning signal | Training time ↑, overfitting risk |
| Bigger d_model | Richer embeddings | Memory/compute ↑, harder to train |
Longer context doesn't mean perfect memory of everything—attention weights still dilute across many tokens.
Common Questions
- Are embeddings the same as meaning? No—they're learned coordinates that correlate with usage; they can encode bias or spurious patterns.
- Why 512 or 768 dimensions? It's a capacity/efficiency trade-off; more dims can express more nuance but cost more.
- Why do probabilities sum to 1? Softmax normalizes logits so each position represents a valid distribution over the vocab.
- Is attention a filter or a blender? It blends values using learned weights; low weights effectively filter.
- Do Transformers read left-to-right? Only if masked; unmasked self-attention sees the whole sequence at once.
Check Your Understanding
- Why do we convert tokens to continuous vectors before doing anything else?
- How does attention decide which tokens influence each other?
- When would you prefer top-p sampling over greedy decoding?
Try answering before peeking back—if you can explain these in your own words, you're ready to dive into the math in the next chapters.
Summary
You now understand the fundamental insight behind Transformers:
- Vectorize: Convert tokens into learnable vectors
- Contextualize: Use attention to update vectors with context
- Predict: Convert final vectors to probability distributions
- Generate: Sample from probabilities to produce output
The Key Takeaway: Each number in a token's vector encodes information about how that token is typically used. Training adjusts these numbers so the model can predict what comes next. The better the predictions, the more the vectors capture about language, knowledge, and reasoning.
In the following chapters, we'll implement each of these components from scratch. You'll build:
- Embedding layers that convert tokens to vectors
- Attention mechanisms that enable contextual understanding
- Transformer layers that stack to create deep representations
- Output heads that convert vectors to predictions
- A complete translation model that puts it all together
Next: In the following sections, we'll cover the technical prerequisites you need to implement these ideas in code.