Chapter 18
22 min read
Section 57 of 65

Text Preprocessing and Embeddings

Text Classification Project

From Text to Numbers: The Core Problem

A neural network does not understand words. It understands tensors of floating-point numbers. Every layer we have studied so far — linear projections, convolutions, recurrences, and attention — is defined as operations over real-valued vectors. So before a single weight can be applied to the sentence “The cat sat on the mat”, we must answer a deceptively hard question: how do we turn a string into a vector without destroying its meaning?

That question has one of the most consequential answers in modern machine learning: we use an embedding — a learnable map from discrete tokens into a continuous vector space where geometry encodes meaning. Everything that comes later in the model — the attention scores in a Transformer, the hidden state of an LSTM, the logits a language model produces — is ultimately a function of those initial vectors. Get the embedding wrong and nothing downstream can rescue the model.

The promise of embeddings: if we can arrange words in space so that semantically similar words land near each other, then linear operations in that space become semantic operations. The famous example kingman+womanqueen\text{king} - \text{man} + \text{woman} \approx \text{queen} is possible precisely because the learned geometry reflects meaning.

In this section we will walk the entire pipeline — from raw Unicode text to the matrix of dense vectors that the next layer consumes. We will build every piece twice: first from scratch in NumPy, so every number is verifiable; then in PyTorch with nn.Embedding\texttt{nn.Embedding} so you see how the same idea scales into a trainable layer. Along the way we will connect embeddings to the layers that will dominate the rest of the book: positional encodings, multi-head attention, Flash Attention, and the KV-cache that makes modern LLM inference feasible.


The Preprocessing Pipeline

Preprocessing is the sequence of transformations that converts raw text into a tensor of integer IDs. Different systems differ in the details but almost every modern pipeline has the same five stages:

  1. Normalize — collapse surface variation that is not semantically meaningful: case, whitespace, Unicode compatibility forms, and sometimes accents. The string “The” and “the” should hit the same vocabulary entry.
  2. Tokenize — split the normalized string into atomic units. Word-level splits on whitespace; character-level uses single characters; subword approaches like Byte-Pair Encoding (BPE) sit in between. The choice fundamentally shapes what the model can and cannot express.
  3. Build a vocabulary — a bijection between token strings and small integers. This vocabulary is fixed at training time and shared between training and inference.
  4. Index — look up each token in the vocabulary and replace it with its integer ID. The string disappears; only an array of integers remains.
  5. Embed — map each integer ID to a dense, d-dimensional vector via a learnable lookup table. This is the bridge from discrete language to continuous geometry.

Visually, the data flows from a string on the left to a T×dT \times d matrix of floats on the right:

Loading preprocessing pipeline…

The first four stages are deterministic string manipulation. The fifth — embedding — is where the learning happens. Everything interesting in this section lives in that last arrow.


One-Hot Encoding: The Naive Baseline

The simplest way to turn an integer ID into a vector is one-hot encoding. For a vocabulary of size VV, token ii becomes a vector of length VV with a 11 at position ii and 00 everywhere else. Formally, the one-hot vector eiRV\mathbf{e}_i \in \mathbb{R}^V satisfies (ei)j=δij(\mathbf{e}_i)_j = \delta_{ij}, the Kronecker delta.

For our sentence with vocabulary {pad:0, unk:1, the:2, cat:3, sat:4, on:5, mat:6}\{\langle \text{pad}\rangle:0,\ \langle \text{unk}\rangle:1,\ \text{the}:2,\ \text{cat}:3,\ \text{sat}:4,\ \text{on}:5,\ \text{mat}:6\}, the six tokens become a 6×76 \times 7 matrix:

⟨pad⟩⟨unk⟩thecatsatonmat
the0010000
cat0001000
sat0000100
on0000010
the0010000
mat0000001

Notice that both occurrences of the have identical rows — the vocabulary guarantees that identical strings produce identical vectors. This is already a property we want: a neural network should not care which copy of the it is looking at.

Why one-hot is sometimes enough: for small vocabularies (classification into a handful of categories) one-hot is perfectly reasonable. It is the loss target of every softmax classifier you will ever train. The trouble starts when VV gets large and we want similar inputs to have similar vectors.

Why One-Hot Fails: The Curse of Orthogonality

One-hot vectors have two fatal flaws for language modelling.

1. They are orthogonal — so every pair of different words is equally dissimilar

The dot product of any two distinct one-hot vectors is zero: eiej=δij\mathbf{e}_i \cdot \mathbf{e}_j = \delta_{ij}. That means the model has no way to learn that cat and kitten are closer in meaning than cat and photosynthesis. Every word is equidistant from every other word. Geometry carries zero semantic information.

2. They are extremely high-dimensional and sparse

A vocabulary of V=128,000V = 128{,}000 (LLaMA 3) means every word becomes a 128,000-dimensional vector. Storing a 1,000-token sequence at float32 would take 1000×128000×4 bytes512MB1000 \times 128000 \times 4 \text{ bytes} \approx 512\,\text{MB} — and the first layer would need to multiply that by a 128000×d128000 \times d weight matrix, almost all of which gets multiplied by zeros. This is wasteful in both memory and compute.

3. The number of parameters explodes linearly in vocab size

Any dense layer fed by a one-hot input has V×dV \times d weights. A word-level model for English can easily reach V>105V > 10^5 — already in the tens of millions of parameters just to consume the input.

We need a representation where (i) similar words live near each other, (ii) the dimensionality dd is much smaller than VV, and (iii) the model learns the geometry from data instead of us hand-designing it. That representation is an embedding.

Embeddings: Dense Vectors with Meaning

An embedding is a function Embed:{0,1,,V1}Rd\text{Embed}: \{0, 1, \ldots, V-1\} \to \mathbb{R}^d that assigns every token ID a dense vector of dimension dVd \ll V. The function is represented as a lookup table: a matrix ERV×d\mathbf{E} \in \mathbb{R}^{V \times d} whose ii-th row is the vector for token ii. The lookup is simply row indexing:

Embed(i)=E[i,:]Rd\text{Embed}(i) = \mathbf{E}[i, :] \in \mathbb{R}^d

Three design decisions give embeddings their power:

  1. Every row is a separate learnable parameter. During training, gradient descent can move each token’s vector independently, so the model discovers its own geometry.
  2. The dimension dd is small. Typical values: 128 (word2vec), 300 (GloVe), 768 (BERT-base), 4096 (LLaMA-2-7B), 12288 (GPT-3). Much smaller than any real vocabulary.
  3. Rows can be close. Nothing stops cat and kitten from pointing in nearly the same direction — and training actively encourages it when their contexts are similar. That is the Distributional Hypothesis made concrete: you shall know a word by the company it keeps.
Embeddings are parameters, not hyperparameters. They are learned end-to-end with the rest of the network. You do not need to pre-train with word2vec first (though you can). The exact same backpropagation signal that tunes the attention weights also tunes the embedding rows.

The Mathematics of Embedding Lookup

Here is a piece of trivia with serious consequences: embedding lookup is a matrix multiplication in disguise.

Let O{0,1}T×V\mathbf{O} \in \{0,1\}^{T \times V} be the one-hot matrix for a sequence of TT tokens, and ERV×d\mathbf{E} \in \mathbb{R}^{V \times d} the embedding matrix. Then the embedded sequence is:

X=OERT×d\mathbf{X} = \mathbf{O}\mathbf{E} \in \mathbb{R}^{T \times d}

Because row ii of O\mathbf{O} contains a single 11 at column idi\text{id}_i and zeros elsewhere, the inner product O[i,:]E[:,j]=E[idi,j]\mathbf{O}[i,:] \cdot \mathbf{E}[:,j] = \mathbf{E}[\text{id}_i, j]. The full product collapses to row selection:

X[i,:]=E[idi,:]\mathbf{X}[i, :] = \mathbf{E}[\text{id}_i, :]

So OE\mathbf{O}\mathbf{E} and the fancy index E[ids]\mathbf{E}[\text{ids}] compute the exact same thing. The matmul form costs O(TVd)O(T \cdot V \cdot d); the lookup form costs O(Td)O(T \cdot d). For modern vocabularies with V>105V > 10^5 the speedup is five orders of magnitude per token.

The interactive below makes the identity concrete. Click any row of O\mathbf{O} to see exactly which row of E\mathbf{E} is “picked up”. Toggle between the full matmul and the direct lookup and notice that the output matrix is bit-identical.

Loading one-hot × E visualizer…
This identity is why every framework implements embeddings as lookup, not matmul. PyTorch’s nn.Embedding\texttt{nn.Embedding}, TensorFlow’s tf.nn.embedding_lookup\texttt{tf.nn.embedding\_lookup}, and JAX’s jax.numpy.take\texttt{jax.numpy.take} all do the same thing under the hood: integer-indexed row selection with a sparse backward pass.

The sparse gradient

The matmul view also explains the gradient. If loss LL depends on X=OE\mathbf{X} = \mathbf{O}\mathbf{E}, then by the chain rule:

LE=OLX\frac{\partial L}{\partial \mathbf{E}} = \mathbf{O}^{\top} \frac{\partial L}{\partial \mathbf{X}}

Because O\mathbf{O}^{\top} is almost entirely zeros, almost every entry of the gradient L/E\partial L / \partial \mathbf{E} is also zero. Only the rows corresponding to token IDs that actually appeared in the batch receive updates. Frameworks exploit this by storing the embedding gradient in a sparse format and by providing specialized optimizer variants (e.g. torch.optim.SparseAdam\texttt{torch.optim.SparseAdam}).


Building It From Scratch in NumPy

Let us implement the whole pipeline — normalize, tokenize, build vocab, index, and embed — using nothing but NumPy. Every line is annotated with the exact values it produces. Click any line on the right to see the execution state on the left.

From raw string to embedded matrix — pure NumPy
🐍tokenize_and_embed.py
1import numpy as np

NumPy is Python's numerical computing library. It gives us the ndarray (N-dimensional array), matrix multiplication via @, fancy indexing (E[input_ids]) and broadcasting — everything we need to turn text into numbers without a single Python loop over the heavy math.

EXECUTION STATE
numpy = Fast C-backed array library. We will use np.array, np.zeros, the @ operator, and fancy indexing E[ids].
as np = Universal alias — lets us write np.array() instead of numpy.array().
3# Section header — Raw input

Comment marking the start of the pipeline. Our input is plain Unicode text — the hardest format for a neural network to consume, since layers only understand floating-point tensors.

4text = "The cat sat on the mat"

A six-word sentence. Notice it contains the word 'the' twice and starts with a capital T — both details the pipeline must handle. Neural networks cannot ingest strings: every character here must eventually become a number.

EXECUTION STATE
text = "The cat sat on the mat"
type(text) = Python str — a sequence of Unicode code points
len(text) = 22 characters (including the 5 spaces)
6# Section header — Normalize

Before tokenizing we collapse surface variation. 'The' and 'the' are the same word semantically, but NumPy would treat them as different strings. Lowercasing removes that accident of capitalization.

7normalized = text.lower().strip()

Two Python string methods chained together. .lower() turns every uppercase letter into its lowercase form; .strip() removes leading/trailing whitespace (harmless here but essential for user input). This is the simplest form of text normalization — real pipelines also strip accents, collapse Unicode compatibility forms, and optionally remove punctuation.

EXECUTION STATE
📚 str.lower() = Returns a new string with every character in its lowercase form. 'The' → 'the'. Does not modify the original string — Python strings are immutable.
📚 str.strip() = Returns a new string with leading and trailing whitespace removed. Example: ' hi '.strip() → 'hi'. Does NOT remove internal spaces between words.
⬆ result: normalized = "the cat sat on the mat"
→ why normalize? = Without lowercasing, 'The' and 'the' would occupy two slots in the vocabulary, doubling the parameter count and splitting learning signal. Modern LLMs normalize less aggressively because subword tokenizers handle the variance, but classic NLP always normalizes first.
9# Section header — Tokenize

Tokenization slices the string into units the model will operate on. Word-level (what we do here), character-level, and subword-level (BPE, WordPiece) are the three common choices.

10tokens = normalized.split()

.split() with no argument splits on runs of any whitespace and discards empty strings. It is the simplest word tokenizer — fast and good enough for English sentences without punctuation.

EXECUTION STATE
📚 str.split(sep=None) = Returns a list of substrings. Called with no argument, it splits on any whitespace (spaces, tabs, newlines). Example: 'a b\t c'.split() → ['a', 'b', 'c'].
⬇ input: normalized = "the cat sat on the mat"
⬆ result: tokens = ['the', 'cat', 'sat', 'on', 'the', 'mat']
→ length = 6 tokens. Note that 'the' appears twice — the tokenizer does not deduplicate, it preserves sequence order.
11# tokens = ['the', 'cat', 'sat', 'on', 'the', 'mat']

Inline comment showing the exact result so you do not have to trace the code mentally. Notice the repeated 'the' — later we will confirm that both occurrences resolve to the same integer ID, and therefore the same embedding vector.

13# Section header — Build vocabulary

A vocabulary is a dictionary mapping every unique token string to a unique integer. This is how we turn a list of words into a list of numbers the model can process.

14vocab = {'<pad>': 0, '<unk>': 1}

We seed the vocabulary with two special symbols before any real word. <pad> is used to fill short sequences in a batch so every sample has the same length (required for matrix math). <unk> is the fallback for words the model has never seen during training. Reserving these at fixed IDs (0 and 1) is convention — PyTorch's nn.Embedding has a dedicated padding_idx argument that expects the pad row.

EXECUTION STATE
<pad> = The padding token. Its embedding row is usually frozen at zeros so it contributes nothing to attention or loss.
<unk> = The unknown-word token. At inference time, any token not in the vocab gets mapped to <unk>'s ID. Rare in modern LLMs because subword tokenizers can represent any string.
⬆ vocab (so far) = {'<pad>': 0, '<unk>': 1}
15for tok in tokens:

Iterate through every token in sequence order. We add each new word to the vocabulary the first time we see it.

LOOP TRACE · 6 iterations
iter 1: tok='the'
new? = yes — 'the' not in vocab
assigned id = 2 (len(vocab) before adding)
iter 2: tok='cat'
new? = yes
assigned id = 3
iter 3: tok='sat'
new? = yes
assigned id = 4
iter 4: tok='on'
new? = yes
assigned id = 5
iter 5: tok='the'
new? = no — already has id 2
action = skip
iter 6: tok='mat'
new? = yes
assigned id = 6
16if tok not in vocab:

Guard so we only assign an ID the first time a word appears. Python dict membership test runs in O(1) average time — great for vocabularies with millions of entries.

EXECUTION STATE
tok not in vocab = True when 'tok' is a string that has never been seen. False for repeat occurrences like the second 'the'.
17vocab[tok] = len(vocab)

Assign the next sequential integer. len(vocab) returns the current count of entries — since we add one per call, IDs grow 2, 3, 4, … in insertion order.

EXECUTION STATE
📚 len(dict) = Returns the number of key-value pairs. O(1) — Python stores the count internally.
⬆ final vocab = {'<pad>': 0, '<unk>': 1, 'the': 2, 'cat': 3, 'sat': 4, 'on': 5, 'mat': 6}
→ vocab_size = 7 total entries
19# Section header — Tokens to IDs

Now that we have a word→id mapping, we can convert the token list into a pure array of integers — exactly the format nn.Embedding expects as input.

20input_ids = np.array([vocab.get(t, vocab['<unk>']) for t in tokens])

A list comprehension wrapped in np.array() — the Pythonic way to build an ndarray. For each token, look up its ID in the vocab, defaulting to the <unk> ID if missing. Wrapping in np.array() converts the Python list into a dense integer ndarray.

EXECUTION STATE
📚 dict.get(key, default) = Return vocab[key] if the key exists, otherwise return the default. Example: vocab.get('hello', 1) → 1 (since 'hello' not in vocab). The default argument is why unseen tokens silently become <unk> instead of crashing with KeyError.
📚 np.array([...]) = Convert a Python list (of ints here) into an ndarray. Gives us vectorized operations, broadcasting, and the @ operator. dtype is inferred as int64 for a list of ints.
⬆ result: input_ids = [2, 3, 4, 5, 2, 6]
→ shape = (6,) — a 1-D array of 6 integers
→ observation = Notice positions 0 and 4 both hold 2 — the two occurrences of 'the' map to the same ID. This is the whole point: identical words → identical IDs → identical embedding rows.
22# Section header — One-hot encoding

One-hot is the naive first attempt at turning IDs into vectors. Every token becomes a vector of length vocab_size containing a single 1.0 and zeros everywhere else. It works, but it is enormously wasteful — we are about to see why.

23vocab_size = len(vocab)

Cache the vocabulary size — it will be the second dimension of the one-hot matrix.

EXECUTION STATE
vocab_size = 7 — because our vocab has 7 entries: 2 specials + 5 unique words
→ scale reality = In production: GPT-2 used 50,257. LLaMA 3 uses 128,256. A naive one-hot for a 128k-word vocab means every single token would be a 128,000-dimensional vector — clearly unusable.
24one_hot = np.zeros((len(input_ids), vocab_size))

Pre-allocate a T×V matrix of zeros, where T=6 is the sequence length and V=7 is the vocabulary size. We will flip exactly one entry per row to 1.0 in the next loop.

EXECUTION STATE
📚 np.zeros(shape) = Creates an ndarray filled with 0.0 of the given shape. Default dtype is float64. Example: np.zeros((2,3)) → [[0,0,0],[0,0,0]].
⬇ arg: shape = (6, 7) — a tuple: (seq_len, vocab_size)
⬆ result: one_hot (6×7) =
all zeros — 42 floats, all 0.0
25for i, idx in enumerate(input_ids):

Loop over (position, token_id) pairs. enumerate yields (0, 2), (1, 3), (2, 4), (3, 5), (4, 2), (5, 6) — exactly the positions where we need to flip a zero to a one.

LOOP TRACE · 6 iterations
i=0, idx=2 (the)
action = one_hot[0, 2] = 1.0 — flip row 0, column 2
i=1, idx=3 (cat)
action = one_hot[1, 3] = 1.0
i=2, idx=4 (sat)
action = one_hot[2, 4] = 1.0
i=3, idx=5 (on)
action = one_hot[3, 5] = 1.0
i=4, idx=2 (the again)
action = one_hot[4, 2] = 1.0 — same column as row 0
i=5, idx=6 (mat)
action = one_hot[5, 6] = 1.0
26one_hot[i, idx] = 1.0

NumPy fancy indexing: one_hot[i, idx] selects row i column idx. Assigning 1.0 flips that single float. After the loop, each row holds exactly one 1.0 — the signature of a one-hot encoding.

EXECUTION STATE
⬆ one_hot (6×7) after loop =
       <pad> <unk>  the  cat  sat   on  mat
row0    0    0    1    0    0    0    0
row1    0    0    0    1    0    0    0
row2    0    0    0    0    1    0    0
row3    0    0    0    0    0    1    0
row4    0    0    1    0    0    0    0
row5    0    0    0    0    0    0    1
→ density = 6 ones out of 42 floats = 14% density. For a 128k vocab it would be 0.0008% density — almost all memory is wasted zeros.
28# Section header — Embedding matrix

The embedding matrix E is the lookup table. Row i is the d_embed-dimensional dense vector for token ID i. Real models learn these rows via backpropagation; here we hand-pick values so you can verify every arithmetic step.

29d_embed = 4

Embedding dimension — how many numbers we use to represent each word. Four is a pedagogical toy; real models use 128 (word2vec), 768 (BERT-base), 4096 (LLaMA-2-7B), or 12288 (GPT-3).

EXECUTION STATE
d_embed = 4 = Each token becomes a 4-dim vector. Parameter count = vocab_size × d_embed = 7 × 4 = 28 floats.
→ scaling = GPT-2 small: 50257 × 768 ≈ 38M parameters just for the embedding layer — about 30% of the entire model's parameters.
30E = np.array([...])

Hand-crafted 7×4 embedding matrix. Each row is one token's 'meaning vector.' We chose these numbers by hand so the arithmetic is verifiable — in practice, they start as small random numbers and move during training.

EXECUTION STATE
⬆ E shape = (7, 4) — vocab_size × d_embed
⬆ E (full matrix) =
         d0    d1    d2    d3
<pad>   0.00  0.00  0.00  0.00
<unk>   0.10  0.20  0.30  0.40
the     0.20  0.10  0.05  0.30
cat     0.80  0.60  0.20  0.10
sat     0.40  0.70  0.90  0.20
on      0.30  0.30  0.10  0.50
mat     0.70  0.50  0.30  0.20
→ <pad> row = All zeros on purpose — padded positions should contribute nothing to downstream computations.
31[0.00, 0.00, 0.00, 0.00], # <pad>

The pad row. Deliberately zeros so that pad positions pass through the network without affecting activations. PyTorch's padding_idx freezes this row automatically.

EXECUTION STATE
E[0] = [0.00, 0.00, 0.00, 0.00]
32[0.10, 0.20, 0.30, 0.40], # <unk>

The unknown-word row — used for any token not seen during training.

EXECUTION STATE
E[1] = [0.10, 0.20, 0.30, 0.40]
33[0.20, 0.10, 0.05, 0.30], # the

The row for 'the'. Small magnitudes — determiners like 'the' usually end up near the origin in learned embeddings because they carry little content.

EXECUTION STATE
E[2] = [0.20, 0.10, 0.05, 0.30]
34[0.80, 0.60, 0.20, 0.10], # cat

The 'cat' row. Larger magnitudes — content words tend to have stronger activations.

EXECUTION STATE
E[3] = [0.80, 0.60, 0.20, 0.10]
35[0.40, 0.70, 0.90, 0.20], # sat

The 'sat' row — a verb.

EXECUTION STATE
E[4] = [0.40, 0.70, 0.90, 0.20]
36[0.30, 0.30, 0.10, 0.50], # on

The 'on' row — a preposition.

EXECUTION STATE
E[5] = [0.30, 0.30, 0.10, 0.50]
37[0.70, 0.50, 0.30, 0.20], # mat

The 'mat' row. Notice its similarity to 'cat' — both are physical objects, and in a real trained embedding they would land close together in vector space.

EXECUTION STATE
E[6] = [0.70, 0.50, 0.30, 0.20]
38])

Closing bracket for np.array(). All 7 rows are now packed into a single (7,4) ndarray.

40# Way A: one-hot @ E (wasteful)

The mathematically clean but slow way to get embeddings: multiply the one-hot matrix by the embedding matrix. This is pedagogically valuable because it shows that embedding lookup IS a matrix multiplication in disguise.

41emb_matmul = one_hot @ E

Matrix multiplication: (6×7) @ (7×4) → (6×4). Each row of the result is one_hot[i] (a row of zeros with a single 1.0) times E (all 7 rows of the embedding matrix). Because only one entry of one_hot[i] is nonzero, the dot product simply picks out that one row of E.

EXECUTION STATE
📚 @ operator = NumPy matrix multiply (also written np.matmul). For 2-D arrays: (M,K) @ (K,N) → (M,N). Under the hood: result[i,j] = sum_k a[i,k] * b[k,j].
⬇ shapes = one_hot (6,7) @ E (7,4) → emb_matmul (6,4). Inner dim 7 matches — required for matmul.
⬆ emb_matmul (6×4) =
       d0    d1    d2    d3
row0  0.20  0.10  0.05  0.30   ← the
row1  0.80  0.60  0.20  0.10   ← cat
row2  0.40  0.70  0.90  0.20   ← sat
row3  0.30  0.30  0.10  0.50   ← on
row4  0.20  0.10  0.05  0.30   ← the (identical to row0)
row5  0.70  0.50  0.30  0.20   ← mat
→ cost analysis = This multiply does 6 × 7 × 4 = 168 multiply-adds. 156 of them multiply by zero — pure waste. For vocab=128k the waste grows linearly in vocab_size.
43# Way B: direct row-lookup (what frameworks do)

The efficient shortcut. Because matmul with a one-hot just selects a row, we can skip the matrix entirely and index directly.

44emb_lookup = E[input_ids]

NumPy fancy indexing: when you index an ndarray with another integer array, you get rows in that order. E[[2,3,4,5,2,6]] returns rows 2, 3, 4, 5, 2, 6 of E in exactly that order — the embeddings for our sentence.

EXECUTION STATE
📚 Fancy indexing E[ids] = Equivalent to np.stack([E[i] for i in ids]). Returns a new array whose i-th row is E[ids[i]]. Crucially, it does NOT build a one-hot matrix internally — it is O(T * d), independent of vocab_size.
⬇ input_ids = [2, 3, 4, 5, 2, 6]
⬆ emb_lookup (6×4) =
       d0    d1    d2    d3
row0  0.20  0.10  0.05  0.30   ← E[2]
row1  0.80  0.60  0.20  0.10   ← E[3]
row2  0.40  0.70  0.90  0.20   ← E[4]
row3  0.30  0.30  0.10  0.50   ← E[5]
row4  0.20  0.10  0.05  0.30   ← E[2] again
row5  0.70  0.50  0.30  0.20   ← E[6]
→ identity check = emb_lookup is bit-exact equal to emb_matmul. Confirmed with np.allclose → True. Both compute the same thing; only their cost differs.
→ gradient implication = In PyTorch the lookup path also has a sparse gradient: only the rows that were looked up receive updates during backprop. This is why embedding gradients are memory-efficient even when vocab_size is huge.
46print("input_ids:", input_ids)

Sanity check — verify the integer sequence that went into the lookup.

EXECUTION STATE
Output = input_ids: [2 3 4 5 2 6]
47print("embedded shape:", emb_lookup.shape)

Shape confirmation — (seq_len, d_embed) is the canonical 2-D shape of a sentence after embedding.

EXECUTION STATE
Output = embedded shape: (6, 4)
48print(emb_lookup)

Print the full matrix. This is the data that the next layer (RNN, attention, CNN, …) receives as input.

EXECUTION STATE
Output =
[[0.2  0.1  0.05 0.3 ]
 [0.8  0.6  0.2  0.1 ]
 [0.4  0.7  0.9  0.2 ]
 [0.3  0.3  0.1  0.5 ]
 [0.2  0.1  0.05 0.3 ]
 [0.7  0.5  0.3  0.2 ]]
10 lines without explanation
1import numpy as np
2
3# --- 1. Raw input text ---
4text = "The cat sat on the mat"
5
6# --- 2. Normalize (lowercase + strip whitespace) ---
7normalized = text.lower().strip()
8
9# --- 3. Tokenize by whitespace ---
10tokens = normalized.split()
11# tokens = ['the', 'cat', 'sat', 'on', 'the', 'mat']
12
13# --- 4. Build vocabulary with special symbols ---
14vocab = {'<pad>': 0, '<unk>': 1}
15for tok in tokens:
16    if tok not in vocab:
17        vocab[tok] = len(vocab)
18
19# --- 5. Convert tokens to integer IDs ---
20input_ids = np.array([vocab.get(t, vocab['<unk>']) for t in tokens])
21
22# --- 6. One-hot encode each token (the naive approach) ---
23vocab_size = len(vocab)
24one_hot = np.zeros((len(input_ids), vocab_size))
25for i, idx in enumerate(input_ids):
26    one_hot[i, idx] = 1.0
27
28# --- 7. Define an embedding matrix (normally learned) ---
29d_embed = 4
30E = np.array([
31    [0.00, 0.00, 0.00, 0.00],   # <pad>
32    [0.10, 0.20, 0.30, 0.40],   # <unk>
33    [0.20, 0.10, 0.05, 0.30],   # the
34    [0.80, 0.60, 0.20, 0.10],   # cat
35    [0.40, 0.70, 0.90, 0.20],   # sat
36    [0.30, 0.30, 0.10, 0.50],   # on
37    [0.70, 0.50, 0.30, 0.20],   # mat
38])
39
40# --- 8. Way A: one-hot @ E   (O(T*V*d) — wasteful) ---
41emb_matmul = one_hot @ E
42
43# --- 9. Way B: direct row-lookup   (O(T*d) — what real frameworks do) ---
44emb_lookup = E[input_ids]
45
46print("input_ids:", input_ids)
47print("embedded shape:", emb_lookup.shape)
48print(emb_lookup)

Two lines deserve extra attention. Line 41 does the lookup the slow way — one_hot @ E\texttt{one\_hot @ E} — to prove the mathematical identity. Line 44 does it the fast way — E[input_ids]\texttt{E[input\_ids]} — which is what every production framework actually runs. Both produce the same 6×46 \times 4 matrix. Verify it yourself: rows 0 and 4 are bit-for-bit identical because both correspond to the token the.


The Same Thing in PyTorch: nn.Embedding

The NumPy implementation is crystal-clear but it cannot be trained. PyTorch’s nn.Embedding\texttt{nn.Embedding} takes the same idea, wraps the matrix in a nn.Parameter\texttt{nn.Parameter}, and plugs it into autograd — so every row becomes a learnable parameter of the model.

Three features of nn.Embedding\texttt{nn.Embedding} are worth calling out before the code:

  • Sparse gradient: only rows whose indices appeared in the forward pass receive gradient updates.
  • padding_idx\texttt{padding\_idx}: the row at this index is initialized to zeros and its gradient is permanently masked to zero — so pad positions stay dead forever.
  • max_norm\texttt{max\_norm}: optionally renormalizes each looked-up row to a fixed L2 norm on the fly — useful when you want to prevent a runaway-embedding pathology.

Below we run a full forward pass, define a contrived squared-error loss, and call loss.backward()\texttt{loss.backward()}. The annotations walk through the exact gradient computed for each row — including the critical observation that row 2 (the word “the”) accumulates contributions from both positions where it appears.

nn.Embedding + gradient flow — PyTorch
🐍pytorch_embedding.py
1import torch

PyTorch's core module. Provides torch.tensor (GPU-capable n-D arrays), autograd (reverse-mode automatic differentiation), and the linear algebra primitives we will use below.

EXECUTION STATE
torch = Deep-learning framework with autograd and GPU support. Every variable that will be optimized lives inside a torch.Tensor with requires_grad=True.
2import torch.nn as nn

torch.nn collects the building-block layers — Linear, Embedding, LayerNorm, MultiheadAttention, etc. Aliased as 'nn' by convention.

EXECUTION STATE
nn = Namespace for neural-network modules. We use nn.Embedding in a moment — the canonical way to implement a learnable lookup table in PyTorch.
4# Same vocab as before — 7 tokens total

Reusing the vocabulary we built with NumPy so the two implementations are directly comparable.

5vocab_size = 7

Row count of the embedding table — one row per token ID.

EXECUTION STATE
vocab_size = 7
6d_embed = 4

Column count of the embedding table — the dimensionality of each token's vector.

EXECUTION STATE
d_embed = 4
7input_ids = torch.tensor([2, 3, 4, 5, 2, 6])

The same integer sequence we produced in NumPy, now wrapped in a torch.Tensor. PyTorch expects LongTensor (int64) as input to nn.Embedding.

EXECUTION STATE
📚 torch.tensor(data) = Builds a Tensor from a Python list/array. Infers dtype — a list of Python ints produces torch.int64. If you need floats, pass floats or use dtype=torch.float32.
⬆ input_ids = tensor([2, 3, 4, 5, 2, 6])
→ dtype = torch.int64 (long). nn.Embedding REQUIRES a long/int tensor — passing float IDs raises RuntimeError: 'Expected tensor for argument #1 index to have scalar type Long'.
9# nn.Embedding: a learnable lookup table

Header for the layer construction. nn.Embedding is the PyTorch realization of the embedding matrix E we built by hand — but with autograd support so the rows can be learned.

10embedding = nn.Embedding(

Construct a new Embedding module. Under the hood this creates a single learnable parameter: embedding.weight of shape (num_embeddings, embedding_dim). Forward pass does a sparse row-lookup and backward pass accumulates gradients only on the rows that were looked up.

EXECUTION STATE
📚 nn.Embedding = A PyTorch Module that stores a (num_embeddings, embedding_dim) weight matrix and, on forward, returns weight[input_ids]. Internally implemented as a C++ kernel with a sparse backward for efficiency.
11num_embeddings=vocab_size,

Number of rows — how many distinct IDs the table supports. Passing an ID ≥ num_embeddings raises an out-of-bounds error.

EXECUTION STATE
⬇ num_embeddings = 7 — must equal max(input_ids)+1 or larger
12embedding_dim=d_embed,

Number of columns per row — dimensionality of each embedding vector. In full-size transformers this is d_model (768, 4096, 12288).

EXECUTION STATE
⬇ embedding_dim = 4
13padding_idx=0,

Two effects at once: (1) the gradient for row 0 is forced to zero on every backward pass (so its row stays frozen if you never reassign it), and (2) if you initialize naively, row 0 is forced to all-zeros during __init__ so pad positions contribute exactly nothing to subsequent layers.

EXECUTION STATE
⬇ padding_idx = 0 = Index of the <pad> token. PyTorch zeros this row on init and blocks its gradient forever.
→ why it matters = In a batch, short sequences are padded to the max length. Without padding_idx the model would still accumulate attention mass onto pad positions and learn garbage from them. Setting padding_idx tells the model these positions are structurally meaningless.
14)

Close the constructor call. embedding.weight now exists as a learnable parameter of shape (7, 4).

EXECUTION STATE
⬆ embedding.weight.shape = torch.Size([7, 4])
⬆ embedding.weight.requires_grad = True — autograd will track operations on this tensor
16# Pin the weights for reproducibility

Next we overwrite the random initialization with our fixed values so every reader computes the exact same numbers.

17with torch.no_grad():

Open a context where autograd is disabled. Inside this block, assignments to .weight do not get recorded on the computation graph — essential for weight surgery, initialization, and inference.

EXECUTION STATE
📚 torch.no_grad() = Context manager that temporarily sets requires_grad=False and skips graph building. Saves memory and prevents accidentally making your initialization part of the loss graph.
→ when to use = Initialization, weight loading, inference/evaluation, validation loops. Do NOT use during training's forward pass — you would lose the gradient.
18embedding.weight.copy_(torch.tensor([

In-place copy — notice the trailing underscore on .copy_(). PyTorch convention: any method whose name ends in _ modifies the tensor in place. We need in-place here because embedding.weight is a nn.Parameter; we cannot simply replace it or the autograd linkage breaks.

EXECUTION STATE
📚 tensor.copy_(src) = In-place copy from src into self. Preserves self's identity (memory address and Parameter wrapping). Returns self. Required when you want to hot-swap values while keeping requires_grad/Parameter status.
→ alternative = Assigning a new Tensor (embedding.weight = torch.tensor(...)) would break because .weight is a nn.Parameter, not a plain attribute. .copy_() respects the Parameter wrapper.
19[0.00, 0.00, 0.00, 0.00],

Row 0 — the <pad> token. PyTorch already forced this to zeros during __init__ because of padding_idx=0; we re-assert it explicitly.

EXECUTION STATE
weight[0] = [0, 0, 0, 0] — stays zero forever because of padding_idx
20[0.10, 0.20, 0.30, 0.40],

Row 1 — <unk>.

EXECUTION STATE
weight[1] = [0.10, 0.20, 0.30, 0.40]
21[0.20, 0.10, 0.05, 0.30],

Row 2 — 'the'.

EXECUTION STATE
weight[2] = [0.20, 0.10, 0.05, 0.30]
22[0.80, 0.60, 0.20, 0.10],

Row 3 — 'cat'.

EXECUTION STATE
weight[3] = [0.80, 0.60, 0.20, 0.10]
23[0.40, 0.70, 0.90, 0.20],

Row 4 — 'sat'.

EXECUTION STATE
weight[4] = [0.40, 0.70, 0.90, 0.20]
24[0.30, 0.30, 0.10, 0.50],

Row 5 — 'on'.

EXECUTION STATE
weight[5] = [0.30, 0.30, 0.10, 0.50]
25[0.70, 0.50, 0.30, 0.20],

Row 6 — 'mat'.

EXECUTION STATE
weight[6] = [0.70, 0.50, 0.30, 0.20]
26]))

Close the torch.tensor([...]) and the .copy_() call. embedding.weight now holds our fixed values.

28# --- Forward: lookup by ID ---

Everything above was setup. The actual work — turning IDs into vectors — happens in one line below.

29embedded = embedding(input_ids) # shape: (6, 4)

Calling a Module like a function routes through its __call__ → forward(). For Embedding, forward does: return self.weight[input] — pure fancy indexing, identical to E[input_ids] in NumPy. Autograd records the lookup so we can backpropagate into weight later.

EXECUTION STATE
📚 embedding(input_ids) = Shortcut for embedding.__call__(input_ids) which calls embedding.forward(input_ids) plus any registered hooks. Always call the module, not .forward() directly.
⬇ input: input_ids = tensor([2, 3, 4, 5, 2, 6])
⬆ embedded shape = torch.Size([6, 4]) — (seq_len, d_embed)
⬆ embedded values =
tensor([[0.2000, 0.1000, 0.0500, 0.3000],
        [0.8000, 0.6000, 0.2000, 0.1000],
        [0.4000, 0.7000, 0.9000, 0.2000],
        [0.3000, 0.3000, 0.1000, 0.5000],
        [0.2000, 0.1000, 0.0500, 0.3000],
        [0.7000, 0.5000, 0.3000, 0.2000]],
       grad_fn=<EmbeddingBackward0>)
→ grad_fn = EmbeddingBackward0 — the autograd node that will compute ∂loss/∂weight[ids] on backward.
32# --- Training signal: target vectors ---

To see gradients, we need a loss. We invent a target for each position and minimize squared error — just enough to expose the gradient structure.

33target = torch.tensor([

Start of the target tensor. Same shape as 'embedded' — one target vector per position.

34[0.0, 0.0, 0.0, 0.0], # 'the' → 0

Target for position 0: we want 'the' to be pulled toward the origin.

EXECUTION STATE
target[0] = [0.0, 0.0, 0.0, 0.0]
35[1.0, 1.0, 1.0, 1.0], # 'cat' → 1

Target for position 1: 'cat' should be pulled toward all-ones.

EXECUTION STATE
target[1] = [1.0, 1.0, 1.0, 1.0]
36[1.0, 1.0, 1.0, 1.0], # 'sat' → 1

Same target vector for 'sat'.

EXECUTION STATE
target[2] = [1.0, 1.0, 1.0, 1.0]
37[0.5, 0.5, 0.5, 0.5], # 'on' → 0.5

Target for 'on' — the middle point.

EXECUTION STATE
target[3] = [0.5, 0.5, 0.5, 0.5]
38[0.0, 0.0, 0.0, 0.0], # 'the' again → 0

IMPORTANT — same target as position 0. Because 'the' appears twice and its embedding is shared, both positions contribute to the same weight[2] gradient. Their contributions accumulate.

EXECUTION STATE
target[4] = [0.0, 0.0, 0.0, 0.0]
39[1.0, 1.0, 1.0, 1.0], # 'mat' → 1

Target for 'mat'.

EXECUTION STATE
target[5] = [1.0, 1.0, 1.0, 1.0]
40])

Close the target tensor. Shape: (6, 4) — matches 'embedded'.

42loss = ((embedded - target) ** 2).mean()

Mean squared error. embedded - target is element-wise subtraction (shape 6×4). ** 2 squares every element. .mean() averages all 24 numbers into a single scalar. This scalar is the root of the autograd graph we will backpropagate from.

EXECUTION STATE
📚 ** operator on Tensor = Element-wise power. (x ** 2)[i,j] = x[i,j] * x[i,j]. Differentiable — d(x^2)/dx = 2x.
📚 tensor.mean() = Average over all elements by default; pass dim=... to reduce only along an axis. Returns a 0-D tensor (a scalar).
→ per-element step-by-step = diff = embedded - target = row0: [0.20, 0.10, 0.05, 0.30] row1: [-0.20, -0.40, -0.80, -0.90] ... squared, then averaged over all 24 entries.
⬆ loss = tensor(0.1977, grad_fn=<MeanBackward0>)
43loss.backward()

Reverse-mode automatic differentiation. Starting from 'loss' (a scalar), PyTorch walks the computation graph in reverse, computing ∂loss/∂tensor for every tensor with requires_grad=True. For Embedding the crucial fact is that the gradient is sparse — only the rows that were looked up receive non-zero accumulation.

EXECUTION STATE
📚 tensor.backward() = Populates the .grad attribute of every leaf tensor in the graph. Must be called on a scalar (or with a gradient argument for non-scalar). Gradients ACCUMULATE — call .zero_grad() between steps.
→ chain rule at Embedding = Upstream grad arrives as ∂loss/∂embedded (6×4). For weight[k] we sum the rows i where input_ids[i] == k. That is why 'the' (ids 0 and 4) accumulates contributions from TWO positions into weight[2].
⬆ embedding.weight.grad = A new (7, 4) tensor filled by backward. Printed on the next three lines.
45# Only rows for tokens that appeared get non-zero gradients

The takeaway. Rows for <pad> (id 0, frozen by padding_idx) and <unk> (id 1, never appeared) receive zero gradient. Rows for tokens that did appear receive real gradients proportional to the position-wise errors.

46print("grad for row 2 ('the'):", embedding.weight.grad[2])

Row 2 corresponds to 'the'. It appears at positions 0 and 4, both with target [0,0,0,0]. The per-position error is embedded - target = [0.20, 0.10, 0.05, 0.30] (same each time). The gradient accumulates: grad[2] = (2 / 24) × [0.20, 0.10, 0.05, 0.30] = [0.0333, 0.0167, 0.0083, 0.0500].

EXECUTION STATE
derivation = ∂loss/∂weight[2][j] = (1/24) × Σ_i 2 × (embedded[i,j] − target[i,j]) × 1[ids[i]==2] = (1/12) × (diff[0,j] + diff[4,j]) Both diff rows equal [0.20, 0.10, 0.05, 0.30] here, so grad[2] = (2/12) × [0.20, 0.10, 0.05, 0.30].
Output = grad for row 2 ('the'): tensor([0.0333, 0.0167, 0.0083, 0.0500])
→ accumulation proof = If we changed input_ids so 'the' appeared only once, the gradient would halve. Repeated tokens get stronger gradient signal.
47print("grad for row 3 ('cat'):", embedding.weight.grad[3])

Row 3 — 'cat' — appears only at position 1. diff[1] = embedded[1] − target[1] = [0.80, 0.60, 0.20, 0.10] − [1.0, 1.0, 1.0, 1.0] = [−0.20, −0.40, −0.80, −0.90]. Gradient = (1/12) × [−0.20, −0.40, −0.80, −0.90] = [−0.0167, −0.0333, −0.0667, −0.0750].

EXECUTION STATE
Output = grad for row 3 ('cat'): tensor([-0.0167, -0.0333, -0.0667, -0.0750])
→ sign meaning = Negative gradient means 'increase this weight to decrease loss'. SGD update: weight ← weight − lr * grad. With lr=1, weight[3] moves from [0.80, 0.60, 0.20, 0.10] toward [0.82, 0.63, 0.27, 0.18] — closer to the target [1,1,1,1].
48print("grad for row 0 ('<pad>'):", embedding.weight.grad[0])

Row 0 — <pad>. Two reasons its gradient is zero: (1) the token never appears in input_ids, so no forward path touches it; (2) even if it did appear, padding_idx=0 explicitly zeros the gradient. Defense in depth.

EXECUTION STATE
Output = grad for row 0 ('<pad>'): tensor([0., 0., 0., 0.])
→ analogous row = Row 1 ('<unk>') also has all-zero gradient — not because of padding_idx (only row 0 gets that treatment) but because <unk> simply never appeared in input_ids this batch.
7 lines without explanation
1import torch
2import torch.nn as nn
3
4# Same vocab as before — 7 tokens total
5vocab_size = 7
6d_embed = 4
7input_ids = torch.tensor([2, 3, 4, 5, 2, 6])
8
9# nn.Embedding: a learnable lookup table
10embedding = nn.Embedding(
11    num_embeddings=vocab_size,
12    embedding_dim=d_embed,
13    padding_idx=0,
14)
15
16# Pin the weights to our hand-designed matrix for reproducibility
17with torch.no_grad():
18    embedding.weight.copy_(torch.tensor([
19        [0.00, 0.00, 0.00, 0.00],
20        [0.10, 0.20, 0.30, 0.40],
21        [0.20, 0.10, 0.05, 0.30],
22        [0.80, 0.60, 0.20, 0.10],
23        [0.40, 0.70, 0.90, 0.20],
24        [0.30, 0.30, 0.10, 0.50],
25        [0.70, 0.50, 0.30, 0.20],
26    ]))
27
28# --- Forward: lookup by ID ---
29embedded = embedding(input_ids)   # shape: (6, 4)
30
31# --- Training signal: push each token toward a target vector ---
32target = torch.tensor([
33    [0.0, 0.0, 0.0, 0.0],   # want "the"  ~ 0
34    [1.0, 1.0, 1.0, 1.0],   # "cat" ~ 1
35    [1.0, 1.0, 1.0, 1.0],   # "sat" ~ 1
36    [0.5, 0.5, 0.5, 0.5],   # "on"  ~ 0.5
37    [0.0, 0.0, 0.0, 0.0],   # "the" ~ 0  (same target as row 0)
38    [1.0, 1.0, 1.0, 1.0],   # "mat" ~ 1
39])
40
41loss = ((embedded - target) ** 2).mean()
42loss.backward()
43
44# Only rows for tokens that appeared get non-zero gradients
45print("grad for row 2 ('the'):", embedding.weight.grad[2])
46print("grad for row 3 ('cat'):", embedding.weight.grad[3])
47print("grad for row 0 ('<pad>'):", embedding.weight.grad[0])
Gradient accumulation is a feature, not a bug. Because the same row is selected twice (once per occurrence of the), its gradient is the sum of the per-position gradients. That is how a token learns from every context it appears in — the more contexts, the stronger the learning signal. This is also why embeddings for frequent words (like the) typically converge much faster than embeddings for rare words.

The panel below traces exactly which rows of E\mathbf{E} receive gradient for this batch. Rows that never appeared stay dark. Toggle padding_idx\texttt{padding\_idx} off and watch row 0 light up; toggle it on again and the lock reinstates.

Loading sparse-gradient visualizer…

Visualizing the Embedding Space

Numbers are hard to feel. Geometry is easy to feel. The visualization below shows a toy embedding space with d=3d = 3 (so we can actually draw it). Each token is a coloured point in 3-space, and tokens in the same semantic cluster — royalty, people, animals, fruits, activities — were placed near each other. Rotate, zoom, and pan to see the clusters from every angle.

Loading embedding space visualization…

What you are seeing is a learned structure. Nothing in the training objective said “put royalty on one side and fruit on the other”. The distributional hypothesis — that words with similar contexts should have similar vectors — is enough. Once the loss has been minimized, similar words inevitably land near each other because pushing them apart would increase the loss.

Two geometric facts drop out of the training that seem magical at first:

  • Direction encodes semantic relations. kingman+womanqueen\vec{\text{king}} - \vec{\text{man}} + \vec{\text{woman}} \approx \vec{\text{queen}} — the male→female direction is (approximately) a translation in the embedding space.
  • Distance encodes semantic similarity. Cosine similarity between embedding vectors is one of the most widely used proxies for semantic similarity in information retrieval, semantic search, and RAG systems.

How Embeddings Learn Meaning

Where does the geometry come from? It is not hand-designed — it is a byproduct of training the model on its primary task. Whether the task is classifying sentiment, predicting the next token, or answering a question, the gradient signal has to pass through the embedding layer on its way back to the weights. Each such pass nudges the rows of E\mathbf{E} in whatever direction reduces the loss.

Concretely, for a language-modelling objective with cross-entropy lossL=logp(wtw<t)L = -\log p(w_t \mid w_{<t}), the gradient with respect to the embedding of an input word is:

LE[idi]=LX[i,:]\frac{\partial L}{\partial \mathbf{E}[\text{id}_i]} = \frac{\partial L}{\partial \mathbf{X}[i,:]}

If two words tend to appear in the same contexts, the upstream gradient flowing into their embedding rows is, on average, similar. Over many updates their vectors drift in the same direction — and they end up geometrically close. This is exactly the distributional hypothesis expressed as an equation.

The same idea, pre-trained on huge text corpora, produced the classic static embeddings — word2vec (skip-gram and CBOW, Mikolov et al., 2013) and GloVe (Pennington et al., 2014). Today’s LLMs learn their embeddings jointly with the rest of the Transformer, so word2vec-style pre-training is largely historical. But the lesson is the same: geometry emerges from prediction.


Embeddings in Modern Transformers

The embedding layer is the first layer in every decoder-only LLM on the planet. Everything downstream — self-attention, Flash Attention, KV-cache, RoPE — is a transformation of the matrix the embedding layer produces. Understanding how embeddings connect to those mechanisms is the difference between reading the diagrams and understanding them.

Token embedding + positional encoding

A vanilla self-attention layer is permutation-invariant: shuffle the input tokens and the output shuffles the same way. For language that is disastrous — “dog bites man” and “man bites dog” must produce different outputs. The standard fix is to add a positional encoding PRT×d\mathbf{P} \in \mathbb{R}^{T \times d} to the token embedding before feeding the sum into the first attention block:

X0=E[ids]+P\mathbf{X}_0 = \mathbf{E}[\text{ids}] + \mathbf{P}

In the original Transformer of Vaswani et al. (2017), P\mathbf{P} is the deterministic sinusoidal function; in GPT-2 it is a learnable parameter of the same shape asE[ids]\mathbf{E}[\text{ids}]. Modern LLMs (LLaMA, Gemma, Mistral) drop additive positions entirely in favour of RoPE (Rotary Position Embedding), which rotates the query and key vectors inside attention itself. Either way, the token embedding from our nn.Embedding\texttt{nn.Embedding} is the starting material.

The dmodel\sqrt{d_{\text{model}}} scaling trick

You will often see embeddings multiplied by dmodel\sqrt{d_{\text{model}}} before being added to positional encodings: X0=E[ids]dmodel+P\mathbf{X}_0 = \mathbf{E}[\text{ids}] \cdot \sqrt{d_{\text{model}}} + \mathbf{P}. This is in the original Transformer paper. The motivation: if embeddings are initialized with variance 1/dmodel1/d_{\text{model}} (Xavier init), scaling by dmodel\sqrt{d_{\text{model}}} restores unit variance so the embedding does not get drowned out by the positional encoding at initialization.

Weight tying with the output projection

A language model needs to do two conversions: tokens → vectors at the input, and vectors → token probabilities at the output. The input uses ERV×d\mathbf{E} \in \mathbb{R}^{V \times d}; the output uses a linear layer WoutRd×V\mathbf{W}_{\text{out}} \in \mathbb{R}^{d \times V}. Modern LLMs almost universally tie these: Wout=E\mathbf{W}_{\text{out}} = \mathbf{E}^{\top}. This cuts parameter count roughly in half for the embedding/unembedding pair (40M → 20M for GPT-2) and often improves quality because every gradient step updates both roles of each row coherently.

Embeddings, multi-head attention, and Flash Attention

The matrix X0RT×d\mathbf{X}_0 \in \mathbb{R}^{T \times d} coming out of the embedding + position stage feeds directly into multi-head attention. The queries, keys, and values are linear projections of this matrix:

Q=X0WQ,K=X0WK,V=X0WV\mathbf{Q} = \mathbf{X}_0 \mathbf{W}_Q,\quad \mathbf{K} = \mathbf{X}_0 \mathbf{W}_K,\quad \mathbf{V} = \mathbf{X}_0 \mathbf{W}_V

So the embedding is literally the ink that attention paints with. This matters for two modern optimizations:

  • Flash Attention (Dao et al., 2022) computes softmax(QK/dk)V\text{softmax}(\mathbf{Q}\mathbf{K}^{\top}/\sqrt{d_k})\mathbf{V} without ever materializing the full T×TT \times T attention matrix. Its memory cost is O(Td)O(T \cdot d) instead of O(T2)O(T^2). The TdT \cdot d factor is set by the embedding dimension — making dd too large directly raises Flash Attention’s on-chip memory requirement.
  • Multi-head attention splits dd across hh heads, giving each head dk=d/hd_k = d / h dimensions. The embedding dimension is therefore constrained to be divisible by the number of heads. LLaMA-2-7B uses d=4096d = 4096 and h=32h = 32, giving dk=128d_k = 128 per head.

The KV-cache and the embedding’s role at inference

At autoregressive inference, a model generates one token at a time. Naive implementations re-run attention over the full sequence on every step — O(T2d)O(T^2 \cdot d) per step. The KV-cache fixes this: it stores every layer’s K\mathbf{K} and V\mathbf{V} from previous steps so only the new token’s single row must be projected and attended. The cache lives in GPU memory and its size is:

KV-cache bytes=2×L×T×d×bytes-per-float\text{KV-cache bytes} = 2 \times L \times T \times d \times \text{bytes-per-float}

where LL is the number of layers, TT the context length, and dd the model dimension — the same dd set by the embedding layer. At L=32, T=8192, d=4096L=32,\ T=8192,\ d=4096, float16, the KV-cache per sequence is about 2 GB. This is why LLM inference is often memory-bound, and why techniques like Grouped-Query Attention, Multi-Query Attention, and KV-cache quantization exist: they all attack the T×dT \times d term that the embedding layer fixes.

Mental model: the embedding dimension is like the diameter of the pipe that carries every downstream tensor through the network. Double it, and attention cost, Flash Attention’s on-chip tile size, KV-cache memory, and output-projection compute all roughly double too. That is why model sizes usually grow by increasing depth and number of heads together with dd — the embedding dimension is a central architectural knob.

Subword Tokenization and the Scaling Story

Our toy pipeline used whitespace tokenization. Real LLMs use subword tokenization — usually Byte-Pair Encoding (BPE) or its byte-level variant. The motivation is practical:

  1. Word-level vocabularies are either enormous (to cover every possible word form) or suffer massive out-of-vocabulary rates. Morphologically rich languages (Finnish, Turkish) are catastrophic for word tokenizers.
  2. Character-level tokenizers have tiny vocabularies (~256 bytes) but produce very long sequences, which is expensive for attention because of the T2T^2 cost.
  3. Subword (BPE, WordPiece, SentencePiece) hits the sweet spot: a vocabulary of a few tens of thousands of subword units that can represent any string by concatenation. Common words get their own token; rare words decompose into meaningful pieces.

Watch the algorithm run on a small toy corpus. Each step merges the most-frequent adjacent pair; after ten merges common substrings like low, est, and er have been absorbed into single subword tokens.

Loading BPE merge animator…
ModelVocab sized_embedEmbedding params
word2vec (2013)~3M300900M
BERT-base (2018)30,52276823.4M
GPT-2 (2019)50,25776838.6M
GPT-3 (2020)50,25712,288617M
LLaMA-2-7B (2023)32,0004,096131M
LLaMA-3 (2024)128,2564,096525M
Gemma-2-9B (2024)256,1283,584918M

Notice the trend. Modern LLMs have kept dmodeld_{\text{model}} roughly fixed (~4k) while growing the vocabulary — the embedding layer keeps getting heavier. That is because more vocabulary means fewer tokens per sentence (better compression → shorter sequences → cheaper attention) and better coverage of rare languages and code. The tradeoff: more parameters and bigger gradients on the embedding layer.

Byte-level BPE never OOVs. GPT-2 and LLaMA use byte-level BPE: every possible byte is its own token. Any string on Earth — emoji, CJK, binary — can be tokenized without an unk\langle \text{unk}\rangle. That is why you rarely see unk\langle \text{unk}\rangle tokens in modern LLM outputs.

Summary: The Gateway to Every Language Model

Embeddings are the layer that turns language into geometry. Everything a language model does — attention, generation, classification — is a transformation of the matrix produced by the embedding lookup. Mastering this one operation unlocks the rest of the stack.

Concepts to carry forward

  1. Preprocessing is a pipeline: normalize → tokenize → build vocab → index → embed. The first four stages are deterministic; the fifth is learned.
  2. One-hot is a stepping stone, not a solution. Its orthogonality makes it semantically blind and its sparsity makes it computationally wasteful.
  3. Embedding lookup equals one-hot matmul. OE\mathbf{O}\mathbf{E} and E[ids]\mathbf{E}[\text{ids}] compute the same thing; only cost differs.
  4. Gradients are sparse. Only rows for tokens that appeared in the forward pass receive updates — the reason embedding layers are cheap to train despite their size.
  5. Padding and unknown tokens are special. padding_idx\texttt{padding\_idx} freezes the pad row; unk\langle \text{unk}\rangle catches out-of-vocab inputs.
  6. Embeddings set dd. The embedding dimension becomes the dmodeld_{\text{model}} of attention, the width of every hidden state, and the constant in Flash Attention memory and KV-cache size.

What is next

In the next section we will take the embedded matrix X0RT×d\mathbf{X}_0 \in \mathbb{R}^{T \times d} we built here and feed it into an LSTM classifier. You will see the embedding layer sitting as the first module of a real, trainable sentiment analyzer — the gateway through which every token in your dataset passes on its way to a prediction.


References

  • Mikolov, T., Chen, K., Corrado, G. & Dean, J. (2013). Efficient Estimation of Word Representations in Vector Space. arXiv:1301.3781.
  • Pennington, J., Socher, R. & Manning, C. D. (2014). GloVe: Global Vectors for Word Representation. EMNLP 2014.
  • Vaswani, A. et al. (2017). Attention Is All You Need. NeurIPS 2017 / arXiv:1706.03762.
  • Sennrich, R., Haddow, B. & Birch, A. (2016). Neural Machine Translation of Rare Words with Subword Units. ACL 2016 / arXiv:1508.07909.
  • Dao, T., Fu, D. Y., Ermon, S., Rudra, A. & Ré, C. (2022). FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness. NeurIPS 2022 / arXiv:2205.14135.
Loading comments...