Introduction
Before we can process text with transformers, we need to convert words (or subwords) into dense vector representations. This is the job of token embeddingsโa lookup table that maps discrete token IDs to continuous vectors.
This section covers vocabulary construction, special tokens, and implementing the embedding layer in PyTorch.
4.1 From Text to Vectors
The Pipeline
1Raw text: "The cat sat on the mat"
2 โ
3Tokenization: ["The", "cat", "sat", "on", "the", "mat"]
4 โ
5Vocabulary lookup: [45, 892, 1203, 28, 45, 3421]
6 โ
7Embedding lookup: [[0.23, ...], [0.87, ...], ...]
8 โ
9Add positional encoding
10 โ
11Ready for transformerWhy Dense Vectors?
One-hot encoding (sparse):
1"cat" โ [0, 0, 0, ..., 1, ..., 0, 0] (30,000 dimensions!)- Huge dimensionality
- No similarity information
- "cat" and "dog" are equally different from "cat" and "happiness"
Dense embeddings:
1"cat" โ [0.23, -0.45, 0.12, ...] (512 dimensions)
2"dog" โ [0.21, -0.42, 0.15, ...] (similar to cat!)- Compact representation
- Captures semantic similarity
- Learnable from data
4.2 Vocabulary Construction
Building a Vocabulary
1from collections import Counter
2from typing import List, Dict, Optional
3
4
5class Vocabulary:
6 """
7 A vocabulary mapping tokens to indices and vice versa.
8
9 Includes special tokens for padding, unknown words, and
10 sequence boundaries.
11 """
12
13 # Special tokens
14 PAD_TOKEN = "<pad>"
15 UNK_TOKEN = "<unk>"
16 BOS_TOKEN = "<bos>" # Beginning of sequence
17 EOS_TOKEN = "<eos>" # End of sequence
18
19 def __init__(
20 self,
21 min_freq: int = 1,
22 max_size: Optional[int] = None,
23 special_tokens: List[str] = None
24 ):
25 self.min_freq = min_freq
26 self.max_size = max_size
27
28 # Special tokens
29 if special_tokens is None:
30 special_tokens = [self.PAD_TOKEN, self.UNK_TOKEN,
31 self.BOS_TOKEN, self.EOS_TOKEN]
32 self.special_tokens = special_tokens
33
34 # Mappings (populated by build())
35 self.token2idx: Dict[str, int] = {}
36 self.idx2token: Dict[int, str] = {}
37
38 # Special indices
39 self.pad_idx = 0
40 self.unk_idx = 1
41 self.bos_idx = 2
42 self.eos_idx = 3
43
44 def build(self, texts: List[List[str]]) -> 'Vocabulary':
45 """
46 Build vocabulary from tokenized texts.
47
48 Args:
49 texts: List of tokenized documents (list of token lists)
50
51 Returns:
52 self (for chaining)
53 """
54 # Count token frequencies
55 counter = Counter()
56 for tokens in texts:
57 counter.update(tokens)
58
59 # Add special tokens first
60 for idx, token in enumerate(self.special_tokens):
61 self.token2idx[token] = idx
62 self.idx2token[idx] = token
63
64 # Add tokens meeting frequency threshold
65 idx = len(self.special_tokens)
66 for token, freq in counter.most_common():
67 if freq < self.min_freq:
68 continue
69 if self.max_size and len(self.token2idx) >= self.max_size:
70 break
71 if token not in self.token2idx:
72 self.token2idx[token] = idx
73 self.idx2token[idx] = token
74 idx += 1
75
76 return self
77
78 def __len__(self) -> int:
79 return len(self.token2idx)
80
81 def encode(self, tokens: List[str]) -> List[int]:
82 """Convert tokens to indices."""
83 return [self.token2idx.get(t, self.unk_idx) for t in tokens]
84
85 def decode(self, indices: List[int]) -> List[str]:
86 """Convert indices to tokens."""
87 return [self.idx2token.get(i, self.UNK_TOKEN) for i in indices]
88
89 def __repr__(self) -> str:
90 return f"Vocabulary(size={len(self)}, special={self.special_tokens})"
91
92
93# Example usage
94def demo_vocabulary():
95 texts = [
96 ["the", "cat", "sat", "on", "the", "mat"],
97 ["the", "dog", "ran", "in", "the", "park"],
98 ["a", "cat", "and", "a", "dog", "played"],
99 ]
100
101 vocab = Vocabulary(min_freq=1, max_size=20).build(texts)
102
103 print(f"Vocabulary: {vocab}")
104 print(f"Size: {len(vocab)}")
105 print(f"\nToken to Index:")
106 for token, idx in list(vocab.token2idx.items())[:10]:
107 print(f" '{token}' โ {idx}")
108
109 # Test encoding/decoding
110 test_tokens = ["the", "cat", "chased", "the", "mouse"]
111 encoded = vocab.encode(test_tokens)
112 decoded = vocab.decode(encoded)
113
114 print(f"\nOriginal: {test_tokens}")
115 print(f"Encoded: {encoded}")
116 print(f"Decoded: {decoded}") # "chased" and "mouse" become <unk>
117
118
119demo_vocabulary()4.3 Special Tokens
Common Special Tokens
| Token | Purpose | Typical Index |
|---|---|---|
| <pad> | Padding for batching | 0 |
| <unk> | Unknown/OOV words | 1 |
| <bos> / <s> | Beginning of sequence | 2 |
| <eos> / </s> | End of sequence | 3 |
| <sep> | Separator between segments | varies |
| <cls> | Classification token (BERT) | varies |
| <mask> | Masked token (BERT) | varies |
Why Each Token Matters
<pad>: Variable-length sequences need padding to batch together.
1Sequence 1: [The, cat, sat, <pad>, <pad>]
2Sequence 2: [A, dog, ran, fast, today]<unk>: Handle words not in vocabulary gracefully.
1"quantum" โ <unk> (if not in vocab)<bos> and <eos>: Mark sequence boundaries.
1[<bos>, The, cat, sat, <eos>]Helps model know when sequences start/end, crucial for generation.
4.4 Token Embedding Layer
Implementation
1import torch
2import torch.nn as nn
3import math
4from typing import Optional
5
6
7class TokenEmbedding(nn.Module):
8 """
9 Token embedding layer.
10
11 Maps token indices to dense vectors using a learnable lookup table.
12
13 Args:
14 vocab_size: Size of vocabulary
15 d_model: Embedding dimension
16 padding_idx: Index of padding token (embeddings will be zero)
17 scale: Whether to scale embeddings by sqrt(d_model)
18
19 Example:
20 >>> emb = TokenEmbedding(vocab_size=10000, d_model=512)
21 >>> input_ids = torch.tensor([[1, 45, 892, 0, 0]]) # [batch, seq_len]
22 >>> output = emb(input_ids) # [batch, seq_len, d_model]
23 """
24
25 def __init__(
26 self,
27 vocab_size: int,
28 d_model: int,
29 padding_idx: int = 0,
30 scale: bool = True
31 ):
32 super().__init__()
33
34 self.vocab_size = vocab_size
35 self.d_model = d_model
36 self.scale = scale
37 self.scale_factor = math.sqrt(d_model) if scale else 1.0
38
39 # Embedding table
40 self.embedding = nn.Embedding(
41 num_embeddings=vocab_size,
42 embedding_dim=d_model,
43 padding_idx=padding_idx
44 )
45
46 # Initialize
47 self._reset_parameters()
48
49 def _reset_parameters(self):
50 """Initialize embedding weights."""
51 # Normal initialization scaled by 1/sqrt(d_model)
52 nn.init.normal_(self.embedding.weight, mean=0.0, std=self.d_model ** -0.5)
53
54 # Ensure padding vector stays zero
55 if self.embedding.padding_idx is not None:
56 with torch.no_grad():
57 self.embedding.weight[self.embedding.padding_idx].fill_(0)
58
59 def forward(self, input_ids: torch.Tensor) -> torch.Tensor:
60 """
61 Look up embeddings for input token IDs.
62
63 Args:
64 input_ids: Token indices [batch, seq_len]
65
66 Returns:
67 embeddings: Dense vectors [batch, seq_len, d_model]
68 """
69 # Look up embeddings
70 embeddings = self.embedding(input_ids)
71
72 # Scale by sqrt(d_model)
73 if self.scale:
74 embeddings = embeddings * self.scale_factor
75
76 return embeddings
77
78 def extra_repr(self) -> str:
79 return f'vocab_size={self.vocab_size}, d_model={self.d_model}, scale={self.scale}'
80
81
82# Test
83def test_token_embedding():
84 vocab_size = 10000
85 d_model = 512
86 batch_size = 2
87 seq_len = 10
88
89 emb = TokenEmbedding(vocab_size, d_model, padding_idx=0)
90
91 # Create input with some padding
92 input_ids = torch.randint(1, vocab_size, (batch_size, seq_len))
93 input_ids[:, -2:] = 0 # Last 2 positions are padding
94
95 output = emb(input_ids)
96
97 print(f"Input shape: {input_ids.shape}")
98 print(f"Output shape: {output.shape}")
99 print(f"Embedding table shape: {emb.embedding.weight.shape}")
100
101 # Verify padding embeddings are zero
102 padding_emb = output[:, -1, :] # Last position (padding)
103 print(f"\nPadding embedding (should be zero):")
104 print(f" Sum: {padding_emb.abs().sum().item():.6f}")
105
106 # Verify scaling
107 raw_emb = emb.embedding(input_ids)
108 expected_scale = math.sqrt(d_model)
109 actual_scale = output[0, 0, 0] / raw_emb[0, 0, 0]
110 print(f"\nScaling factor: {actual_scale:.4f} (expected {expected_scale:.4f})")
111
112 print("\nโ Token embedding test passed!")
113
114
115test_token_embedding()4.5 Why Scale by โd_model?
The Problem
Token embeddings and positional encodings are added together:
1input = token_embedding + positional_encodingIf they have very different magnitudes, one will dominate.
Positional Encoding Magnitude
Sinusoidal PE values are in [-1, 1].
Mean magnitude per position: approximately 0.5-0.7.
Token Embedding Magnitude
With normal initialization N(0, ฯยฒ):
- Mean: 0
- Std: ฯ
If ฯ is too small, positions dominate.
If ฯ is too large, tokens dominate.
The Solution
Initialize embeddings with std = 1/โd_model, then scale by โd_model:
1# Initialization
2std = 1 / sqrt(d_model) # Small values
3
4# In forward pass
5embeddings = embeddings * sqrt(d_model) # Scale up
6
7# Result: embeddings have similar magnitude to positional encodingThis balancing ensures both token identity and position contribute meaningfully.
4.6 Shared Embeddings (Weight Tying)
The Concept
In language models, the same vocabulary is used for input and output:
- Input: token โ embedding โ transformer
- Output: transformer โ linear โ logits over vocabulary
We can share the embedding weights:
1class LanguageModel(nn.Module):
2 def __init__(self, vocab_size, d_model):
3 self.embedding = TokenEmbedding(vocab_size, d_model)
4 # ... transformer layers ...
5 self.output_projection = nn.Linear(d_model, vocab_size, bias=False)
6
7 # Tie weights: output projection uses embedding weights
8 self.output_projection.weight = self.embedding.embedding.weightBenefits
1. Fewer parameters: One embedding matrix instead of two
2. Consistency: Same representation for input and output
3. Better generalization: Works especially well for smaller models
In PyTorch
1class TiedEmbeddingModel(nn.Module):
2 def __init__(self, vocab_size, d_model):
3 super().__init__()
4
5 self.embedding = nn.Embedding(vocab_size, d_model)
6 self.output = nn.Linear(d_model, vocab_size, bias=False)
7
8 # Tie weights
9 self.output.weight = self.embedding.weight
10
11 def forward(self, x):
12 # x: [batch, seq_len]
13 emb = self.embedding(x) # [batch, seq_len, d_model]
14 # ... transformer processing ...
15 logits = self.output(emb) # [batch, seq_len, vocab_size]
16 return logits
17
18
19# Verify weight tying
20model = TiedEmbeddingModel(1000, 256)
21print(f"Embedding weight id: {id(model.embedding.weight)}")
22print(f"Output weight id: {id(model.output.weight)}")
23print(f"Same object: {model.embedding.weight is model.output.weight}")4.7 Pre-trained Embeddings
Loading Pre-trained Vectors
1import numpy as np
2
3
4def load_pretrained_embeddings(
5 vocab: Vocabulary,
6 embedding_path: str,
7 d_model: int
8) -> torch.Tensor:
9 """
10 Load pre-trained word vectors (e.g., GloVe, Word2Vec).
11
12 Args:
13 vocab: Vocabulary object
14 embedding_path: Path to embedding file
15 d_model: Expected embedding dimension
16
17 Returns:
18 Embedding matrix [vocab_size, d_model]
19 """
20 # Initialize with random
21 embeddings = torch.randn(len(vocab), d_model) * 0.01
22
23 # Load pre-trained vectors
24 loaded = 0
25 with open(embedding_path, 'r', encoding='utf-8') as f:
26 for line in f:
27 parts = line.strip().split()
28 word = parts[0]
29 if word in vocab.token2idx:
30 vector = torch.tensor([float(x) for x in parts[1:]])
31 if len(vector) == d_model:
32 embeddings[vocab.token2idx[word]] = vector
33 loaded += 1
34
35 print(f"Loaded {loaded}/{len(vocab)} pre-trained embeddings")
36
37 # Keep special tokens as random/zero
38 embeddings[vocab.pad_idx] = 0
39
40 return embeddings
41
42
43def create_embedding_with_pretrained(
44 vocab: Vocabulary,
45 pretrained_path: str,
46 d_model: int,
47 freeze: bool = False
48) -> TokenEmbedding:
49 """Create TokenEmbedding initialized with pre-trained vectors."""
50
51 emb = TokenEmbedding(len(vocab), d_model, padding_idx=vocab.pad_idx)
52
53 # Load and set weights
54 pretrained = load_pretrained_embeddings(vocab, pretrained_path, d_model)
55 emb.embedding.weight.data = pretrained
56
57 # Optionally freeze
58 if freeze:
59 emb.embedding.weight.requires_grad = False
60
61 return emb4.8 Handling Large Vocabularies
Memory Considerations
1Vocabulary size: 50,000
2Embedding dimension: 1024
3Memory: 50,000 ร 1024 ร 4 bytes = 200 MB (just for embeddings!)Strategies for Large Vocabularies
1. Reduce vocabulary size
- Use subword tokenization (BPE, WordPiece)
- Typical sizes: 30K-50K tokens
2. Embedding factorization
1class FactorizedEmbedding(nn.Module):
2 """Project through smaller dimension first."""
3 def __init__(self, vocab_size, d_model, factor_dim=128):
4 super().__init__()
5 self.embed1 = nn.Embedding(vocab_size, factor_dim)
6 self.embed2 = nn.Linear(factor_dim, d_model)
7
8 def forward(self, x):
9 return self.embed2(self.embed1(x))3. Adaptive embeddings
Common words get full dimension, rare words get smaller dimension.
Summary
Key Components
| Component | Purpose |
|---|---|
| Vocabulary | Maps tokens โ indices |
| Special tokens | Handle padding, unknown, boundaries |
| TokenEmbedding | Lookup table for dense vectors |
| Scaling | Balance with positional encoding |
| Weight tying | Share input/output embeddings |
Implementation Checklist
- Build vocabulary with special tokens
- Set padding_idx in nn.Embedding
- Scale embeddings by โd_model
- Initialize appropriately
- Consider weight tying for LMs
Next Section Preview
In the final section of this chapter, we'll combine everythingโtoken embeddings, positional encoding, and proper scalingโinto a complete TransformerEmbedding module ready for use in our translation model.