Chapter 4
17 min read
Section 22 of 75

Token Embeddings and Vocabulary

Positional Encoding and Embeddings

Introduction

Before we can process text with transformers, we need to convert words (or subwords) into dense vector representations. This is the job of token embeddingsโ€”a lookup table that maps discrete token IDs to continuous vectors.

This section covers vocabulary construction, special tokens, and implementing the embedding layer in PyTorch.


4.1 From Text to Vectors

The Pipeline

๐Ÿ“text
1Raw text: "The cat sat on the mat"
2    โ†“
3Tokenization: ["The", "cat", "sat", "on", "the", "mat"]
4    โ†“
5Vocabulary lookup: [45, 892, 1203, 28, 45, 3421]
6    โ†“
7Embedding lookup: [[0.23, ...], [0.87, ...], ...]
8    โ†“
9Add positional encoding
10    โ†“
11Ready for transformer

Why Dense Vectors?

One-hot encoding (sparse):

๐Ÿ“text
1"cat" โ†’ [0, 0, 0, ..., 1, ..., 0, 0]  (30,000 dimensions!)

- Huge dimensionality

- No similarity information

- "cat" and "dog" are equally different from "cat" and "happiness"

Dense embeddings:

๐Ÿ“text
1"cat" โ†’ [0.23, -0.45, 0.12, ...]  (512 dimensions)
2"dog" โ†’ [0.21, -0.42, 0.15, ...]  (similar to cat!)

- Compact representation

- Captures semantic similarity

- Learnable from data


4.2 Vocabulary Construction

Building a Vocabulary

๐Ÿpython
1from collections import Counter
2from typing import List, Dict, Optional
3
4
5class Vocabulary:
6    """
7    A vocabulary mapping tokens to indices and vice versa.
8
9    Includes special tokens for padding, unknown words, and
10    sequence boundaries.
11    """
12
13    # Special tokens
14    PAD_TOKEN = "<pad>"
15    UNK_TOKEN = "<unk>"
16    BOS_TOKEN = "<bos>"  # Beginning of sequence
17    EOS_TOKEN = "<eos>"  # End of sequence
18
19    def __init__(
20        self,
21        min_freq: int = 1,
22        max_size: Optional[int] = None,
23        special_tokens: List[str] = None
24    ):
25        self.min_freq = min_freq
26        self.max_size = max_size
27
28        # Special tokens
29        if special_tokens is None:
30            special_tokens = [self.PAD_TOKEN, self.UNK_TOKEN,
31                            self.BOS_TOKEN, self.EOS_TOKEN]
32        self.special_tokens = special_tokens
33
34        # Mappings (populated by build())
35        self.token2idx: Dict[str, int] = {}
36        self.idx2token: Dict[int, str] = {}
37
38        # Special indices
39        self.pad_idx = 0
40        self.unk_idx = 1
41        self.bos_idx = 2
42        self.eos_idx = 3
43
44    def build(self, texts: List[List[str]]) -> 'Vocabulary':
45        """
46        Build vocabulary from tokenized texts.
47
48        Args:
49            texts: List of tokenized documents (list of token lists)
50
51        Returns:
52            self (for chaining)
53        """
54        # Count token frequencies
55        counter = Counter()
56        for tokens in texts:
57            counter.update(tokens)
58
59        # Add special tokens first
60        for idx, token in enumerate(self.special_tokens):
61            self.token2idx[token] = idx
62            self.idx2token[idx] = token
63
64        # Add tokens meeting frequency threshold
65        idx = len(self.special_tokens)
66        for token, freq in counter.most_common():
67            if freq < self.min_freq:
68                continue
69            if self.max_size and len(self.token2idx) >= self.max_size:
70                break
71            if token not in self.token2idx:
72                self.token2idx[token] = idx
73                self.idx2token[idx] = token
74                idx += 1
75
76        return self
77
78    def __len__(self) -> int:
79        return len(self.token2idx)
80
81    def encode(self, tokens: List[str]) -> List[int]:
82        """Convert tokens to indices."""
83        return [self.token2idx.get(t, self.unk_idx) for t in tokens]
84
85    def decode(self, indices: List[int]) -> List[str]:
86        """Convert indices to tokens."""
87        return [self.idx2token.get(i, self.UNK_TOKEN) for i in indices]
88
89    def __repr__(self) -> str:
90        return f"Vocabulary(size={len(self)}, special={self.special_tokens})"
91
92
93# Example usage
94def demo_vocabulary():
95    texts = [
96        ["the", "cat", "sat", "on", "the", "mat"],
97        ["the", "dog", "ran", "in", "the", "park"],
98        ["a", "cat", "and", "a", "dog", "played"],
99    ]
100
101    vocab = Vocabulary(min_freq=1, max_size=20).build(texts)
102
103    print(f"Vocabulary: {vocab}")
104    print(f"Size: {len(vocab)}")
105    print(f"\nToken to Index:")
106    for token, idx in list(vocab.token2idx.items())[:10]:
107        print(f"  '{token}' โ†’ {idx}")
108
109    # Test encoding/decoding
110    test_tokens = ["the", "cat", "chased", "the", "mouse"]
111    encoded = vocab.encode(test_tokens)
112    decoded = vocab.decode(encoded)
113
114    print(f"\nOriginal: {test_tokens}")
115    print(f"Encoded: {encoded}")
116    print(f"Decoded: {decoded}")  # "chased" and "mouse" become <unk>
117
118
119demo_vocabulary()

4.3 Special Tokens

Common Special Tokens

TokenPurposeTypical Index
<pad>Padding for batching0
<unk>Unknown/OOV words1
<bos> / <s>Beginning of sequence2
<eos> / </s>End of sequence3
<sep>Separator between segmentsvaries
<cls>Classification token (BERT)varies
<mask>Masked token (BERT)varies

Why Each Token Matters

<pad>: Variable-length sequences need padding to batch together.

๐Ÿ“text
1Sequence 1: [The, cat, sat, <pad>, <pad>]
2Sequence 2: [A, dog, ran, fast, today]

<unk>: Handle words not in vocabulary gracefully.

๐Ÿ“text
1"quantum" โ†’ <unk> (if not in vocab)

<bos> and <eos>: Mark sequence boundaries.

๐Ÿ“text
1[<bos>, The, cat, sat, <eos>]

Helps model know when sequences start/end, crucial for generation.


4.4 Token Embedding Layer

Implementation

๐Ÿpython
1import torch
2import torch.nn as nn
3import math
4from typing import Optional
5
6
7class TokenEmbedding(nn.Module):
8    """
9    Token embedding layer.
10
11    Maps token indices to dense vectors using a learnable lookup table.
12
13    Args:
14        vocab_size: Size of vocabulary
15        d_model: Embedding dimension
16        padding_idx: Index of padding token (embeddings will be zero)
17        scale: Whether to scale embeddings by sqrt(d_model)
18
19    Example:
20        >>> emb = TokenEmbedding(vocab_size=10000, d_model=512)
21        >>> input_ids = torch.tensor([[1, 45, 892, 0, 0]])  # [batch, seq_len]
22        >>> output = emb(input_ids)  # [batch, seq_len, d_model]
23    """
24
25    def __init__(
26        self,
27        vocab_size: int,
28        d_model: int,
29        padding_idx: int = 0,
30        scale: bool = True
31    ):
32        super().__init__()
33
34        self.vocab_size = vocab_size
35        self.d_model = d_model
36        self.scale = scale
37        self.scale_factor = math.sqrt(d_model) if scale else 1.0
38
39        # Embedding table
40        self.embedding = nn.Embedding(
41            num_embeddings=vocab_size,
42            embedding_dim=d_model,
43            padding_idx=padding_idx
44        )
45
46        # Initialize
47        self._reset_parameters()
48
49    def _reset_parameters(self):
50        """Initialize embedding weights."""
51        # Normal initialization scaled by 1/sqrt(d_model)
52        nn.init.normal_(self.embedding.weight, mean=0.0, std=self.d_model ** -0.5)
53
54        # Ensure padding vector stays zero
55        if self.embedding.padding_idx is not None:
56            with torch.no_grad():
57                self.embedding.weight[self.embedding.padding_idx].fill_(0)
58
59    def forward(self, input_ids: torch.Tensor) -> torch.Tensor:
60        """
61        Look up embeddings for input token IDs.
62
63        Args:
64            input_ids: Token indices [batch, seq_len]
65
66        Returns:
67            embeddings: Dense vectors [batch, seq_len, d_model]
68        """
69        # Look up embeddings
70        embeddings = self.embedding(input_ids)
71
72        # Scale by sqrt(d_model)
73        if self.scale:
74            embeddings = embeddings * self.scale_factor
75
76        return embeddings
77
78    def extra_repr(self) -> str:
79        return f'vocab_size={self.vocab_size}, d_model={self.d_model}, scale={self.scale}'
80
81
82# Test
83def test_token_embedding():
84    vocab_size = 10000
85    d_model = 512
86    batch_size = 2
87    seq_len = 10
88
89    emb = TokenEmbedding(vocab_size, d_model, padding_idx=0)
90
91    # Create input with some padding
92    input_ids = torch.randint(1, vocab_size, (batch_size, seq_len))
93    input_ids[:, -2:] = 0  # Last 2 positions are padding
94
95    output = emb(input_ids)
96
97    print(f"Input shape: {input_ids.shape}")
98    print(f"Output shape: {output.shape}")
99    print(f"Embedding table shape: {emb.embedding.weight.shape}")
100
101    # Verify padding embeddings are zero
102    padding_emb = output[:, -1, :]  # Last position (padding)
103    print(f"\nPadding embedding (should be zero):")
104    print(f"  Sum: {padding_emb.abs().sum().item():.6f}")
105
106    # Verify scaling
107    raw_emb = emb.embedding(input_ids)
108    expected_scale = math.sqrt(d_model)
109    actual_scale = output[0, 0, 0] / raw_emb[0, 0, 0]
110    print(f"\nScaling factor: {actual_scale:.4f} (expected {expected_scale:.4f})")
111
112    print("\nโœ“ Token embedding test passed!")
113
114
115test_token_embedding()

4.5 Why Scale by โˆšd_model?

The Problem

Token embeddings and positional encodings are added together:

๐Ÿ“text
1input = token_embedding + positional_encoding

If they have very different magnitudes, one will dominate.

Positional Encoding Magnitude

Sinusoidal PE values are in [-1, 1].

Mean magnitude per position: approximately 0.5-0.7.

Token Embedding Magnitude

With normal initialization N(0, ฯƒยฒ):

- Mean: 0

- Std: ฯƒ

If ฯƒ is too small, positions dominate.

If ฯƒ is too large, tokens dominate.

The Solution

Initialize embeddings with std = 1/โˆšd_model, then scale by โˆšd_model:

๐Ÿpython
1# Initialization
2std = 1 / sqrt(d_model)  # Small values
3
4# In forward pass
5embeddings = embeddings * sqrt(d_model)  # Scale up
6
7# Result: embeddings have similar magnitude to positional encoding

This balancing ensures both token identity and position contribute meaningfully.


4.6 Shared Embeddings (Weight Tying)

The Concept

In language models, the same vocabulary is used for input and output:

- Input: token โ†’ embedding โ†’ transformer

- Output: transformer โ†’ linear โ†’ logits over vocabulary

We can share the embedding weights:

๐Ÿpython
1class LanguageModel(nn.Module):
2    def __init__(self, vocab_size, d_model):
3        self.embedding = TokenEmbedding(vocab_size, d_model)
4        # ... transformer layers ...
5        self.output_projection = nn.Linear(d_model, vocab_size, bias=False)
6
7        # Tie weights: output projection uses embedding weights
8        self.output_projection.weight = self.embedding.embedding.weight

Benefits

1. Fewer parameters: One embedding matrix instead of two

2. Consistency: Same representation for input and output

3. Better generalization: Works especially well for smaller models

In PyTorch

๐Ÿpython
1class TiedEmbeddingModel(nn.Module):
2    def __init__(self, vocab_size, d_model):
3        super().__init__()
4
5        self.embedding = nn.Embedding(vocab_size, d_model)
6        self.output = nn.Linear(d_model, vocab_size, bias=False)
7
8        # Tie weights
9        self.output.weight = self.embedding.weight
10
11    def forward(self, x):
12        # x: [batch, seq_len]
13        emb = self.embedding(x)  # [batch, seq_len, d_model]
14        # ... transformer processing ...
15        logits = self.output(emb)  # [batch, seq_len, vocab_size]
16        return logits
17
18
19# Verify weight tying
20model = TiedEmbeddingModel(1000, 256)
21print(f"Embedding weight id: {id(model.embedding.weight)}")
22print(f"Output weight id: {id(model.output.weight)}")
23print(f"Same object: {model.embedding.weight is model.output.weight}")

4.7 Pre-trained Embeddings

Loading Pre-trained Vectors

๐Ÿpython
1import numpy as np
2
3
4def load_pretrained_embeddings(
5    vocab: Vocabulary,
6    embedding_path: str,
7    d_model: int
8) -> torch.Tensor:
9    """
10    Load pre-trained word vectors (e.g., GloVe, Word2Vec).
11
12    Args:
13        vocab: Vocabulary object
14        embedding_path: Path to embedding file
15        d_model: Expected embedding dimension
16
17    Returns:
18        Embedding matrix [vocab_size, d_model]
19    """
20    # Initialize with random
21    embeddings = torch.randn(len(vocab), d_model) * 0.01
22
23    # Load pre-trained vectors
24    loaded = 0
25    with open(embedding_path, 'r', encoding='utf-8') as f:
26        for line in f:
27            parts = line.strip().split()
28            word = parts[0]
29            if word in vocab.token2idx:
30                vector = torch.tensor([float(x) for x in parts[1:]])
31                if len(vector) == d_model:
32                    embeddings[vocab.token2idx[word]] = vector
33                    loaded += 1
34
35    print(f"Loaded {loaded}/{len(vocab)} pre-trained embeddings")
36
37    # Keep special tokens as random/zero
38    embeddings[vocab.pad_idx] = 0
39
40    return embeddings
41
42
43def create_embedding_with_pretrained(
44    vocab: Vocabulary,
45    pretrained_path: str,
46    d_model: int,
47    freeze: bool = False
48) -> TokenEmbedding:
49    """Create TokenEmbedding initialized with pre-trained vectors."""
50
51    emb = TokenEmbedding(len(vocab), d_model, padding_idx=vocab.pad_idx)
52
53    # Load and set weights
54    pretrained = load_pretrained_embeddings(vocab, pretrained_path, d_model)
55    emb.embedding.weight.data = pretrained
56
57    # Optionally freeze
58    if freeze:
59        emb.embedding.weight.requires_grad = False
60
61    return emb

4.8 Handling Large Vocabularies

Memory Considerations

๐Ÿ“text
1Vocabulary size: 50,000
2Embedding dimension: 1024
3Memory: 50,000 ร— 1024 ร— 4 bytes = 200 MB (just for embeddings!)

Strategies for Large Vocabularies

1. Reduce vocabulary size

- Use subword tokenization (BPE, WordPiece)

- Typical sizes: 30K-50K tokens

2. Embedding factorization

๐Ÿpython
1class FactorizedEmbedding(nn.Module):
2    """Project through smaller dimension first."""
3    def __init__(self, vocab_size, d_model, factor_dim=128):
4        super().__init__()
5        self.embed1 = nn.Embedding(vocab_size, factor_dim)
6        self.embed2 = nn.Linear(factor_dim, d_model)
7
8    def forward(self, x):
9        return self.embed2(self.embed1(x))

3. Adaptive embeddings

Common words get full dimension, rare words get smaller dimension.


Summary

Key Components

ComponentPurpose
VocabularyMaps tokens โ†” indices
Special tokensHandle padding, unknown, boundaries
TokenEmbeddingLookup table for dense vectors
ScalingBalance with positional encoding
Weight tyingShare input/output embeddings

Implementation Checklist

- Build vocabulary with special tokens

- Set padding_idx in nn.Embedding

- Scale embeddings by โˆšd_model

- Initialize appropriately

- Consider weight tying for LMs


Next Section Preview

In the final section of this chapter, we'll combine everythingโ€”token embeddings, positional encoding, and proper scalingโ€”into a complete TransformerEmbedding module ready for use in our translation model.