Chapter 0
10 min read
Section 1 of 75

Prerequisite Knowledge

Prerequisites

Introduction

This chapter covers the foundational knowledge required to understand and implement Transformers from scratch. Before diving into attention mechanisms and model architecture, you should be comfortable with the concepts presented here.

What you need:

  • Python programming (intermediate level)
  • NumPy array operations
  • PyTorch basics
  • Linear algebra fundamentals
  • Basic deep learning concepts

0.1 Python Essentials

Functions and Lambda Expressions

🐍functions.py
1# Standard function
2def multiply(a, b):
3    """Multiply two numbers."""
4    return a * b
5
6# Lambda (anonymous function)
7multiply_lambda = lambda a, b: a * b
8
9# Higher-order function (function that takes/returns functions)
10def apply_twice(func, x):
11    return func(func(x))
12
13result = apply_twice(lambda x: x * 2, 3)  # 12
14print(f"Apply twice: {result}")
15
16
17# Default arguments and keyword arguments
18def create_layer(input_dim, output_dim, activation="relu", dropout=0.1):
19    return {
20        "input": input_dim,
21        "output": output_dim,
22        "activation": activation,
23        "dropout": dropout
24    }
25
26layer = create_layer(512, 256, dropout=0.2)
27print(f"Layer config: {layer}")

Classes and Object-Oriented Programming

🐍classes.py
1class NeuralLayer:
2    """
3    Example neural network layer class.
4
5    This demonstrates OOP concepts used throughout the course.
6    """
7
8    def __init__(self, input_dim: int, output_dim: int):
9        """
10        Initialize the layer.
11
12        Args:
13            input_dim: Input dimension
14            output_dim: Output dimension
15        """
16        self.input_dim = input_dim
17        self.output_dim = output_dim
18        self.weights = None  # Would be initialized with actual weights
19
20    def forward(self, x):
21        """Forward pass through the layer."""
22        raise NotImplementedError("Subclasses must implement forward()")
23
24    def __repr__(self):
25        return f"NeuralLayer(input={self.input_dim}, output={self.output_dim})"
26
27
28class LinearLayer(NeuralLayer):
29    """Linear layer inheriting from NeuralLayer."""
30
31    def __init__(self, input_dim: int, output_dim: int, bias: bool = True):
32        super().__init__(input_dim, output_dim)  # Call parent __init__
33        self.bias = bias
34
35    def forward(self, x):
36        # Simplified - actual implementation would do matrix multiplication
37        return f"Linear transform of shape {x.shape}"
38
39
40# Usage
41layer = LinearLayer(512, 256)
42print(layer)  # LinearLayer(input=512, output=256)

Decorators

🐍decorators.py
1import time
2from functools import wraps
3
4def timer(func):
5    """Decorator to measure function execution time."""
6    @wraps(func)
7    def wrapper(*args, **kwargs):
8        start = time.time()
9        result = func(*args, **kwargs)
10        end = time.time()
11        print(f"{func.__name__} took {end - start:.4f} seconds")
12        return result
13    return wrapper
14
15@timer
16def slow_function():
17    time.sleep(0.1)
18    return "Done"
19
20slow_function()
21
22
23# Property decorator (used in PyTorch modules)
24class Model:
25    def __init__(self):
26        self._device = "cpu"
27
28    @property
29    def device(self):
30        """Get current device."""
31        return self._device
32
33    @device.setter
34    def device(self, value):
35        """Set device with validation."""
36        if value not in ["cpu", "cuda"]:
37            raise ValueError("Device must be 'cpu' or 'cuda'")
38        self._device = value
39
40model = Model()
41print(f"Device: {model.device}")  # Uses getter
42model.device = "cuda"  # Uses setter

Type Hints

🐍type_hints.py
1from typing import List, Dict, Tuple, Optional, Union
2
3def process_batch(
4    tokens: List[int],
5    max_length: int = 512,
6    padding: Optional[str] = None
7) -> Tuple[List[int], int]:
8    """
9    Process a batch of tokens.
10
11    Args:
12        tokens: List of token IDs
13        max_length: Maximum sequence length
14        padding: Padding strategy (optional)
15
16    Returns:
17        Tuple of (processed tokens, actual length)
18    """
19    actual_length = min(len(tokens), max_length)
20    processed = tokens[:max_length]
21    return processed, actual_length
22
23
24# Union for multiple types
25def get_embedding(token: Union[int, str]) -> List[float]:
26    """Get embedding for token (ID or string)."""
27    if isinstance(token, str):
28        token = hash(token) % 1000  # Simplified
29    return [0.0] * 64  # Return dummy embedding

Context Managers

🐍context_managers.py
1import torch
2
3# Context managers are used extensively in PyTorch
4# Example: torch.no_grad()
5
6class NoGradContext:
7    """Simplified version of torch.no_grad()."""
8
9    def __enter__(self):
10        self.prev_state = torch.is_grad_enabled()
11        torch.set_grad_enabled(False)
12        return self
13
14    def __exit__(self, exc_type, exc_val, exc_tb):
15        torch.set_grad_enabled(self.prev_state)
16        return False  # Don't suppress exceptions
17
18# Usage
19with torch.no_grad():
20    # Gradients not computed here
21    pass

0.2 NumPy Essentials

Array Creation and Manipulation

🐍numpy_arrays.py
1import numpy as np
2
3# Array creation
4arr1 = np.array([1, 2, 3, 4, 5])
5arr2 = np.zeros((3, 4))           # 3x4 zeros
6arr3 = np.ones((2, 3))            # 2x3 ones
7arr4 = np.random.randn(2, 3)      # Random normal distribution
8arr5 = np.arange(0, 10, 2)        # [0, 2, 4, 6, 8]
9arr6 = np.linspace(0, 1, 5)       # 5 values from 0 to 1
10
11print(f"Shapes: {arr1.shape}, {arr2.shape}, {arr4.shape}")
12
13
14# Reshaping
15arr = np.arange(12)
16reshaped = arr.reshape(3, 4)      # 3x4 matrix
17reshaped_auto = arr.reshape(3, -1)  # -1 infers dimension
18print(f"Original: {arr.shape}, Reshaped: {reshaped.shape}")
19
20
21# Indexing and slicing
22matrix = np.arange(20).reshape(4, 5)
23print(f"Original matrix:\n{matrix}")
24print(f"First row: {matrix[0]}")
25print(f"First column: {matrix[:, 0]}")
26print(f"Submatrix [1:3, 2:4]:\n{matrix[1:3, 2:4]}")

Broadcasting

🐍broadcasting.py
1"""
2Broadcasting: NumPy's way of handling arrays of different shapes.
3Critical for understanding tensor operations in transformers!
4"""
5
6# Example 1: Scalar and array
7arr = np.array([1, 2, 3])
8result = arr * 2  # [2, 4, 6] - scalar broadcasts to all elements
9print(f"Scalar broadcast: {result}")
10
11
12# Example 2: Different shaped arrays
13a = np.array([[1], [2], [3]])  # Shape: (3, 1)
14b = np.array([10, 20, 30])      # Shape: (3,)
15
16# Broadcasting rules:
17# 1. Align shapes from right
18# 2. Dimensions must be equal OR one of them is 1
19# a: (3, 1)
20# b:    (3,) -> treated as (1, 3)
21# Result: (3, 3)
22
23result = a + b
24print(f"a shape: {a.shape}, b shape: {b.shape}")
25print(f"Result shape: {result.shape}")
26print(f"Result:\n{result}")
27# [[11, 21, 31],
28#  [12, 22, 32],
29#  [13, 23, 33]]
30
31
32# Example 3: Attention score broadcasting (preview!)
33# Query: (batch, heads, seq_len, d_k) = (2, 8, 10, 64)
34# Key:   (batch, heads, seq_len, d_k) = (2, 8, 10, 64)
35# We want Q @ K^T -> (2, 8, 10, 10)
36
37batch, heads, seq_len, d_k = 2, 8, 10, 64
38Q = np.random.randn(batch, heads, seq_len, d_k)
39K = np.random.randn(batch, heads, seq_len, d_k)
40
41# Matrix multiply on last two dimensions
42scores = np.matmul(Q, K.transpose(0, 1, 3, 2))  # K^T
43print(f"Q shape: {Q.shape}")
44print(f"K shape: {K.shape}")
45print(f"Scores shape: {scores.shape}")  # (2, 8, 10, 10)

Common Operations

🐍numpy_operations.py
1# Element-wise operations
2a = np.array([1, 2, 3])
3b = np.array([4, 5, 6])
4
5print(f"Add: {a + b}")        # [5, 7, 9]
6print(f"Multiply: {a * b}")   # [4, 10, 18]
7print(f"Power: {a ** 2}")     # [1, 4, 9]
8print(f"Exp: {np.exp(a)}")    # [2.71, 7.39, 20.09]
9
10
11# Reduction operations
12matrix = np.array([[1, 2, 3],
13                   [4, 5, 6]])
14
15print(f"Sum all: {np.sum(matrix)}")           # 21
16print(f"Sum axis 0: {np.sum(matrix, axis=0)}")  # [5, 7, 9] (column sums)
17print(f"Sum axis 1: {np.sum(matrix, axis=1)}")  # [6, 15] (row sums)
18print(f"Mean: {np.mean(matrix)}")             # 3.5
19print(f"Max: {np.max(matrix)}")               # 6
20print(f"Argmax: {np.argmax(matrix)}")         # 5 (flat index)
21
22
23# Matrix operations
24A = np.array([[1, 2], [3, 4]])
25B = np.array([[5, 6], [7, 8]])
26
27print(f"Matrix multiply:\n{np.matmul(A, B)}")  # or A @ B
28print(f"Transpose:\n{A.T}")
29print(f"Inverse:\n{np.linalg.inv(A)}")

Softmax Implementation

🐍softmax.py
1def softmax(x, axis=-1):
2    """
3    Compute softmax values.
4
5    Softmax converts logits to probabilities.
6    Used extensively in attention mechanisms!
7
8    softmax(x_i) = exp(x_i) / sum(exp(x_j))
9
10    Args:
11        x: Input array
12        axis: Axis to compute softmax over
13
14    Returns:
15        Softmax probabilities (sum to 1 along axis)
16    """
17    # Subtract max for numerical stability
18    x_max = np.max(x, axis=axis, keepdims=True)
19    exp_x = np.exp(x - x_max)
20    return exp_x / np.sum(exp_x, axis=axis, keepdims=True)
21
22
23# Example
24logits = np.array([2.0, 1.0, 0.1])
25probs = softmax(logits)
26print(f"Logits: {logits}")
27print(f"Softmax: {probs}")
28print(f"Sum: {probs.sum()}")  # 1.0
29
30
31# 2D example (like attention scores)
32scores = np.array([[2.0, 1.0, 0.5],
33                   [1.0, 3.0, 1.0]])
34attention_weights = softmax(scores, axis=-1)
35print(f"Attention weights:\n{attention_weights}")
36print(f"Row sums: {attention_weights.sum(axis=-1)}")  # [1.0, 1.0]

0.3 PyTorch Fundamentals

Tensors: The Building Block

🐍tensors.py
1import torch
2
3# Tensor creation
4t1 = torch.tensor([1, 2, 3])
5t2 = torch.zeros(3, 4)
6t3 = torch.ones(2, 3)
7t4 = torch.randn(2, 3)         # Normal distribution
8t5 = torch.arange(0, 10, 2)
9t6 = torch.eye(3)              # Identity matrix
10
11print(f"Tensor: {t1}")
12print(f"Shape: {t1.shape}, Dtype: {t1.dtype}, Device: {t1.device}")
13
14
15# NumPy conversion
16np_array = np.array([1, 2, 3])
17tensor_from_np = torch.from_numpy(np_array)
18np_from_tensor = t1.numpy()
19
20
21# Device management (CPU/GPU)
22device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
23print(f"Using device: {device}")
24
25t_gpu = t4.to(device)
26print(f"Tensor device: {t_gpu.device}")
27
28
29# Data types
30t_float = torch.tensor([1.0, 2.0], dtype=torch.float32)
31t_half = t_float.half()  # float16 - used for efficiency
32t_int = torch.tensor([1, 2], dtype=torch.long)
33print(f"Float32: {t_float.dtype}, Float16: {t_half.dtype}")

Tensor Operations

🐍tensor_operations.py
1# Basic operations
2a = torch.tensor([[1, 2], [3, 4]], dtype=torch.float32)
3b = torch.tensor([[5, 6], [7, 8]], dtype=torch.float32)
4
5# Element-wise
6print(f"Add: {a + b}")
7print(f"Multiply: {a * b}")
8
9# Matrix multiplication (3 equivalent ways)
10print(f"Matmul 1: {torch.matmul(a, b)}")
11print(f"Matmul 2: {a @ b}")
12print(f"Matmul 3: {torch.mm(a, b)}")  # Only for 2D
13
14# Batch matrix multiplication (for attention!)
15batch_a = torch.randn(2, 3, 4)  # (batch, m, n)
16batch_b = torch.randn(2, 4, 5)  # (batch, n, p)
17result = torch.bmm(batch_a, batch_b)  # (batch, m, p)
18print(f"Batch matmul: {batch_a.shape} @ {batch_b.shape} = {result.shape}")
19
20
21# Reshaping
22t = torch.arange(12)
23print(f"Original: {t.shape}")
24print(f"View: {t.view(3, 4).shape}")
25print(f"Reshape: {t.reshape(2, 6).shape}")
26print(f"Unsqueeze: {t.unsqueeze(0).shape}")  # Add dim at position 0
27print(f"Squeeze: {torch.zeros(1, 3, 1).squeeze().shape}")  # Remove dims of size 1
28
29
30# Transpose and permute
31t = torch.randn(2, 3, 4)
32print(f"Original: {t.shape}")
33print(f"Transpose (1,2): {t.transpose(1, 2).shape}")  # Swap dims 1 and 2
34print(f"Permute: {t.permute(2, 0, 1).shape}")  # Arbitrary reorder
35
36
37# Concatenation
38t1 = torch.randn(2, 3)
39t2 = torch.randn(2, 3)
40print(f"Cat dim 0: {torch.cat([t1, t2], dim=0).shape}")  # (4, 3)
41print(f"Cat dim 1: {torch.cat([t1, t2], dim=1).shape}")  # (2, 6)
42print(f"Stack: {torch.stack([t1, t2], dim=0).shape}")    # (2, 2, 3)

Autograd: Automatic Differentiation

🐍autograd.py
1"""
2Autograd is PyTorch's automatic differentiation engine.
3It computes gradients automatically - essential for training!
4"""
5
6# Basic gradient computation
7x = torch.tensor([2.0], requires_grad=True)
8y = x ** 2 + 3 * x + 1  # y = x² + 3x + 1
9
10y.backward()  # Compute dy/dx
11print(f"x = {x.item()}, y = {y.item()}")
12print(f"dy/dx = {x.grad.item()}")  # dy/dx = 2x + 3 = 7
13
14
15# Multi-variable gradient
16x = torch.tensor([1.0, 2.0, 3.0], requires_grad=True)
17y = (x ** 2).sum()  # y = x₁² + x₂² + x₃²
18
19y.backward()
20print(f"Gradient: {x.grad}")  # [2, 4, 6] = 2x
21
22
23# Neural network style gradient
24W = torch.randn(3, 2, requires_grad=True)
25x = torch.randn(2)
26y = torch.randn(3)
27
28# Forward pass
29pred = W @ x
30loss = ((pred - y) ** 2).mean()
31
32# Backward pass
33loss.backward()
34print(f"Weight gradient shape: {W.grad.shape}")
35
36
37# Disabling gradients (for inference)
38with torch.no_grad():
39    # No gradients computed here - faster!
40    output = W @ x
41
42# Or using decorator
43@torch.no_grad()
44def inference(model, x):
45    return model(x)

nn.Module: Building Neural Networks

🐍nn_module.py
1import torch.nn as nn
2
3class SimpleNetwork(nn.Module):
4    """
5    Example neural network using nn.Module.
6
7    All PyTorch models inherit from nn.Module.
8    """
9
10    def __init__(self, input_dim: int, hidden_dim: int, output_dim: int):
11        super().__init__()  # Always call parent __init__
12
13        # Define layers as attributes
14        self.linear1 = nn.Linear(input_dim, hidden_dim)
15        self.activation = nn.ReLU()
16        self.linear2 = nn.Linear(hidden_dim, output_dim)
17
18    def forward(self, x: torch.Tensor) -> torch.Tensor:
19        """
20        Forward pass.
21
22        Args:
23            x: Input tensor [batch_size, input_dim]
24
25        Returns:
26            Output tensor [batch_size, output_dim]
27        """
28        x = self.linear1(x)
29        x = self.activation(x)
30        x = self.linear2(x)
31        return x
32
33
34# Create and use model
35model = SimpleNetwork(input_dim=10, hidden_dim=32, output_dim=5)
36print(model)
37
38# Forward pass
39x = torch.randn(4, 10)  # Batch of 4
40output = model(x)
41print(f"Input: {x.shape}, Output: {output.shape}")
42
43# Access parameters
44print(f"\nNumber of parameters: {sum(p.numel() for p in model.parameters())}")
45for name, param in model.named_parameters():
46    print(f"  {name}: {param.shape}")
47
48
49# Move to GPU
50device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
51model = model.to(device)
52x = x.to(device)
53output = model(x)

Common nn Layers

🐍nn_layers.py
1# Layers used in Transformers
2
3# Linear (fully connected)
4linear = nn.Linear(in_features=512, out_features=256, bias=True)
5# y = xW^T + b
6
7# Embedding (lookup table)
8embedding = nn.Embedding(num_embeddings=10000, embedding_dim=512)
9# Maps integer indices to dense vectors
10token_ids = torch.tensor([1, 5, 100, 999])
11embedded = embedding(token_ids)  # [4, 512]
12
13# Layer Normalization
14layer_norm = nn.LayerNorm(normalized_shape=512)
15# Normalizes across feature dimension
16
17# Dropout
18dropout = nn.Dropout(p=0.1)
19# Randomly zeros elements during training
20
21# Softmax
22softmax = nn.Softmax(dim=-1)
23# Converts logits to probabilities
24
25# Common activations
26relu = nn.ReLU()
27gelu = nn.GELU()  # Used in transformers
28tanh = nn.Tanh()
29
30
31# Sequential container
32mlp = nn.Sequential(
33    nn.Linear(512, 2048),
34    nn.GELU(),
35    nn.Dropout(0.1),
36    nn.Linear(2048, 512)
37)

Training Loop Basics

🐍training_loop.py
1import torch.optim as optim
2
3# Complete training example
4model = SimpleNetwork(10, 32, 2)
5criterion = nn.CrossEntropyLoss()
6optimizer = optim.Adam(model.parameters(), lr=0.001)
7
8# Dummy data
9X = torch.randn(100, 10)
10y = torch.randint(0, 2, (100,))
11
12# Training loop
13num_epochs = 5
14batch_size = 16
15
16for epoch in range(num_epochs):
17    model.train()  # Set to training mode
18    total_loss = 0
19
20    # Mini-batch training
21    for i in range(0, len(X), batch_size):
22        batch_X = X[i:i+batch_size]
23        batch_y = y[i:i+batch_size]
24
25        # Forward pass
26        outputs = model(batch_X)
27        loss = criterion(outputs, batch_y)
28
29        # Backward pass
30        optimizer.zero_grad()  # Clear old gradients
31        loss.backward()        # Compute gradients
32        optimizer.step()       # Update weights
33
34        total_loss += loss.item()
35
36    print(f"Epoch {epoch+1}, Loss: {total_loss:.4f}")
37
38# Evaluation
39model.eval()  # Set to evaluation mode
40with torch.no_grad():
41    test_output = model(X[:5])
42    predictions = test_output.argmax(dim=-1)
43    print(f"Predictions: {predictions}")

0.4 Linear Algebra for Transformers

Vectors and Matrices

🐍vectors_matrices.py
1import numpy as np
2
3"""
4Linear algebra is the foundation of neural networks.
5Key operations you'll see constantly in transformers.
6"""
7
8# Vectors
9v1 = np.array([1, 2, 3])
10v2 = np.array([4, 5, 6])
11
12# Dot product (inner product)
13# Used in attention: query · key
14dot = np.dot(v1, v2)  # 1*4 + 2*5 + 3*6 = 32
15print(f"Dot product: {dot}")
16
17# Vector norm (length)
18norm = np.linalg.norm(v1)  # sqrt(1² + 2² + 3²)
19print(f"Norm: {norm:.4f}")
20
21
22# Matrices
23A = np.array([[1, 2, 3],
24              [4, 5, 6]])  # 2x3 matrix
25
26B = np.array([[1, 2],
27              [3, 4],
28              [5, 6]])  # 3x2 matrix
29
30# Matrix multiplication
31# (2x3) @ (3x2) = (2x2)
32C = A @ B
33print(f"A shape: {A.shape}, B shape: {B.shape}, C shape: {C.shape}")
34print(f"A @ B:\n{C}")
35
36# Rule: Inner dimensions must match
37# (m, n) @ (n, p) = (m, p)

Matrix Multiplication in Attention

🐍attention_matmul.py
1"""
2Understanding matrix multiplication is CRITICAL for attention.
3
4Attention: Attention(Q, K, V) = softmax(QK^T / √d) V
5
6Let's trace through the shapes step by step.
7"""
8
9# Setup
10batch_size = 2
11seq_len = 4      # Number of tokens
12d_model = 8      # Model dimension (embedding size)
13d_k = 8          # Key/Query dimension (often = d_model)
14
15# Input: sequence of embeddings
16# Shape: (batch, seq_len, d_model)
17X = np.random.randn(batch_size, seq_len, d_model)
18print(f"Input X: {X.shape}")
19
20# Project to Q, K, V (simplified - just use X)
21Q = X  # (batch, seq_len, d_k)
22K = X  # (batch, seq_len, d_k)
23V = X  # (batch, seq_len, d_v)
24
25print(f"Q: {Q.shape}, K: {K.shape}, V: {V.shape}")
26
27# Step 1: QK^T - compute attention scores
28# Q: (batch, seq_len, d_k)
29# K^T: (batch, d_k, seq_len)  <- transpose last two dims
30# Result: (batch, seq_len, seq_len)
31
32K_T = np.transpose(K, (0, 2, 1))  # Transpose last two dimensions
33print(f"K^T: {K_T.shape}")
34
35scores = Q @ K_T
36print(f"QK^T (scores): {scores.shape}")  # (2, 4, 4)
37
38# This gives us a seq_len × seq_len attention matrix!
39# scores[b, i, j] = how much token i attends to token j
40
41# Step 2: Scale by sqrt(d_k)
42d_k_value = d_k
43scores_scaled = scores / np.sqrt(d_k_value)
44
45# Step 3: Softmax to get attention weights
46def softmax(x, axis=-1):
47    exp_x = np.exp(x - np.max(x, axis=axis, keepdims=True))
48    return exp_x / np.sum(exp_x, axis=axis, keepdims=True)
49
50attention_weights = softmax(scores_scaled, axis=-1)
51print(f"Attention weights: {attention_weights.shape}")  # (2, 4, 4)
52print(f"Sum of weights (should be 1): {attention_weights[0, 0].sum():.4f}")
53
54# Step 4: Multiply by V to get output
55# attention_weights: (batch, seq_len, seq_len)
56# V: (batch, seq_len, d_v)
57# Result: (batch, seq_len, d_v)
58
59output = attention_weights @ V
60print(f"Output: {output.shape}")  # (2, 4, 8)
61
62# Each output token is a weighted combination of all V vectors!

Batch Matrix Multiplication

🐍batch_matmul.py
1"""
2In transformers, we do matrix multiplication across batches.
3PyTorch handles this with broadcasting and batched operations.
4"""
5
6import torch
7
8# Batched matrix multiplication
9# (batch, m, n) @ (batch, n, p) = (batch, m, p)
10
11A = torch.randn(32, 10, 64)  # 32 batches of 10x64 matrices
12B = torch.randn(32, 64, 20)  # 32 batches of 64x20 matrices
13
14C = torch.bmm(A, B)  # Batch matrix multiply
15print(f"Batched matmul: {A.shape} @ {B.shape} = {C.shape}")
16# (32, 10, 20)
17
18# With more dimensions (multi-head attention)
19# (batch, heads, seq, dim) @ (batch, heads, dim, seq)
20Q = torch.randn(2, 8, 100, 64)  # 2 batches, 8 heads, 100 tokens, 64 dim
21K = torch.randn(2, 8, 100, 64)
22
23# torch.matmul handles arbitrary batch dimensions!
24scores = torch.matmul(Q, K.transpose(-2, -1))
25print(f"Multi-head scores: {scores.shape}")  # (2, 8, 100, 100)

Einsum: Einstein Summation

🐍einsum.py
1"""
2Einsum is a powerful notation for tensor operations.
3You'll see it in many transformer implementations.
4"""
5
6import torch
7
8# Basic matrix multiply: C[i,j] = sum_k A[i,k] * B[k,j]
9A = torch.randn(3, 4)
10B = torch.randn(4, 5)
11C = torch.einsum('ik,kj->ij', A, B)  # Same as A @ B
12print(f"Einsum matmul: {C.shape}")
13
14# Batch matrix multiply
15A = torch.randn(2, 3, 4)
16B = torch.randn(2, 4, 5)
17C = torch.einsum('bij,bjk->bik', A, B)  # Same as torch.bmm
18print(f"Einsum batch matmul: {C.shape}")
19
20# Multi-head attention
21# Q: (batch, heads, seq_q, d_k)
22# K: (batch, heads, seq_k, d_k)
23# Output: (batch, heads, seq_q, seq_k)
24Q = torch.randn(2, 8, 10, 64)
25K = torch.randn(2, 8, 20, 64)
26scores = torch.einsum('bhqd,bhkd->bhqk', Q, K)
27print(f"Einsum attention scores: {scores.shape}")
28
29# Einsum is readable once you understand it:
30# 'bhqd,bhkd->bhqk' means:
31# - b: batch (same in both)
32# - h: heads (same in both)
33# - q: query sequence position
34# - k: key sequence position
35# - d: dimension (summed over - contracted)

0.5 Deep Learning Fundamentals

Neural Network Basics

🐍nn_basics.py
1"""
2Core concepts you need before studying transformers.
3"""
4
5print("""
6NEURAL NETWORK BASICS:
7──────────────────────
8
91. FORWARD PASS
10   Input → Layers → Output
11
12   Each layer: y = activation(Wx + b)
13
142. LOSS FUNCTION
15   Measures how wrong the prediction is
16   - Classification: Cross-entropy loss
17   - Regression: MSE loss
18
193. BACKWARD PASS (Backpropagation)
20   Compute gradients: ∂Loss/∂weights
21
22   Chain rule: ∂L/∂w = ∂L/∂y × ∂y/∂w
23
244. OPTIMIZATION
25   Update weights: w = w - lr × ∂L/∂w
26
27
28ACTIVATION FUNCTIONS:
29─────────────────────
30
31ReLU(x) = max(0, x)
32- Simple, effective
33- Can have "dead neurons"
34
35GELU(x) = x × Φ(x)  where Φ is CDF of normal
36- Smoother than ReLU
37- Used in transformers (BERT, GPT)
38
39Softmax(x_i) = exp(x_i) / Σ exp(x_j)
40- Converts logits to probabilities
41- Used in attention and classification
42
43
44REGULARIZATION:
45───────────────
46
47Dropout: Randomly zero neurons during training
48- Prevents overfitting
49- dropout_rate = 0.1 means 10% dropped
50
51Layer Normalization: Normalize across features
52- Stabilizes training
53- LayerNorm(x) = (x - μ) / σ × γ + β
54""")

Loss Functions

🐍loss_functions.py
1import torch
2import torch.nn.functional as F
3
4# Cross-Entropy Loss (for classification)
5# Used when predicting next token in language models
6
7logits = torch.tensor([[2.0, 1.0, 0.5],   # Predictions for sample 1
8                       [0.5, 2.5, 1.0]])   # Predictions for sample 2
9targets = torch.tensor([0, 1])  # True classes
10
11loss = F.cross_entropy(logits, targets)
12print(f"Cross-entropy loss: {loss.item():.4f}")
13
14# Manual computation
15probs = F.softmax(logits, dim=-1)
16print(f"Probabilities:\n{probs}")
17# Loss = -log(prob of correct class)
18manual_loss = -torch.log(probs[0, 0]) - torch.log(probs[1, 1])
19print(f"Manual loss: {(manual_loss/2).item():.4f}")
20
21
22# Label Smoothing (used in transformers)
23# Instead of hard targets [1, 0, 0], use soft targets [0.9, 0.05, 0.05]
24def label_smoothed_loss(logits, targets, smoothing=0.1):
25    n_classes = logits.size(-1)
26
27    # Create smoothed targets
28    smooth_targets = torch.full_like(logits, smoothing / (n_classes - 1))
29    smooth_targets.scatter_(-1, targets.unsqueeze(-1), 1.0 - smoothing)
30
31    # KL divergence
32    log_probs = F.log_softmax(logits, dim=-1)
33    loss = -(smooth_targets * log_probs).sum(dim=-1).mean()
34
35    return loss
36
37smoothed = label_smoothed_loss(logits, targets, smoothing=0.1)
38print(f"Label smoothed loss: {smoothed.item():.4f}")

Optimization

🐍optimization.py
1import torch.optim as optim
2
3# Create a simple model
4model = nn.Linear(10, 2)
5
6# Different optimizers
7
8# SGD: Simple but requires tuning
9sgd = optim.SGD(model.parameters(), lr=0.01, momentum=0.9)
10
11# Adam: Adaptive learning rates (most common)
12adam = optim.Adam(model.parameters(), lr=0.001, betas=(0.9, 0.999))
13
14# AdamW: Adam with proper weight decay (used in transformers)
15adamw = optim.AdamW(model.parameters(), lr=0.001, weight_decay=0.01)
16
17
18# Learning rate scheduling
19from torch.optim.lr_scheduler import LambdaLR
20
21optimizer = optim.Adam(model.parameters(), lr=0.001)
22
23# Warmup + decay (transformer-style)
24def lr_lambda(step):
25    warmup_steps = 1000
26    if step < warmup_steps:
27        return step / warmup_steps  # Linear warmup
28    else:
29        return 1.0  # Could add decay here
30
31scheduler = LambdaLR(optimizer, lr_lambda)
32
33# In training loop:
34for step in range(100):
35    # ... training code ...
36    optimizer.step()
37    scheduler.step()  # Update learning rate

0.6 NLP Basics

Tokenization

🐍tokenization.py
1"""
2Tokenization: Converting text to numbers that models can process.
3"""
4
5# Word tokenization (simple)
6text = "The cat sat on the mat."
7words = text.lower().split()
8print(f"Words: {words}")
9
10# Build vocabulary
11vocab = {word: idx for idx, word in enumerate(set(words))}
12vocab['<PAD>'] = len(vocab)
13vocab['<UNK>'] = len(vocab)
14print(f"Vocabulary: {vocab}")
15
16# Encode
17token_ids = [vocab.get(w, vocab['<UNK>']) for w in words]
18print(f"Token IDs: {token_ids}")
19
20
21# Subword tokenization (used in transformers)
22# BPE, WordPiece, SentencePiece
23print("""
24SUBWORD TOKENIZATION:
25─────────────────────
26
27"unhappiness" → ["un", "happiness"] or ["un", "happ", "iness"]
28
29Benefits:
30- Handles unknown words
31- Smaller vocabulary
32- Captures morphology
33
34Common methods:
35- BPE (Byte Pair Encoding): GPT-2, RoBERTa
36- WordPiece: BERT
37- SentencePiece: T5, mBART
38""")

Embeddings

🐍embeddings.py
1import torch.nn as nn
2
3"""
4Embeddings: Dense vector representations of tokens.
5"""
6
7vocab_size = 10000
8embedding_dim = 512
9
10# Embedding layer
11embedding = nn.Embedding(vocab_size, embedding_dim)
12
13# Usage
14token_ids = torch.tensor([1, 5, 100, 2, 8])
15embeddings = embedding(token_ids)
16print(f"Token IDs: {token_ids.shape}")
17print(f"Embeddings: {embeddings.shape}")  # [5, 512]
18
19# Each token gets a 512-dimensional vector
20# These vectors are LEARNED during training
21
22print("""
23EMBEDDING PROPERTIES:
24─────────────────────
25
261. Similar words have similar embeddings
27   dog ≈ cat (both animals)
28
292. Semantic relationships encoded
30   king - man + woman ≈ queen
31
323. Learned from data
33   - Random initialization
34   - Updated during training
35   - Can also use pre-trained (Word2Vec, GloVe)
36""")

Sequence Modeling

🐍sequence_modeling.py
1"""
2Key concepts for understanding transformers.
3"""
4
5print("""
6SEQUENCE MODELING CHALLENGES:
7─────────────────────────────
8
91. VARIABLE LENGTH
10   "Hi" vs "The quick brown fox jumps over the lazy dog"
11   Solution: Padding + masking
12
132. ORDER MATTERS
14   "Dog bites man" ≠ "Man bites dog"
15   Solution: Position encodings
16
173. LONG-RANGE DEPENDENCIES
18   "The cat, which was very hungry, ate..."
19   'cat' and 'ate' are related but far apart
20   Solution: Attention mechanism!
21
22
23BEFORE TRANSFORMERS:
24────────────────────
25
26RNNs (Recurrent Neural Networks):
27- Process one token at a time
28- Hidden state carries information
29- Problem: Hard to parallelize, forgets long context
30
31LSTMs (Long Short-Term Memory):
32- Better at long sequences
33- Still sequential processing
34- Problem: Still slow
35
36
37TRANSFORMERS SOLVE THESE:
38─────────────────────────
39
401. Attention: Every token can attend to every other token
41   - No information bottleneck
42   - Captures long-range dependencies
43
442. Parallel processing: All tokens processed simultaneously
45   - Much faster training
46   - Efficient use of GPUs
47
483. Position encodings: Inject position information
49   - Sine/cosine functions (original)
50   - Learned embeddings
51   - Rotary embeddings (modern)
52""")

0.7 Quick Reference

Common Tensor Shapes in Transformers

🐍tensor_shapes.py
1print("""
2COMMON TENSOR SHAPES:
3─────────────────────
4
5Input text:
6  token_ids: (batch_size, seq_len)
7  Example: (32, 128) = 32 sentences, 128 tokens each
8
9Embeddings:
10  embeddings: (batch_size, seq_len, d_model)
11  Example: (32, 128, 512)
12
13Attention (single head):
14  Q, K, V: (batch_size, seq_len, d_k)
15  Scores: (batch_size, seq_len, seq_len)
16  Output: (batch_size, seq_len, d_v)
17
18Attention (multi-head):
19  Q, K, V: (batch_size, num_heads, seq_len, d_k)
20  Scores: (batch_size, num_heads, seq_len, seq_len)
21  Output: (batch_size, num_heads, seq_len, d_v)
22
23After concatenating heads:
24  Output: (batch_size, seq_len, num_heads * d_v) = (batch_size, seq_len, d_model)
25
26FFN:
27  Input: (batch_size, seq_len, d_model)
28  Hidden: (batch_size, seq_len, d_ff)  # d_ff = 4 * d_model typically
29  Output: (batch_size, seq_len, d_model)
30
31Final output (translation):
32  logits: (batch_size, seq_len, vocab_size)
33""")

Common Hyperparameters

🐍hyperparameters.py
1print("""
2COMMON HYPERPARAMETERS:
3───────────────────────
4
5Architecture:
6  d_model: 512 (base), 768 (BERT), 1024 (large)
7  num_heads: 8 (base), 12 (BERT), 16 (large)
8  num_layers: 6 (base), 12 (BERT), 24+ (large)
9  d_ff: 2048 (= 4 * d_model typically)
10  vocab_size: 30000-50000
11  max_seq_len: 512, 1024, 2048+
12
13Training:
14  batch_size: 16-128 (depends on GPU memory)
15  learning_rate: 1e-4 to 3e-4 (with warmup)
16  warmup_steps: 4000-10000
17  dropout: 0.1
18  label_smoothing: 0.1
19
20Optimizer:
21  Adam/AdamW
22  β₁ = 0.9, β₂ = 0.98 (transformers) or 0.999 (default)
23  ε = 1e-9 (transformers) or 1e-8 (default)
24  weight_decay: 0.01
25""")

Summary

What You Should Know Before Starting

TopicKey Concepts
PythonClasses, decorators, type hints, context managers
NumPyArrays, broadcasting, reshaping, axis operations
PyTorchTensors, autograd, nn.Module, optimizers
Linear AlgebraMatrix multiplication, transpose, batch operations
Deep LearningForward/backward pass, loss functions, optimization
NLPTokenization, embeddings, sequence modeling

Self-Assessment Checklist

  • Can implement a simple neural network in PyTorch
  • Understand broadcasting in NumPy/PyTorch
  • Know what matrix multiplication dimensions mean
  • Can explain what backpropagation does
  • Understand softmax and cross-entropy loss
  • Know what an embedding layer does
  • Understand why sequence order matters in NLP

Exercises

  1. Implement softmax from scratch in NumPy and verify against torch.softmax.
  2. Build a simple 2-layer neural network in PyTorch and train it on MNIST.
  3. Practice matrix multiplication: Given shapes, predict output shape.
  4. Implement a basic tokenizer that splits text and builds a vocabulary.
  5. Trace through the attention score computation with concrete shapes.

Ready to Start!

If you're comfortable with the concepts in this chapter, you're ready to dive into transformers! The course will build on these fundamentals to construct attention mechanisms, encoder-decoder architectures, and complete translation systems.

Next up: Chapter 1 - Introduction to Transformers