Boo-AI — Master Artificial Intelligence by Building from Scratch

Learning Objectives

By the end of this section, you will be able to:

Understand the three main gradient descent variants: batch, stochastic, and mini-batch gradient descent
Analyze the trade-offs between noise, computational efficiency, and convergence speed
Derive the gradient variance scaling with batch size mathematically
Select appropriate hyperparameters (learning rate, batch size) for different scenarios
Implement all three variants in PyTorch and understand when to use each
Apply the linear scaling rule for large-batch training

Why This Matters: The choice between gradient descent variants is one of the most impactful decisions in training neural networks. It affects not just training speed, but also generalization, memory usage, and ultimately model quality. Understanding these trade-offs is essential for any practitioner.

The Optimization Challenge

In the previous section, we established that training a neural network means minimizing a loss function $\mathcal{L}(\theta)$ over the parameter space. The fundamental approach is gradient descent:

\theta^{(t+1)} = \theta^{(t)} - \eta \nabla_\theta \mathcal{L}(\theta^{(t)})

But here's the catch: computing the full gradient requires evaluating the loss on every training example:

\mathcal{L}(\theta) = \frac{1}{N} \sum_{i=1}^{N} \ell(f_\theta(x^{(i)}), y^{(i)})

For modern datasets with millions or billions of examples, computing this sum at every step is computationally prohibitive. This motivates the development of stochastic optimization methods that use subsets of the data.

The Core Trade-off

All gradient descent variants navigate a fundamental trade-off:

Aspect	More Data Per Update	Less Data Per Update
Gradient Accuracy	More accurate (lower variance)	Noisier (higher variance)
Computational Cost	Expensive per update	Cheap per update
Memory Usage	Higher	Lower
Updates Per Second	Fewer	Many
Generalization	May overfit	Regularization effect

Batch Gradient Descent

Batch gradient descent (also called "full-batch" or "vanilla" gradient descent) uses the entire training set to compute each gradient update:

\theta^{(t+1)} = \theta^{(t)} - \eta \cdot \frac{1}{N} \sum_{i=1}^{N} \nabla_\theta \ell(f_\theta(x^{(i)}), y^{(i)})

The Algorithm

📝batch_gd.txt

1Batch Gradient Descent
2======================
3Input: Dataset D = {(x⁽¹⁾, y⁽¹⁾), ..., (x⁽ᴺ⁾, y⁽ᴺ⁾)}
4       Learning rate η
5       Number of epochs T
6
71. Initialize parameters θ randomly
82. For epoch = 1 to T:
9     a. Compute gradient on ENTIRE dataset:
10        g = (1/N) Σᵢ ∇θ ℓ(fθ(x⁽ⁱ⁾), y⁽ⁱ⁾)
11     b. Update parameters:
12        θ = θ - η · g
133. Return θ

Advantages

Deterministic: Same starting point always produces the same trajectory
Smooth convergence: No noise means clean descent toward minimum
Theoretical guarantees: Convergence proofs are straightforward for convex functions
Stable learning rate: Can use larger learning rates without divergence

Disadvantages

Computational cost: Each update requires passing through all N examples
Memory intensive: Must store gradients for all examples (or accumulate)
Slow for large datasets: One epoch = one gradient update
May get stuck: Deterministic path can get trapped in sharp local minima

When to Use Batch GD

Batch gradient descent is suitable when: (1) the dataset fits in memory, (2) the dataset is small (thousands of examples), or (3) you need very precise gradient estimates (e.g., for second-order methods).

Stochastic Gradient Descent

Stochastic Gradient Descent (SGD) takes the opposite extreme: it uses only one training example per update:

\theta^{(t+1)} = \theta^{(t)} - \eta \cdot \nabla_\theta \ell(f_\theta(x^{(i)}), y^{(i)})

where $(x^{(i)}, y^{(i)})$ is a randomly sampled training example.

The Key Insight: Unbiased Estimator

The single-sample gradient is an unbiased estimator of the true gradient:

\mathbb{E}_{i \sim \text{Uniform}(1..N)} \left[ \nabla_\theta \ell(f_\theta(x^{(i)}), y^{(i)}) \right] = \frac{1}{N} \sum_{i=1}^{N} \nabla_\theta \ell(f_\theta(x^{(i)}), y^{(i)})

This means that on average, we're moving in the right direction. The noise from individual samples cancels out over many updates.

The Algorithm

📝sgd.txt

1Stochastic Gradient Descent
2============================
3Input: Dataset D = {(x⁽¹⁾, y⁽¹⁾), ..., (x⁽ᴺ⁾, y⁽ᴺ⁾)}
4       Learning rate η
5       Number of epochs T
6
71. Initialize parameters θ randomly
82. For epoch = 1 to T:
9     Shuffle the dataset
10     For each example (x⁽ⁱ⁾, y⁽ⁱ⁾):
11       a. Compute gradient on single example:
12          g = ∇θ ℓ(fθ(x⁽ⁱ⁾), y⁽ⁱ⁾)
13       b. Update parameters:
14          θ = θ - η · g
153. Return θ

Advantages

Very fast iterations: Each update only processes one example
Low memory: Only need to store one example at a time
Online learning: Can learn from streaming data
Escapes local minima: Noise helps explore the loss landscape
Implicit regularization: Noise acts as a regularizer, often improving generalization

Disadvantages

High variance: Updates are noisy, causing erratic convergence
No vectorization: Cannot leverage GPU parallelism effectively
Requires careful tuning: Learning rate must be smaller to handle noise
Never truly converges: Oscillates around the minimum due to noise

The Noise Paradox

The noise in SGD is both a bug and a feature. While it prevents perfect convergence, research shows that this noise helps SGD find flatter minima that generalize better. This is one reason why SGD often outperforms more sophisticated optimizers on generalization.

Mini-batch Gradient Descent

Mini-batch gradient descent strikes a balance between the extremes, using a small batch of $B$ examples per update:

\theta^{(t+1)} = \theta^{(t)} - \eta \cdot \frac{1}{B} \sum_{j=1}^{B} \nabla_\theta \ell(f_\theta(x^{(i_j)}), y^{(i_j)})

where $\{i_1, ..., i_B\}$ is a randomly sampled subset of indices.

Why Mini-batch is the Standard

Mini-batch gradient descent has become the de facto standard for training neural networks because it optimally balances:

Reduced variance: Averaging over B samples reduces gradient variance by factor of $\sqrt{B}$
GPU efficiency: Modern GPUs are optimized for batch matrix operations
Regularization: Still retains some noise for implicit regularization
Memory efficiency: Batch size can be tuned to available memory

Variance Reduction Analysis

Let $g_i = \nabla_\theta \ell(f_\theta(x^{(i)}), y^{(i)})$ be the gradient for example $i$ , with variance $\text{Var}(g_i) = \sigma^2$ . The mini-batch gradient is:

\hat{g}_B = \frac{1}{B} \sum_{j=1}^{B} g_{i_j}

By properties of variance (for independent samples):

\text{Var}(\hat{g}_B) = \text{Var}\left(\frac{1}{B} \sum_{j=1}^{B} g_{i_j}\right) = \frac{1}{B^2} \cdot B \cdot \sigma^2 = \frac{\sigma^2}{B}

Therefore, the standard deviation of the gradient estimate scales as:

\text{Std}(\hat{g}_B) = \frac{\sigma}{\sqrt{B}}

Key Result: Doubling the batch size reduces gradient noise by a factor of $\sqrt{2} \approx 1.41$ , not by a factor of 2. This has important implications for computational efficiency.

Common Batch Sizes

Batch Size	Use Case	Typical Hardware
1	Pure SGD, streaming data	CPU
16-32	Small datasets, limited memory	Single GPU
64-128	Standard training	Single GPU
256-512	Large-scale training	Multi-GPU
1024+	Distributed training	GPU cluster

Powers of 2

Batch sizes are typically powers of 2 (32, 64, 128, 256) because they align well with GPU memory architectures and matrix operation optimizations.

Interactive: Compare Variants

The visualization below shows all three gradient descent variants navigating the same loss landscape. Watch how batch GD follows a smooth path, while SGD is noisy, and mini-batch balances the two.

Gradient Descent Variants: Loss Landscape Navigation

Uses all data, smooth but slow per iteration

Learning Rate: 0.0010

Iteration: 0

Current Loss: 0.0000

Distance to Optimum: 2.6926

Quick Check

Looking at the visualization, which variant reaches the minimum fastest in terms of wall-clock time?

Convergence Analysis

Let's analyze how these variants converge over training epochs:

Convergence Comparison

Watch how different gradient descent variants converge over training epochs. Notice the trade-off between noise and convergence speed.

Current Epoch

0 / 100

Learning Rate: 0.010

Current Losses

Batch GD-

Stochastic GD-

Mini-batch GD-

Early training: All methods show rapid initial decrease.

Convergence Rates

For a strongly convex function with condition number $\kappa$ :

Method	Convergence Rate	Notes
Batch GD	O(κ log(1/ε))	Linear convergence, deterministic
SGD	O(σ²/ε)	Sublinear, limited by noise floor
Mini-batch	O(σ²/(Bε))	Improves with batch size B

However, neural network loss functions are non-convex, so these theoretical rates don't directly apply. In practice, all three methods can find good solutions, but with different characteristics.

The Generalization Advantage of Noise

Empirically, noisier methods (SGD, small-batch) often find solutions that generalize better. Several theories explain this:

Flat minima hypothesis: Noise helps escape sharp minima that overfit, finding flatter minima that generalize
Implicit regularization: The noise term effectively adds an L2 regularization-like penalty
Exploration: Random fluctuations explore more of the loss landscape

Learning Rate Dynamics

The learning rate $\eta$ is perhaps the most critical hyperparameter in gradient descent. Too small and training is slow; too large and training diverges.

Learning Rate: The Critical Hyperparameter

Adjust the learning rate and observe how it affects the optimization trajectory on a simple quadratic loss function.

Learning Rate (η)0.30

0.01 (too small)1.5 (diverging)

Optimal Range

Learning rate is in a good range. The algorithm will converge smoothly and efficiently to the minimum.

Mathematical Insight

For a quadratic loss L(w) = (w - w*)², the update rule is:

wₙ₊₁ = wₙ - η · 2(wₙ - w*)

Convergence requires |1 - 2η| < 1, meaning 0 < η < 1

Loss Landscape & Trajectory

Learning Rate and Batch Size Coupling

When you change the batch size, you often need to adjust the learning rate. The linear scaling rule provides guidance:

\eta_{\text{new}} = \eta_{\text{base}} \cdot \frac{B_{\text{new}}}{B_{\text{base}}}

This rule ensures that the expected weight update magnitude stays constant when batch size changes.

Derivation of Linear Scaling

Consider the expected gradient step after $k$ mini-batch updates with batch size $B$ :

\mathbb{E}\left[\sum_{t=1}^{k} \eta \cdot \hat{g}_t^{(B)}\right] = k \cdot \eta \cdot \bar{g}

To match the same expected update with batch size $B' = k \cdot B$ in one update:

\eta' \cdot \bar{g} = k \cdot \eta \cdot \bar{g} \quad \Rightarrow \quad \eta' = k \cdot \eta = \frac{B'}{B} \cdot \eta

Learning Rate Warmup

When using large batch sizes with linear scaling, the learning rate can be quite large at the start. This can cause instability. The solution is learning rate warmup: start with a small learning rate and gradually increase to the target over several epochs.

Batch Size Trade-offs

Choosing the right batch size involves balancing multiple factors. Use the explorer below to understand these trade-offs:

Batch Size Trade-offs Explorer

Explore how batch size affects various aspects of training. Each choice involves trade-offs between speed, memory, and generalization.

Batch Size32

18321285121024

Medium Batch (16-64)

Balanced noise and stability. Most common choice for training. Good GPU utilization on most hardware.

Dataset Size: 50,000 samples

Iterations per Epoch: 1,563

Gradient Noise Scaling

The variance of the stochastic gradient estimate decreases with batch size:

Var(∇L̂) = σ² / B

Standard deviation scales as 1/√B, so doubling batch size reduces noise by ~30%.

Training Metrics

Gradient Variance

0.18

Training Stability

0.82

Generalization (est.)

0.85

Memory Usage

128.00MB

Time per Epoch

1281.66ms

Key Trade-offs at B = 32

Iterations/Epoch:1,563

Gradient Updates:Moderate (1563/epoch)

Regularization Effect:Strong

GPU Utilization:High

Practical Recommendation

This is a solid choice for most tasks. Provides good balance between noise and efficiency.

The Batch Size Spectrum

Batch Size	Gradient Noise	Updates/Epoch	GPU Util.	Generalization
B=1 (SGD)	Very High	N	Poor	Often Best
B=32	Moderate	N/32	Good	Good
B=256	Low	N/256	Excellent	May Degrade
B=8192+	Very Low	N/8192	Excellent	Requires Care

Large Batch Training Challenges

Training with very large batches (1000+) presents unique challenges:

Generalization gap: Models trained with large batches often generalize worse
Sharp minima: Large batches tend to find sharper minima
Learning rate sensitivity: Requires careful tuning and warmup
Communication overhead: In distributed training, synchronization becomes bottleneck

Solutions include LARS (Layer-wise Adaptive Rate Scaling), LAMB optimizer, and gradient checkpointing.

Implementation in PyTorch

Let's implement all three gradient descent variants in PyTorch. We'll use a simple quadratic loss for clarity.

Implementing Gradient Descent Variants

🐍gradient_descent_variants.py

Explanation(7)

Code(92)

7Synthetic Dataset

We create a simple linear regression problem where the true relationship is y = Xw + noise. This lets us visualize convergence clearly.

14Simple Linear Model

A single nn.Linear layer is equivalent to linear regression. The optimization landscape is convex, making it ideal for comparing GD variants.

23Unified Training Function

This function handles all three variants. The key difference is how we sample data for each gradient computation.

30Batch Gradient Descent

Uses ALL data (X, y) for each gradient update. One epoch = one gradient step. Smoothest convergence but expensive per step.

EXAMPLE

1000 samples → 1 gradient computation per epoch

40True Stochastic GD

Processes one sample at a time. N gradient updates per epoch, but each is noisy. Note the lower learning rate (0.01 vs 0.1) needed for stability.

EXAMPLE

1000 samples → 1000 gradient updates per epoch

53Mini-batch with DataLoader

PyTorch's DataLoader handles batching and shuffling automatically. This is the standard approach in practice.

64Different Learning Rates

Notice how each variant uses a different learning rate! Batch GD can use larger LR (0.1), SGD needs smaller (0.01), mini-batch is in between (0.05).

85 lines without explanation

1import torch
2import torch.nn as nn
3from torch.utils.data import DataLoader, TensorDataset
4import matplotlib.pyplot as plt
5
6# Create synthetic dataset
7torch.manual_seed(42)
8N = 1000  # Dataset size
9X = torch.randn(N, 10)  # 10 features
10true_weights = torch.randn(10, 1)
11y = X @ true_weights + 0.1 * torch.randn(N, 1)  # Add noise
12
13# Simple linear model
14class LinearModel(nn.Module):
15    def __init__(self, input_dim):
16        super().__init__()
17        self.linear = nn.Linear(input_dim, 1)
18
19    def forward(self, x):
20        return self.linear(x)
21
22# Training function for different variants
23def train_variant(variant, batch_size, lr, epochs=50):
24    """
25    variant: 'batch', 'sgd', 'minibatch'
26    """
27    model = LinearModel(10)
28    optimizer = torch.optim.SGD(model.parameters(), lr=lr)
29    criterion = nn.MSELoss()
30    losses = []
31
32    if variant == 'batch':
33        # Full batch: use entire dataset each iteration
34        for epoch in range(epochs):
35            optimizer.zero_grad()
36            pred = model(X)
37            loss = criterion(pred, y)
38            loss.backward()
39            optimizer.step()
40            losses.append(loss.item())
41
42    elif variant == 'sgd':
43        # True SGD: one sample at a time
44        for epoch in range(epochs):
45            indices = torch.randperm(N)
46            epoch_loss = 0
47            for i in indices:
48                optimizer.zero_grad()
49                pred = model(X[i:i+1])
50                loss = criterion(pred, y[i:i+1])
51                loss.backward()
52                optimizer.step()
53                epoch_loss += loss.item()
54            losses.append(epoch_loss / N)
55
56    else:  # minibatch
57        dataset = TensorDataset(X, y)
58        loader = DataLoader(dataset, batch_size=batch_size, shuffle=True)
59        for epoch in range(epochs):
60            epoch_loss = 0
61            for batch_X, batch_y in loader:
62                optimizer.zero_grad()
63                pred = model(batch_X)
64                loss = criterion(pred, batch_y)
65                loss.backward()
66                optimizer.step()
67                epoch_loss += loss.item() * len(batch_X)
68            losses.append(epoch_loss / N)
69
70    return losses, model
71
72# Run all three variants
73print("Training Batch GD...")
74batch_losses, _ = train_variant('batch', N, lr=0.1)
75
76print("Training SGD...")
77sgd_losses, _ = train_variant('sgd', 1, lr=0.01)  # Lower LR for stability
78
79print("Training Mini-batch (B=32)...")
80minibatch_losses, _ = train_variant('minibatch', 32, lr=0.05)
81
82# Plot results
83plt.figure(figsize=(10, 6))
84plt.plot(batch_losses, label='Batch GD', linewidth=2)
85plt.plot(sgd_losses, label='SGD (B=1)', linewidth=2, alpha=0.8)
86plt.plot(minibatch_losses, label='Mini-batch (B=32)', linewidth=2)
87plt.xlabel('Epoch')
88plt.ylabel('Loss')
89plt.legend()
90plt.title('Convergence Comparison')
91plt.yscale('log')
92plt.show()

PyTorch's Built-in Approach

In practice, you almost always use mini-batch gradient descent through PyTorch's DataLoader:

Standard PyTorch Training Loop

🐍standard_training.py

Explanation(6)

Code(40)

5Train Epoch Function

Encapsulating one epoch of training in a function is a common pattern. It processes all batches in the DataLoader.

9Iterate Over Batches

The DataLoader yields mini-batches. Each iteration of this loop processes batch_size examples.

15Zero Gradients First

CRITICAL: Always call zero_grad() before the forward pass. PyTorch accumulates gradients by default, so we must clear them.

22Accumulate Loss

We track total loss weighted by batch size for accurate epoch loss computation. The last batch might be smaller.

30DataLoader Configuration

batch_size=64 is a common default. shuffle=True ensures random sampling. num_workers enables parallel data loading.

33pin_memory for GPU

pin_memory=True enables faster CPU to GPU transfer. Always use this when training on GPU.

34 lines without explanation

1import torch
2from torch.utils.data import DataLoader, TensorDataset
3
4# Standard PyTorch training loop (mini-batch)
5def train_epoch(model, dataloader, optimizer, criterion):
6    model.train()
7    total_loss = 0
8
9    for batch_X, batch_y in dataloader:
10        # Move to device if using GPU
11        batch_X = batch_X.to(device)
12        batch_y = batch_y.to(device)
13
14        # Forward pass
15        optimizer.zero_grad()
16        predictions = model(batch_X)
17        loss = criterion(predictions, batch_y)
18
19        # Backward pass
20        loss.backward()
21        optimizer.step()
22
23        total_loss += loss.item() * len(batch_X)
24
25    return total_loss / len(dataloader.dataset)
26
27# Usage example
28dataset = TensorDataset(X_train, y_train)
29dataloader = DataLoader(
30    dataset,
31    batch_size=64,        # Mini-batch size
32    shuffle=True,         # Shuffle each epoch
33    num_workers=4,        # Parallel data loading
34    pin_memory=True       # Faster GPU transfer
35)
36
37# Training loop
38for epoch in range(num_epochs):
39    train_loss = train_epoch(model, dataloader, optimizer, criterion)
40    print(f"Epoch {epoch+1}: Loss = {train_loss:.4f}")

Practical Guidelines

Choosing Batch Size

Start with 32 or 64: These are safe defaults that work well for most problems
Increase if training is too slow: Larger batches = fewer iterations per epoch
Decrease if out of memory: Smaller batches use less GPU memory
Use largest batch that fits in memory: Up to a point (generalization may degrade)

Choosing Learning Rate

Start with 0.001 or 0.01: Common starting points for most optimizers
Use learning rate finder: Train for a few iterations with exponentially increasing LR, find where loss starts increasing
Apply linear scaling: When doubling batch size, double learning rate
Use warmup for large batches: Start low, ramp up over first few epochs

Decision Flowchart

Scenario	Recommended Approach
Small dataset (<10k)	Full batch or large mini-batch
Standard training	Mini-batch (32-128)
Memory limited	Smaller batch + gradient accumulation
Very large dataset	Mini-batch with distributed training
Need best generalization	Smaller batches (16-32)
Need fastest training	Largest batch that fits + linear scaling

Summary

In this section, we explored the three main variants of gradient descent that power modern deep learning:

Variant	Batch Size	Updates/Epoch	Key Property
Batch GD	N (full)	1	Smooth, deterministic
SGD	1	N	Noisy, regularizing
Mini-batch	B (e.g., 32)	N/B	Balanced, GPU-efficient

Key Takeaways

Mini-batch is the standard because it balances noise, efficiency, and GPU utilization
Gradient variance scales as 1/B, so larger batches give more accurate gradient estimates
Noise is a feature, not just a bug: It provides implicit regularization and helps escape local minima
Learning rate and batch size are coupled: Use the linear scaling rule when changing batch size
There's no universally best choice: The optimal variant depends on your dataset, hardware, and goals

Looking Ahead

In the next section, we'll explore computational graphs—the data structure that makes automatic differentiation possible. Understanding computational graphs is essential for understanding how backpropagation actually works in PyTorch.

Exercises

Conceptual Questions

Explain why the gradient estimate from a mini-batch is an unbiased estimator of the true gradient.
If you double the batch size from 32 to 64, by what factor does the gradient standard deviation decrease?
Why does SGD often find solutions that generalize better than batch GD, despite using noisier gradients?
Derive the linear scaling rule for learning rate when batch size changes.

Coding Exercises

Gradient Variance Experiment: For a simple linear regression problem, empirically measure the gradient variance for batch sizes 1, 10, 100, and 1000. Plot the variance vs. batch size and verify the $1/B$ scaling.
Learning Rate Sensitivity: Train the same model with learning rates 0.001, 0.01, 0.1, and 1.0. For each, record whether training converges, diverges, or oscillates.
Gradient Accumulation: Implement gradient accumulation to simulate a batch size of 256 while only fitting 32 samples in memory at a time.

Gradient Accumulation Hint

Instead of calling optimizer.step() after every batch, accumulate gradients over multiple batches before stepping. Remember to divide the loss by the accumulation factor.

Challenge Exercise

Large Batch Training with LARS: Implement the LARS (Layer-wise Adaptive Rate Scaling) algorithm, which enables training with very large batches by adapting the learning rate per layer based on the ratio of weight norm to gradient norm.

\eta^{(l)} = \eta_{\text{global}} \cdot \frac{\|w^{(l)}\|}{\|\nabla w^{(l)}\| + \lambda \|w^{(l)}\|}

Difficulty Level

This is an advanced exercise. You'll need to create a custom optimizer that computes layer-wise learning rates at each step.

In the next section, we'll explore computational graphs—the foundation of automatic differentiation that makes backpropagation possible.

Learning Objectives

The Optimization Challenge

The Core Trade-off

Batch Gradient Descent

The Algorithm

Advantages

Disadvantages

When to Use Batch GD

Stochastic Gradient Descent

The Key Insight: Unbiased Estimator

The Algorithm

Advantages

Disadvantages

The Noise Paradox

Mini-batch Gradient Descent

Why Mini-batch is the Standard

Variance Reduction Analysis

Common Batch Sizes

Powers of 2

Interactive: Compare Variants

Gradient Descent Variants: Loss Landscape Navigation

Quick Check

Convergence Analysis

Convergence Comparison

Convergence Rates

The Generalization Advantage of Noise

Learning Rate Dynamics

Learning Rate: The Critical Hyperparameter

Mathematical Insight

Loss Landscape & Trajectory

Learning Rate and Batch Size Coupling

Derivation of Linear Scaling

Learning Rate Warmup

Batch Size Trade-offs

Batch Size Trade-offs Explorer

Gradient Noise Scaling

Training Metrics

Key Trade-offs at B = 32

Practical Recommendation

The Batch Size Spectrum

Large Batch Training Challenges

Implementation in PyTorch

PyTorch's Built-in Approach

Practical Guidelines

Choosing Batch Size

Choosing Learning Rate

Decision Flowchart

Related Topics

Summary

Key Takeaways

Looking Ahead

Exercises

Conceptual Questions

Coding Exercises

Gradient Accumulation Hint

Challenge Exercise

Difficulty Level