Chapter 8
25 min read
Section 49 of 178

Gradient Descent Variants

Backpropagation from Scratch

Learning Objectives

By the end of this section, you will be able to:

  1. Understand the three main gradient descent variants: batch, stochastic, and mini-batch gradient descent
  2. Analyze the trade-offs between noise, computational efficiency, and convergence speed
  3. Derive the gradient variance scaling with batch size mathematically
  4. Select appropriate hyperparameters (learning rate, batch size) for different scenarios
  5. Implement all three variants in PyTorch and understand when to use each
  6. Apply the linear scaling rule for large-batch training
Why This Matters: The choice between gradient descent variants is one of the most impactful decisions in training neural networks. It affects not just training speed, but also generalization, memory usage, and ultimately model quality. Understanding these trade-offs is essential for any practitioner.

The Optimization Challenge

In the previous section, we established that training a neural network means minimizing a loss function L(θ)\mathcal{L}(\theta) over the parameter space. The fundamental approach is gradient descent:

θ(t+1)=θ(t)ηθL(θ(t))\theta^{(t+1)} = \theta^{(t)} - \eta \nabla_\theta \mathcal{L}(\theta^{(t)})

But here's the catch: computing the full gradient requires evaluating the loss on every training example:

L(θ)=1Ni=1N(fθ(x(i)),y(i))\mathcal{L}(\theta) = \frac{1}{N} \sum_{i=1}^{N} \ell(f_\theta(x^{(i)}), y^{(i)})

For modern datasets with millions or billions of examples, computing this sum at every step is computationally prohibitive. This motivates the development of stochastic optimization methods that use subsets of the data.

The Core Trade-off

All gradient descent variants navigate a fundamental trade-off:

AspectMore Data Per UpdateLess Data Per Update
Gradient AccuracyMore accurate (lower variance)Noisier (higher variance)
Computational CostExpensive per updateCheap per update
Memory UsageHigherLower
Updates Per SecondFewerMany
GeneralizationMay overfitRegularization effect

Batch Gradient Descent

Batch gradient descent (also called "full-batch" or "vanilla" gradient descent) uses the entire training set to compute each gradient update:

θ(t+1)=θ(t)η1Ni=1Nθ(fθ(x(i)),y(i))\theta^{(t+1)} = \theta^{(t)} - \eta \cdot \frac{1}{N} \sum_{i=1}^{N} \nabla_\theta \ell(f_\theta(x^{(i)}), y^{(i)})

The Algorithm

📝batch_gd.txt
1Batch Gradient Descent
2======================
3Input: Dataset D = {(x⁽¹⁾, y⁽¹⁾), ..., (x⁽ᴺ⁾, y⁽ᴺ⁾)}
4       Learning rate η
5       Number of epochs T
6
71. Initialize parameters θ randomly
82. For epoch = 1 to T:
9     a. Compute gradient on ENTIRE dataset:
10        g = (1/N) Σᵢ ∇θ ℓ(fθ(x⁽ⁱ⁾), y⁽ⁱ⁾)
11     b. Update parameters:
12        θ = θ - η · g
133. Return θ

Advantages

  • Deterministic: Same starting point always produces the same trajectory
  • Smooth convergence: No noise means clean descent toward minimum
  • Theoretical guarantees: Convergence proofs are straightforward for convex functions
  • Stable learning rate: Can use larger learning rates without divergence

Disadvantages

  • Computational cost: Each update requires passing through all N examples
  • Memory intensive: Must store gradients for all examples (or accumulate)
  • Slow for large datasets: One epoch = one gradient update
  • May get stuck: Deterministic path can get trapped in sharp local minima

When to Use Batch GD

Batch gradient descent is suitable when: (1) the dataset fits in memory, (2) the dataset is small (thousands of examples), or (3) you need very precise gradient estimates (e.g., for second-order methods).

Stochastic Gradient Descent

Stochastic Gradient Descent (SGD) takes the opposite extreme: it uses only one training example per update:

θ(t+1)=θ(t)ηθ(fθ(x(i)),y(i))\theta^{(t+1)} = \theta^{(t)} - \eta \cdot \nabla_\theta \ell(f_\theta(x^{(i)}), y^{(i)})

where (x(i),y(i))(x^{(i)}, y^{(i)}) is a randomly sampled training example.

The Key Insight: Unbiased Estimator

The single-sample gradient is an unbiased estimator of the true gradient:

EiUniform(1..N)[θ(fθ(x(i)),y(i))]=1Ni=1Nθ(fθ(x(i)),y(i))\mathbb{E}_{i \sim \text{Uniform}(1..N)} \left[ \nabla_\theta \ell(f_\theta(x^{(i)}), y^{(i)}) \right] = \frac{1}{N} \sum_{i=1}^{N} \nabla_\theta \ell(f_\theta(x^{(i)}), y^{(i)})

This means that on average, we're moving in the right direction. The noise from individual samples cancels out over many updates.

The Algorithm

📝sgd.txt
1Stochastic Gradient Descent
2============================
3Input: Dataset D = {(x⁽¹⁾, y⁽¹⁾), ..., (x⁽ᴺ⁾, y⁽ᴺ⁾)}
4       Learning rate η
5       Number of epochs T
6
71. Initialize parameters θ randomly
82. For epoch = 1 to T:
9     Shuffle the dataset
10     For each example (x⁽ⁱ⁾, y⁽ⁱ⁾):
11       a. Compute gradient on single example:
12          g = ∇θ ℓ(fθ(x⁽ⁱ⁾), y⁽ⁱ⁾)
13       b. Update parameters:
14          θ = θ - η · g
153. Return θ

Advantages

  • Very fast iterations: Each update only processes one example
  • Low memory: Only need to store one example at a time
  • Online learning: Can learn from streaming data
  • Escapes local minima: Noise helps explore the loss landscape
  • Implicit regularization: Noise acts as a regularizer, often improving generalization

Disadvantages

  • High variance: Updates are noisy, causing erratic convergence
  • No vectorization: Cannot leverage GPU parallelism effectively
  • Requires careful tuning: Learning rate must be smaller to handle noise
  • Never truly converges: Oscillates around the minimum due to noise

The Noise Paradox

The noise in SGD is both a bug and a feature. While it prevents perfect convergence, research shows that this noise helps SGD find flatter minima that generalize better. This is one reason why SGD often outperforms more sophisticated optimizers on generalization.

Mini-batch Gradient Descent

Mini-batch gradient descent strikes a balance between the extremes, using a small batch of BB examples per update:

θ(t+1)=θ(t)η1Bj=1Bθ(fθ(x(ij)),y(ij))\theta^{(t+1)} = \theta^{(t)} - \eta \cdot \frac{1}{B} \sum_{j=1}^{B} \nabla_\theta \ell(f_\theta(x^{(i_j)}), y^{(i_j)})

where {i1,...,iB}\{i_1, ..., i_B\} is a randomly sampled subset of indices.

Why Mini-batch is the Standard

Mini-batch gradient descent has become the de facto standard for training neural networks because it optimally balances:

  1. Reduced variance: Averaging over B samples reduces gradient variance by factor of B\sqrt{B}
  2. GPU efficiency: Modern GPUs are optimized for batch matrix operations
  3. Regularization: Still retains some noise for implicit regularization
  4. Memory efficiency: Batch size can be tuned to available memory

Variance Reduction Analysis

Let gi=θ(fθ(x(i)),y(i))g_i = \nabla_\theta \ell(f_\theta(x^{(i)}), y^{(i)}) be the gradient for example ii, with variance Var(gi)=σ2\text{Var}(g_i) = \sigma^2. The mini-batch gradient is:

g^B=1Bj=1Bgij\hat{g}_B = \frac{1}{B} \sum_{j=1}^{B} g_{i_j}

By properties of variance (for independent samples):

Var(g^B)=Var(1Bj=1Bgij)=1B2Bσ2=σ2B\text{Var}(\hat{g}_B) = \text{Var}\left(\frac{1}{B} \sum_{j=1}^{B} g_{i_j}\right) = \frac{1}{B^2} \cdot B \cdot \sigma^2 = \frac{\sigma^2}{B}

Therefore, the standard deviation of the gradient estimate scales as:

Std(g^B)=σB\text{Std}(\hat{g}_B) = \frac{\sigma}{\sqrt{B}}
Key Result: Doubling the batch size reduces gradient noise by a factor of 21.41\sqrt{2} \approx 1.41, not by a factor of 2. This has important implications for computational efficiency.

Common Batch Sizes

Batch SizeUse CaseTypical Hardware
1Pure SGD, streaming dataCPU
16-32Small datasets, limited memorySingle GPU
64-128Standard trainingSingle GPU
256-512Large-scale trainingMulti-GPU
1024+Distributed trainingGPU cluster

Powers of 2

Batch sizes are typically powers of 2 (32, 64, 128, 256) because they align well with GPU memory architectures and matrix operation optimizations.

Interactive: Compare Variants

The visualization below shows all three gradient descent variants navigating the same loss landscape. Watch how batch GD follows a smooth path, while SGD is noisy, and mini-batch balances the two.

Gradient Descent Variants: Loss Landscape Navigation

Uses all data, smooth but slow per iteration

Iteration: 0
Current Loss: 0.0000
Distance to Optimum: 2.6926

Quick Check

Looking at the visualization, which variant reaches the minimum fastest in terms of wall-clock time?


Convergence Analysis

Let's analyze how these variants converge over training epochs:

Convergence Comparison

Watch how different gradient descent variants converge over training epochs. Notice the trade-off between noise and convergence speed.

0.00.51.01.52.02.53.0020406080100EpochTraining LossBatch GD-Stochastic GD-Mini-batch GD-
Current Epoch
0 / 100
Current Losses
Batch GD-
Stochastic GD-
Mini-batch GD-
Early training: All methods show rapid initial decrease.

Convergence Rates

For a strongly convex function with condition number κ\kappa:

MethodConvergence RateNotes
Batch GDO(κ log(1/ε))Linear convergence, deterministic
SGDO(σ²/ε)Sublinear, limited by noise floor
Mini-batchO(σ²/(Bε))Improves with batch size B

However, neural network loss functions are non-convex, so these theoretical rates don't directly apply. In practice, all three methods can find good solutions, but with different characteristics.

The Generalization Advantage of Noise

Empirically, noisier methods (SGD, small-batch) often find solutions that generalize better. Several theories explain this:

  1. Flat minima hypothesis: Noise helps escape sharp minima that overfit, finding flatter minima that generalize
  2. Implicit regularization: The noise term effectively adds an L2 regularization-like penalty
  3. Exploration: Random fluctuations explore more of the loss landscape

Learning Rate Dynamics

The learning rate η\eta is perhaps the most critical hyperparameter in gradient descent. Too small and training is slow; too large and training diverges.

Learning Rate: The Critical Hyperparameter

Adjust the learning rate and observe how it affects the optimization trajectory on a simple quadratic loss function.

0.30
0.01 (too small)1.5 (diverging)
Optimal Range

Learning rate is in a good range. The algorithm will converge smoothly and efficiently to the minimum.

Mathematical Insight

For a quadratic loss L(w) = (w - w*)², the update rule is:

wₙ₊₁ = wₙ - η · 2(wₙ - w*)

Convergence requires |1 - 2η| < 1, meaning 0 < η < 1

Loss Landscape & Trajectory

-4-2024601020304050Weight (w)Lossw* = 2

Learning Rate and Batch Size Coupling

When you change the batch size, you often need to adjust the learning rate. The linear scaling rule provides guidance:

ηnew=ηbaseBnewBbase\eta_{\text{new}} = \eta_{\text{base}} \cdot \frac{B_{\text{new}}}{B_{\text{base}}}

This rule ensures that the expected weight update magnitude stays constant when batch size changes.

Derivation of Linear Scaling

Consider the expected gradient step after kk mini-batch updates with batch size BB:

E[t=1kηg^t(B)]=kηgˉ\mathbb{E}\left[\sum_{t=1}^{k} \eta \cdot \hat{g}_t^{(B)}\right] = k \cdot \eta \cdot \bar{g}

To match the same expected update with batch size B=kBB' = k \cdot B in one update:

ηgˉ=kηgˉη=kη=BBη\eta' \cdot \bar{g} = k \cdot \eta \cdot \bar{g} \quad \Rightarrow \quad \eta' = k \cdot \eta = \frac{B'}{B} \cdot \eta

Learning Rate Warmup

When using large batch sizes with linear scaling, the learning rate can be quite large at the start. This can cause instability. The solution is learning rate warmup: start with a small learning rate and gradually increase to the target over several epochs.

Batch Size Trade-offs

Choosing the right batch size involves balancing multiple factors. Use the explorer below to understand these trade-offs:

Batch Size Trade-offs Explorer

Explore how batch size affects various aspects of training. Each choice involves trade-offs between speed, memory, and generalization.

32
18321285121024
Medium Batch (16-64)

Balanced noise and stability. Most common choice for training. Good GPU utilization on most hardware.

Dataset Size: 50,000 samples
Iterations per Epoch: 1,563

Gradient Noise Scaling

The variance of the stochastic gradient estimate decreases with batch size:

Var(∇L̂) = σ² / B

Standard deviation scales as 1/√B, so doubling batch size reduces noise by ~30%.

Training Metrics

Gradient Variance
0.18
Training Stability
0.82
Generalization (est.)
0.85
Memory Usage
128.00MB
Time per Epoch
1281.66ms

Key Trade-offs at B = 32

Iterations/Epoch:1,563
Gradient Updates:Moderate (1563/epoch)
Regularization Effect:Strong
GPU Utilization:High

Practical Recommendation

This is a solid choice for most tasks. Provides good balance between noise and efficiency.

The Batch Size Spectrum

Batch SizeGradient NoiseUpdates/EpochGPU Util.Generalization
B=1 (SGD)Very HighNPoorOften Best
B=32ModerateN/32GoodGood
B=256LowN/256ExcellentMay Degrade
B=8192+Very LowN/8192ExcellentRequires Care

Large Batch Training Challenges

Training with very large batches (1000+) presents unique challenges:

  1. Generalization gap: Models trained with large batches often generalize worse
  2. Sharp minima: Large batches tend to find sharper minima
  3. Learning rate sensitivity: Requires careful tuning and warmup
  4. Communication overhead: In distributed training, synchronization becomes bottleneck

Solutions include LARS (Layer-wise Adaptive Rate Scaling), LAMB optimizer, and gradient checkpointing.


Implementation in PyTorch

Let's implement all three gradient descent variants in PyTorch. We'll use a simple quadratic loss for clarity.

Implementing Gradient Descent Variants
🐍gradient_descent_variants.py
7Synthetic Dataset

We create a simple linear regression problem where the true relationship is y = Xw + noise. This lets us visualize convergence clearly.

14Simple Linear Model

A single nn.Linear layer is equivalent to linear regression. The optimization landscape is convex, making it ideal for comparing GD variants.

23Unified Training Function

This function handles all three variants. The key difference is how we sample data for each gradient computation.

30Batch Gradient Descent

Uses ALL data (X, y) for each gradient update. One epoch = one gradient step. Smoothest convergence but expensive per step.

EXAMPLE
1000 samples → 1 gradient computation per epoch
40True Stochastic GD

Processes one sample at a time. N gradient updates per epoch, but each is noisy. Note the lower learning rate (0.01 vs 0.1) needed for stability.

EXAMPLE
1000 samples → 1000 gradient updates per epoch
53Mini-batch with DataLoader

PyTorch's DataLoader handles batching and shuffling automatically. This is the standard approach in practice.

64Different Learning Rates

Notice how each variant uses a different learning rate! Batch GD can use larger LR (0.1), SGD needs smaller (0.01), mini-batch is in between (0.05).

85 lines without explanation
1import torch
2import torch.nn as nn
3from torch.utils.data import DataLoader, TensorDataset
4import matplotlib.pyplot as plt
5
6# Create synthetic dataset
7torch.manual_seed(42)
8N = 1000  # Dataset size
9X = torch.randn(N, 10)  # 10 features
10true_weights = torch.randn(10, 1)
11y = X @ true_weights + 0.1 * torch.randn(N, 1)  # Add noise
12
13# Simple linear model
14class LinearModel(nn.Module):
15    def __init__(self, input_dim):
16        super().__init__()
17        self.linear = nn.Linear(input_dim, 1)
18
19    def forward(self, x):
20        return self.linear(x)
21
22# Training function for different variants
23def train_variant(variant, batch_size, lr, epochs=50):
24    """
25    variant: 'batch', 'sgd', 'minibatch'
26    """
27    model = LinearModel(10)
28    optimizer = torch.optim.SGD(model.parameters(), lr=lr)
29    criterion = nn.MSELoss()
30    losses = []
31
32    if variant == 'batch':
33        # Full batch: use entire dataset each iteration
34        for epoch in range(epochs):
35            optimizer.zero_grad()
36            pred = model(X)
37            loss = criterion(pred, y)
38            loss.backward()
39            optimizer.step()
40            losses.append(loss.item())
41
42    elif variant == 'sgd':
43        # True SGD: one sample at a time
44        for epoch in range(epochs):
45            indices = torch.randperm(N)
46            epoch_loss = 0
47            for i in indices:
48                optimizer.zero_grad()
49                pred = model(X[i:i+1])
50                loss = criterion(pred, y[i:i+1])
51                loss.backward()
52                optimizer.step()
53                epoch_loss += loss.item()
54            losses.append(epoch_loss / N)
55
56    else:  # minibatch
57        dataset = TensorDataset(X, y)
58        loader = DataLoader(dataset, batch_size=batch_size, shuffle=True)
59        for epoch in range(epochs):
60            epoch_loss = 0
61            for batch_X, batch_y in loader:
62                optimizer.zero_grad()
63                pred = model(batch_X)
64                loss = criterion(pred, batch_y)
65                loss.backward()
66                optimizer.step()
67                epoch_loss += loss.item() * len(batch_X)
68            losses.append(epoch_loss / N)
69
70    return losses, model
71
72# Run all three variants
73print("Training Batch GD...")
74batch_losses, _ = train_variant('batch', N, lr=0.1)
75
76print("Training SGD...")
77sgd_losses, _ = train_variant('sgd', 1, lr=0.01)  # Lower LR for stability
78
79print("Training Mini-batch (B=32)...")
80minibatch_losses, _ = train_variant('minibatch', 32, lr=0.05)
81
82# Plot results
83plt.figure(figsize=(10, 6))
84plt.plot(batch_losses, label='Batch GD', linewidth=2)
85plt.plot(sgd_losses, label='SGD (B=1)', linewidth=2, alpha=0.8)
86plt.plot(minibatch_losses, label='Mini-batch (B=32)', linewidth=2)
87plt.xlabel('Epoch')
88plt.ylabel('Loss')
89plt.legend()
90plt.title('Convergence Comparison')
91plt.yscale('log')
92plt.show()

PyTorch's Built-in Approach

In practice, you almost always use mini-batch gradient descent through PyTorch's DataLoader:

Standard PyTorch Training Loop
🐍standard_training.py
5Train Epoch Function

Encapsulating one epoch of training in a function is a common pattern. It processes all batches in the DataLoader.

9Iterate Over Batches

The DataLoader yields mini-batches. Each iteration of this loop processes batch_size examples.

15Zero Gradients First

CRITICAL: Always call zero_grad() before the forward pass. PyTorch accumulates gradients by default, so we must clear them.

22Accumulate Loss

We track total loss weighted by batch size for accurate epoch loss computation. The last batch might be smaller.

30DataLoader Configuration

batch_size=64 is a common default. shuffle=True ensures random sampling. num_workers enables parallel data loading.

33pin_memory for GPU

pin_memory=True enables faster CPU to GPU transfer. Always use this when training on GPU.

34 lines without explanation
1import torch
2from torch.utils.data import DataLoader, TensorDataset
3
4# Standard PyTorch training loop (mini-batch)
5def train_epoch(model, dataloader, optimizer, criterion):
6    model.train()
7    total_loss = 0
8
9    for batch_X, batch_y in dataloader:
10        # Move to device if using GPU
11        batch_X = batch_X.to(device)
12        batch_y = batch_y.to(device)
13
14        # Forward pass
15        optimizer.zero_grad()
16        predictions = model(batch_X)
17        loss = criterion(predictions, batch_y)
18
19        # Backward pass
20        loss.backward()
21        optimizer.step()
22
23        total_loss += loss.item() * len(batch_X)
24
25    return total_loss / len(dataloader.dataset)
26
27# Usage example
28dataset = TensorDataset(X_train, y_train)
29dataloader = DataLoader(
30    dataset,
31    batch_size=64,        # Mini-batch size
32    shuffle=True,         # Shuffle each epoch
33    num_workers=4,        # Parallel data loading
34    pin_memory=True       # Faster GPU transfer
35)
36
37# Training loop
38for epoch in range(num_epochs):
39    train_loss = train_epoch(model, dataloader, optimizer, criterion)
40    print(f"Epoch {epoch+1}: Loss = {train_loss:.4f}")

Practical Guidelines

Choosing Batch Size

  1. Start with 32 or 64: These are safe defaults that work well for most problems
  2. Increase if training is too slow: Larger batches = fewer iterations per epoch
  3. Decrease if out of memory: Smaller batches use less GPU memory
  4. Use largest batch that fits in memory: Up to a point (generalization may degrade)

Choosing Learning Rate

  1. Start with 0.001 or 0.01: Common starting points for most optimizers
  2. Use learning rate finder: Train for a few iterations with exponentially increasing LR, find where loss starts increasing
  3. Apply linear scaling: When doubling batch size, double learning rate
  4. Use warmup for large batches: Start low, ramp up over first few epochs

Decision Flowchart

ScenarioRecommended Approach
Small dataset (<10k)Full batch or large mini-batch
Standard trainingMini-batch (32-128)
Memory limitedSmaller batch + gradient accumulation
Very large datasetMini-batch with distributed training
Need best generalizationSmaller batches (16-32)
Need fastest trainingLargest batch that fits + linear scaling

Related Topics

  • Chapter 9 Section 2: Optimizers - PyTorch optimizer implementations (SGD, Adam, AdamW) with momentum and adaptive learning rates
  • Chapter 9 Section 3: Learning Rate Strategies - Learning rate schedules, warmup, and the learning rate finder

Summary

In this section, we explored the three main variants of gradient descent that power modern deep learning:

VariantBatch SizeUpdates/EpochKey Property
Batch GDN (full)1Smooth, deterministic
SGD1NNoisy, regularizing
Mini-batchB (e.g., 32)N/BBalanced, GPU-efficient

Key Takeaways

  1. Mini-batch is the standard because it balances noise, efficiency, and GPU utilization
  2. Gradient variance scales as 1/B, so larger batches give more accurate gradient estimates
  3. Noise is a feature, not just a bug: It provides implicit regularization and helps escape local minima
  4. Learning rate and batch size are coupled: Use the linear scaling rule when changing batch size
  5. There's no universally best choice: The optimal variant depends on your dataset, hardware, and goals

Looking Ahead

In the next section, we'll explore computational graphs—the data structure that makes automatic differentiation possible. Understanding computational graphs is essential for understanding how backpropagation actually works in PyTorch.


Exercises

Conceptual Questions

  1. Explain why the gradient estimate from a mini-batch is an unbiased estimator of the true gradient.
  2. If you double the batch size from 32 to 64, by what factor does the gradient standard deviation decrease?
  3. Why does SGD often find solutions that generalize better than batch GD, despite using noisier gradients?
  4. Derive the linear scaling rule for learning rate when batch size changes.

Coding Exercises

  1. Gradient Variance Experiment: For a simple linear regression problem, empirically measure the gradient variance for batch sizes 1, 10, 100, and 1000. Plot the variance vs. batch size and verify the 1/B1/B scaling.
  2. Learning Rate Sensitivity: Train the same model with learning rates 0.001, 0.01, 0.1, and 1.0. For each, record whether training converges, diverges, or oscillates.
  3. Gradient Accumulation: Implement gradient accumulation to simulate a batch size of 256 while only fitting 32 samples in memory at a time.

Gradient Accumulation Hint

Instead of calling optimizer.step() after every batch, accumulate gradients over multiple batches before stepping. Remember to divide the loss by the accumulation factor.

Challenge Exercise

Large Batch Training with LARS: Implement the LARS (Layer-wise Adaptive Rate Scaling) algorithm, which enables training with very large batches by adapting the learning rate per layer based on the ratio of weight norm to gradient norm.

η(l)=ηglobalw(l)w(l)+λw(l)\eta^{(l)} = \eta_{\text{global}} \cdot \frac{\|w^{(l)}\|}{\|\nabla w^{(l)}\| + \lambda \|w^{(l)}\|}

Difficulty Level

This is an advanced exercise. You'll need to create a custom optimizer that computes layer-wise learning rates at each step.

In the next section, we'll explore computational graphs—the foundation of automatic differentiation that makes backpropagation possible.