Learning Objectives
By the end of this section, you will be able to:
- Understand the three main gradient descent variants: batch, stochastic, and mini-batch gradient descent
- Analyze the trade-offs between noise, computational efficiency, and convergence speed
- Derive the gradient variance scaling with batch size mathematically
- Select appropriate hyperparameters (learning rate, batch size) for different scenarios
- Implement all three variants in PyTorch and understand when to use each
- Apply the linear scaling rule for large-batch training
Why This Matters: The choice between gradient descent variants is one of the most impactful decisions in training neural networks. It affects not just training speed, but also generalization, memory usage, and ultimately model quality. Understanding these trade-offs is essential for any practitioner.
The Optimization Challenge
In the previous section, we established that training a neural network means minimizing a loss function over the parameter space. The fundamental approach is gradient descent:
But here's the catch: computing the full gradient requires evaluating the loss on every training example:
For modern datasets with millions or billions of examples, computing this sum at every step is computationally prohibitive. This motivates the development of stochastic optimization methods that use subsets of the data.
The Core Trade-off
All gradient descent variants navigate a fundamental trade-off:
| Aspect | More Data Per Update | Less Data Per Update |
|---|---|---|
| Gradient Accuracy | More accurate (lower variance) | Noisier (higher variance) |
| Computational Cost | Expensive per update | Cheap per update |
| Memory Usage | Higher | Lower |
| Updates Per Second | Fewer | Many |
| Generalization | May overfit | Regularization effect |
Batch Gradient Descent
Batch gradient descent (also called "full-batch" or "vanilla" gradient descent) uses the entire training set to compute each gradient update:
The Algorithm
1Batch Gradient Descent
2======================
3Input: Dataset D = {(x⁽¹⁾, y⁽¹⁾), ..., (x⁽ᴺ⁾, y⁽ᴺ⁾)}
4 Learning rate η
5 Number of epochs T
6
71. Initialize parameters θ randomly
82. For epoch = 1 to T:
9 a. Compute gradient on ENTIRE dataset:
10 g = (1/N) Σᵢ ∇θ ℓ(fθ(x⁽ⁱ⁾), y⁽ⁱ⁾)
11 b. Update parameters:
12 θ = θ - η · g
133. Return θAdvantages
- Deterministic: Same starting point always produces the same trajectory
- Smooth convergence: No noise means clean descent toward minimum
- Theoretical guarantees: Convergence proofs are straightforward for convex functions
- Stable learning rate: Can use larger learning rates without divergence
Disadvantages
- Computational cost: Each update requires passing through all N examples
- Memory intensive: Must store gradients for all examples (or accumulate)
- Slow for large datasets: One epoch = one gradient update
- May get stuck: Deterministic path can get trapped in sharp local minima
When to Use Batch GD
Stochastic Gradient Descent
Stochastic Gradient Descent (SGD) takes the opposite extreme: it uses only one training example per update:
where is a randomly sampled training example.
The Key Insight: Unbiased Estimator
The single-sample gradient is an unbiased estimator of the true gradient:
This means that on average, we're moving in the right direction. The noise from individual samples cancels out over many updates.
The Algorithm
1Stochastic Gradient Descent
2============================
3Input: Dataset D = {(x⁽¹⁾, y⁽¹⁾), ..., (x⁽ᴺ⁾, y⁽ᴺ⁾)}
4 Learning rate η
5 Number of epochs T
6
71. Initialize parameters θ randomly
82. For epoch = 1 to T:
9 Shuffle the dataset
10 For each example (x⁽ⁱ⁾, y⁽ⁱ⁾):
11 a. Compute gradient on single example:
12 g = ∇θ ℓ(fθ(x⁽ⁱ⁾), y⁽ⁱ⁾)
13 b. Update parameters:
14 θ = θ - η · g
153. Return θAdvantages
- Very fast iterations: Each update only processes one example
- Low memory: Only need to store one example at a time
- Online learning: Can learn from streaming data
- Escapes local minima: Noise helps explore the loss landscape
- Implicit regularization: Noise acts as a regularizer, often improving generalization
Disadvantages
- High variance: Updates are noisy, causing erratic convergence
- No vectorization: Cannot leverage GPU parallelism effectively
- Requires careful tuning: Learning rate must be smaller to handle noise
- Never truly converges: Oscillates around the minimum due to noise
The Noise Paradox
Mini-batch Gradient Descent
Mini-batch gradient descent strikes a balance between the extremes, using a small batch of examples per update:
where is a randomly sampled subset of indices.
Why Mini-batch is the Standard
Mini-batch gradient descent has become the de facto standard for training neural networks because it optimally balances:
- Reduced variance: Averaging over B samples reduces gradient variance by factor of
- GPU efficiency: Modern GPUs are optimized for batch matrix operations
- Regularization: Still retains some noise for implicit regularization
- Memory efficiency: Batch size can be tuned to available memory
Variance Reduction Analysis
Let be the gradient for example , with variance . The mini-batch gradient is:
By properties of variance (for independent samples):
Therefore, the standard deviation of the gradient estimate scales as:
Key Result: Doubling the batch size reduces gradient noise by a factor of , not by a factor of 2. This has important implications for computational efficiency.
Common Batch Sizes
| Batch Size | Use Case | Typical Hardware |
|---|---|---|
| 1 | Pure SGD, streaming data | CPU |
| 16-32 | Small datasets, limited memory | Single GPU |
| 64-128 | Standard training | Single GPU |
| 256-512 | Large-scale training | Multi-GPU |
| 1024+ | Distributed training | GPU cluster |
Powers of 2
Interactive: Compare Variants
The visualization below shows all three gradient descent variants navigating the same loss landscape. Watch how batch GD follows a smooth path, while SGD is noisy, and mini-batch balances the two.
Gradient Descent Variants: Loss Landscape Navigation
Uses all data, smooth but slow per iteration
Quick Check
Looking at the visualization, which variant reaches the minimum fastest in terms of wall-clock time?
Convergence Analysis
Let's analyze how these variants converge over training epochs:
Convergence Comparison
Watch how different gradient descent variants converge over training epochs. Notice the trade-off between noise and convergence speed.
Convergence Rates
For a strongly convex function with condition number :
| Method | Convergence Rate | Notes |
|---|---|---|
| Batch GD | O(κ log(1/ε)) | Linear convergence, deterministic |
| SGD | O(σ²/ε) | Sublinear, limited by noise floor |
| Mini-batch | O(σ²/(Bε)) | Improves with batch size B |
However, neural network loss functions are non-convex, so these theoretical rates don't directly apply. In practice, all three methods can find good solutions, but with different characteristics.
The Generalization Advantage of Noise
Empirically, noisier methods (SGD, small-batch) often find solutions that generalize better. Several theories explain this:
- Flat minima hypothesis: Noise helps escape sharp minima that overfit, finding flatter minima that generalize
- Implicit regularization: The noise term effectively adds an L2 regularization-like penalty
- Exploration: Random fluctuations explore more of the loss landscape
Learning Rate Dynamics
The learning rate is perhaps the most critical hyperparameter in gradient descent. Too small and training is slow; too large and training diverges.
Learning Rate: The Critical Hyperparameter
Adjust the learning rate and observe how it affects the optimization trajectory on a simple quadratic loss function.
Learning rate is in a good range. The algorithm will converge smoothly and efficiently to the minimum.
Mathematical Insight
For a quadratic loss L(w) = (w - w*)², the update rule is:
Convergence requires |1 - 2η| < 1, meaning 0 < η < 1
Loss Landscape & Trajectory
Learning Rate and Batch Size Coupling
When you change the batch size, you often need to adjust the learning rate. The linear scaling rule provides guidance:
This rule ensures that the expected weight update magnitude stays constant when batch size changes.
Derivation of Linear Scaling
Consider the expected gradient step after mini-batch updates with batch size :
To match the same expected update with batch size in one update:
Learning Rate Warmup
Batch Size Trade-offs
Choosing the right batch size involves balancing multiple factors. Use the explorer below to understand these trade-offs:
Batch Size Trade-offs Explorer
Explore how batch size affects various aspects of training. Each choice involves trade-offs between speed, memory, and generalization.
Balanced noise and stability. Most common choice for training. Good GPU utilization on most hardware.
Gradient Noise Scaling
The variance of the stochastic gradient estimate decreases with batch size:
Standard deviation scales as 1/√B, so doubling batch size reduces noise by ~30%.
Training Metrics
Key Trade-offs at B = 32
Practical Recommendation
This is a solid choice for most tasks. Provides good balance between noise and efficiency.
The Batch Size Spectrum
| Batch Size | Gradient Noise | Updates/Epoch | GPU Util. | Generalization |
|---|---|---|---|---|
| B=1 (SGD) | Very High | N | Poor | Often Best |
| B=32 | Moderate | N/32 | Good | Good |
| B=256 | Low | N/256 | Excellent | May Degrade |
| B=8192+ | Very Low | N/8192 | Excellent | Requires Care |
Large Batch Training Challenges
Training with very large batches (1000+) presents unique challenges:
- Generalization gap: Models trained with large batches often generalize worse
- Sharp minima: Large batches tend to find sharper minima
- Learning rate sensitivity: Requires careful tuning and warmup
- Communication overhead: In distributed training, synchronization becomes bottleneck
Solutions include LARS (Layer-wise Adaptive Rate Scaling), LAMB optimizer, and gradient checkpointing.
Implementation in PyTorch
Let's implement all three gradient descent variants in PyTorch. We'll use a simple quadratic loss for clarity.
PyTorch's Built-in Approach
In practice, you almost always use mini-batch gradient descent through PyTorch's DataLoader:
Practical Guidelines
Choosing Batch Size
- Start with 32 or 64: These are safe defaults that work well for most problems
- Increase if training is too slow: Larger batches = fewer iterations per epoch
- Decrease if out of memory: Smaller batches use less GPU memory
- Use largest batch that fits in memory: Up to a point (generalization may degrade)
Choosing Learning Rate
- Start with 0.001 or 0.01: Common starting points for most optimizers
- Use learning rate finder: Train for a few iterations with exponentially increasing LR, find where loss starts increasing
- Apply linear scaling: When doubling batch size, double learning rate
- Use warmup for large batches: Start low, ramp up over first few epochs
Decision Flowchart
| Scenario | Recommended Approach |
|---|---|
| Small dataset (<10k) | Full batch or large mini-batch |
| Standard training | Mini-batch (32-128) |
| Memory limited | Smaller batch + gradient accumulation |
| Very large dataset | Mini-batch with distributed training |
| Need best generalization | Smaller batches (16-32) |
| Need fastest training | Largest batch that fits + linear scaling |
Related Topics
- Chapter 9 Section 2: Optimizers - PyTorch optimizer implementations (SGD, Adam, AdamW) with momentum and adaptive learning rates
- Chapter 9 Section 3: Learning Rate Strategies - Learning rate schedules, warmup, and the learning rate finder
Summary
In this section, we explored the three main variants of gradient descent that power modern deep learning:
| Variant | Batch Size | Updates/Epoch | Key Property |
|---|---|---|---|
| Batch GD | N (full) | 1 | Smooth, deterministic |
| SGD | 1 | N | Noisy, regularizing |
| Mini-batch | B (e.g., 32) | N/B | Balanced, GPU-efficient |
Key Takeaways
- Mini-batch is the standard because it balances noise, efficiency, and GPU utilization
- Gradient variance scales as 1/B, so larger batches give more accurate gradient estimates
- Noise is a feature, not just a bug: It provides implicit regularization and helps escape local minima
- Learning rate and batch size are coupled: Use the linear scaling rule when changing batch size
- There's no universally best choice: The optimal variant depends on your dataset, hardware, and goals
Looking Ahead
In the next section, we'll explore computational graphs—the data structure that makes automatic differentiation possible. Understanding computational graphs is essential for understanding how backpropagation actually works in PyTorch.
Exercises
Conceptual Questions
- Explain why the gradient estimate from a mini-batch is an unbiased estimator of the true gradient.
- If you double the batch size from 32 to 64, by what factor does the gradient standard deviation decrease?
- Why does SGD often find solutions that generalize better than batch GD, despite using noisier gradients?
- Derive the linear scaling rule for learning rate when batch size changes.
Coding Exercises
- Gradient Variance Experiment: For a simple linear regression problem, empirically measure the gradient variance for batch sizes 1, 10, 100, and 1000. Plot the variance vs. batch size and verify the scaling.
- Learning Rate Sensitivity: Train the same model with learning rates 0.001, 0.01, 0.1, and 1.0. For each, record whether training converges, diverges, or oscillates.
- Gradient Accumulation: Implement gradient accumulation to simulate a batch size of 256 while only fitting 32 samples in memory at a time.
Gradient Accumulation Hint
optimizer.step() after every batch, accumulate gradients over multiple batches before stepping. Remember to divide the loss by the accumulation factor.Challenge Exercise
Large Batch Training with LARS: Implement the LARS (Layer-wise Adaptive Rate Scaling) algorithm, which enables training with very large batches by adapting the learning rate per layer based on the ratio of weight norm to gradient norm.
Difficulty Level
In the next section, we'll explore computational graphs—the foundation of automatic differentiation that makes backpropagation possible.