Learning Objectives
By the end of this section, you will be able to:
- Distinguish parameters from hyperparameters: Understand what makes a hyperparameter different from a learned parameter and why this distinction matters
- Identify key hyperparameters: Know which hyperparameters have the biggest impact on training and when to tune them
- Apply search strategies: Use grid search, random search, and Bayesian optimization to find good hyperparameter configurations
- Implement automated tuning: Use tools like Optuna and Ray Tune to automate hyperparameter search in PyTorch
- Develop a practical workflow: Know when to stop tuning and how to allocate your compute budget effectively
Why This Matters: The same neural network architecture can perform poorly with bad hyperparameters or achieve state-of-the-art results with good ones. Learning rate alone can make the difference between a model that diverges, one that trains for weeks without converging, and one that learns efficiently. Hyperparameter tuning is often where the real gains come from.
The Big Picture
The Meta-Optimization Problem
Training a neural network is an optimization problem: we minimize the loss function by adjusting the model's parameters. But there's a level above this—we also need to choose how to optimize. The learning rate, batch size, network architecture, and regularization strength are all choices that affect whether optimization succeeds or fails.
These meta-choices are called hyperparameters, and finding good values for them is itself an optimization problem. The challenge? We can't use gradient descent because we can't differentiate through the entire training process (though methods like MAML try to approximate this).
The History
For decades, hyperparameter tuning was primarily done by intuition and trial-and-error. Researchers would hand-tune networks based on experience. In 2012, James Bergstra and Yoshua Bengio published a influential paper showing that random search is more efficient than grid search—a surprising result that changed how practitioners approach the problem.
More recently, Bayesian optimization and neural architecture search (NAS) have automated much of this process, enabling the discovery of architectures like EfficientNet that outperform hand-designed networks.
Parameters vs Hyperparameters
The Key Distinction
Understanding the difference between parameters and hyperparameters is fundamental:
| Aspect | Parameters | Hyperparameters |
|---|---|---|
| Definition | Learned from data during training | Set before training begins |
| Examples | Weights, biases | Learning rate, batch size, architecture |
| How optimized | Gradient descent (backpropagation) | Search algorithms, human intuition |
| Quantity | Millions to billions | Tens to hundreds |
| Gradient available? | Yes | No (typically) |
Mathematical Formulation
Let denote the model parameters and denote the hyperparameters. Training optimizes:
But we want hyperparameters that lead to good generalization, so we optimize:
Notice the nested structure: for each hyperparameter configuration , we need to fully train the model to get , then evaluate on validation data. This makes hyperparameter optimization expensive!
Why Not Use Test Data?
Quick Check
Which of the following is a hyperparameter?
Common Hyperparameters
Not all hyperparameters are equally important. Here's a prioritized list of what typically matters most:
Tier 1: Critical (Always Tune)
| Hyperparameter | Typical Range | Impact |
|---|---|---|
| Learning rate | 1e-5 to 1e-1 (log scale) | Most important; wrong values cause divergence or no learning |
| Batch size | 8 to 512 (powers of 2) | Affects gradient noise, training speed, and generalization |
| Number of epochs | 10 to 1000+ | Must balance underfitting vs. overfitting |
Tier 2: Important (Tune After Tier 1)
| Hyperparameter | Typical Range | Impact |
|---|---|---|
| Network depth | 2 to 100+ layers | More depth = more capacity but harder to train |
| Hidden layer width | 32 to 4096 | More width = more capacity |
| Regularization (L2/dropout) | 0.0 to 0.9 | Controls overfitting |
| Optimizer momentum | 0.9 to 0.999 | Affects convergence speed and stability |
Tier 3: Fine-Tuning (Optional)
| Hyperparameter | Typical Range | Impact |
|---|---|---|
| Learning rate schedule | Step/cosine/warmup | Can improve final performance |
| Weight decay | 1e-6 to 1e-2 | Additional regularization |
| Activation function | ReLU/GELU/SiLU | Usually minor impact for feedforward networks |
| Initialization scheme | Xavier/He/etc. | Usually use defaults unless debugging gradient issues |
The 80/20 Rule
Search Strategies
Grid Search
The simplest approach: define a grid of values for each hyperparameter and evaluate all combinations.
1# Grid search example
2learning_rates = [0.001, 0.01, 0.1]
3batch_sizes = [32, 64, 128]
4
5# Evaluate all 3 × 3 = 9 combinations
6for lr in learning_rates:
7 for bs in batch_sizes:
8 model = train_model(lr=lr, batch_size=bs)
9 val_loss = evaluate(model)
10 print(f"LR={lr}, BS={bs}: Val Loss = {val_loss:.4f}")Problem: The number of combinations grows exponentially with dimensions. With 5 hyperparameters and 4 values each, you need evaluations!
Random Search
Instead of a fixed grid, sample hyperparameters randomly from distributions:
1import numpy as np
2
3# Random search: sample 20 configurations
4for trial in range(20):
5 lr = 10 ** np.random.uniform(-4, -1) # Log-uniform: [0.0001, 0.1]
6 bs = int(2 ** np.random.uniform(4, 8)) # Powers of 2: [16, 256]
7 dropout = np.random.uniform(0.0, 0.5)
8
9 model = train_model(lr=lr, batch_size=bs, dropout=dropout)
10 val_loss = evaluate(model)
11 print(f"Trial {trial}: LR={lr:.4f}, BS={bs}, Dropout={dropout:.2f}")Why Random Beats Grid
The key insight from Bergstra & Bengio (2012): not all hyperparameters matter equally. If learning rate is crucial but dropout barely matters, grid search wastes evaluations varying dropout while keeping learning rate fixed.
Random search samples the important dimensions more densely by chance. With 9 random samples, you get 9 different learning rates. With a 3×3 grid, you only get 3.
Use Log Scale for Learning Rate
10 ** uniform(-4, -1), not uniform(0.0001, 0.1).Interactive: Search Strategies
Visualize how different search strategies explore the hyperparameter space. The heatmap shows a loss landscape—blue regions have low loss (good), red regions have high loss (bad). Watch how each method finds the optimum:
Search Method
Grid Search: Systematically evaluates points on a regular grid. Simple but scales poorly with dimensions.
Grid Size: 5 × 5 = 25 points
Search Statistics
Global minimum is at (0, 0) with loss = 0 (marked with yellow dot)
Key Insight: Notice how grid search wastes evaluations in regions far from the optimum, while Bayesian optimization quickly focuses on promising areas. Random search often finds good solutions faster than grid search because it can sample anywhere in the space.
Quick Check
With 4 hyperparameters and a budget of 81 evaluations, which approach explores more unique values per hyperparameter?
Bayesian Optimization
The Core Idea
Random search treats all unexplored points equally. But after a few evaluations, we have information! If low learning rates have been consistently better, we should focus our search there. Bayesian optimization does exactly this: it builds a surrogate model of the objective function and uses it to decide where to sample next.
The Algorithm
- Initialize: Evaluate a few random configurations
- Fit surrogate: Build a probabilistic model (usually a Gaussian Process) that predicts loss given hyperparameters, along with uncertainty
- Maximize acquisition: Find the point that balances predicted low loss (exploitation) with high uncertainty (exploration)
- Evaluate: Train a model with the chosen hyperparameters
- Update: Add the new result to observations and repeat from step 2
Acquisition Functions
The acquisition function decides where to sample next. Common choices:
| Function | Formula | Behavior |
|---|---|---|
| Expected Improvement (EI) | E[max(f* - f(x), 0)] | Prefers points likely to beat current best |
| Upper Confidence Bound (UCB) | μ(x) - κ·σ(x) | Explicitly trades off mean and uncertainty |
| Probability of Improvement | P(f(x) < f*) | Prefers any improvement, even small |
Gaussian Process Surrogate
A Gaussian Process (GP) provides not just a prediction but a full probability distribution over possible loss values at each point. This uncertainty is key:
- Near observed points: Low uncertainty, predictions are reliable
- Far from observed points: High uncertainty, might be worth exploring
Where is the mean function and is the kernel (covariance function) that encodes smoothness assumptions.
Interactive: Bayesian Optimization
Watch Bayesian optimization in action on a 1D function. The blue line shows the surrogate model's prediction, the shaded region shows uncertainty, and the orange line shows the acquisition function (Expected Improvement). Click to add samples and observe how the model updates:
0
Evaluations
—
Best Loss
—
Best Parameter
How it works: The blue line shows the surrogate model's prediction of the loss function. The shaded area represents uncertainty—wider means less confident. The orange dashed line is the acquisition function (Expected Improvement), which balances exploration (high uncertainty) and exploitation (low predicted loss).
Click "Add Suggested Point" to see Bayesian optimization choose where to sample next. Notice how it focuses on regions where the model predicts low loss OR has high uncertainty.
When to Use Bayesian Optimization
Sensitivity Analysis
Before tuning all hyperparameters equally, it's valuable to understand which ones actually matter. Sensitivity analysis measures how much the output changes when each hyperparameter varies.
Why Sensitivity Matters
- High sensitivity: Small changes cause large performance differences—tune carefully!
- Low sensitivity: The hyperparameter can be left at a reasonable default
Experiment with the interactive demo below to see how each hyperparameter affects training:
Adjust the hyperparameters below and observe how sensitive the model performance is to each one. Parameters marked as "high sensitivity" need careful tuning.
Range: 0.00001 to 0.3 (log scale)
Model converged in ~200 epochs with current settings.
Training Loss
0.1619
Validation Loss
0.1619
Good generalization - model is not overfitting
Tuning Priority
Based on current sensitivity, tune these hyperparameters first:
- 1.dropout(54% relative sensitivity)
- 2.hidden Units(45% relative sensitivity)
- 3.batch Size(14% relative sensitivity)
- 4.learning Rate(10% relative sensitivity)
Search Strategy Comparison
See how different search strategies perform as the number of hyperparameters increases:
More dimensions = harder optimization
Total trials allowed
Key Observations:
- Grid Search struggles as dimensions increase because the number of points needed grows exponentially (curse of dimensionality).
- Random Search maintains consistent efficiency regardless of dimensionality because it can sample any point in the space.
- Bayesian Optimization uses past results to make smarter choices about where to sample next, often finding better solutions faster.
Try increasing the number of hyperparameters to see how grid search degrades rapidly compared to the other methods.
Practical Workflow
A Recommended Approach
- Start with reasonable defaults: Use published baselines or framework defaults. Don't tune from scratch.
- Get a working baseline: Train once with defaults. This is your reference point.
- Tune learning rate first: Do a log-scale search from 1e-5 to 1e-1. This single hyperparameter often determines success.
- Adjust batch size: Larger batches can use larger learning rates (linear scaling rule). Find the largest batch your GPU can handle.
- Tune regularization: If overfitting, increase dropout/weight decay. If underfitting, decrease them.
- Fine-tune architecture: Only adjust depth/width if you have budget remaining.
Premature Optimization
Budget Allocation
With limited compute, how should you allocate your budget?
| Budget Size | Recommended Approach |
|---|---|
| Very small (< 10 trials) | Tune learning rate only (log scale) |
| Small (10-50 trials) | Random search over LR, batch size, and 1-2 other hyperparameters |
| Medium (50-200 trials) | Bayesian optimization or multi-fidelity methods |
| Large (200+ trials) | Full Bayesian optimization with early stopping |
PyTorch Implementation
Manual Hyperparameter Search
Using Optuna
Optuna is a popular hyperparameter optimization framework that implements efficient search algorithms:
Tools and Frameworks
Several excellent tools exist for hyperparameter tuning. Here's a comparison:
| Tool | Best For | Key Features |
|---|---|---|
| Optuna | General-purpose tuning | Easy API, pruning, visualization, many samplers |
| Ray Tune | Distributed tuning | Scales to clusters, integrates with many frameworks |
| Weights & Biases Sweeps | Experiment tracking + tuning | Beautiful dashboards, easy setup |
| Hyperopt | Bayesian optimization | Mature library, Tree-Parzen Estimator |
| Ax/BoTorch | Advanced BO | Facebook's production-grade Bayesian optimization |
Installing Tools
1# Optuna (recommended for beginners)
2pip install optuna optuna-dashboard
3
4# Ray Tune (for distributed tuning)
5pip install "ray[tune]"
6
7# Weights & Biases (for experiment tracking + sweeps)
8pip install wandbStart Simple
Summary
Hyperparameter tuning is a critical skill for training high-performance neural networks. Let's review the key takeaways:
| Concept | Key Point |
|---|---|
| Parameters vs Hyperparameters | Parameters are learned; hyperparameters are set before training |
| Most Important Hyperparameter | Learning rate—wrong values cause divergence or no learning |
| Grid Search | Simple but scales poorly with dimensions (curse of dimensionality) |
| Random Search | More efficient than grid because it samples important dims more densely |
| Bayesian Optimization | Uses past results to decide where to sample; balances exploration/exploitation |
| Sensitivity Analysis | Identifies which hyperparameters matter most for your problem |
| Practical Workflow | Start with defaults, tune learning rate first, use random/Bayesian search |
Common Mistakes to Avoid
- Tuning on test data (contaminates final evaluation)
- Using linear scale for learning rate (use log scale!)
- Tuning architecture before getting learning rate right
- Not using early stopping for expensive evaluations
- Ignoring the validation-test gap (validation performance != test performance)
Knowledge Check
Test your understanding of hyperparameter tuning concepts:
Which of the following is a hyperparameter (not a learned parameter)?
Exercises
Conceptual Questions
- Explain why the number of evaluations needed for grid search grows exponentially with the number of hyperparameters.
- You have a budget of 50 evaluations and 6 hyperparameters to tune. Would you use grid search, random search, or Bayesian optimization? Justify your answer.
- What is the role of the acquisition function in Bayesian optimization? Describe the exploration-exploitation tradeoff.
- Your colleague suggests using test set performance to choose the best hyperparameters. Explain why this is problematic.
Solution Hints
- Q1: With k values per dimension and d dimensions, grid search needs k^d evaluations. Each additional dimension multiplies the number of points by k.
- Q2: Grid search would need at least 2^6 = 64 points for just 2 values per dimension. Random or Bayesian with 50 trials is more practical.
- Q3: The acquisition function balances sampling where the surrogate predicts low loss (exploitation) and where uncertainty is high (exploration).
- Q4: Tuning on test data biases the reported test performance upward. The test set is no longer an unbiased estimate of generalization.
Coding Exercises
- Implement grid search: Write a function that takes a dictionary of hyperparameter lists and evaluates all combinations. Track the best configuration.
- Compare search strategies: Using MNIST or CIFAR-10, compare random search vs Optuna's TPE sampler with 30 trials each. Plot validation loss over trials.
- Early stopping integration: Extend the Optuna example to implement your own pruning strategy based on validation accuracy rather than loss.
- Learning rate finder: Implement the learning rate finder (gradually increase LR while monitoring loss). Find the optimal LR range for a model.
Coding Exercise Hints
- Exercise 1: Use
itertools.productto generate all combinations from the hyperparameter lists. - Exercise 2: Create two Optuna studies, one with
RandomSamplerand one withTPESampler. Plotstudy.trials_dataframe(). - Exercise 3: Create a custom pruner by subclassing
optuna.pruners.BasePruner. - Exercise 4: Start with a very small LR, increase exponentially each batch, and record loss. Plot LR vs loss to find the "knee".
In the next section, we'll explore debugging neural networks—a practical guide to diagnosing and fixing common training problems, from vanishing gradients to data issues.