Chapter 9
18 min read
Section 59 of 178

Hyperparameter Tuning

Training Neural Networks

Learning Objectives

By the end of this section, you will be able to:

  1. Distinguish parameters from hyperparameters: Understand what makes a hyperparameter different from a learned parameter and why this distinction matters
  2. Identify key hyperparameters: Know which hyperparameters have the biggest impact on training and when to tune them
  3. Apply search strategies: Use grid search, random search, and Bayesian optimization to find good hyperparameter configurations
  4. Implement automated tuning: Use tools like Optuna and Ray Tune to automate hyperparameter search in PyTorch
  5. Develop a practical workflow: Know when to stop tuning and how to allocate your compute budget effectively
Why This Matters: The same neural network architecture can perform poorly with bad hyperparameters or achieve state-of-the-art results with good ones. Learning rate alone can make the difference between a model that diverges, one that trains for weeks without converging, and one that learns efficiently. Hyperparameter tuning is often where the real gains come from.

The Big Picture

The Meta-Optimization Problem

Training a neural network is an optimization problem: we minimize the loss function by adjusting the model's parameters. But there's a level above this—we also need to choose how to optimize. The learning rate, batch size, network architecture, and regularization strength are all choices that affect whether optimization succeeds or fails.

These meta-choices are called hyperparameters, and finding good values for them is itself an optimization problem. The challenge? We can't use gradient descent because we can't differentiate through the entire training process (though methods like MAML try to approximate this).

The History

For decades, hyperparameter tuning was primarily done by intuition and trial-and-error. Researchers would hand-tune networks based on experience. In 2012, James Bergstra and Yoshua Bengio published a influential paper showing that random search is more efficient than grid search—a surprising result that changed how practitioners approach the problem.

More recently, Bayesian optimization and neural architecture search (NAS) have automated much of this process, enabling the discovery of architectures like EfficientNet that outperform hand-designed networks.


Parameters vs Hyperparameters

The Key Distinction

Understanding the difference between parameters and hyperparameters is fundamental:

AspectParametersHyperparameters
DefinitionLearned from data during trainingSet before training begins
ExamplesWeights, biasesLearning rate, batch size, architecture
How optimizedGradient descent (backpropagation)Search algorithms, human intuition
QuantityMillions to billionsTens to hundreds
Gradient available?YesNo (typically)

Mathematical Formulation

Let θ\theta denote the model parameters and λ\lambda denote the hyperparameters. Training optimizes:

θ=argminθLtrain(θ;λ)\theta^* = \arg\min_\theta \mathcal{L}_{\text{train}}(\theta; \lambda)

But we want hyperparameters that lead to good generalization, so we optimize:

λ=argminλLval(θ(λ))\lambda^* = \arg\min_\lambda \mathcal{L}_{\text{val}}(\theta^*(\lambda))

Notice the nested structure: for each hyperparameter configuration λ\lambda, we need to fully train the model to get θ(λ)\theta^*(\lambda), then evaluate on validation data. This makes hyperparameter optimization expensive!

Why Not Use Test Data?

We use validation data for hyperparameter tuning, never test data. If we tuned on test data, our reported test performance would be overly optimistic—we'd have implicitly fitted to the test set through our hyperparameter choices.

Quick Check

Which of the following is a hyperparameter?


Common Hyperparameters

Not all hyperparameters are equally important. Here's a prioritized list of what typically matters most:

Tier 1: Critical (Always Tune)

HyperparameterTypical RangeImpact
Learning rate1e-5 to 1e-1 (log scale)Most important; wrong values cause divergence or no learning
Batch size8 to 512 (powers of 2)Affects gradient noise, training speed, and generalization
Number of epochs10 to 1000+Must balance underfitting vs. overfitting

Tier 2: Important (Tune After Tier 1)

HyperparameterTypical RangeImpact
Network depth2 to 100+ layersMore depth = more capacity but harder to train
Hidden layer width32 to 4096More width = more capacity
Regularization (L2/dropout)0.0 to 0.9Controls overfitting
Optimizer momentum0.9 to 0.999Affects convergence speed and stability

Tier 3: Fine-Tuning (Optional)

HyperparameterTypical RangeImpact
Learning rate scheduleStep/cosine/warmupCan improve final performance
Weight decay1e-6 to 1e-2Additional regularization
Activation functionReLU/GELU/SiLUUsually minor impact for feedforward networks
Initialization schemeXavier/He/etc.Usually use defaults unless debugging gradient issues

The 80/20 Rule

Roughly 80% of your hyperparameter improvement will come from tuning the learning rate and batch size. Don't spend hours tuning minor hyperparameters before you've nailed these two.

Search Strategies

The simplest approach: define a grid of values for each hyperparameter and evaluate all combinations.

🐍grid_search.py
1# Grid search example
2learning_rates = [0.001, 0.01, 0.1]
3batch_sizes = [32, 64, 128]
4
5# Evaluate all 3 × 3 = 9 combinations
6for lr in learning_rates:
7    for bs in batch_sizes:
8        model = train_model(lr=lr, batch_size=bs)
9        val_loss = evaluate(model)
10        print(f"LR={lr}, BS={bs}: Val Loss = {val_loss:.4f}")

Problem: The number of combinations grows exponentially with dimensions. With 5 hyperparameters and 4 values each, you need 45=10244^5 = 1024 evaluations!

Instead of a fixed grid, sample hyperparameters randomly from distributions:

🐍random_search.py
1import numpy as np
2
3# Random search: sample 20 configurations
4for trial in range(20):
5    lr = 10 ** np.random.uniform(-4, -1)  # Log-uniform: [0.0001, 0.1]
6    bs = int(2 ** np.random.uniform(4, 8))  # Powers of 2: [16, 256]
7    dropout = np.random.uniform(0.0, 0.5)
8
9    model = train_model(lr=lr, batch_size=bs, dropout=dropout)
10    val_loss = evaluate(model)
11    print(f"Trial {trial}: LR={lr:.4f}, BS={bs}, Dropout={dropout:.2f}")

Why Random Beats Grid

The key insight from Bergstra & Bengio (2012): not all hyperparameters matter equally. If learning rate is crucial but dropout barely matters, grid search wastes evaluations varying dropout while keeping learning rate fixed.

Random search samples the important dimensions more densely by chance. With 9 random samples, you get 9 different learning rates. With a 3×3 grid, you only get 3.

Grid: unique values per dimension=k\text{Grid: } \text{unique values per dimension} = k
Random: unique values per dimension=n (total samples)\text{Random: } \text{unique values per dimension} = n \text{ (total samples)}

Use Log Scale for Learning Rate

Learning rates vary over orders of magnitude (0.0001 to 0.1). Always search on a log scale, not linear. Use 10 ** uniform(-4, -1), not uniform(0.0001, 0.1).

Visualize how different search strategies explore the hyperparameter space. The heatmap shows a loss landscape—blue regions have low loss (good), red regions have high loss (bad). Watch how each method finds the optimum:

Hyperparameter Search Strategies

Search Method

Grid Search: Systematically evaluates points on a regular grid. Simple but scales poorly with dimensions.

Grid Size: 5 × 5 = 25 points

Search Statistics

Points Evaluated:0
Best Loss:

Global minimum is at (0, 0) with loss = 0 (marked with yellow dot)

Low Loss (Good)
High Loss (Bad)

Key Insight: Notice how grid search wastes evaluations in regions far from the optimum, while Bayesian optimization quickly focuses on promising areas. Random search often finds good solutions faster than grid search because it can sample anywhere in the space.

Quick Check

With 4 hyperparameters and a budget of 81 evaluations, which approach explores more unique values per hyperparameter?


Bayesian Optimization

The Core Idea

Random search treats all unexplored points equally. But after a few evaluations, we have information! If low learning rates have been consistently better, we should focus our search there. Bayesian optimization does exactly this: it builds a surrogate model of the objective function and uses it to decide where to sample next.

The Algorithm

  1. Initialize: Evaluate a few random configurations
  2. Fit surrogate: Build a probabilistic model (usually a Gaussian Process) that predicts loss given hyperparameters, along with uncertainty
  3. Maximize acquisition: Find the point that balances predicted low loss (exploitation) with high uncertainty (exploration)
  4. Evaluate: Train a model with the chosen hyperparameters
  5. Update: Add the new result to observations and repeat from step 2

Acquisition Functions

The acquisition function decides where to sample next. Common choices:

FunctionFormulaBehavior
Expected Improvement (EI)E[max(f* - f(x), 0)]Prefers points likely to beat current best
Upper Confidence Bound (UCB)μ(x) - κ·σ(x)Explicitly trades off mean and uncertainty
Probability of ImprovementP(f(x) < f*)Prefers any improvement, even small

Gaussian Process Surrogate

A Gaussian Process (GP) provides not just a prediction but a full probability distribution over possible loss values at each point. This uncertainty is key:

  • Near observed points: Low uncertainty, predictions are reliable
  • Far from observed points: High uncertainty, might be worth exploring
f(x)GP(m(x),k(x,x))f(x) \sim \mathcal{GP}(m(x), k(x, x'))

Where m(x)m(x) is the mean function and k(x,x)k(x, x') is the kernel (covariance function) that encodes smoothness assumptions.


Interactive: Bayesian Optimization

Watch Bayesian optimization in action on a 1D function. The blue line shows the surrogate model's prediction, the shaded region shows uncertainty, and the orange line shows the acquisition function (Expected Improvement). Click to add samples and observe how the model updates:

Bayesian Optimization in Action
Hyperparameter ValueLoss
Surrogate Model (Mean)
Uncertainty (95% CI)
Acquisition Function
Best Found

0

Evaluations

Best Loss

Best Parameter

How it works: The blue line shows the surrogate model's prediction of the loss function. The shaded area represents uncertainty—wider means less confident. The orange dashed line is the acquisition function (Expected Improvement), which balances exploration (high uncertainty) and exploitation (low predicted loss).

Click "Add Suggested Point" to see Bayesian optimization choose where to sample next. Notice how it focuses on regions where the model predicts low loss OR has high uncertainty.

When to Use Bayesian Optimization

Bayesian optimization shines when: (1) each evaluation is expensive (training takes hours), (2) you have a small budget (less than 100 trials), and (3) hyperparameters are continuous. For cheap evaluations or discrete/categorical hyperparameters, random search may be simpler and equally effective.

Sensitivity Analysis

Before tuning all hyperparameters equally, it's valuable to understand which ones actually matter. Sensitivity analysis measures how much the output changes when each hyperparameter varies.

Why Sensitivity Matters

  • High sensitivity: Small changes cause large performance differences—tune carefully!
  • Low sensitivity: The hyperparameter can be left at a reasonable default

Experiment with the interactive demo below to see how each hyperparameter affects training:

Hyperparameter Sensitivity Analysis

Adjust the hyperparameters below and observe how sensitive the model performance is to each one. Parameters marked as "high sensitivity" need careful tuning.

0.001low sensitivity

Range: 0.00001 to 0.3 (log scale)

32low sensitivity
128high sensitivity
0.30high sensitivity
Training Converged

Model converged in ~200 epochs with current settings.

Training Loss

0.1619

Validation Loss

0.1619

Generalization Gap0.0000

Good generalization - model is not overfitting

Tuning Priority

Based on current sensitivity, tune these hyperparameters first:

  1. 1.dropout(54% relative sensitivity)
  2. 2.hidden Units(45% relative sensitivity)
  3. 3.batch Size(14% relative sensitivity)
  4. 4.learning Rate(10% relative sensitivity)

Search Strategy Comparison

See how different search strategies perform as the number of hyperparameters increases:

Search Strategy Comparison

More dimensions = harder optimization

Total trials allowed

Key Observations:

  • Grid Search struggles as dimensions increase because the number of points needed grows exponentially (curse of dimensionality).
  • Random Search maintains consistent efficiency regardless of dimensionality because it can sample any point in the space.
  • Bayesian Optimization uses past results to make smarter choices about where to sample next, often finding better solutions faster.

Try increasing the number of hyperparameters to see how grid search degrades rapidly compared to the other methods.


Practical Workflow

  1. Start with reasonable defaults: Use published baselines or framework defaults. Don't tune from scratch.
  2. Get a working baseline: Train once with defaults. This is your reference point.
  3. Tune learning rate first: Do a log-scale search from 1e-5 to 1e-1. This single hyperparameter often determines success.
  4. Adjust batch size: Larger batches can use larger learning rates (linear scaling rule). Find the largest batch your GPU can handle.
  5. Tune regularization: If overfitting, increase dropout/weight decay. If underfitting, decrease them.
  6. Fine-tune architecture: Only adjust depth/width if you have budget remaining.

Premature Optimization

Don't spend hours tuning hyperparameters on a broken pipeline. First ensure your data loading, loss computation, and evaluation metrics are correct. One bug can make all your tuning worthless.

Budget Allocation

With limited compute, how should you allocate your budget?

Budget SizeRecommended Approach
Very small (< 10 trials)Tune learning rate only (log scale)
Small (10-50 trials)Random search over LR, batch size, and 1-2 other hyperparameters
Medium (50-200 trials)Bayesian optimization or multi-fidelity methods
Large (200+ trials)Full Bayesian optimization with early stopping

PyTorch Implementation

Manual Random Search
🐍manual_search.py
11Train Function

This function takes a hyperparameter configuration and returns the validation loss after training. This is the 'objective function' we're trying to minimize.

14Extract Hyperparameters

Configuration is passed as a dictionary. This makes it easy to extend with new hyperparameters without changing function signatures.

19Build Model

The model architecture uses hyperparameters from config. Here we parameterize hidden_size and dropout rate.

58Log-Uniform Sampling

Learning rate is sampled on a log scale using 10^uniform(-4, -2). This explores 0.0001 to 0.01 evenly in log space.

59Power-of-2 Sampling

Hidden sizes are powers of 2 for GPU efficiency. We sample the exponent uniformly, then convert to integer.

87 lines without explanation
1import torch
2import torch.nn as nn
3from torch.utils.data import DataLoader
4import numpy as np
5from typing import Dict, Any
6
7def train_with_config(
8    config: Dict[str, Any],
9    train_loader: DataLoader,
10    val_loader: DataLoader,
11    num_epochs: int = 10,
12) -> float:
13    """Train a model with given hyperparameters and return validation loss."""
14
15    # Extract hyperparameters
16    lr = config["learning_rate"]
17    hidden_size = config["hidden_size"]
18    dropout = config["dropout"]
19
20    # Build model with hyperparameters
21    model = nn.Sequential(
22        nn.Linear(784, hidden_size),
23        nn.ReLU(),
24        nn.Dropout(dropout),
25        nn.Linear(hidden_size, hidden_size),
26        nn.ReLU(),
27        nn.Dropout(dropout),
28        nn.Linear(hidden_size, 10),
29    )
30
31    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
32    model = model.to(device)
33
34    optimizer = torch.optim.Adam(model.parameters(), lr=lr)
35    criterion = nn.CrossEntropyLoss()
36
37    # Training loop
38    for epoch in range(num_epochs):
39        model.train()
40        for inputs, targets in train_loader:
41            inputs, targets = inputs.to(device), targets.to(device)
42            optimizer.zero_grad()
43            outputs = model(inputs)
44            loss = criterion(outputs, targets)
45            loss.backward()
46            optimizer.step()
47
48    # Validation
49    model.eval()
50    total_loss = 0
51    with torch.no_grad():
52        for inputs, targets in val_loader:
53            inputs, targets = inputs.to(device), targets.to(device)
54            outputs = model(inputs)
55            total_loss += criterion(outputs, targets).item()
56
57    return total_loss / len(val_loader)
58
59
60def random_search(
61    train_loader: DataLoader,
62    val_loader: DataLoader,
63    n_trials: int = 20,
64) -> Dict[str, Any]:
65    """Perform random hyperparameter search."""
66
67    best_config = None
68    best_loss = float("inf")
69
70    for trial in range(n_trials):
71        # Sample hyperparameters
72        config = {
73            "learning_rate": 10 ** np.random.uniform(-4, -2),
74            "hidden_size": int(2 ** np.random.uniform(6, 9)),  # 64 to 512
75            "dropout": np.random.uniform(0.0, 0.5),
76        }
77
78        # Train and evaluate
79        val_loss = train_with_config(config, train_loader, val_loader)
80
81        print(f"Trial {trial + 1}: LR={config['learning_rate']:.4f}, "
82              f"Hidden={config['hidden_size']}, "
83              f"Dropout={config['dropout']:.2f}, "
84              f"Val Loss={val_loss:.4f}")
85
86        if val_loss < best_loss:
87            best_loss = val_loss
88            best_config = config
89
90    print(f"\nBest config: {best_config}")
91    print(f"Best validation loss: {best_loss:.4f}")
92    return best_config

Using Optuna

Optuna is a popular hyperparameter optimization framework that implements efficient search algorithms:

Hyperparameter Tuning with Optuna
🐍optuna_search.py
11Log-Scale Float

suggest_float with log=True samples on a logarithmic scale. Essential for learning rates that span multiple orders of magnitude.

12Integer with Step

suggest_int with step=64 constrains hidden sizes to multiples of 64, which can be more efficient on GPUs.

14Categorical Hyperparameters

suggest_categorical samples from a discrete set. Useful for optimizer choice, activation functions, etc.

64Report Intermediate

Report validation loss after each epoch. Optuna uses this for early stopping (pruning) of unpromising trials.

68Pruning

should_prune() checks if this trial is performing worse than the median of completed trials. If so, stop early to save compute.

75TPE Sampler

Tree-Parzen Estimator is Optuna's default Bayesian optimization algorithm. More efficient than random search for most problems.

76Median Pruner

Prunes trials that fall below the median performance at any epoch. Dramatically reduces compute for large searches.

84 lines without explanation
1import optuna
2from optuna.trial import Trial
3import torch
4import torch.nn as nn
5from torch.utils.data import DataLoader
6
7
8def objective(trial: Trial) -> float:
9    """Optuna objective function to minimize."""
10
11    # Suggest hyperparameters
12    lr = trial.suggest_float("learning_rate", 1e-5, 1e-1, log=True)
13    hidden_size = trial.suggest_int("hidden_size", 64, 512, step=64)
14    dropout = trial.suggest_float("dropout", 0.0, 0.5)
15    optimizer_name = trial.suggest_categorical("optimizer", ["Adam", "SGD", "AdamW"])
16    batch_size = trial.suggest_categorical("batch_size", [32, 64, 128, 256])
17
18    # Build model
19    model = nn.Sequential(
20        nn.Linear(784, hidden_size),
21        nn.ReLU(),
22        nn.Dropout(dropout),
23        nn.Linear(hidden_size, hidden_size),
24        nn.ReLU(),
25        nn.Dropout(dropout),
26        nn.Linear(hidden_size, 10),
27    )
28
29    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
30    model = model.to(device)
31
32    # Create optimizer based on suggestion
33    if optimizer_name == "Adam":
34        optimizer = torch.optim.Adam(model.parameters(), lr=lr)
35    elif optimizer_name == "SGD":
36        optimizer = torch.optim.SGD(model.parameters(), lr=lr, momentum=0.9)
37    else:
38        optimizer = torch.optim.AdamW(model.parameters(), lr=lr)
39
40    criterion = nn.CrossEntropyLoss()
41
42    # Create data loaders with suggested batch size
43    train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)
44    val_loader = DataLoader(val_dataset, batch_size=batch_size)
45
46    # Training loop with pruning
47    for epoch in range(20):
48        model.train()
49        for inputs, targets in train_loader:
50            inputs, targets = inputs.to(device), targets.to(device)
51            optimizer.zero_grad()
52            outputs = model(inputs)
53            loss = criterion(outputs, targets)
54            loss.backward()
55            optimizer.step()
56
57        # Validation and pruning
58        model.eval()
59        val_loss = 0
60        with torch.no_grad():
61            for inputs, targets in val_loader:
62                inputs, targets = inputs.to(device), targets.to(device)
63                val_loss += criterion(model(inputs), targets).item()
64        val_loss /= len(val_loader)
65
66        # Report intermediate result for pruning
67        trial.report(val_loss, epoch)
68
69        # Prune unpromising trials early
70        if trial.should_prune():
71            raise optuna.TrialPruned()
72
73    return val_loss
74
75
76# Create study and optimize
77study = optuna.create_study(
78    direction="minimize",
79    sampler=optuna.samplers.TPESampler(),  # Tree-Parzen Estimator
80    pruner=optuna.pruners.MedianPruner(),   # Prune below-median trials
81)
82
83study.optimize(objective, n_trials=100, timeout=3600)
84
85# Get best results
86print(f"Best trial: {study.best_trial.params}")
87print(f"Best value: {study.best_value:.4f}")
88
89# Visualization
90optuna.visualization.plot_optimization_history(study)
91optuna.visualization.plot_param_importances(study)

Tools and Frameworks

Several excellent tools exist for hyperparameter tuning. Here's a comparison:

ToolBest ForKey Features
OptunaGeneral-purpose tuningEasy API, pruning, visualization, many samplers
Ray TuneDistributed tuningScales to clusters, integrates with many frameworks
Weights & Biases SweepsExperiment tracking + tuningBeautiful dashboards, easy setup
HyperoptBayesian optimizationMature library, Tree-Parzen Estimator
Ax/BoTorchAdvanced BOFacebook's production-grade Bayesian optimization

Installing Tools

install.sh
1# Optuna (recommended for beginners)
2pip install optuna optuna-dashboard
3
4# Ray Tune (for distributed tuning)
5pip install "ray[tune]"
6
7# Weights & Biases (for experiment tracking + sweeps)
8pip install wandb

Start Simple

If you're new to hyperparameter tuning, start with Optuna. It has the gentlest learning curve while still being powerful enough for serious research. Move to Ray Tune when you need to distribute across multiple GPUs or machines.

Summary

Hyperparameter tuning is a critical skill for training high-performance neural networks. Let's review the key takeaways:

ConceptKey Point
Parameters vs HyperparametersParameters are learned; hyperparameters are set before training
Most Important HyperparameterLearning rate—wrong values cause divergence or no learning
Grid SearchSimple but scales poorly with dimensions (curse of dimensionality)
Random SearchMore efficient than grid because it samples important dims more densely
Bayesian OptimizationUses past results to decide where to sample; balances exploration/exploitation
Sensitivity AnalysisIdentifies which hyperparameters matter most for your problem
Practical WorkflowStart with defaults, tune learning rate first, use random/Bayesian search

Common Mistakes to Avoid

  • Tuning on test data (contaminates final evaluation)
  • Using linear scale for learning rate (use log scale!)
  • Tuning architecture before getting learning rate right
  • Not using early stopping for expensive evaluations
  • Ignoring the validation-test gap (validation performance != test performance)

Knowledge Check

Test your understanding of hyperparameter tuning concepts:

Knowledge Check
Question 1 of 8

Which of the following is a hyperparameter (not a learned parameter)?


Exercises

Conceptual Questions

  1. Explain why the number of evaluations needed for grid search grows exponentially with the number of hyperparameters.
  2. You have a budget of 50 evaluations and 6 hyperparameters to tune. Would you use grid search, random search, or Bayesian optimization? Justify your answer.
  3. What is the role of the acquisition function in Bayesian optimization? Describe the exploration-exploitation tradeoff.
  4. Your colleague suggests using test set performance to choose the best hyperparameters. Explain why this is problematic.

Solution Hints

  1. Q1: With k values per dimension and d dimensions, grid search needs k^d evaluations. Each additional dimension multiplies the number of points by k.
  2. Q2: Grid search would need at least 2^6 = 64 points for just 2 values per dimension. Random or Bayesian with 50 trials is more practical.
  3. Q3: The acquisition function balances sampling where the surrogate predicts low loss (exploitation) and where uncertainty is high (exploration).
  4. Q4: Tuning on test data biases the reported test performance upward. The test set is no longer an unbiased estimate of generalization.

Coding Exercises

  1. Implement grid search: Write a function that takes a dictionary of hyperparameter lists and evaluates all combinations. Track the best configuration.
  2. Compare search strategies: Using MNIST or CIFAR-10, compare random search vs Optuna's TPE sampler with 30 trials each. Plot validation loss over trials.
  3. Early stopping integration: Extend the Optuna example to implement your own pruning strategy based on validation accuracy rather than loss.
  4. Learning rate finder: Implement the learning rate finder (gradually increase LR while monitoring loss). Find the optimal LR range for a model.

Coding Exercise Hints

  • Exercise 1: Use itertools.product to generate all combinations from the hyperparameter lists.
  • Exercise 2: Create two Optuna studies, one with RandomSampler and one with TPESampler. Plot study.trials_dataframe().
  • Exercise 3: Create a custom pruner by subclassing optuna.pruners.BasePruner.
  • Exercise 4: Start with a very small LR, increase exponentially each batch, and record loss. Plot LR vs loss to find the "knee".

In the next section, we'll explore debugging neural networks—a practical guide to diagnosing and fixing common training problems, from vanishing gradients to data issues.