Boo-AI — Master Artificial Intelligence by Building from Scratch

Learning Objectives

By the end of this section, you will be able to:

Distinguish parameters from hyperparameters: Understand what makes a hyperparameter different from a learned parameter and why this distinction matters
Identify key hyperparameters: Know which hyperparameters have the biggest impact on training and when to tune them
Apply search strategies: Use grid search, random search, and Bayesian optimization to find good hyperparameter configurations
Implement automated tuning: Use tools like Optuna and Ray Tune to automate hyperparameter search in PyTorch
Develop a practical workflow: Know when to stop tuning and how to allocate your compute budget effectively

Why This Matters: The same neural network architecture can perform poorly with bad hyperparameters or achieve state-of-the-art results with good ones. Learning rate alone can make the difference between a model that diverges, one that trains for weeks without converging, and one that learns efficiently. Hyperparameter tuning is often where the real gains come from.

The Big Picture

The Meta-Optimization Problem

Training a neural network is an optimization problem: we minimize the loss function by adjusting the model's parameters. But there's a level above this—we also need to choose how to optimize. The learning rate, batch size, network architecture, and regularization strength are all choices that affect whether optimization succeeds or fails.

These meta-choices are called hyperparameters, and finding good values for them is itself an optimization problem. The challenge? We can't use gradient descent because we can't differentiate through the entire training process (though methods like MAML try to approximate this).

The History

For decades, hyperparameter tuning was primarily done by intuition and trial-and-error. Researchers would hand-tune networks based on experience. In 2012, James Bergstra and Yoshua Bengio published a influential paper showing that random search is more efficient than grid search—a surprising result that changed how practitioners approach the problem.

More recently, Bayesian optimization and neural architecture search (NAS) have automated much of this process, enabling the discovery of architectures like EfficientNet that outperform hand-designed networks.

Parameters vs Hyperparameters

The Key Distinction

Understanding the difference between parameters and hyperparameters is fundamental:

Aspect	Parameters	Hyperparameters
Definition	Learned from data during training	Set before training begins
Examples	Weights, biases	Learning rate, batch size, architecture
How optimized	Gradient descent (backpropagation)	Search algorithms, human intuition
Quantity	Millions to billions	Tens to hundreds
Gradient available?	Yes	No (typically)

Mathematical Formulation

Let $\theta$ denote the model parameters and $\lambda$ denote the hyperparameters. Training optimizes:

\theta^* = \arg\min_\theta \mathcal{L}_{\text{train}}(\theta; \lambda)

But we want hyperparameters that lead to good generalization, so we optimize:

\lambda^* = \arg\min_\lambda \mathcal{L}_{\text{val}}(\theta^*(\lambda))

Notice the nested structure: for each hyperparameter configuration $\lambda$ , we need to fully train the model to get $\theta^*(\lambda)$ , then evaluate on validation data. This makes hyperparameter optimization expensive!

Why Not Use Test Data?

We use validation data for hyperparameter tuning, never test data. If we tuned on test data, our reported test performance would be overly optimistic—we'd have implicitly fitted to the test set through our hyperparameter choices.

Quick Check

Which of the following is a hyperparameter?

Common Hyperparameters

Not all hyperparameters are equally important. Here's a prioritized list of what typically matters most:

Tier 1: Critical (Always Tune)

Hyperparameter	Typical Range	Impact
Learning rate	1e-5 to 1e-1 (log scale)	Most important; wrong values cause divergence or no learning
Batch size	8 to 512 (powers of 2)	Affects gradient noise, training speed, and generalization
Number of epochs	10 to 1000+	Must balance underfitting vs. overfitting

Tier 2: Important (Tune After Tier 1)

Hyperparameter	Typical Range	Impact
Network depth	2 to 100+ layers	More depth = more capacity but harder to train
Hidden layer width	32 to 4096	More width = more capacity
Regularization (L2/dropout)	0.0 to 0.9	Controls overfitting
Optimizer momentum	0.9 to 0.999	Affects convergence speed and stability

Tier 3: Fine-Tuning (Optional)

Hyperparameter	Typical Range	Impact
Learning rate schedule	Step/cosine/warmup	Can improve final performance
Weight decay	1e-6 to 1e-2	Additional regularization
Activation function	ReLU/GELU/SiLU	Usually minor impact for feedforward networks
Initialization scheme	Xavier/He/etc.	Usually use defaults unless debugging gradient issues

The 80/20 Rule

Roughly 80% of your hyperparameter improvement will come from tuning the learning rate and batch size. Don't spend hours tuning minor hyperparameters before you've nailed these two.

Search Strategies

Grid Search

The simplest approach: define a grid of values for each hyperparameter and evaluate all combinations.

🐍grid_search.py

1# Grid search example
2learning_rates = [0.001, 0.01, 0.1]
3batch_sizes = [32, 64, 128]
4
5# Evaluate all 3 × 3 = 9 combinations
6for lr in learning_rates:
7    for bs in batch_sizes:
8        model = train_model(lr=lr, batch_size=bs)
9        val_loss = evaluate(model)
10        print(f"LR={lr}, BS={bs}: Val Loss = {val_loss:.4f}")

Problem: The number of combinations grows exponentially with dimensions. With 5 hyperparameters and 4 values each, you need $4^5 = 1024$ evaluations!

Random Search

Instead of a fixed grid, sample hyperparameters randomly from distributions:

🐍random_search.py

1import numpy as np
2
3# Random search: sample 20 configurations
4for trial in range(20):
5    lr = 10 ** np.random.uniform(-4, -1)  # Log-uniform: [0.0001, 0.1]
6    bs = int(2 ** np.random.uniform(4, 8))  # Powers of 2: [16, 256]
7    dropout = np.random.uniform(0.0, 0.5)
8
9    model = train_model(lr=lr, batch_size=bs, dropout=dropout)
10    val_loss = evaluate(model)
11    print(f"Trial {trial}: LR={lr:.4f}, BS={bs}, Dropout={dropout:.2f}")

Why Random Beats Grid

The key insight from Bergstra & Bengio (2012): not all hyperparameters matter equally. If learning rate is crucial but dropout barely matters, grid search wastes evaluations varying dropout while keeping learning rate fixed.

Random search samples the important dimensions more densely by chance. With 9 random samples, you get 9 different learning rates. With a 3×3 grid, you only get 3.

\text{Grid: } \text{unique values per dimension} = k

\text{Random: } \text{unique values per dimension} = n \text{ (total samples)}

Use Log Scale for Learning Rate

Learning rates vary over orders of magnitude (0.0001 to 0.1). Always search on a log scale, not linear. Use 10 ** uniform(-4, -1), not uniform(0.0001, 0.1).

Interactive: Search Strategies

Visualize how different search strategies explore the hyperparameter space. The heatmap shows a loss landscape—blue regions have low loss (good), red regions have high loss (bad). Watch how each method finds the optimum:

Hyperparameter Search Strategies

Search Method

Grid Search: Systematically evaluates points on a regular grid. Simple but scales poorly with dimensions.

Grid Size: 5 × 5 = 25 points

Search Statistics

Points Evaluated:0

Best Loss:—

Global minimum is at (0, 0) with loss = 0 (marked with yellow dot)

Low Loss (Good)

High Loss (Bad)

Key Insight: Notice how grid search wastes evaluations in regions far from the optimum, while Bayesian optimization quickly focuses on promising areas. Random search often finds good solutions faster than grid search because it can sample anywhere in the space.

Quick Check

With 4 hyperparameters and a budget of 81 evaluations, which approach explores more unique values per hyperparameter?

Bayesian Optimization

The Core Idea

Random search treats all unexplored points equally. But after a few evaluations, we have information! If low learning rates have been consistently better, we should focus our search there. Bayesian optimization does exactly this: it builds a surrogate model of the objective function and uses it to decide where to sample next.

The Algorithm

Initialize: Evaluate a few random configurations
Fit surrogate: Build a probabilistic model (usually a Gaussian Process) that predicts loss given hyperparameters, along with uncertainty
Maximize acquisition: Find the point that balances predicted low loss (exploitation) with high uncertainty (exploration)
Evaluate: Train a model with the chosen hyperparameters
Update: Add the new result to observations and repeat from step 2

Acquisition Functions

The acquisition function decides where to sample next. Common choices:

Function	Formula	Behavior
Expected Improvement (EI)	E[max(f* - f(x), 0)]	Prefers points likely to beat current best
Upper Confidence Bound (UCB)	μ(x) - κ·σ(x)	Explicitly trades off mean and uncertainty
Probability of Improvement	P(f(x) < f*)	Prefers any improvement, even small

Gaussian Process Surrogate

A Gaussian Process (GP) provides not just a prediction but a full probability distribution over possible loss values at each point. This uncertainty is key:

Near observed points: Low uncertainty, predictions are reliable
Far from observed points: High uncertainty, might be worth exploring

f(x) \sim \mathcal{GP}(m(x), k(x, x'))

Where $m(x)$ is the mean function and $k(x, x')$ is the kernel (covariance function) that encodes smoothness assumptions.

Interactive: Bayesian Optimization

Watch Bayesian optimization in action on a 1D function. The blue line shows the surrogate model's prediction, the shaded region shows uncertainty, and the orange line shows the acquisition function (Expected Improvement). Click to add samples and observe how the model updates:

Bayesian Optimization in Action

Show AcquisitionShow True Function

Surrogate Model (Mean)

Uncertainty (95% CI)

Acquisition Function

Best Found

Evaluations

—

Best Loss

—

Best Parameter

How it works: The blue line shows the surrogate model's prediction of the loss function. The shaded area represents uncertainty—wider means less confident. The orange dashed line is the acquisition function (Expected Improvement), which balances exploration (high uncertainty) and exploitation (low predicted loss).

Click "Add Suggested Point" to see Bayesian optimization choose where to sample next. Notice how it focuses on regions where the model predicts low loss OR has high uncertainty.

When to Use Bayesian Optimization

Bayesian optimization shines when: (1) each evaluation is expensive (training takes hours), (2) you have a small budget (less than 100 trials), and (3) hyperparameters are continuous. For cheap evaluations or discrete/categorical hyperparameters, random search may be simpler and equally effective.

Sensitivity Analysis

Before tuning all hyperparameters equally, it's valuable to understand which ones actually matter. Sensitivity analysis measures how much the output changes when each hyperparameter varies.

Why Sensitivity Matters

High sensitivity: Small changes cause large performance differences—tune carefully!
Low sensitivity: The hyperparameter can be left at a reasonable default

Experiment with the interactive demo below to see how each hyperparameter affects training:

Hyperparameter Sensitivity Analysis

Adjust the hyperparameters below and observe how sensitive the model performance is to each one. Parameters marked as "high sensitivity" need careful tuning.

Learning Rate

0.001low sensitivity

Range: 0.00001 to 0.3 (log scale)

Batch Size

32low sensitivity

Hidden Units

128high sensitivity

Dropout Rate

0.30high sensitivity

Training Converged

Model converged in ~200 epochs with current settings.

Training Loss

0.1619

Validation Loss

0.1619

Generalization Gap0.0000

Good generalization - model is not overfitting

Tuning Priority

Based on current sensitivity, tune these hyperparameters first:

1.dropout(54% relative sensitivity)
2.hidden Units(45% relative sensitivity)
3.batch Size(14% relative sensitivity)
4.learning Rate(10% relative sensitivity)

Search Strategy Comparison

See how different search strategies perform as the number of hyperparameters increases:

Search Strategy Comparison

Number of Hyperparameters: 3

More dimensions = harder optimization

Evaluation Budget: 50

Total trials allowed

Key Observations:

Grid Search struggles as dimensions increase because the number of points needed grows exponentially (curse of dimensionality).
Random Search maintains consistent efficiency regardless of dimensionality because it can sample any point in the space.
Bayesian Optimization uses past results to make smarter choices about where to sample next, often finding better solutions faster.

Try increasing the number of hyperparameters to see how grid search degrades rapidly compared to the other methods.

Practical Workflow

A Recommended Approach

Start with reasonable defaults: Use published baselines or framework defaults. Don't tune from scratch.
Get a working baseline: Train once with defaults. This is your reference point.
Tune learning rate first: Do a log-scale search from 1e-5 to 1e-1. This single hyperparameter often determines success.
Adjust batch size: Larger batches can use larger learning rates (linear scaling rule). Find the largest batch your GPU can handle.
Tune regularization: If overfitting, increase dropout/weight decay. If underfitting, decrease them.
Fine-tune architecture: Only adjust depth/width if you have budget remaining.

Premature Optimization

Don't spend hours tuning hyperparameters on a broken pipeline. First ensure your data loading, loss computation, and evaluation metrics are correct. One bug can make all your tuning worthless.

Budget Allocation

With limited compute, how should you allocate your budget?

Budget Size	Recommended Approach
Very small (< 10 trials)	Tune learning rate only (log scale)
Small (10-50 trials)	Random search over LR, batch size, and 1-2 other hyperparameters
Medium (50-200 trials)	Bayesian optimization or multi-fidelity methods
Large (200+ trials)	Full Bayesian optimization with early stopping

PyTorch Implementation

Manual Hyperparameter Search

Manual Random Search

🐍manual_search.py

Explanation(5)

Code(92)

11Train Function

This function takes a hyperparameter configuration and returns the validation loss after training. This is the 'objective function' we're trying to minimize.

14Extract Hyperparameters

Configuration is passed as a dictionary. This makes it easy to extend with new hyperparameters without changing function signatures.

19Build Model

The model architecture uses hyperparameters from config. Here we parameterize hidden_size and dropout rate.

58Log-Uniform Sampling

Learning rate is sampled on a log scale using 10^uniform(-4, -2). This explores 0.0001 to 0.01 evenly in log space.

59Power-of-2 Sampling

Hidden sizes are powers of 2 for GPU efficiency. We sample the exponent uniformly, then convert to integer.

87 lines without explanation

1import torch
2import torch.nn as nn
3from torch.utils.data import DataLoader
4import numpy as np
5from typing import Dict, Any
6
7def train_with_config(
8    config: Dict[str, Any],
9    train_loader: DataLoader,
10    val_loader: DataLoader,
11    num_epochs: int = 10,
12) -> float:
13    """Train a model with given hyperparameters and return validation loss."""
14
15    # Extract hyperparameters
16    lr = config["learning_rate"]
17    hidden_size = config["hidden_size"]
18    dropout = config["dropout"]
19
20    # Build model with hyperparameters
21    model = nn.Sequential(
22        nn.Linear(784, hidden_size),
23        nn.ReLU(),
24        nn.Dropout(dropout),
25        nn.Linear(hidden_size, hidden_size),
26        nn.ReLU(),
27        nn.Dropout(dropout),
28        nn.Linear(hidden_size, 10),
29    )
30
31    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
32    model = model.to(device)
33
34    optimizer = torch.optim.Adam(model.parameters(), lr=lr)
35    criterion = nn.CrossEntropyLoss()
36
37    # Training loop
38    for epoch in range(num_epochs):
39        model.train()
40        for inputs, targets in train_loader:
41            inputs, targets = inputs.to(device), targets.to(device)
42            optimizer.zero_grad()
43            outputs = model(inputs)
44            loss = criterion(outputs, targets)
45            loss.backward()
46            optimizer.step()
47
48    # Validation
49    model.eval()
50    total_loss = 0
51    with torch.no_grad():
52        for inputs, targets in val_loader:
53            inputs, targets = inputs.to(device), targets.to(device)
54            outputs = model(inputs)
55            total_loss += criterion(outputs, targets).item()
56
57    return total_loss / len(val_loader)
58
59
60def random_search(
61    train_loader: DataLoader,
62    val_loader: DataLoader,
63    n_trials: int = 20,
64) -> Dict[str, Any]:
65    """Perform random hyperparameter search."""
66
67    best_config = None
68    best_loss = float("inf")
69
70    for trial in range(n_trials):
71        # Sample hyperparameters
72        config = {
73            "learning_rate": 10 ** np.random.uniform(-4, -2),
74            "hidden_size": int(2 ** np.random.uniform(6, 9)),  # 64 to 512
75            "dropout": np.random.uniform(0.0, 0.5),
76        }
77
78        # Train and evaluate
79        val_loss = train_with_config(config, train_loader, val_loader)
80
81        print(f"Trial {trial + 1}: LR={config['learning_rate']:.4f}, "
82              f"Hidden={config['hidden_size']}, "
83              f"Dropout={config['dropout']:.2f}, "
84              f"Val Loss={val_loss:.4f}")
85
86        if val_loss < best_loss:
87            best_loss = val_loss
88            best_config = config
89
90    print(f"\nBest config: {best_config}")
91    print(f"Best validation loss: {best_loss:.4f}")
92    return best_config

Using Optuna

Optuna is a popular hyperparameter optimization framework that implements efficient search algorithms:

Hyperparameter Tuning with Optuna

🐍optuna_search.py

Explanation(7)

Code(91)

11Log-Scale Float

suggest_float with log=True samples on a logarithmic scale. Essential for learning rates that span multiple orders of magnitude.

12Integer with Step

suggest_int with step=64 constrains hidden sizes to multiples of 64, which can be more efficient on GPUs.

14Categorical Hyperparameters

suggest_categorical samples from a discrete set. Useful for optimizer choice, activation functions, etc.

64Report Intermediate

Report validation loss after each epoch. Optuna uses this for early stopping (pruning) of unpromising trials.

68Pruning

should_prune() checks if this trial is performing worse than the median of completed trials. If so, stop early to save compute.

75TPE Sampler

Tree-Parzen Estimator is Optuna's default Bayesian optimization algorithm. More efficient than random search for most problems.

76Median Pruner

Prunes trials that fall below the median performance at any epoch. Dramatically reduces compute for large searches.

84 lines without explanation

1import optuna
2from optuna.trial import Trial
3import torch
4import torch.nn as nn
5from torch.utils.data import DataLoader
6
7
8def objective(trial: Trial) -> float:
9    """Optuna objective function to minimize."""
10
11    # Suggest hyperparameters
12    lr = trial.suggest_float("learning_rate", 1e-5, 1e-1, log=True)
13    hidden_size = trial.suggest_int("hidden_size", 64, 512, step=64)
14    dropout = trial.suggest_float("dropout", 0.0, 0.5)
15    optimizer_name = trial.suggest_categorical("optimizer", ["Adam", "SGD", "AdamW"])
16    batch_size = trial.suggest_categorical("batch_size", [32, 64, 128, 256])
17
18    # Build model
19    model = nn.Sequential(
20        nn.Linear(784, hidden_size),
21        nn.ReLU(),
22        nn.Dropout(dropout),
23        nn.Linear(hidden_size, hidden_size),
24        nn.ReLU(),
25        nn.Dropout(dropout),
26        nn.Linear(hidden_size, 10),
27    )
28
29    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
30    model = model.to(device)
31
32    # Create optimizer based on suggestion
33    if optimizer_name == "Adam":
34        optimizer = torch.optim.Adam(model.parameters(), lr=lr)
35    elif optimizer_name == "SGD":
36        optimizer = torch.optim.SGD(model.parameters(), lr=lr, momentum=0.9)
37    else:
38        optimizer = torch.optim.AdamW(model.parameters(), lr=lr)
39
40    criterion = nn.CrossEntropyLoss()
41
42    # Create data loaders with suggested batch size
43    train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)
44    val_loader = DataLoader(val_dataset, batch_size=batch_size)
45
46    # Training loop with pruning
47    for epoch in range(20):
48        model.train()
49        for inputs, targets in train_loader:
50            inputs, targets = inputs.to(device), targets.to(device)
51            optimizer.zero_grad()
52            outputs = model(inputs)
53            loss = criterion(outputs, targets)
54            loss.backward()
55            optimizer.step()
56
57        # Validation and pruning
58        model.eval()
59        val_loss = 0
60        with torch.no_grad():
61            for inputs, targets in val_loader:
62                inputs, targets = inputs.to(device), targets.to(device)
63                val_loss += criterion(model(inputs), targets).item()
64        val_loss /= len(val_loader)
65
66        # Report intermediate result for pruning
67        trial.report(val_loss, epoch)
68
69        # Prune unpromising trials early
70        if trial.should_prune():
71            raise optuna.TrialPruned()
72
73    return val_loss
74
75
76# Create study and optimize
77study = optuna.create_study(
78    direction="minimize",
79    sampler=optuna.samplers.TPESampler(),  # Tree-Parzen Estimator
80    pruner=optuna.pruners.MedianPruner(),   # Prune below-median trials
81)
82
83study.optimize(objective, n_trials=100, timeout=3600)
84
85# Get best results
86print(f"Best trial: {study.best_trial.params}")
87print(f"Best value: {study.best_value:.4f}")
88
89# Visualization
90optuna.visualization.plot_optimization_history(study)
91optuna.visualization.plot_param_importances(study)

Tools and Frameworks

Several excellent tools exist for hyperparameter tuning. Here's a comparison:

Tool	Best For	Key Features
Optuna	General-purpose tuning	Easy API, pruning, visualization, many samplers
Ray Tune	Distributed tuning	Scales to clusters, integrates with many frameworks
Weights & Biases Sweeps	Experiment tracking + tuning	Beautiful dashboards, easy setup
Hyperopt	Bayesian optimization	Mature library, Tree-Parzen Estimator
Ax/BoTorch	Advanced BO	Facebook's production-grade Bayesian optimization

Installing Tools

⚡install.sh

1# Optuna (recommended for beginners)
2pip install optuna optuna-dashboard
3
4# Ray Tune (for distributed tuning)
5pip install "ray[tune]"
6
7# Weights & Biases (for experiment tracking + sweeps)
8pip install wandb

Start Simple

If you're new to hyperparameter tuning, start with Optuna. It has the gentlest learning curve while still being powerful enough for serious research. Move to Ray Tune when you need to distribute across multiple GPUs or machines.

Summary

Hyperparameter tuning is a critical skill for training high-performance neural networks. Let's review the key takeaways:

Concept	Key Point
Parameters vs Hyperparameters	Parameters are learned; hyperparameters are set before training
Most Important Hyperparameter	Learning rate—wrong values cause divergence or no learning
Grid Search	Simple but scales poorly with dimensions (curse of dimensionality)
Random Search	More efficient than grid because it samples important dims more densely
Bayesian Optimization	Uses past results to decide where to sample; balances exploration/exploitation
Sensitivity Analysis	Identifies which hyperparameters matter most for your problem
Practical Workflow	Start with defaults, tune learning rate first, use random/Bayesian search

Common Mistakes to Avoid

Tuning on test data (contaminates final evaluation)
Using linear scale for learning rate (use log scale!)
Tuning architecture before getting learning rate right
Not using early stopping for expensive evaluations
Ignoring the validation-test gap (validation performance != test performance)

Knowledge Check

Test your understanding of hyperparameter tuning concepts:

Knowledge Check

Question 1 of 8

Which of the following is a hyperparameter (not a learned parameter)?

Exercises

Conceptual Questions

Explain why the number of evaluations needed for grid search grows exponentially with the number of hyperparameters.
You have a budget of 50 evaluations and 6 hyperparameters to tune. Would you use grid search, random search, or Bayesian optimization? Justify your answer.
What is the role of the acquisition function in Bayesian optimization? Describe the exploration-exploitation tradeoff.
Your colleague suggests using test set performance to choose the best hyperparameters. Explain why this is problematic.

Solution Hints

Q1: With k values per dimension and d dimensions, grid search needs k^d evaluations. Each additional dimension multiplies the number of points by k.
Q2: Grid search would need at least 2^6 = 64 points for just 2 values per dimension. Random or Bayesian with 50 trials is more practical.
Q3: The acquisition function balances sampling where the surrogate predicts low loss (exploitation) and where uncertainty is high (exploration).
Q4: Tuning on test data biases the reported test performance upward. The test set is no longer an unbiased estimate of generalization.

Coding Exercises

Implement grid search: Write a function that takes a dictionary of hyperparameter lists and evaluates all combinations. Track the best configuration.
Compare search strategies: Using MNIST or CIFAR-10, compare random search vs Optuna's TPE sampler with 30 trials each. Plot validation loss over trials.
Early stopping integration: Extend the Optuna example to implement your own pruning strategy based on validation accuracy rather than loss.
Learning rate finder: Implement the learning rate finder (gradually increase LR while monitoring loss). Find the optimal LR range for a model.

Coding Exercise Hints

Exercise 1: Use itertools.product to generate all combinations from the hyperparameter lists.
Exercise 2: Create two Optuna studies, one with RandomSampler and one with TPESampler. Plot study.trials_dataframe().
Exercise 3: Create a custom pruner by subclassing optuna.pruners.BasePruner.
Exercise 4: Start with a very small LR, increase exponentially each batch, and record loss. Plot LR vs loss to find the "knee".

In the next section, we'll explore debugging neural networks—a practical guide to diagnosing and fixing common training problems, from vanishing gradients to data issues.