Learning Objectives
This section is the culmination of everything we've learned in Chapter 6. By the end, you will be able to:
- Design a complete neural network architecture with input, hidden, and output layers
- Implement the forward pass that transforms inputs into predictions
- Select appropriate loss functions for classification and regression tasks
- Understand how backpropagation computes gradients through the network
- Build a complete training loop that iteratively improves your model
- Debug common issues that arise when training neural networks
Why This Matters: Building a neural network from scratch—even with a framework like PyTorch—solidifies your understanding of how deep learning actually works. This knowledge will be invaluable when debugging models, designing new architectures, or understanding research papers.
The Big Picture
From Perceptrons to Modern Neural Networks
In the previous sections, we traced the evolution from simple perceptrons to multi-layer networks. We learned that:
- Perceptrons can only solve linearly separable problems
- Multi-layer perceptrons (MLPs) with nonlinear activations can approximate any function
- Depth matters—deeper networks can represent complex functions more efficiently
Now it's time to put theory into practice. We'll build a complete neural network that can learn to classify data, starting from random weights and ending with a trained model.
The Neural Network as a Function
At its core, a neural network is a parameterized function that maps inputs to outputs:
Where represents all learnable parameters (weights and biases), and each is a layer transformation. Training means finding that minimizes some loss function on our data.
The Learning Algorithm
The high-level algorithm for training a neural network is surprisingly simple:
- Initialize weights randomly (or with smart initialization)
- Forward pass: Compute predictions from inputs
- Compute loss: Measure how wrong the predictions are
- Backward pass: Compute gradients of loss with respect to all weights
- Update weights: Move weights in the direction that reduces loss
- Repeat steps 2-5 until convergence
| Step | Mathematical Operation | PyTorch |
|---|---|---|
| Forward Pass | ŷ = f_θ(x) | output = model(x) |
| Compute Loss | L = loss(ŷ, y) | loss = criterion(output, y) |
| Backward Pass | ∇_θL | loss.backward() |
| Update Weights | θ ← θ - η∇_θL | optimizer.step() |
Anatomy of a Neural Network
Before writing code, let's understand the components of a neural network. We'll build a simple but complete network for binary classification.
Network Architecture
Our network will have:
- Input layer: 2 neurons (for 2D data points)
- Hidden layer 1: 16 neurons with ReLU activation
- Hidden layer 2: 8 neurons with ReLU activation
- Output layer: 1 neuron with sigmoid activation (for binary classification)
Parameters to Learn
Let's count the learnable parameters:
| Layer | Weights | Biases | Total |
|---|---|---|---|
| Input → Hidden 1 | 2 × 16 = 32 | 16 | 48 |
| Hidden 1 → Hidden 2 | 16 × 8 = 128 | 8 | 136 |
| Hidden 2 → Output | 8 × 1 = 8 | 1 | 9 |
| Total | 168 | 25 | 193 |
Our network has 193 learnable parameters. Despite this small size, it can learn complex decision boundaries—as you'll see in the interactive playground below.
Quick Check
If we added a third hidden layer with 4 neurons between Hidden 2 and Output, how many total parameters would the network have?
The Forward Pass
The forward pass is how data flows through the network from input to output. Each layer applies a linear transformation followed by a nonlinear activation.
Layer-by-Layer Computation
For layer , the forward pass computes:
Where:
- is the weight matrix of layer
- is the bias vector of layer
- is the activation from the previous layer (or the input for )
- is the pre-activation (before nonlinearity)
- is the activation function (ReLU, sigmoid, etc.)
- is the post-activation output
Activation Functions
Different layers use different activation functions:
| Function | Formula | When to Use | Range |
|---|---|---|---|
| ReLU | max(0, z) | Hidden layers (default) | [0, ∞) |
| Sigmoid | 1/(1+e⁻ᶻ) | Binary classification output | (0, 1) |
| Softmax | eᶻⁱ/Σeᶻʲ | Multi-class classification output | (0, 1), sums to 1 |
| Tanh | (eᶻ−e⁻ᶻ)/(eᶻ+e⁻ᶻ) | Hidden layers (sometimes) | (-1, 1) |
Why ReLU for hidden layers?
Interactive: Forward Pass Visualizer
Watch how data flows through a neural network step by step. Adjust the input values and observe how each neuron computes its output.
Forward Pass Step-by-Step
Watch how data flows through a neural network, layer by layer
Input Layer
We start with two input features: x₁ = 0.50 and x₂ = -0.30
Quick Reference
Key observations from the visualization:
- Linear transformation () computes weighted sums of inputs
- ReLU activation zeros out negative values, creating the nonlinearity needed to learn complex patterns
- Sigmoid output squashes the final value to a probability between 0 and 1
- The decision boundary (class 0 vs class 1) is determined by whether the output exceeds 0.5
Choosing a Loss Function
The loss function measures how "wrong" our predictions are. It provides the signal that guides learning.
Binary Cross-Entropy Loss
For binary classification (our case), we use binary cross-entropy (BCE):
Where is the true label and is the predicted probability.
Intuition Behind BCE
- When : Loss is . High if is low (confident but wrong)
- When : Loss is . High if is high (confident but wrong)
- Perfect predictions () give loss of 0
| True Label y | Predicted ŷ | Loss | Interpretation |
|---|---|---|---|
| 1 | 0.99 | 0.01 | Correct, confident → low loss |
| 1 | 0.01 | 4.61 | Wrong, confident → high loss |
| 0 | 0.01 | 0.01 | Correct, confident → low loss |
| 0 | 0.99 | 4.61 | Wrong, confident → high loss |
Other Loss Functions
| Task | Loss Function | PyTorch | Notes |
|---|---|---|---|
| Binary Classification | Binary Cross-Entropy | nn.BCEWithLogitsLoss() | Combines sigmoid + BCE |
| Multi-class Classification | Cross-Entropy | nn.CrossEntropyLoss() | Combines softmax + CE |
| Regression | Mean Squared Error | nn.MSELoss() | Sensitive to outliers |
| Regression | Mean Absolute Error | nn.L1Loss() | Robust to outliers |
Use BCEWithLogitsLoss
nn.BCEWithLogitsLoss() over nn.BCELoss(). It combines the sigmoid activation with BCE loss in a numerically stable way. Your network's output layer should produce logits (raw scores), not probabilities.Quick Check
For a sample with true label y=1, what happens to the loss as our predicted probability ŷ approaches 0?
The Backward Pass (Backpropagation)
The backward pass computes how much each weight contributed to the loss. This is done using the chain rule of calculus.
The Chain Rule in Action
Consider a simple network:
To compute , we use the chain rule:
Each term in this chain corresponds to a local gradient that can be computed efficiently.
Backpropagation Through Layers
For layer , we need to compute:
Where is element-wise multiplication and is the derivative of the activation function.
Once we have , the gradients for weights and biases are:
PyTorch handles this automatically
loss.backward(), PyTorch automatically computes all gradients using its autograd system. Understanding the math helps you debug and design better architectures.The Training Loop
The training loop is where learning happens. We repeatedly show the network data, compute how wrong it is, and update the weights to reduce error.
Key Concepts
- Epoch: One complete pass through the entire training dataset
- Batch: A subset of training examples processed together
- Iteration: One weight update (processing one batch)
- Learning rate: Step size for weight updates (typically 0.001 to 0.01)
The Basic Training Loop
The Most Common Bug
optimizer.zero_grad() is the most common training bug. Without it, gradients accumulate across batches, causing erratic training. Always zero gradients at the start of each iteration!Complete Implementation in PyTorch
Now let's put everything together into a complete, working implementation. We'll train a network to classify points in a spiral pattern.
Step 1: Create the Dataset
Step 2: Define the Model
Step 3: Training Function
Step 4: Visualize Results
Expected Results
Interactive: Neural Network Playground
Now it's time to experiment! Use the interactive playground below to train a neural network in real-time. You can:
- Choose different datasets (spiral, XOR, circular)
- Adjust the learning rate to see how it affects training speed and stability
- Watch the decision boundary evolve as the network learns
- Reset and try again with different random initializations
Neural Network Playground
Train a neural network in real-time
Experiments to Try
- Learning rate too high: Set it to 0.1 and watch the loss oscillate or explode
- Learning rate too low: Set it to 0.001 and see how slowly the boundary forms
- Different datasets: The XOR pattern requires learning curved boundaries
- Multiple runs: Reset several times to see how random initialization affects the final solution
Quick Check
While experimenting with the playground, what happens if you set the learning rate to a very high value (like 0.1)?
The XOR Problem: A Complete Walkthrough
The XOR (exclusive OR) problem is a classic benchmark that demonstrates why neural networks need hidden layers. Let's work through it step by step, from understanding the problem to implementing a solution from scratch.
Why XOR Matters
XOR outputs 1 when exactly one of its inputs is 1, and 0 otherwise:
| x₁ | x₂ | XOR(x₁, x₂) |
|---|---|---|
| 0 | 0 | 0 |
| 0 | 1 | 1 |
| 1 | 0 | 1 |
| 1 | 1 | 0 |
If you plot these points, you'll see that no single straight line can separate the 0s from the 1s. The points (0,0) and (1,1) output 0, while (0,1) and (1,0) output 1. They form an "X" pattern that requires a curved or multi-line decision boundary.
Historical Note: In 1969, Minsky and Papert proved that perceptrons cannot solve XOR, which contributed to the first "AI Winter." The solution—hidden layers—was known theoretically but couldn't be trained efficiently until backpropagation was popularized in the 1980s.
The Solution: A Hidden Layer
To solve XOR, we need a network with at least one hidden layer. The minimal architecture is:
- Input layer: 2 neurons (x₁, x₂)
- Hidden layer: 2 neurons with sigmoid activation
- Output layer: 1 neuron with sigmoid activation
The hidden layer transforms the input space so that the outputs become linearly separable. Think of it as creating new features that make the problem easier.
Interactive: XOR Step by Step
Use this interactive visualizer to watch how the network learns XOR. You can:
- Click on any XOR input to see the forward pass computation
- Watch the forward pass animation to see values flow through the network
- Watch the backward pass to see how gradients flow
- Click "Train" to watch the network learn
XOR Step-by-Step Visualizer
Watch how a neural network learns XOR
XOR Truth Table (click to select)
Calculation Details
Forward Pass with Real Numbers
Let's trace through a forward pass for input (expected output: 1). Suppose our trained weights are:
1Hidden layer weights (W₁): Hidden layer biases (b₁):
2 w₁₁ = 5.0, w₁₂ = 5.0 b₁ = -2.5
3 w₂₁ = 5.0, w₂₂ = 5.0 b₂ = -7.5
4
5Output layer weights (W₂): Output layer bias (b₂):
6 v₁ = 10.0, v₂ = -10.0 b_out = -5.0Step 1: Compute Hidden Layer Pre-activations
Step 2: Apply Sigmoid Activation
Step 3: Compute Output
The network outputs 0.970, which is very close to the target of 1. Success!
Backward Pass: Computing Gradients
Now let's trace through the backward pass to see how gradients are computed. We'll use the Binary Cross-Entropy (BCE) loss:
Step 1: Gradient at Output
The gradient of BCE loss with respect to the output (before sigmoid) simplifies beautifully:
📐 Show derivation: Why does BCE + Sigmoid give us ŷ - y?
Step 2: Gradients for Output Layer Weights
Step 3: Gradients at Hidden Layer
To propagate gradients backward, we need to find how the loss changes with respect to each hidden neuron's activation:
📐 Show derivation: How do we get ∂L/∂h using chain rule?
Step 4: Through Sigmoid to Hidden Pre-activations
Now we push the gradient through the sigmoid activation to get gradients w.r.t. the pre-activation values:
📐 Show derivation: How does gradient flow through sigmoid?
Step 5: Gradients for Hidden Layer Weights
Finally, we compute gradients for the input-to-hidden weights:
📐 Show derivation: How do we get weight gradients?
XOR From Scratch (NumPy)
Here's a complete implementation that trains a neural network to solve XOR without any deep learning framework:
Expected Output
Key Insights from XOR
- Hidden layers enable nonlinear decision boundaries. The hidden layer transforms the input space so that XOR outputs become linearly separable.
- Backpropagation distributes credit. Even though the output error is just one number, backprop figures out how each weight contributed and adjusts accordingly.
- BCE + Sigmoid = elegant gradients. The derivative simplifies to , making implementation clean.
- XOR is a "sanity check" for networks. If your implementation can't learn XOR, something is fundamentally broken.
Quick Check
Why does a single-layer perceptron fail to learn XOR?
Debugging Your Neural Network
Training neural networks can be tricky. Here are common issues and how to diagnose them:
Problem: Loss Not Decreasing
| Symptom | Likely Cause | Solution |
|---|---|---|
| Loss stays constant | Learning rate too small | Increase lr by 10x |
| Loss increases | Learning rate too high | Decrease lr by 10x |
| Loss oscillates wildly | Learning rate too high | Decrease lr |
| Loss stuck after initial drop | Local minimum or plateau | Try different initialization, add momentum |
Problem: Training Loss Good, Test Loss Bad (Overfitting)
- Add regularization (L2 weight decay, dropout)
- Reduce model capacity (fewer layers or neurons)
- Get more training data or use data augmentation
- Use early stopping based on validation loss
Problem: Accuracy Stuck at 50% (Binary Classification)
This usually means the network is outputting the same prediction for all inputs:
- Check that your labels are correct (0s and 1s, not 1s and 2s)
- Verify the loss function matches your output (BCEWithLogitsLoss for sigmoid output)
- Make sure data is shuffled properly
- Try better weight initialization (PyTorch defaults are usually fine)
The Debugging Checklist
Practical Checklist
Before training any neural network, run through this checklist:
Data Preparation
- ☐ Data is normalized (typically mean 0, std 1)
- ☐ Labels are in the correct format (0/1 for binary, one-hot for multi-class)
- ☐ Training and validation sets are properly split
- ☐ Data is shuffled before training
Model Setup
- ☐ Loss function matches the task (BCE for binary, CE for multi-class, MSE for regression)
- ☐ Output layer matches the loss (no sigmoid if using BCEWithLogitsLoss)
- ☐ Model parameters are counted and reasonable for the data size
- ☐ Model is on the correct device (CPU or GPU)
Training Loop
- ☐
optimizer.zero_grad()is called each iteration - ☐
model.train()is set during training - ☐
model.eval()is set during validation - ☐ Loss and accuracy are being tracked
Summary
Congratulations! You've built your first neural network from scratch. Let's review what you've learned:
| Concept | Key Point | PyTorch |
|---|---|---|
| Architecture | Input → Hidden (ReLU) → Output (Sigmoid) | nn.Sequential(nn.Linear, nn.ReLU, ...) |
| Forward Pass | z = Wx + b, then a = g(z) | output = model(x) |
| Loss Function | BCE measures prediction error | nn.BCEWithLogitsLoss() |
| Backward Pass | Chain rule computes all gradients | loss.backward() |
| Weight Update | θ = θ - η∇L | optimizer.step() |
| Training Loop | Repeat: forward → loss → backward → update | The complete training loop |
Key Takeaways
- Neural networks are parameterized functions that learn to map inputs to outputs through gradient-based optimization
- The forward pass computes predictions by applying linear transformations and nonlinear activations layer by layer
- The loss function measures how wrong predictions are; BCE for classification, MSE for regression
- Backpropagation efficiently computes gradients using the chain rule
- The training loop follows a sacred order: zero gradients → forward → loss → backward → update
- Debugging requires systematically checking data, model outputs, loss values, and gradients
Exercises
Conceptual Questions
- Explain why we need nonlinear activation functions. What would happen if we used only linear layers?
- Why do we zero the gradients at the start of each iteration instead of after the weight update?
- What is the difference between
nn.BCELossandnn.BCEWithLogitsLoss? When would you use each? - If you double the learning rate, how does this affect the size of each weight update?
Solution Hints
- Q1: Without nonlinearity, the entire network collapses to a single linear transformation (matrix multiplication). No matter how many layers, f(x) = W₃(W₂(W₁x)) = Wx.
- Q2: PyTorch accumulates gradients. Zeroing before backward() ensures we only have gradients from the current batch.
- Q3: BCELoss expects probabilities (after sigmoid), BCEWithLogitsLoss expects logits (before sigmoid). The latter is more numerically stable.
- Q4: The weight update is θ = θ - lr × ∇L. Doubling lr doubles the step size.
Coding Exercises
- Modify the architecture: Add a third hidden layer with 4 neurons. How does this affect training speed and final accuracy?
- Try different optimizers: Replace Adam with SGD (with and without momentum). Compare convergence.
- Implement early stopping: Track validation loss and stop training when it starts increasing.
- Add regularization: Implement L2 weight decay using the optimizer's weight_decay parameter.
Challenge Exercise
Implement a neural network from scratch without PyTorch's autograd. Create a two-layer network (input → hidden → output) and manually implement:
- Forward pass with ReLU and sigmoid
- Backward pass computing gradients for all weights
- Weight update using SGD
Train it on the spiral dataset and verify it achieves similar accuracy to the PyTorch version.
This is Advanced
You've now mastered the fundamentals of building neural networks! In the next chapter, we'll explore data loading and processing—essential skills for working with real-world datasets.