Chapter 6
25 min read
Section 42 of 178

Building Your First Neural Network

From Perceptrons to Deep Networks

Learning Objectives

This section is the culmination of everything we've learned in Chapter 6. By the end, you will be able to:

  1. Design a complete neural network architecture with input, hidden, and output layers
  2. Implement the forward pass that transforms inputs into predictions
  3. Select appropriate loss functions for classification and regression tasks
  4. Understand how backpropagation computes gradients through the network
  5. Build a complete training loop that iteratively improves your model
  6. Debug common issues that arise when training neural networks
Why This Matters: Building a neural network from scratch—even with a framework like PyTorch—solidifies your understanding of how deep learning actually works. This knowledge will be invaluable when debugging models, designing new architectures, or understanding research papers.

The Big Picture

From Perceptrons to Modern Neural Networks

In the previous sections, we traced the evolution from simple perceptrons to multi-layer networks. We learned that:

  • Perceptrons can only solve linearly separable problems
  • Multi-layer perceptrons (MLPs) with nonlinear activations can approximate any function
  • Depth matters—deeper networks can represent complex functions more efficiently

Now it's time to put theory into practice. We'll build a complete neural network that can learn to classify data, starting from random weights and ending with a trained model.

The Neural Network as a Function

At its core, a neural network is a parameterized function fθf_\theta that maps inputs to outputs:

y^=fθ(x)=f(L)(f(L1)(f(1)(x)))\hat{y} = f_\theta(x) = f^{(L)}(f^{(L-1)}(\cdots f^{(1)}(x)))

Where θ\theta represents all learnable parameters (weights and biases), and each f(l)f^{(l)} is a layer transformation. Training means finding θ\theta^* that minimizes some loss function on our data.

The Learning Algorithm

The high-level algorithm for training a neural network is surprisingly simple:

  1. Initialize weights randomly (or with smart initialization)
  2. Forward pass: Compute predictions from inputs
  3. Compute loss: Measure how wrong the predictions are
  4. Backward pass: Compute gradients of loss with respect to all weights
  5. Update weights: Move weights in the direction that reduces loss
  6. Repeat steps 2-5 until convergence
StepMathematical OperationPyTorch
Forward Passŷ = f_θ(x)output = model(x)
Compute LossL = loss(ŷ, y)loss = criterion(output, y)
Backward Pass∇_θLloss.backward()
Update Weightsθ ← θ - η∇_θLoptimizer.step()

Anatomy of a Neural Network

Before writing code, let's understand the components of a neural network. We'll build a simple but complete network for binary classification.

Network Architecture

Our network will have:

  • Input layer: 2 neurons (for 2D data points)
  • Hidden layer 1: 16 neurons with ReLU activation
  • Hidden layer 2: 8 neurons with ReLU activation
  • Output layer: 1 neuron with sigmoid activation (for binary classification)
Input2LinearHidden16ReLUHidden8ReLUOutput1σy^\text{Input}_{2} \xrightarrow{\text{Linear}} \text{Hidden}_{16} \xrightarrow{\text{ReLU}} \text{Hidden}_{8} \xrightarrow{\text{ReLU}} \text{Output}_{1} \xrightarrow{\sigma} \hat{y}

Parameters to Learn

Let's count the learnable parameters:

LayerWeightsBiasesTotal
Input → Hidden 12 × 16 = 321648
Hidden 1 → Hidden 216 × 8 = 1288136
Hidden 2 → Output8 × 1 = 819
Total16825193

Our network has 193 learnable parameters. Despite this small size, it can learn complex decision boundaries—as you'll see in the interactive playground below.

Quick Check

If we added a third hidden layer with 4 neurons between Hidden 2 and Output, how many total parameters would the network have?


The Forward Pass

The forward pass is how data flows through the network from input to output. Each layer applies a linear transformation followed by a nonlinear activation.

Layer-by-Layer Computation

For layer ll, the forward pass computes:

z(l)=W(l)a(l1)+b(l)z^{(l)} = W^{(l)} a^{(l-1)} + b^{(l)}
a(l)=g(z(l))a^{(l)} = g(z^{(l)})

Where:

  • W(l)W^{(l)} is the weight matrix of layer ll
  • b(l)b^{(l)} is the bias vector of layer ll
  • a(l1)a^{(l-1)} is the activation from the previous layer (or the input xx for l=1l=1)
  • z(l)z^{(l)} is the pre-activation (before nonlinearity)
  • g()g(\cdot) is the activation function (ReLU, sigmoid, etc.)
  • a(l)a^{(l)} is the post-activation output

Activation Functions

Different layers use different activation functions:

FunctionFormulaWhen to UseRange
ReLUmax(0, z)Hidden layers (default)[0, ∞)
Sigmoid1/(1+e⁻ᶻ)Binary classification output(0, 1)
Softmaxeᶻⁱ/ΣeᶻʲMulti-class classification output(0, 1), sums to 1
Tanh(eᶻ−e⁻ᶻ)/(eᶻ+e⁻ᶻ)Hidden layers (sometimes)(-1, 1)

Why ReLU for hidden layers?

ReLU is the default choice for hidden layers because: (1) it's computationally efficient, (2) it doesn't saturate for positive values, avoiding the vanishing gradient problem, and (3) it induces sparsity (many neurons output 0), which can help generalization.

Interactive: Forward Pass Visualizer

Watch how data flows through a neural network step by step. Adjust the input values and observe how each neuron computes its output.

Forward Pass Step-by-Step

Watch how data flows through a neural network, layer by layer

Input
0.50x₁
-0.30x₂
Wx+b
Hidden (ReLU)
0.00h1
0.00h2
0.00h3
Wa+b
Output (σ)
0.00ŷ
1

Input Layer

We start with two input features: x₁ = 0.50 and x₂ = -0.30

Step 1 / 6

Quick Reference

Linear:
z = Wx + b
ReLU:
a = max(0, z)
Sigmoid:
σ(z) = 1/(1+e⁻ᶻ)

Key observations from the visualization:

  1. Linear transformation (z=Wx+bz = Wx + b) computes weighted sums of inputs
  2. ReLU activation zeros out negative values, creating the nonlinearity needed to learn complex patterns
  3. Sigmoid output squashes the final value to a probability between 0 and 1
  4. The decision boundary (class 0 vs class 1) is determined by whether the output exceeds 0.5

Choosing a Loss Function

The loss function measures how "wrong" our predictions are. It provides the signal that guides learning.

Binary Cross-Entropy Loss

For binary classification (our case), we use binary cross-entropy (BCE):

L(y^,y)=1Ni=1N[yilog(y^i)+(1yi)log(1y^i)]\mathcal{L}(\hat{y}, y) = -\frac{1}{N}\sum_{i=1}^{N} \left[ y_i \log(\hat{y}_i) + (1-y_i) \log(1-\hat{y}_i) \right]

Where yi{0,1}y_i \in \{0, 1\} is the true label and y^i(0,1)\hat{y}_i \in (0, 1) is the predicted probability.

Intuition Behind BCE

  • When y=1y=1: Loss is log(y^)-\log(\hat{y}). High if y^\hat{y} is low (confident but wrong)
  • When y=0y=0: Loss is log(1y^)-\log(1-\hat{y}). High if y^\hat{y} is high (confident but wrong)
  • Perfect predictions (y^=y\hat{y}=y) give loss of 0
True Label yPredicted ŷLossInterpretation
10.990.01Correct, confident → low loss
10.014.61Wrong, confident → high loss
00.010.01Correct, confident → low loss
00.994.61Wrong, confident → high loss

Other Loss Functions

TaskLoss FunctionPyTorchNotes
Binary ClassificationBinary Cross-Entropynn.BCEWithLogitsLoss()Combines sigmoid + BCE
Multi-class ClassificationCross-Entropynn.CrossEntropyLoss()Combines softmax + CE
RegressionMean Squared Errornn.MSELoss()Sensitive to outliers
RegressionMean Absolute Errornn.L1Loss()Robust to outliers

Use BCEWithLogitsLoss

In PyTorch, prefer nn.BCEWithLogitsLoss() over nn.BCELoss(). It combines the sigmoid activation with BCE loss in a numerically stable way. Your network's output layer should produce logits (raw scores), not probabilities.

Quick Check

For a sample with true label y=1, what happens to the loss as our predicted probability ŷ approaches 0?


The Backward Pass (Backpropagation)

The backward pass computes how much each weight contributed to the loss. This is done using the chain rule of calculus.

The Chain Rule in Action

Consider a simple network: xW1hW2y^Lx \xrightarrow{W_1} h \xrightarrow{W_2} \hat{y} \rightarrow L

To compute LW1\frac{\partial L}{\partial W_1}, we use the chain rule:

LW1=Ly^y^hhW1\frac{\partial L}{\partial W_1} = \frac{\partial L}{\partial \hat{y}} \cdot \frac{\partial \hat{y}}{\partial h} \cdot \frac{\partial h}{\partial W_1}

Each term in this chain corresponds to a local gradient that can be computed efficiently.

Backpropagation Through Layers

For layer ll, we need to compute:

δ(l)=Lz(l)=((W(l+1))Tδ(l+1))g(z(l))\delta^{(l)} = \frac{\partial L}{\partial z^{(l)}} = \left( (W^{(l+1)})^T \delta^{(l+1)} \right) \odot g'(z^{(l)})

Where \odot is element-wise multiplication and gg' is the derivative of the activation function.

Once we have δ(l)\delta^{(l)}, the gradients for weights and biases are:

LW(l)=δ(l)(a(l1))T\frac{\partial L}{\partial W^{(l)}} = \delta^{(l)} (a^{(l-1)})^T
Lb(l)=δ(l)\frac{\partial L}{\partial b^{(l)}} = \delta^{(l)}

PyTorch handles this automatically

You don't need to implement backpropagation manually! When you call loss.backward(), PyTorch automatically computes all gradients using its autograd system. Understanding the math helps you debug and design better architectures.

The Training Loop

The training loop is where learning happens. We repeatedly show the network data, compute how wrong it is, and update the weights to reduce error.

Key Concepts

  • Epoch: One complete pass through the entire training dataset
  • Batch: A subset of training examples processed together
  • Iteration: One weight update (processing one batch)
  • Learning rate: Step size for weight updates (typically 0.001 to 0.01)

The Basic Training Loop

The Complete Training Loop
🐍training_loop.py
7Optimizer Creation

Adam optimizer adaptively adjusts learning rates for each parameter. It's the most common choice and works well out of the box.

EXAMPLE
optim.SGD(model.parameters(), lr=0.01) # Alternative
11Model Training Mode

model.train() enables training-specific behaviors like dropout. Always call this before training iterations.

16Zero Gradients - CRITICAL

PyTorch accumulates gradients by default. We must zero them before each batch, or gradients from previous batches will pollute the current update.

EXAMPLE
Without this: gradients keep growing, training fails!
19Forward Pass

Run input through the network to get predictions. The model automatically tracks all operations for backprop.

22Compute Loss

Compare predictions to true labels using the loss function. This scalar value measures 'how wrong' we are.

25Backward Pass

Compute gradients of loss with respect to all parameters. These gradients are stored in each parameter's .grad attribute.

28Update Weights

Apply the optimizer's update rule. For SGD: θ = θ - lr × ∇L. Adam uses adaptive learning rates.

30Track Progress

loss.item() extracts the Python number from the loss tensor. We accumulate for averaging.

29 lines without explanation
1import torch
2import torch.nn as nn
3import torch.optim as optim
4
5# Assume we have: model, train_loader, criterion (loss fn)
6
7# Choose an optimizer
8optimizer = optim.Adam(model.parameters(), lr=0.001)
9
10# Training loop
11num_epochs = 100
12for epoch in range(num_epochs):
13    model.train()  # Set to training mode
14    total_loss = 0
15
16    for batch_x, batch_y in train_loader:
17        # 1. Zero the gradients (IMPORTANT!)
18        optimizer.zero_grad()
19
20        # 2. Forward pass
21        outputs = model(batch_x)
22
23        # 3. Compute loss
24        loss = criterion(outputs, batch_y)
25
26        # 4. Backward pass (compute gradients)
27        loss.backward()
28
29        # 5. Update weights
30        optimizer.step()
31
32        total_loss += loss.item()
33
34    # Print progress
35    avg_loss = total_loss / len(train_loader)
36    if (epoch + 1) % 10 == 0:
37        print(f"Epoch {epoch+1}/{num_epochs}, Loss: {avg_loss:.4f}")

The Most Common Bug

Forgetting optimizer.zero_grad() is the most common training bug. Without it, gradients accumulate across batches, causing erratic training. Always zero gradients at the start of each iteration!

Complete Implementation in PyTorch

Now let's put everything together into a complete, working implementation. We'll train a network to classify points in a spiral pattern.

Step 1: Create the Dataset

Creating a Spiral Dataset
🐍create_data.py
4Spiral Dataset

This creates a classic 'spiral' dataset with two interleaved spirals. It's linearly inseparable, requiring a nonlinear classifier.

9Class 0 Spiral

Points are placed along a spiral path. We add Gaussian noise to make it more realistic and challenging.

EXAMPLE
theta goes from 0 to 4π (two full rotations)
15Class 1 Spiral

Same spiral pattern, but rotated by π radians (180°). This creates the interlocking pattern.

20Stack Features

Combine x and y coordinates into a 2D feature matrix of shape (N, 2).

28Shuffle Data

Randomly shuffle the data so classes are mixed. This is crucial for training stability when using mini-batches.

30Convert to Tensors

PyTorch requires torch.Tensor inputs. Labels need shape (N, 1) for BCEWithLogitsLoss.

30 lines without explanation
1import torch
2import numpy as np
3import matplotlib.pyplot as plt
4
5def create_spiral_data(n_points=100, noise=0.2):
6    """Create a two-class spiral dataset."""
7    np.random.seed(42)
8
9    # Class 0: one arm of spiral
10    theta0 = np.linspace(0, 4*np.pi, n_points)
11    r0 = np.linspace(0.5, 2, n_points)
12    x0 = r0 * np.cos(theta0) + np.random.randn(n_points) * noise
13    y0 = r0 * np.sin(theta0) + np.random.randn(n_points) * noise
14
15    # Class 1: other arm (rotated by pi)
16    theta1 = np.linspace(0, 4*np.pi, n_points)
17    r1 = np.linspace(0.5, 2, n_points)
18    x1 = r1 * np.cos(theta1 + np.pi) + np.random.randn(n_points) * noise
19    y1 = r1 * np.sin(theta1 + np.pi) + np.random.randn(n_points) * noise
20
21    # Combine into dataset
22    X = np.column_stack([
23        np.concatenate([x0, x1]),
24        np.concatenate([y0, y1])
25    ])
26    y = np.concatenate([np.zeros(n_points), np.ones(n_points)])
27
28    # Shuffle
29    idx = np.random.permutation(len(y))
30    X, y = X[idx], y[idx]
31
32    return torch.FloatTensor(X), torch.FloatTensor(y).unsqueeze(1)
33
34# Create dataset
35X_train, y_train = create_spiral_data(n_points=200)
36print(f"Dataset shape: X={X_train.shape}, y={y_train.shape}")

Step 2: Define the Model

Defining the Neural Network Model
🐍model.py
3Inheriting from nn.Module

All PyTorch models inherit from nn.Module. This provides parameter tracking, GPU support, and many utilities.

6Flexible Architecture

By passing hidden_sizes as a list, we can easily create networks with different depths and widths.

EXAMPLE
hidden_sizes=[32, 16, 8] for 3 hidden layers
13Dynamic Layer Building

We loop through hidden sizes to create Linear + ReLU pairs. This pattern is cleaner than hardcoding each layer.

19No Output Activation

We don't apply sigmoid here because BCEWithLogitsLoss expects raw logits. It applies sigmoid internally for numerical stability.

22nn.Sequential

Sequential bundles layers into a single callable. Input flows through in order: Linear → ReLU → Linear → ReLU → Linear.

25Forward Method

This defines how data flows through the network. PyTorch calls this when you do model(x).

29Inference Method

For predictions, we apply sigmoid and threshold at 0.5. torch.no_grad() disables gradient tracking for efficiency.

32 lines without explanation
1import torch.nn as nn
2
3class NeuralNetwork(nn.Module):
4    """A simple feedforward neural network for binary classification."""
5
6    def __init__(self, input_size=2, hidden_sizes=[16, 8], output_size=1):
7        super().__init__()
8
9        # Build layers dynamically
10        layers = []
11        prev_size = input_size
12
13        # Hidden layers with ReLU
14        for hidden_size in hidden_sizes:
15            layers.append(nn.Linear(prev_size, hidden_size))
16            layers.append(nn.ReLU())
17            prev_size = hidden_size
18
19        # Output layer (no activation - handled by loss function)
20        layers.append(nn.Linear(prev_size, output_size))
21
22        # Combine into Sequential
23        self.network = nn.Sequential(*layers)
24
25    def forward(self, x):
26        """Forward pass through the network."""
27        return self.network(x)
28
29    def predict(self, x):
30        """Get class predictions (0 or 1)."""
31        with torch.no_grad():
32            logits = self.forward(x)
33            probs = torch.sigmoid(logits)
34            return (probs > 0.5).float()
35
36# Create model and count parameters
37model = NeuralNetwork()
38total_params = sum(p.numel() for p in model.parameters())
39print(f"Total parameters: {total_params}")  # 193

Step 3: Training Function

Training Function
🐍train.py
5BCEWithLogitsLoss

Combines sigmoid + BCE in one numerically stable function. Works with raw logits from our model.

6Adam Optimizer

Adam is the go-to optimizer. It adapts learning rates per-parameter and includes momentum.

EXAMPLE
Try lr=0.001 if training is unstable, lr=0.1 if too slow
14Full Batch Training

We pass all data at once (full batch gradient descent). For large datasets, use mini-batches with DataLoader.

18Gradient Computation Order

zero_grad() → backward() → step() is the sacred order. Changing it breaks training!

23Compute Accuracy

Inside no_grad() block for efficiency. Convert logits → probs → predictions → compare to labels.

27Track History

Storing loss and accuracy history helps visualize training progress and diagnose issues.

36 lines without explanation
1def train_model(model, X_train, y_train, epochs=500, lr=0.01):
2    """Train the neural network."""
3
4    # Setup
5    criterion = nn.BCEWithLogitsLoss()
6    optimizer = optim.Adam(model.parameters(), lr=lr)
7
8    # Track history for plotting
9    history = {'loss': [], 'accuracy': []}
10
11    for epoch in range(epochs):
12        model.train()
13
14        # Forward pass
15        outputs = model(X_train)
16        loss = criterion(outputs, y_train)
17
18        # Backward pass
19        optimizer.zero_grad()
20        loss.backward()
21        optimizer.step()
22
23        # Track metrics
24        with torch.no_grad():
25            predictions = (torch.sigmoid(outputs) > 0.5).float()
26            accuracy = (predictions == y_train).float().mean()
27
28        history['loss'].append(loss.item())
29        history['accuracy'].append(accuracy.item())
30
31        # Print progress
32        if (epoch + 1) % 100 == 0:
33            print(f"Epoch {epoch+1}/{epochs} | "
34                  f"Loss: {loss.item():.4f} | "
35                  f"Accuracy: {accuracy.item():.2%}")
36
37    return history
38
39# Train the model
40history = train_model(model, X_train, y_train, epochs=500, lr=0.01)
41
42print(f"\nFinal accuracy: {history['accuracy'][-1]:.2%}")

Step 4: Visualize Results

🐍visualize.py
1Function Definition

Takes model, data X, labels y, and optional title

4Create Mesh Grid

Define the region to visualize with padding

7Grid Points

Create 100x100 grid of points for smooth visualization

10Prepare Grid Data

Convert mesh to tensor format for model input

12Get Predictions

Run model on all grid points (no gradients needed)

13Apply Sigmoid

Convert logits to probabilities [0, 1]

17Create Figure

Initialize matplotlib figure with specified size

18Contour Plot

Draw filled contours showing prediction probabilities

20Color Bar

Add legend showing probability scale

23Plot Data Points

Overlay training data colored by class

32Usage Example

Call the function with trained model and data

22 lines without explanation
1def plot_decision_boundary(model, X, y, title="Decision Boundary"):
2    """Plot the decision boundary learned by the model."""
3
4    # Create mesh grid
5    x_min, x_max = X[:, 0].min() - 0.5, X[:, 0].max() + 0.5
6    y_min, y_max = X[:, 1].min() - 0.5, X[:, 1].max() + 0.5
7    xx, yy = np.meshgrid(np.linspace(x_min, x_max, 100),
8                          np.linspace(y_min, y_max, 100))
9
10    # Get predictions for each point in the grid
11    grid_points = torch.FloatTensor(np.c_[xx.ravel(), yy.ravel()])
12    with torch.no_grad():
13        Z = torch.sigmoid(model(grid_points)).numpy()
14    Z = Z.reshape(xx.shape)
15
16    # Plot
17    plt.figure(figsize=(10, 8))
18    plt.contourf(xx, yy, Z, levels=np.linspace(0, 1, 11),
19                 cmap='RdBu_r', alpha=0.6)
20    plt.colorbar(label='P(Class 1)')
21
22    # Plot training points
23    plt.scatter(X[:, 0], X[:, 1], c=y.flatten(),
24                cmap='RdBu_r', edgecolors='k', s=40)
25
26    plt.xlabel('$x_1$')
27    plt.ylabel('$x_2$')
28    plt.title(title)
29    plt.show()
30
31# Visualize the learned boundary
32plot_decision_boundary(model, X_train.numpy(), y_train.numpy(),
33                       title="Neural Network Decision Boundary")

Expected Results

After 500 epochs, you should see: (1) Loss decreasing from ~0.7 to ~0.1, (2) Accuracy increasing from ~50% to ~95%+, and (3) A smooth, nonlinear decision boundary separating the two spirals.

Interactive: Neural Network Playground

Now it's time to experiment! Use the interactive playground below to train a neural network in real-time. You can:

  • Choose different datasets (spiral, XOR, circular)
  • Adjust the learning rate to see how it affects training speed and stability
  • Watch the decision boundary evolve as the network learns
  • Reset and try again with different random initializations

Neural Network Playground

Train a neural network in real-time

Epoch: 0
Accuracy: 50.0%
Input (2) Hidden (8) Hidden (8) Output (1)
Start training to see loss
Loss: 0.00000 epochs
Class 0
Class 1

Experiments to Try

  1. Learning rate too high: Set it to 0.1 and watch the loss oscillate or explode
  2. Learning rate too low: Set it to 0.001 and see how slowly the boundary forms
  3. Different datasets: The XOR pattern requires learning curved boundaries
  4. Multiple runs: Reset several times to see how random initialization affects the final solution

Quick Check

While experimenting with the playground, what happens if you set the learning rate to a very high value (like 0.1)?


The XOR Problem: A Complete Walkthrough

The XOR (exclusive OR) problem is a classic benchmark that demonstrates why neural networks need hidden layers. Let's work through it step by step, from understanding the problem to implementing a solution from scratch.

Why XOR Matters

XOR outputs 1 when exactly one of its inputs is 1, and 0 otherwise:

x₁x₂XOR(x₁, x₂)
000
011
101
110

If you plot these points, you'll see that no single straight line can separate the 0s from the 1s. The points (0,0) and (1,1) output 0, while (0,1) and (1,0) output 1. They form an "X" pattern that requires a curved or multi-line decision boundary.

Historical Note: In 1969, Minsky and Papert proved that perceptrons cannot solve XOR, which contributed to the first "AI Winter." The solution—hidden layers—was known theoretically but couldn't be trained efficiently until backpropagation was popularized in the 1980s.

The Solution: A Hidden Layer

To solve XOR, we need a network with at least one hidden layer. The minimal architecture is:

  • Input layer: 2 neurons (x₁, x₂)
  • Hidden layer: 2 neurons with sigmoid activation
  • Output layer: 1 neuron with sigmoid activation

The hidden layer transforms the input space so that the outputs become linearly separable. Think of it as creating new features that make the problem easier.

Interactive: XOR Step by Step

Use this interactive visualizer to watch how the network learns XOR. You can:

  • Click on any XOR input to see the forward pass computation
  • Watch the forward pass animation to see values flow through the network
  • Watch the backward pass to see how gradients flow
  • Click "Train" to watch the network learn

XOR Step-by-Step Visualizer

Watch how a neural network learns XOR

Inputx10x21w₁₁=0.08w₁₂=0.02w₂₁=0.41w₂₂=0.20Hiddenh10.543h20.577v₁=0.02v₂=-0.35Outputŷ0.387Target: 1b₁=[0.15, 0.11]b_out=-0.27

XOR Truth Table (click to select)

Epoch: 0Loss: 0.000Acc: 50%

Calculation Details

Input(1)
Click "Forward Pass" or "Backward Pass" to see step-by-step calculations.
Input: [0, 1] → Target: 1
Hidden: [h₁=0.543, h₂=0.577]
Output: ŷ = 0.387 → Pred: 0
Loss: 0.949
Tip: Click "Train" to watch the network learn XOR. Use "Forward Pass" and "Backward Pass" buttons to step through calculations. The diagram highlights active nodes and connections at each step.

Forward Pass with Real Numbers

Let's trace through a forward pass for input x=[1,0]x = [1, 0] (expected output: 1). Suppose our trained weights are:

📝trained_weights.txt
1Hidden layer weights (W₁):     Hidden layer biases (b₁):
2  w₁₁ = 5.0,  w₁₂ = 5.0          b₁ = -2.5
3  w₂₁ = 5.0,  w₂₂ = 5.0          b₂ = -7.5
4
5Output layer weights (W₂):     Output layer bias (b₂):
6  v₁ = 10.0,  v₂ = -10.0          b_out = -5.0

Step 1: Compute Hidden Layer Pre-activations

z1=x1w11+x2w12+b1=15+05+(2.5)=2.5z_1 = x_1 \cdot w_{11} + x_2 \cdot w_{12} + b_1 = 1 \cdot 5 + 0 \cdot 5 + (-2.5) = 2.5
z2=x1w21+x2w22+b2=15+05+(7.5)=2.5z_2 = x_1 \cdot w_{21} + x_2 \cdot w_{22} + b_2 = 1 \cdot 5 + 0 \cdot 5 + (-7.5) = -2.5

Step 2: Apply Sigmoid Activation

h1=σ(z1)=σ(2.5)=11+e2.50.924h_1 = \sigma(z_1) = \sigma(2.5) = \frac{1}{1 + e^{-2.5}} \approx 0.924
h2=σ(z2)=σ(2.5)=11+e2.50.076h_2 = \sigma(z_2) = \sigma(-2.5) = \frac{1}{1 + e^{2.5}} \approx 0.076

Step 3: Compute Output

zout=h1v1+h2v2+bout=0.92410+0.076(10)+(5)=3.48z_{out} = h_1 \cdot v_1 + h_2 \cdot v_2 + b_{out} = 0.924 \cdot 10 + 0.076 \cdot (-10) + (-5) = 3.48
y^=σ(zout)=σ(3.48)0.970\hat{y} = \sigma(z_{out}) = \sigma(3.48) \approx 0.970

The network outputs 0.970, which is very close to the target of 1. Success!

Backward Pass: Computing Gradients

Now let's trace through the backward pass to see how gradients are computed. We'll use the Binary Cross-Entropy (BCE) loss:

L=[ylog(y^)+(1y)log(1y^)]\mathcal{L} = -[y \log(\hat{y}) + (1-y) \log(1-\hat{y})]

Step 1: Gradient at Output

The gradient of BCE loss with respect to the output (before sigmoid) simplifies beautifully:

Lzout=y^y=0.9701=0.030\frac{\partial \mathcal{L}}{\partial z_{out}} = \hat{y} - y = 0.970 - 1 = -0.030
📐 Show derivation: Why does BCE + Sigmoid give us ŷ - y?
Given: L=[ylog(y^)+(1y)log(1y^)]\mathcal{L} = -[y \log(\hat{y}) + (1-y) \log(1-\hat{y})] and y^=σ(zout)\hat{y} = \sigma(z_{out})
Step 1a: Ly^=yy^+1y1y^=y^yy^(1y^)\frac{\partial \mathcal{L}}{\partial \hat{y}} = -\frac{y}{\hat{y}} + \frac{1-y}{1-\hat{y}} = \frac{\hat{y} - y}{\hat{y}(1-\hat{y})}
Step 1b: y^zout=y^(1y^)\frac{\partial \hat{y}}{\partial z_{out}} = \hat{y}(1-\hat{y}) (sigmoid derivative)
Step 1c: Lzout=y^yy^(1y^)y^(1y^)=y^y\frac{\partial \mathcal{L}}{\partial z_{out}} = \frac{\hat{y} - y}{\hat{y}(1-\hat{y})} \cdot \hat{y}(1-\hat{y}) = \hat{y} - y ✓ Terms cancel!

Step 2: Gradients for Output Layer Weights

Lv1=Lzouth1=0.0300.924=0.028\frac{\partial \mathcal{L}}{\partial v_1} = \frac{\partial \mathcal{L}}{\partial z_{out}} \cdot h_1 = -0.030 \cdot 0.924 = -0.028
Lv2=Lzouth2=0.0300.076=0.002\frac{\partial \mathcal{L}}{\partial v_2} = \frac{\partial \mathcal{L}}{\partial z_{out}} \cdot h_2 = -0.030 \cdot 0.076 = -0.002

Step 3: Gradients at Hidden Layer

To propagate gradients backward, we need to find how the loss changes with respect to each hidden neuron's activation:

Lh1=0.30andLh2=0.30\frac{\partial \mathcal{L}}{\partial h_1} = -0.30 \quad \text{and} \quad \frac{\partial \mathcal{L}}{\partial h_2} = 0.30
📐 Show derivation: How do we get ∂L/∂h using chain rule?
Given: zout=h1v1+h2v2+boutz_{out} = h_1 v_1 + h_2 v_2 + b_{out}, so zouth1=v1\frac{\partial z_{out}}{\partial h_1} = v_1 and zouth2=v2\frac{\partial z_{out}}{\partial h_2} = v_2
Chain rule: Lh=Lzoutzouth\frac{\partial \mathcal{L}}{\partial h} = \frac{\partial \mathcal{L}}{\partial z_{out}} \cdot \frac{\partial z_{out}}{\partial h}
Result: Lh1=0.03010=0.30\frac{\partial \mathcal{L}}{\partial h_1} = -0.030 \cdot 10 = -0.30 and Lh2=0.030(10)=0.30\frac{\partial \mathcal{L}}{\partial h_2} = -0.030 \cdot (-10) = 0.30

Step 4: Through Sigmoid to Hidden Pre-activations

Now we push the gradient through the sigmoid activation to get gradients w.r.t. the pre-activation values:

Lz1=0.021andLz2=0.021\frac{\partial \mathcal{L}}{\partial z_1} = -0.021 \quad \text{and} \quad \frac{\partial \mathcal{L}}{\partial z_2} = 0.021
📐 Show derivation: How does gradient flow through sigmoid?
Sigmoid derivative: σ(z)=h(1h)\sigma'(z) = h(1-h)
Chain rule: Lz=Lhh(1h)\frac{\partial \mathcal{L}}{\partial z} = \frac{\partial \mathcal{L}}{\partial h} \cdot h(1-h)
For z₁: Lz1=0.300.9240.076=0.021\frac{\partial \mathcal{L}}{\partial z_1} = -0.30 \cdot 0.924 \cdot 0.076 = -0.021
For z₂: Lz2=0.300.0760.924=0.021\frac{\partial \mathcal{L}}{\partial z_2} = 0.30 \cdot 0.076 \cdot 0.924 = 0.021
⚠️ When h≈0 or h≈1, h(1-h)→0 causing vanishing gradients. This is why ReLU is preferred in deep networks.

Step 5: Gradients for Hidden Layer Weights

Finally, we compute gradients for the input-to-hidden weights:

Lw11=0.021,Lw12=0,Lw21=0.021,Lw22=0\frac{\partial \mathcal{L}}{\partial w_{11}} = -0.021, \quad \frac{\partial \mathcal{L}}{\partial w_{12}} = 0, \quad \frac{\partial \mathcal{L}}{\partial w_{21}} = 0.021, \quad \frac{\partial \mathcal{L}}{\partial w_{22}} = 0
📐 Show derivation: How do we get weight gradients?
Given: z1=x1w11+x2w12+b1z_1 = x_1 w_{11} + x_2 w_{12} + b_1, so z1w11=x1\frac{\partial z_1}{\partial w_{11}} = x_1
Chain rule: Lw=Lzx\frac{\partial \mathcal{L}}{\partial w} = \frac{\partial \mathcal{L}}{\partial z} \cdot x
For x=[1,0]: Lw11=0.0211=0.021\frac{\partial \mathcal{L}}{\partial w_{11}} = -0.021 \cdot 1 = -0.021, Lw12=0.0210=0\frac{\partial \mathcal{L}}{\partial w_{12}} = -0.021 \cdot 0 = 0
Second neuron: Lw21=0.0211=0.021\frac{\partial \mathcal{L}}{\partial w_{21}} = 0.021 \cdot 1 = 0.021, Lw22=0.0210=0\frac{\partial \mathcal{L}}{\partial w_{22}} = 0.021 \cdot 0 = 0
💡 When input=0, weight gradient=0. Only active inputs update their weights!

XOR From Scratch (NumPy)

Here's a complete implementation that trains a neural network to solve XOR without any deep learning framework:

XOR Network: Pure NumPy Implementation
🐍xor_from_scratch.py
4Sigmoid Function

Maps any input to (0, 1). We clip extreme values to prevent overflow.

8Sigmoid Derivative

For backprop: σ'(x) = σ(x)(1 - σ(x)). Note: input is already the sigmoid output.

11XOR Dataset

All 4 XOR input-output pairs. This is our complete training set.

15Weight Initialization

Random weights with small scale (0.5). Seed for reproducibility.

23High Learning Rate

We use lr=1.0 because XOR is simple and we train on all 4 examples each step.

28Forward Pass

z1 = linear, h = activation, z2 = linear, y_pred = final activation.

34BCE Loss

Binary cross-entropy. The 1e-10 prevents log(0) errors.

38Output Gradient

The beautiful simplification: ∂L/∂z = ŷ - y for BCE + sigmoid.

39Weight Gradients

Chain rule: ∂L/∂W = (input)ᵀ @ (gradient). Divide by 4 for mean.

42Hidden Gradient

Gradient flows backward through W2, then through sigmoid derivative.

47Weight Update

Gradient descent: W = W - lr × gradient. Move opposite to gradient.

50 lines without explanation
1import numpy as np
2
3# Sigmoid activation and its derivative
4def sigmoid(x):
5    return 1 / (1 + np.exp(-np.clip(x, -500, 500)))
6
7def sigmoid_derivative(x):
8    return x * (1 - x)
9
10# XOR dataset
11X = np.array([[0, 0], [0, 1], [1, 0], [1, 1]])
12y = np.array([[0], [1], [1], [0]])
13
14# Initialize weights randomly
15np.random.seed(42)
16W1 = np.random.randn(2, 2) * 0.5  # Input -> Hidden
17b1 = np.zeros((1, 2))
18W2 = np.random.randn(2, 1) * 0.5  # Hidden -> Output
19b2 = np.zeros((1, 1))
20
21# Training parameters
22learning_rate = 1.0
23epochs = 10000
24
25# Training loop
26for epoch in range(epochs):
27    # Forward pass
28    z1 = X @ W1 + b1
29    h = sigmoid(z1)
30    z2 = h @ W2 + b2
31    y_pred = sigmoid(z2)
32
33    # Compute loss (BCE)
34    loss = -np.mean(y * np.log(y_pred + 1e-10) +
35                    (1 - y) * np.log(1 - y_pred + 1e-10))
36
37    # Backward pass
38    dz2 = y_pred - y                    # Output gradient
39    dW2 = h.T @ dz2 / 4                 # Gradient for W2
40    db2 = np.mean(dz2, axis=0, keepdims=True)
41
42    dh = dz2 @ W2.T                     # Gradient at hidden
43    dz1 = dh * sigmoid_derivative(h)    # Through sigmoid
44    dW1 = X.T @ dz1 / 4                 # Gradient for W1
45    db1 = np.mean(dz1, axis=0, keepdims=True)
46
47    # Update weights
48    W1 -= learning_rate * dW1
49    b1 -= learning_rate * db1
50    W2 -= learning_rate * dW2
51    b2 -= learning_rate * db2
52
53    if epoch % 1000 == 0:
54        predictions = (y_pred > 0.5).astype(int)
55        accuracy = np.mean(predictions == y)
56        print(f"Epoch {epoch}: Loss={loss:.4f}, Acc={accuracy:.0%}")
57
58# Final predictions
59print("\nFinal Predictions:")
60for i, (inp, pred) in enumerate(zip(X, y_pred)):
61    print(f"  {inp} -> {pred[0]:.4f} (expected: {y[i][0]})")

Expected Output

After ~5000 epochs, the network should achieve 100% accuracy. Final predictions will be very close to 0 or 1 for each XOR input.

Key Insights from XOR

  1. Hidden layers enable nonlinear decision boundaries. The hidden layer transforms the input space so that XOR outputs become linearly separable.
  2. Backpropagation distributes credit. Even though the output error is just one number, backprop figures out how each weight contributed and adjusts accordingly.
  3. BCE + Sigmoid = elegant gradients. The derivative simplifies to y^y\hat{y} - y, making implementation clean.
  4. XOR is a "sanity check" for networks. If your implementation can't learn XOR, something is fundamentally broken.

Quick Check

Why does a single-layer perceptron fail to learn XOR?


Debugging Your Neural Network

Training neural networks can be tricky. Here are common issues and how to diagnose them:

Problem: Loss Not Decreasing

SymptomLikely CauseSolution
Loss stays constantLearning rate too smallIncrease lr by 10x
Loss increasesLearning rate too highDecrease lr by 10x
Loss oscillates wildlyLearning rate too highDecrease lr
Loss stuck after initial dropLocal minimum or plateauTry different initialization, add momentum

Problem: Training Loss Good, Test Loss Bad (Overfitting)

  • Add regularization (L2 weight decay, dropout)
  • Reduce model capacity (fewer layers or neurons)
  • Get more training data or use data augmentation
  • Use early stopping based on validation loss

Problem: Accuracy Stuck at 50% (Binary Classification)

This usually means the network is outputting the same prediction for all inputs:

  • Check that your labels are correct (0s and 1s, not 1s and 2s)
  • Verify the loss function matches your output (BCEWithLogitsLoss for sigmoid output)
  • Make sure data is shuffled properly
  • Try better weight initialization (PyTorch defaults are usually fine)

The Debugging Checklist

🐍debug_checklist.py
1Step 1: Check Data

Always start by verifying your data shapes and contents

2Verify Shapes

X should be (samples, features), y should be (samples, 1) or (samples,)

3Check Labels

For binary classification, should only contain [0, 1]

4Class Balance

~50% means balanced. Imbalanced data may need weighted loss

6Step 2: Model Output

Check what the untrained model produces

7No Gradient Context

Disable gradients for inference-only operations

9Sample Outputs

Should vary (not all same value). If constant, weights may not be initialized

11Step 3: Loss Check

Verify loss computation works correctly

12Loss Function

BCEWithLogitsLoss combines sigmoid + BCE for numerical stability

14Initial Loss

For balanced binary data, initial loss should be ~0.69 (= -ln(0.5))

16Step 4: Gradient Check

Verify gradients flow through the network

17Backward Pass

Compute gradients for all parameters

18Iterate Parameters

Check each layer's gradients

20Gradient Mean

Should be non-zero. Zero gradients = no learning!

6 lines without explanation
1# 1. Check data
2print(f"X shape: {X.shape}, y shape: {y.shape}")
3print(f"y values: {torch.unique(y)}")
4print(f"Class balance: {y.mean().item():.2%}")
5
6# 2. Check model output before training
7with torch.no_grad():
8    sample_out = model(X[:5])
9    print(f"Sample outputs: {sample_out.flatten()}")
10
11# 3. Verify loss computation
12criterion = nn.BCEWithLogitsLoss()
13loss = criterion(sample_out, y[:5])
14print(f"Initial loss: {loss.item()}")
15
16# 4. Check gradients after one step
17loss.backward()
18for name, param in model.named_parameters():
19    if param.grad is not None:
20        print(f"{name}: grad mean={param.grad.mean():.6f}")

Practical Checklist

Before training any neural network, run through this checklist:

Data Preparation

  • ☐ Data is normalized (typically mean 0, std 1)
  • ☐ Labels are in the correct format (0/1 for binary, one-hot for multi-class)
  • ☐ Training and validation sets are properly split
  • ☐ Data is shuffled before training

Model Setup

  • ☐ Loss function matches the task (BCE for binary, CE for multi-class, MSE for regression)
  • ☐ Output layer matches the loss (no sigmoid if using BCEWithLogitsLoss)
  • ☐ Model parameters are counted and reasonable for the data size
  • ☐ Model is on the correct device (CPU or GPU)

Training Loop

  • optimizer.zero_grad() is called each iteration
  • model.train() is set during training
  • model.eval() is set during validation
  • ☐ Loss and accuracy are being tracked

Summary

Congratulations! You've built your first neural network from scratch. Let's review what you've learned:

ConceptKey PointPyTorch
ArchitectureInput → Hidden (ReLU) → Output (Sigmoid)nn.Sequential(nn.Linear, nn.ReLU, ...)
Forward Passz = Wx + b, then a = g(z)output = model(x)
Loss FunctionBCE measures prediction errornn.BCEWithLogitsLoss()
Backward PassChain rule computes all gradientsloss.backward()
Weight Updateθ = θ - η∇Loptimizer.step()
Training LoopRepeat: forward → loss → backward → updateThe complete training loop

Key Takeaways

  1. Neural networks are parameterized functions that learn to map inputs to outputs through gradient-based optimization
  2. The forward pass computes predictions by applying linear transformations and nonlinear activations layer by layer
  3. The loss function measures how wrong predictions are; BCE for classification, MSE for regression
  4. Backpropagation efficiently computes gradients using the chain rule
  5. The training loop follows a sacred order: zero gradients → forward → loss → backward → update
  6. Debugging requires systematically checking data, model outputs, loss values, and gradients

Exercises

Conceptual Questions

  1. Explain why we need nonlinear activation functions. What would happen if we used only linear layers?
  2. Why do we zero the gradients at the start of each iteration instead of after the weight update?
  3. What is the difference between nn.BCELoss and nn.BCEWithLogitsLoss? When would you use each?
  4. If you double the learning rate, how does this affect the size of each weight update?

Solution Hints

  1. Q1: Without nonlinearity, the entire network collapses to a single linear transformation (matrix multiplication). No matter how many layers, f(x) = W₃(W₂(W₁x)) = Wx.
  2. Q2: PyTorch accumulates gradients. Zeroing before backward() ensures we only have gradients from the current batch.
  3. Q3: BCELoss expects probabilities (after sigmoid), BCEWithLogitsLoss expects logits (before sigmoid). The latter is more numerically stable.
  4. Q4: The weight update is θ = θ - lr × ∇L. Doubling lr doubles the step size.

Coding Exercises

  1. Modify the architecture: Add a third hidden layer with 4 neurons. How does this affect training speed and final accuracy?
  2. Try different optimizers: Replace Adam with SGD (with and without momentum). Compare convergence.
  3. Implement early stopping: Track validation loss and stop training when it starts increasing.
  4. Add regularization: Implement L2 weight decay using the optimizer's weight_decay parameter.

Challenge Exercise

Implement a neural network from scratch without PyTorch's autograd. Create a two-layer network (input → hidden → output) and manually implement:

  • Forward pass with ReLU and sigmoid
  • Backward pass computing gradients for all weights
  • Weight update using SGD

Train it on the spiral dataset and verify it achieves similar accuracy to the PyTorch version.

This is Advanced

The challenge exercise requires solid understanding of matrix calculus. Start by deriving the gradients on paper, then implement. This exercise will cement your understanding of backpropagation.

You've now mastered the fundamentals of building neural networks! In the next chapter, we'll explore data loading and processing—essential skills for working with real-world datasets.