Build an MLP classifier from raw NumPy, including the sigmoid output layer and binary cross-entropy loss
Explain why cross-entropy is the correct loss for classification and why MSE fails
Define custom networks using PyTorch's nn.Module with __init__() and forward()
Implement the full training pipeline: DataLoader, BCEWithLogitsLoss, Adam optimizer, train/eval modes
Split data into train/validation and use validation accuracy to detect overfitting
Systematically compare architectures by varying width and depth on a fixed task
From Architecture to Classification
In Section 1, we designed MLP architectures and compared them on the diagonal flip task — a regression problem where the network predicts continuous values. But most real-world applications require classification: deciding which category an input belongs to. Is this email spam or not? Is this image a cat or a dog? Is this patient's tumor benign or malignant?
Classification requires two changes to our network:
Output activation: replace the linear output with a sigmoid function that produces a probability p∈(0,1)
Loss function: replace MSE with binary cross-entropy (BCE), which is the correct information-theoretic measure for probability predictions
Everything else — hidden layers, ReLU activations, He initialization, backpropagation — stays exactly the same. Classification is regression with a probability wrapper.
The Two-Moons Dataset
We need a classification task that is impossible to solve with a linear classifier but easy to visualize. The two-moons dataset is a classic choice: two interleaving crescent-shaped point clouds in 2D space.
Each data point has two features (x1,x2) and a binary label (y∈{0,1}). Class 0 forms an upper crescent, class 1 forms a lower crescent offset to the right. The crescents interleave, so no straight line can separate them — but an MLP with ReLU can learn a curved decision boundary that snakes between the two moons.
Property
Value
Samples
300 (150 per class)
Features
2 (x, y coordinates)
Classes
2 (upper moon = 0, lower moon = 1)
Noise
σ = 0.1 (Gaussian)
Range
x ∈ [−1.25, 1.63], y ∈ [−1.25, 1.63]
Why two moons? Because the required decision boundary is a smooth curve — exactly the kind of function an MLP builds by composing linear regions. With 16 ReLU neurons, we get 17 linear regions, more than enough to approximate the curved boundary. With only 2 neurons, we get just 3 linear regions — not enough for the interleaving shape.
Cross-Entropy Loss: The Right Tool for Classification
In Chapters 7–9, we used mean squared error (MSE) as our loss function: L=N1∑i(y^i−yi)2. This works well for regression (predicting continuous values), but for classification it has a fatal flaw.
Why MSE Fails for Classification
Consider a sample with true label y=1 and predicted probability y^=0.01 (confidently wrong). The MSE gradient is:
∂y^∂L=2(y^−y)=2(0.01−1)=−1.98
Now consider y^=0.99 (confidently right). The gradient is:
∂y^∂L=2(0.99−1)=−0.02
The problem is that the gradient passes through sigmoid's saturated region. When y^ is near 0 or 1, sigmoid's derivative is nearly 0, and the product ∂z∂L=∂y^∂L⋅σ′(z) vanishes. The network is confidently wrong but the gradient is nearly zero — it cannot fix its mistake.
Binary Cross-Entropy (BCE)
Cross-entropy solves this by using the logarithm, which has an infinite gradient near zero:
LBCE=−N1∑i[yilog(y^i)+(1−yi)log(1−y^i)]
When the network combines BCE with sigmoid, a beautiful cancellation occurs. The gradient of BCE + sigmoid with respect to the logitz (the pre-sigmoid value) simplifies to:
∂z∂L=σ(z)−y=y^−y
No sigmoid derivative involved! The gradient is simply the prediction error: positive when the prediction is too high, negative when too low. This means the gradient is always proportional to how wrong the prediction is, regardless of how saturated the sigmoid is.
Scenario
MSE Gradient on z
BCE Gradient on z
y=1, ŷ=0.01 (very wrong)
Vanishes (sigmoid saturated)
ŷ - y = -0.99 (strong!)
y=1, ŷ=0.50 (uncertain)
Moderate
ŷ - y = -0.50 (moderate)
y=1, ŷ=0.99 (correct)
Tiny
ŷ - y = -0.01 (tiny)
Information theory connection: Cross-entropy measures the expected number of bits needed to encode events from the true distribution using a code optimized for the predicted distribution. When prediction = truth, cross-entropy equals the entropy of the data (minimum possible). Every bit of prediction error adds cost. This is why cross-entropy is the natural loss for probability predictions.
NumPy: Binary Classifier from Scratch
Let us build a complete binary classifier from scratch using NumPy. The architecture is 2→16→1: 2 input features, 16 hidden neurons with ReLU, and 1 output neuron with sigmoid. This code generates the two-moons dataset, trains using full-batch gradient descent with BCE loss, and reaches 99% accuracy in 500 epochs.
NumPy — Two-Moons Binary Classifier
🐍two_moons_classifier.py
Explanation(45)
Code(55)
1import numpy as np
NumPy provides fast numerical arrays and vectorized math. All matrix multiplications, element-wise operations, and random number generation in this code run as optimized C code under the hood.
EXECUTION STATE
numpy = Library for numerical computing — ndarray, linear algebra, random numbers, vectorized math. We alias it as np.
3# Generate “two moons” dataset
The two-moons dataset is a classic 2D classification benchmark. It consists of two interleaving crescent shapes that cannot be separated by a straight line — requiring a non-linear classifier like our MLP. Unlike the diagonal flip (which was a regression task), this is binary classification: each point belongs to class 0 or class 1.
EXECUTION STATE
two moons = Two crescent-shaped point clouds that overlap. A linear classifier draws a straight line and gets ~75% accuracy. An MLP with ReLU can curve the boundary and get >99%.
4np.random.seed(42) — Reproducibility
Fixes the random number generator so we get identical data every run. This is critical for fair comparisons — the only variable should be the architecture, not random luck in data generation.
EXECUTION STATE
📚 np.random.seed(42) = Sets NumPy’s PRNG to a fixed state. All subsequent calls to np.random.randn(), np.random.permutation(), etc. produce the same sequence.
5n = 150 — Points per class
We generate 150 points for each class (300 total). This is large enough to learn a good boundary but small enough to train quickly with full-batch gradient descent.
EXECUTION STATE
n = 150 = 150 points in class 0 (upper moon) + 150 points in class 1 (lower moon) = 300 total samples
6t = np.linspace(0, π, n) — Angle range
Creates 150 evenly spaced angles from 0 to π (a half-circle). These angles trace out the crescent shapes. linspace(0, π, 150) = [0, 0.0211, 0.0422, ..., 3.1416].
EXECUTION STATE
📚 np.linspace(start, stop, n) = Creates n evenly spaced values between start and stop (inclusive). linspace(0, π, 150) produces 150 values from 0 to 3.1416.
t shape = (150,) — one angle per data point
7X_top = np.c_[np.cos(t), np.sin(t)] — Upper moon
Creates the upper crescent. cos(t) gives x-coordinates, sin(t) gives y-coordinates. As t goes from 0 to π, this traces the top half of a unit circle: starting at (1, 0), arcing through (0, 1), ending at (-1, 0).
EXECUTION STATE
📚 np.c_[a, b] = Column-stacks two 1D arrays into a 2D matrix. np.c_[[1,2,3], [4,5,6]] = [[1,4], [2,5], [3,6]]. Creates (n, 2) from two (n,) arrays.
X_top shape = (150, 2) — 150 points, each with (x, y) coordinates
X_top[0] = [cos(0), sin(0)] = [1.0, 0.0] — rightmost point
X_top[75] = [cos(π/2), sin(π/2)] = [0.0, 1.0] — top point
X_top[149] = [cos(π), sin(π)] = [-1.0, 0.0] — leftmost point
Creates the lower crescent. Same semicircle as the upper moon, but flipped vertically (-sin) and shifted right (+0.5) and up (+0.5). The offset makes the two moons interleave — they overlap in the middle, creating a non-linearly-separable pattern.
+0.5 horizontal shift = Moves the lower moon 0.5 units to the right, creating the interleaving pattern
-np.sin(t)+0.5 = Flips the arc downward (negative sin) then shifts up by 0.5. Without the shift, the moons would be too far apart and easily separable.
9X = np.vstack([X_top, X_bot]) — Combine
Stacks upper and lower moon into one dataset. Rows 0–149 are class 0 (upper), rows 150–299 are class 1 (lower).
EXECUTION STATE
📚 np.vstack([A, B]) = Vertically stacks arrays: puts A on top, B below. For (150,2) + (150,2) = (300,2).
X shape = (300, 2) — 300 data points, 2 features each (x, y)
10X += np.random.randn(300, 2) * 0.1 — Add noise
Adds Gaussian noise to make the task realistic. Without noise, the moons are clean curves that could be perfectly separated. With noise = 0.1, points scatter slightly around the crescents, creating overlap near the boundary. This forces the network to learn a robust decision boundary.
EXECUTION STATE
📚 np.random.randn(300, 2) = Draws 600 values from standard normal N(0, 1). Shape (300, 2) matches X. Each point gets independent x and y jitter.
* 0.1 = Scales noise to std=0.1. Larger noise = more overlap = harder task. 0.1 gives mild noise — challenging but learnable.
Creates binary labels: 0 for upper moon (first 150 points), 1 for lower moon (last 150 points). These are the targets our classifier will learn to predict.
EXECUTION STATE
y shape = (300,) — one label per data point
y[:5] = [0, 0, 0, 0, 0] — upper moon
y[150:155] = [1, 1, 1, 1, 1] — lower moon
13perm = np.random.permutation(300) — Random shuffle
Creates a random ordering of indices 0–299. We use this to shuffle the data so class 0 and class 1 are interleaved, not grouped. This prevents the network from learning spurious patterns based on sample order.
EXECUTION STATE
📚 np.random.permutation(n) = Returns a random permutation of [0, 1, 2, ..., n-1]. Example: permutation(5) might return [3, 0, 4, 1, 2].
14X, y = X[perm], y[perm] — Apply shuffle
Reorders both X and y using the same permutation. This ensures each data point keeps its correct label after shuffling.
EXECUTION STATE
X[0] after shuffle = [-0.2139, 1.0612], y[0] = 0 — a class-0 point from the upper moon
X[2] after shuffle = [0.9309, -0.3566], y[2] = 1 — a class-1 point from the lower moon
16print(f"Dataset: X{X.shape}, y{y.shape}")
Confirms the dataset dimensions. Output: Dataset: X(300, 2), y(300,). 300 points, 2 features per point, 1 label per point.
18def sigmoid(z): — Sigmoid activation
The sigmoid function squashes any real number into the range (0, 1), making it interpretable as a probability. For classification, the output neuron uses sigmoid (not ReLU) so we get P(class=1|x). Sigmoid is defined as σ(z) = 1/(1 + e^(-z)).
EXECUTION STATE
⬇ input: z = A NumPy array of any shape. Sigmoid is applied element-wise. For the output layer: z is the raw logit score.
→ Why sigmoid for classification? = ReLU outputs [0, ∞) which is not a probability. Sigmoid outputs (0, 1) which we interpret as P(y=1). If sigmoid > 0.5, predict class 1; otherwise predict class 0.
19z = np.clip(z, -500, 500) — Prevent overflow
Clips extreme values to prevent numerical overflow in exp(-z). Without clipping: exp(-(-1000)) = exp(1000) = infinity. With clipping: exp(-(-500)) = exp(500) is still huge but finite. The sigmoid of any z > 20 is effectively 1.0, so clipping at 500 has zero effect on the result.
EXECUTION STATE
📚 np.clip(a, min, max) = Restricts values to [min, max]. Values below min become min, above max become max. clip([-600, 0, 600], -500, 500) = [-500, 0, 500].
20return 1 / (1 + np.exp(-z)) — Sigmoid formula
The core sigmoid computation: σ(z) = 1/(1 + e^(-z)). When z is large and positive, e^(-z) ≈ 0, so σ ≈ 1. When z is large and negative, e^(-z) is huge, so σ ≈ 0. When z = 0, e^0 = 1, so σ = 1/2 = 0.5.
EXECUTION STATE
📚 np.exp(-z) = Element-wise e^(-z). For z = [-2, 0, 2]: exp(-z) = [7.389, 1.0, 0.135]
We build a 2-input, 16-hidden, 1-output network. 2 inputs because our data is 2D (x, y coordinates). 16 hidden neurons with ReLU (from Section 1, we know 16 neurons give ample capacity). 1 output with sigmoid for binary probability.
23np.random.seed(0) — Weight seed
Sets a separate seed for weight initialization. Using seed 0 (different from data seed 42) ensures independence between data randomness and weight randomness.
Creates the weight matrix for the hidden layer using He initialization. Shape (2, 16) means 2 inputs and 16 hidden neurons. He scale: √(2/fan_in) = √(2/2) = 1.0, so we just use standard normal values.
EXECUTION STATE
📚 np.random.randn(2, 16) = Draws 32 values from N(0, 1). Shape (2, 16): each of 16 neurons has 2 weights (one per input feature).
√(2/2) = 1.0 = He initialization scale. fan_in = 2 (input features). With only 2 inputs, the scale happens to be 1.0, so weights are standard normal.
⬆ W1 shape = (2, 16) — 32 weights. W1[:,j] are the 2 weights for hidden neuron j.
25b1 = np.zeros(16) — Hidden biases
One bias per hidden neuron, all initialized to zero. Standard practice — He initialization handles the variance through weights.
EXECUTION STATE
b1 shape = (16,) — [0, 0, 0, ..., 0]. One bias per hidden neuron.
Weight matrix for the output layer. Shape (16, 1): 16 inputs from hidden layer, 1 output. He scale: √(2/16) = √0.125 = 0.354.
EXECUTION STATE
√(2/16) = 0.354 = Smaller scale because fan_in is larger (16). More inputs means each weight should be smaller to keep the weighted sum’s variance around 1.
⬆ W2 shape = (16, 1) — 16 weights connecting hidden layer to the single output neuron
27b2 = np.zeros(1) — Output bias
Single output bias. With b2 = 0 and random weights, the initial sigmoid output is centered around 0.5 (no class preference).
Learning rate for gradient descent. We use lr = 1.0 because the gradient is averaged over 300 samples (divided by N), making the effective step size reasonable. With lr = 0.1, convergence would be 10× slower.
EXECUTION STATE
lr = 1.0 = Each weight update: w_new = w_old - 1.0 × gradient. Since gradient is averaged over 300 samples, the per-sample step is effectively 1.0/300 ≈ 0.0033.
30for epoch in range(500): — Training loop
Train for 500 epochs. Each epoch does one full forward pass (all 300 samples), computes the loss, does one full backward pass, and updates all weights. This is full-batch gradient descent — every epoch uses the entire dataset.
LOOP TRACE · 4 iterations
Epoch 0
BCE=0.4991, acc=77.3% = Random initialization. The sigmoid outputs are near 0.5 for most points, so accuracy is just above chance (50%).
Epoch 100
BCE=0.2186, acc=89.7% = Network has learned the rough shape of the boundary. Most points classified correctly.
Epoch 200
BCE=0.1079, acc=97.3% = Boundary is much sharper. Only a few ambiguous points near the overlap region are misclassified.
Epoch 499
BCE=0.0377, acc=99.0% = Nearly perfect. Only 3 out of 300 points misclassified — likely noisy outliers near the boundary.
31# Forward pass
The forward pass computes predictions for all 300 samples simultaneously using matrix operations. Data flows: input (300,2) → hidden (300,16) → output (300,1) → probability (300,).
32Z1 = X @ W1 + b1 — Hidden pre-activations
Computes the weighted sum for every hidden neuron on every sample. X is (300, 2) and W1 is (2, 16), so X @ W1 is (300, 16) — 300 samples, 16 pre-activations each. Adding b1 broadcasts the (16,) bias across all 300 rows.
EXECUTION STATE
📚 @ (matrix multiply) = Python’s matrix multiplication. X(300,2) @ W1(2,16) = Z(300,16). Each row of Z is the 16 dot products between one input and all 16 weight vectors.
+ b1 (broadcast) = b1 is shape (16,). NumPy broadcasts it to (300, 16): each row gets the same bias vector added.
Applies ReLU element-wise: negative values become 0, positive values pass through. This creates the non-linearity that lets the MLP learn curved boundaries. ~50% of values are zeroed out (sparse activation).
⬆ H shape = (300, 16) — same shape as Z1. For sample 0: 10 of 16 neurons are active (positive).
34Z2 = H @ W2 + b2 — Output logit
Computes the raw output (logit) by multiplying hidden activations by output weights. H is (300, 16) and W2 is (16, 1), so Z2 is (300, 1) — one logit per sample.
EXECUTION STATE
Z2 shape = (300, 1) — one raw score per sample. Positive logit → sigmoid > 0.5 → predict class 1.
→ Z2[0] = -1.4615 — negative, so sigmoid will produce a value < 0.5 (predicts class 0)
35A2 = sigmoid(Z2.squeeze()) — Output probability
Applies sigmoid to convert logits to probabilities. squeeze() removes the size-1 last dimension: (300, 1) → (300,). Each value is now P(class=1) for that sample.
EXECUTION STATE
📚 .squeeze() = Removes dimensions of size 1. Shape (300, 1) becomes (300,). This makes A2 a 1D array matching y’s shape for easy comparison.
⬆ A2 shape = (300,) — one probability per sample, all values in (0, 1)
→ A2[0] = sigmoid(-1.4615) = 0.1882 — 18.8% probability of class 1. Since y[0]=0, this is a correct prediction (probability < 0.5).
37# Binary cross-entropy loss
Cross-entropy is the correct loss function for classification. Unlike MSE (which treats outputs as regression targets), cross-entropy measures the information-theoretic distance between the predicted probability distribution and the true label. It produces stronger gradients when predictions are confidently wrong.
38eps = 1e-7 — Numerical safety
A tiny constant to prevent log(0) = -infinity. Since A2 can be extremely close to 0 or 1, we add eps before taking the log. This is standard practice in every ML framework.
EXECUTION STATE
eps = 1e-7 = 0.0000001. Adding this to probabilities before log() has negligible effect on the result but prevents NaN/infinity.
Binary Cross-Entropy (BCE) loss: L = -(1/N) ∑ [yᵢ · log(aᵢ) + (1-yᵢ) · log(1-aᵢ)]. For each sample, only one term is active: if y=1, the loss is -log(a) (penalizes low predictions); if y=0, the loss is -log(1-a) (penalizes high predictions). The minimum is 0 when every prediction perfectly matches its label.
EXECUTION STATE
y*np.log(A2+eps) = Active when y=1. If a=0.9: -log(0.9) = 0.105 (low penalty, good prediction). If a=0.1: -log(0.1) = 2.303 (high penalty, bad prediction).
(1-y)*np.log(1-A2+eps) = Active when y=0. If a=0.1: -log(0.9) = 0.105 (low penalty). If a=0.9: -log(0.1) = 2.303 (high penalty).
⬆ bce (epoch 0) = 0.4991 — average over all 300 samples. Random predictions (≈0.5) give BCE ≈ -log(0.5) = 0.693, so 0.4991 shows the init is slightly better than random.
Converts probabilities to binary predictions (>0.5 → class 1, ≤0.5 → class 0), then compares with true labels. The mean of this boolean array is the fraction of correct predictions.
EXECUTION STATE
(A2 > 0.5) = Boolean array: True where we predict class 1, False where we predict class 0
== y = Element-wise comparison with true labels. True where prediction matches label.
⬆ acc (epoch 0) = 0.773 = 77.3% — 232 out of 300 correct. Better than 50% chance because He init gives reasonable initial weights.
42# Backward pass
Backpropagation computes the gradient of the loss with respect to every weight. The chain rule flows the error signal from the output back through each layer. For BCE + sigmoid, the output gradient has the elegant form: dZ2 = (A2 - y) / N.
The gradient of BCE loss with respect to the output logit Z2. The math works out beautifully: d(BCE)/d(Z2) = (sigmoid(Z2) - y) / N = (A2 - y) / N. The /300 averages over all samples. Reshape from (300,) to (300, 1) for matrix math.
EXECUTION STATE
(A2 - y) = Prediction error. Positive where A2 is too high (predicted class 1 but true class 0), negative where too low. Shape (300,).
/ 300 = Averages the gradient. Without this, the gradient magnitude would scale with dataset size, requiring a tiny learning rate.
.reshape(-1, 1) = Changes shape from (300,) to (300, 1) so matrix multiplications work correctly. -1 means ‘infer this dimension’.
⬆ dZ2 shape = (300, 1) — one gradient value per sample
44dW2 = H.T @ dZ2 — Output weight gradient
Gradient for W2: the matrix product of hidden activations (transposed) and the output gradient. H.T is (16, 300) and dZ2 is (300, 1), giving dW2 of shape (16, 1) — matching W2’s shape.
EXECUTION STATE
📚 .T (transpose) = H is (300, 16). H.T is (16, 300). We need this shape for the matrix multiply: (16, 300) @ (300, 1) = (16, 1).
⬆ dW2 shape = (16, 1) — matches W2. Each element is the sum of (hidden activation × output error) across all 300 samples.
45db2 = dZ2.sum(axis=0) — Output bias gradient
Bias gradient is the sum of output gradients across all samples. Shape (1,) — matches b2.
EXECUTION STATE
sum(axis=0) = Sum along the sample dimension (axis 0). (300, 1) → (1,). The bias gradient is simpler because bias doesn’t multiply any input.
46dH = dZ2 @ W2.T — Hidden activation gradient
Sends the error signal back through W2 to the hidden layer. dZ2 is (300, 1) and W2.T is (1, 16), giving dH of shape (300, 16). Each hidden neuron receives a portion of the output error, weighted by its connection strength W2.
EXECUTION STATE
W2.T shape = (1, 16) — transpose of W2 (16, 1). The same weights used in the forward pass are used (transposed) in the backward pass.
⬆ dH shape = (300, 16) — gradient for each hidden activation, for each sample
47dZ1 = dH * (Z1 > 0) — ReLU backward
Applies the ReLU derivative: gradient passes through active neurons (Z1 > 0) and is blocked by dead neurons (Z1 ≤ 0). The expression (Z1 > 0) creates a boolean mask that is 1 where ReLU was active and 0 where it was not.
EXECUTION STATE
(Z1 > 0) = Boolean mask, shape (300, 16). True where pre-activation was positive (ReLU let the value through). False where it was negative (ReLU output 0, gradient is 0).
* (element-wise) = Multiplies dH by the mask. Where ReLU was active: gradient passes. Where ReLU was dead: gradient is zeroed out.
→ ReLU derivative = d/dz max(0, z) = 1 if z > 0, 0 if z ≤ 0. This is why we check Z1 (pre-activation), not H (post-activation).
48dW1 = X.T @ dZ1 — Hidden weight gradient
Same pattern as dW2: gradient = input.T @ layer_gradient. X.T is (2, 300) and dZ1 is (300, 16), giving dW1 of shape (2, 16) — matching W1.
EXECUTION STATE
⬆ dW1 shape = (2, 16) — matches W1. Each element measures how much the loss changes when that weight changes.
49db1 = dZ1.sum(axis=0) — Hidden bias gradient
Sum of hidden gradients across all 300 samples. Shape (16,) matches b1.
Gradient descent: move weights in the direction that reduces loss. W_new = W_old - lr × gradient. We update the output layer first, but the order does not matter since we already computed all gradients.
EXECUTION STATE
lr*dW2 = 1.0 × dW2. With lr=1.0 and averaged gradients, each weight moves by the mean gradient — a reasonable step size.
Prints the epoch number, BCE loss, and classification accuracy. The :3d formats the epoch as a 3-digit integer, .4f gives 4 decimal places, .1% gives one decimal percentage.
Epoch 0 (77.3%): Random weights already do better than chance (50%) because He initialization gives reasonable starting activations.
Epoch 100 (89.7%): The network has learned the rough shape of the decision boundary. Most points are correct, but some near the overlap region are wrong.
Epoch 200 (97.3%): The boundary has sharpened. Only 8 points out of 300 are misclassified.
Epoch 499 (99.0%): Nearly perfect. Only 3 noisy outliers near the boundary are wrong.
Key insight from the backward pass: Notice how compact the gradient computation is. The output gradient dZ2=(y^−y)/N is just the prediction error, averaged over samples. This simplicity comes from the BCE + sigmoid cancellation — one of the most elegant results in neural network theory.
The nn.Module Blueprint
Writing 15 lines of manual backprop for a 2-layer network is manageable. But for a 96-layer transformer with attention, layer norm, and residual connections, manual gradients are impractical. PyTorch's nn.Module solves this: you define the architecture, and autograd computes all gradients automatically.
Every PyTorch neural network follows the same blueprint:
Inherit from nn.Module:class MLP(nn.Module):
Define layers in __init__(): Create nn.Linear, nn.ReLU, etc. and assign them to self
Define data flow in forward(): Specify how input transforms to output
Never call forward() directly: Use model(x) which also triggers hooks and autograd
The power of nn.Module is automatic parameter tracking. When you assign an nn.Linear to self.layer = nn.Linear(2, 16), PyTorch automatically registers its weight and bias as parameters. Calling model.parameters() returns every weight in the entire network — no matter how deep or complex — ready for the optimizer.
nn.Sequential vs nn.Module: In Section 1, we used nn.Sequential to build MLPs. Sequential is convenient but limited: data must flow straight through the layers. nn.Module lets you define any computation graph: skip connections (ResNet), attention (Transformer), branching (Inception). Every nn.Sequential is an nn.Module, but not every nn.Module is a Sequential.
PyTorch: Complete Training Pipeline
The following code demonstrates the complete PyTorch workflow: define a network class, prepare data with train/validation split, train with mini-batches, and compare architectures. This pattern scales from our 65-parameter MLP to billion-parameter transformers — only the model definition and data pipeline change.
PyTorch — nn.Module MLP with Full Pipeline
🐍pytorch_mlp_pipeline.py
Explanation(53)
Code(70)
1import torch
PyTorch’s core library. Provides tensors (GPU-accelerated arrays), automatic differentiation (autograd), and the computational graph that enables loss.backward() to compute all gradients in one call.
EXECUTION STATE
torch = Core PyTorch — tensors, autograd, GPU computing. Version 2.x adds torch.compile() for performance.
2import torch.nn as nn
The neural network module. Contains layer types (nn.Linear, nn.Conv2d), activations (nn.ReLU, nn.GELU), loss functions (nn.CrossEntropyLoss), containers (nn.Sequential), and the nn.Module base class that all networks must extend.
DataLoader handles batching, shuffling, and iterating over data. TensorDataset wraps tensors into a dataset object that DataLoader can consume. Together, they replace our manual for-loop over samples.
EXECUTION STATE
📚 DataLoader = Wraps a dataset and provides an iterator that yields batches. Handles shuffling, batch size, and parallel loading. The workhorse of PyTorch data pipelines.
📚 TensorDataset = Simple dataset that wraps one or more tensors. TensorDataset(X, y) creates a dataset where dataset[i] = (X[i], y[i]).
5class MLP(nn.Module): — Custom neural network
Every PyTorch neural network is a class that extends nn.Module. This is the fundamental pattern: define layers in __init__(), define the data flow in forward(). nn.Module gives us automatic parameter tracking, GPU support, save/load, and gradient computation for free.
EXECUTION STATE
📚 nn.Module = Base class for all neural networks in PyTorch. Provides: parameters() for listing weights, to() for GPU transfer, train()/eval() for mode switching, state_dict() for saving/loading.
→ Why a class? = A class bundles architecture (layers) and behavior (forward pass) together. You can create multiple instances with different architectures from the same class.
Constructor that defines the network architecture. Takes flexible parameters so one class can build any MLP: MLP(2, [16], 1) for a 2→16→1 network, MLP(2, [8, 8], 1) for a 2→8→8→1 network.
EXECUTION STATE
⬇ input: in_dim = Number of input features. For our 2D data: in_dim = 2 (x and y coordinates).
⬇ input: hidden_dims = List of hidden layer widths. [16] = one layer of 16 neurons. [8, 8] = two layers of 8 neurons each.
⬇ input: out_dim = Number of outputs. For binary classification: out_dim = 1 (one logit).
7super().__init__() — Initialize nn.Module
Calls the parent class (nn.Module) constructor. This sets up PyTorch’s internal parameter tracking, hooks, and module registration. You MUST call this before defining any layers — omitting it causes subtle bugs.
EXECUTION STATE
📚 super().__init__() = Initializes nn.Module internals: _parameters dict, _modules dict, _buffers dict, training flag, hooks. Without this, self.parameters() would return nothing.
8layers = [] — Build layer list dynamically
We accumulate layer objects in a list, then pass them to nn.Sequential. This lets us build networks of any depth from the hidden_dims parameter.
9prev = in_dim — Track input size
The input dimension for the next layer. Starts at in_dim (2), then updates to each hidden width. For MLP(2, [16], 1): prev goes 2 → 16.
10for h in hidden_dims: — Create each hidden layer
Iterates over hidden layer widths. For hidden_dims = [16]: one iteration creating Linear(2, 16) + ReLU. For [8, 8]: two iterations creating Linear(2, 8) + ReLU + Linear(8, 8) + ReLU.
11layers.append(nn.Linear(prev, h)) — Linear layer
Creates a fully-connected layer mapping prev inputs to h outputs. Internally stores weight matrix W of shape (h, prev) and bias vector b of shape (h,). The forward pass computes: output = input @ W.T + b.
EXECUTION STATE
📚 nn.Linear(in_features, out_features) = Stores W(out, in) and b(out). Forward: y = x @ W.T + b. For Linear(2, 16): W is (16, 2) = 32 weights, b is (16,) = 16 biases = 48 params.
12layers.append(nn.ReLU()) — Activation
Adds ReLU after each hidden layer. nn.ReLU() has no learnable parameters — it just applies max(0, x) element-wise. We add it after every hidden Linear but NOT after the output Linear.
13prev = h — Update dimension tracker
The output of this layer (h neurons) becomes the input to the next layer. For hidden_dims = [8, 8]: after first iteration prev = 8, so the second Linear is Linear(8, 8).
Creates the final layer: prev inputs → out_dim outputs. For binary classification with out_dim = 1: this produces a single logit (raw score before sigmoid). No activation is added here — we use BCEWithLogitsLoss which applies sigmoid internally.
15self.net = nn.Sequential(*layers) — Build the model
Wraps all layers into an nn.Sequential container. The * unpacks the list. When you call self.net(x), it passes x through each layer in order. Assigning to self.net (a nn.Module attribute) registers all sub-layers so model.parameters() finds them.
self.net = = Assigning a nn.Module to self registers it. PyTorch then tracks all sub-modules’ parameters. If you used a plain Python variable instead of self, parameters() would miss them!
16self._init_weights() — Apply He initialization
Calls our custom initialization method. PyTorch’s default is Kaiming uniform, which is close to He init. We explicitly apply Kaiming normal for consistency with our NumPy code.
18def _init_weights(self): — Custom initializer
Iterates through all layers and applies He (Kaiming) initialization to Linear layers. The leading underscore _ is a Python convention indicating this is an internal method not meant to be called externally.
19for m in self.modules(): — Iterate all sub-modules
self.modules() returns every module in the model tree: the MLP itself, the Sequential, each Linear, and each ReLU. We filter for nn.Linear to only initialize weight-bearing layers.
EXECUTION STATE
📚 self.modules() = Recursive iterator over all modules. For MLP(2,[16],1): yields MLP, Sequential, Linear(2,16), ReLU, Linear(16,1).
20if isinstance(m, nn.Linear): — Filter for Linear layers
Only initialize nn.Linear layers (which have weights). nn.ReLU has no parameters, so initializing it would be meaningless.
He initialization: fills weights from N(0, √(2/fan_in)). The trailing _ means in-place modification. nonlinearity=‘relu’ tells PyTorch to use the factor of 2 (because ReLU zeroes half the activations).
EXECUTION STATE
📚 nn.init.kaiming_normal_(tensor, nonlinearity) = Fills tensor with values from N(0, std²) where std = √(2/fan_in). For Linear(2,16): fan_in=2, std = √(2/2) = 1.0. For Linear(16,1): fan_in=16, std = √(2/16) = 0.354.
22nn.init.zeros_(m.bias) — Zero biases
Sets all biases to 0. Same as our b1 = np.zeros(16) in NumPy.
24def forward(self, x): — Define data flow
The forward method defines how data flows through the network. PyTorch calls this automatically when you do model(x). You NEVER call model.forward(x) directly — model(x) also triggers hooks and autograd tracking.
EXECUTION STATE
⬇ input: x = Input tensor. Shape (batch_size, in_dim). For our data: (32, 2) for a mini-batch or (240, 2) for the full training set.
⬆ returns = Output logits. Shape (batch_size, 1). Raw scores before sigmoid — positive means class 1, negative means class 0.
25return self.net(x) — Run through all layers
Passes x through the Sequential container: Linear → ReLU → Linear. The output is raw logits (no sigmoid). For a batch of 32: input (32, 2) → hidden (32, 16) → output (32, 1).
27# Prepare data with train/val split (80/20)
We split the 300-point dataset into 240 training and 60 validation samples. Training data is used to update weights; validation data is held out to detect overfitting — when the model memorizes training data but fails on new data.
28X_t = torch.tensor(X[:240], dtype=torch.float32) — Training inputs
Converts the first 240 NumPy rows to a PyTorch tensor. float32 is required because nn.Linear expects 32-bit floats (not 64-bit doubles or integers).
EXECUTION STATE
X[:240] = NumPy fancy indexing: rows 0 through 239 = first 80% of shuffled data. 240 training samples.
📚 torch.tensor(data, dtype) = Creates a new tensor from data (NumPy array, list, etc.). dtype=torch.float32 ensures 32-bit precision for GPU compatibility.
⬆ X_t shape = torch.Size([240, 2]) — 240 training samples, 2 features
29y_t = torch.tensor(y[:240], dtype=torch.float32) — Training labels
Training labels as float32 tensor. BCEWithLogitsLoss requires float targets, not integers. Shape: (240,).
Creates a mini-batch data pipeline. TensorDataset bundles (X_t, y_t) so dataset[i] = (X_t[i], y_t[i]). DataLoader splits the 240 samples into batches of 32, shuffled each epoch. This gives 8 batches per epoch (7 full batches of 32 + 1 batch of 16).
EXECUTION STATE
📚 TensorDataset(X_t, y_t) = Pairs inputs and labels: dataset[0] = (X_t[0], y_t[0]). Supports indexing, slicing, and len().
📚 DataLoader(..., batch_size=32) = Yields batches of 32 samples. For 240 samples: 7 batches of 32 + 1 batch of 16 = 240 total. Each batch is a tuple (xb, yb) with shapes (32, 2) and (32,).
shuffle=True = Randomizes sample order each epoch. Prevents the network from learning order-dependent patterns. Critical for SGD convergence.
35# Training function
We encapsulate training in a function so we can easily compare different architectures. Each call creates a fresh optimizer, trains for 300 epochs, and returns final accuracies.
Takes a model, trains it, and returns train/validation accuracy. Separating training into a function is good practice — it avoids copy-pasting the training loop for each architecture.
EXECUTION STATE
⬇ input: model = An MLP instance (already initialized). Different calls get different architectures: MLP(2, [4], 1) vs MLP(2, [16], 1).
⬇ input: epochs = 300 = Number of training epochs. 300 is enough for these small networks to converge on this dataset.
⬇ input: lr = 0.01 = Learning rate for Adam optimizer. Adam uses adaptive per-parameter rates, so the effective step is typically much smaller than lr.
⬆ returns = (train_accuracy, val_accuracy) as Python floats
Creates an Adam optimizer for this model. Adam maintains per-parameter momentum (m) and squared-gradient (v) buffers, then updates: w -= lr × m̂/(√v̂ + ε). This adaptive rate lets Adam converge faster than vanilla SGD on most problems.
EXECUTION STATE
📚 torch.optim.Adam(params, lr) = Adam optimizer: adaptive learning rate per parameter. Combines momentum (for speed) and RMSprop (for stability). Default β1=0.9, β2=0.999, ε=1e-8.
model.parameters() = Iterator over all learnable tensors in the model. For MLP(2,[16],1): W1(16,2), b1(16), W2(1,16), b2(1) = 65 parameters.
38loss_fn = nn.BCEWithLogitsLoss() — Loss function
Binary Cross-Entropy loss with built-in sigmoid. Unlike calling sigmoid() + BCE separately, BCEWithLogitsLoss uses the log-sum-exp trick for numerical stability. It expects raw logits (not probabilities) as input.
EXECUTION STATE
📚 nn.BCEWithLogitsLoss() = Computes: -mean(y·log(σ(z)) + (1-y)·log(1-σ(z))) where σ is sigmoid. Numerically stable because it never computes sigmoid directly — uses log(sigmoid(z)) = z - softplus(z). This avoids exp overflow.
→ Why not sigmoid + BCE? = sigmoid(1000) = 1.0 exactly in float32, then log(1-1.0) = log(0) = -inf = NaN loss. BCEWithLogitsLoss avoids this by never computing sigmoid explicitly.
40for epoch in range(epochs): — Training loop
Each epoch iterates over all mini-batches in the DataLoader. With batch_size=32 and 240 samples, that is 8 optimizer steps per epoch, 2400 total steps over 300 epochs.
41model.train() — Training mode
Switches the model to training mode. This enables dropout (if any) and uses running statistics for batch normalization. Our simple MLP has no dropout or batch norm, but it is good practice to always call model.train() before training.
EXECUTION STATE
📚 model.train() = Sets self.training = True for the model and all sub-modules. Affects behavior of nn.Dropout (active in train, off in eval) and nn.BatchNorm (uses batch stats in train, running stats in eval).
42for xb, yb in loader: — Iterate mini-batches
The DataLoader yields one batch at a time: (xb, yb) where xb is shape (32, 2) and yb is shape (32,). The last batch may be smaller (16 samples if 240 is not divisible by 32).
Passes the batch through the model. model(xb) calls forward(), which returns shape (32, 1). squeeze(-1) removes the last dimension: (32, 1) → (32,) to match yb’s shape.
EXECUTION STATE
model(xb) = Calls MLP.forward(xb). Returns raw logits (no sigmoid). Shape (32, 1).
.squeeze(-1) = Removes the last dimension of size 1. (32, 1) → (32,). Now logits and yb have the same shape for the loss function.
⬆ logits shape = torch.Size([32]) — one raw score per sample. Positive → class 1, negative → class 0.
44loss = loss_fn(logits, yb) — Compute loss
Computes BCE loss over the batch. The loss function applies sigmoid internally, then computes cross-entropy, then averages over the batch. Returns a scalar tensor with gradient tracking.
EXECUTION STATE
⬆ loss = Scalar tensor. Epoch 0: ~0.54. Epoch 100: ~0.05. Epoch 299: ~0.01. The loss has a computational graph attached for backward().
45opt.zero_grad() — Clear old gradients
Resets all parameter gradients to zero. PyTorch accumulates gradients by default (adding new gradients to existing ones). Without this, gradients from batch 1 would carry into batch 2, corrupting the update.
EXECUTION STATE
📚 opt.zero_grad() = Sets .grad = None for every parameter tracked by the optimizer. Equivalent to: for p in model.parameters(): p.grad = None
46loss.backward() — Backpropagation
Computes gradients for ALL model parameters via reverse-mode automatic differentiation. This single call replaces our entire NumPy backward pass (dZ2, dW2, db2, dH, dZ1, dW1, db1). PyTorch traces the computational graph from loss back to every weight.
EXECUTION STATE
📚 loss.backward() = Traverses the computational graph in reverse: loss → logits → Linear → ReLU → Linear → input. Computes d(loss)/d(param) for every parameter and stores it in param.grad.
47opt.step() — Update weights with Adam
Applies Adam’s update rule to every parameter: updates momentum m, updates squared gradient v, applies bias correction, then updates the weight. This is the equivalent of our W -= lr*dW but with adaptive per-parameter rates.
EXECUTION STATE
📚 opt.step() = For each param p: m = β1*m + (1-β1)*p.grad, v = β2*v + (1-β2)*p.grad², p -= lr * m̂/(√v̂ + ε). Each weight gets a custom effective learning rate based on its gradient history.
49model.eval() — Evaluation mode
Switches to evaluation mode: disables dropout and uses running statistics for batch norm. For our simple MLP, the behavior is identical to train mode, but always switch modes as a habit.
EXECUTION STATE
📚 model.eval() = Sets self.training = False. Returns the model, so you can chain: model.eval().to(’cpu’). Always pair with model.train() when resuming training.
Inside this block, PyTorch skips building the computational graph for gradient computation. This saves memory and speeds up inference. We don’t need gradients during evaluation — only during training.
EXECUTION STATE
📚 torch.no_grad() = Context manager that disables autograd. Tensors created inside have requires_grad=False. Saves ~50% memory and is ~20% faster than with gradients.
Computes training accuracy in one line. The logit > 0 check is equivalent to sigmoid > 0.5 (sigmoid(0) = 0.5). We compare predictions with labels, convert booleans to float (True=1, False=0), and take the mean.
EXECUTION STATE
model(X_t).squeeze(-1) = Forward pass on all 240 training samples. Returns (240,) logits.
> 0 = Threshold at 0 (same as sigmoid > 0.5). Returns boolean tensor.
== y_t.bool() = Element-wise comparison: True where prediction matches label. y_t.bool() converts 0.0/1.0 to False/True.
.float().mean() = Converts True/False to 1.0/0.0, then averages. Result is the fraction of correct predictions.
52v_acc = ... — Validation accuracy
Same accuracy computation on the 60 validation samples. If val_acc << train_acc, the model is overfitting (memorizing training data). If both are high, the model generalizes well.
EXECUTION STATE
⬆ return (t_acc, v_acc) = Two floats: training accuracy and validation accuracy. .item() extracts the Python float from the 0-d tensor.
53return t_acc.item(), v_acc.item()
Returns accuracy as plain Python floats. .item() converts a single-element tensor to a float. This is needed for printing — f-strings work better with Python floats than torch tensors.
EXECUTION STATE
📚 .item() = Extracts the scalar value from a 0-dimensional tensor. tensor(0.95).item() = 0.95 (Python float). Only works on single-element tensors.
55# Compare architectures
Now we systematically test five architectures: three single-hidden-layer networks (varying width) and two two-hidden-layer networks (varying width). This reveals how architecture affects generalization.
56configs = { ... } — Architecture dictionary
Maps architecture names to hidden layer dimension lists. The name uses → arrows to show data flow. Each list is passed to MLP() as hidden_dims.
EXECUTION STATE
2→4→1: [4] = Narrow: 17 params. 4 hidden neurons may not be enough for the curved boundary.
2→16→1: [16] = Medium: 65 params. Same as our NumPy network. Should work well.
2→32→1: [32] = Wide: 129 params. Extra capacity — risk of overfitting?
2→8→8→1: [8, 8] = Deep: 105 params. Two hidden layers of 8 neurons each.
2→16→16→1: [16, 16] = Deep + wide: 337 params. Most capacity. Does depth help for 2D data?
63torch.manual_seed(42) — Reproducible comparison
Sets PyTorch’s random seed before the comparison loop. This ensures each model gets a consistent random initialization, making the comparison fair.
64for name, hidden in configs.items(): — Test each architecture
Iterates over all five architectures. For each: builds the model, counts parameters, trains for 300 epochs, and evaluates.
LOOP TRACE · 5 iterations
2→4→1
Result = params=17, train=100.0%, val=96.7% — only 17 params but reaches 100% train!
2→16→1
Result = params=65, train=100.0%, val=98.3% — our standard network, great generalization
2→32→1
Result = params=129, train=100.0%, val=98.3% — more params but same val accuracy
2→8→8→1
Result = params=105, train=100.0%, val=98.3% — depth doesn’t help on this simple 2D task
2→16→16→1
Result = params=337, train=100.0%, val=98.3% — 5× more params, same accuracy. Overkill!
65model = MLP(2, hidden, 1) — Build model
Creates a fresh MLP instance with the specified architecture. Each call to MLP() runs __init__ which builds the layers and applies He initialization.
66params = sum(p.numel() for p in model.parameters())
Counts total learnable parameters using a generator expression. model.parameters() yields each weight/bias tensor, p.numel() counts elements in each.
EXECUTION STATE
📚 p.numel() = Number of elements in tensor p. For a (16, 2) weight matrix: 32 elements. For a (16,) bias: 16 elements.
67t_acc, v_acc = train_and_eval(model) — Train and measure
Calls our training function. Returns train and validation accuracy after 300 epochs of Adam optimization.
1import torch
2import torch.nn as nn
3from torch.utils.data import DataLoader, TensorDataset
45classMLP(nn.Module):6def__init__(self, in_dim, hidden_dims, out_dim):7super().__init__()8 layers =[]9 prev = in_dim
10for h in hidden_dims:11 layers.append(nn.Linear(prev, h))12 layers.append(nn.ReLU())13 prev = h
14 layers.append(nn.Linear(prev, out_dim))15 self.net = nn.Sequential(*layers)16 self._init_weights()1718def_init_weights(self):19for m in self.modules():20ifisinstance(m, nn.Linear):21 nn.init.kaiming_normal_(m.weight, nonlinearity='relu')22 nn.init.zeros_(m.bias)2324defforward(self, x):25return self.net(x)2627# ── Prepare data with train/val split (80/20) ──28X_t = torch.tensor(X[:240], dtype=torch.float32)29y_t = torch.tensor(y[:240], dtype=torch.float32)30X_v = torch.tensor(X[240:], dtype=torch.float32)31y_v = torch.tensor(y[240:], dtype=torch.float32)3233loader = DataLoader(TensorDataset(X_t, y_t),34 batch_size=32, shuffle=True)3536# ── Training function ──37deftrain_and_eval(model, epochs=300, lr=0.01):38 opt = torch.optim.Adam(model.parameters(), lr=lr)39 loss_fn = nn.BCEWithLogitsLoss()4041for epoch inrange(epochs):42 model.train()43for xb, yb in loader:44 logits = model(xb).squeeze(-1)45 loss = loss_fn(logits, yb)46 opt.zero_grad()47 loss.backward()48 opt.step()4950 model.eval()51with torch.no_grad():52 t_acc =((model(X_t).squeeze(-1)>0)== y_t.bool()).float().mean()53 v_acc =((model(X_v).squeeze(-1)>0)== y_v.bool()).float().mean()54return t_acc.item(), v_acc.item()5556# ── Compare architectures ──57configs ={58"2→4→1":[4],59"2→16→1":[16],60"2→32→1":[32],61"2→8→8→1":[8,8],62"2→16→16→1":[16,16],63}6465torch.manual_seed(42)66for name, hidden in configs.items():67 model = MLP(2, hidden,1)68 params =sum(p.numel()for p in model.parameters())69 t_acc, v_acc = train_and_eval(model)70print(f"{name:14s} params={params:4d} train={t_acc:.1%} val={v_acc:.1%}")
Compare this with the NumPy version. The PyTorch code has zero manual gradient computation. The entire backward pass is replaced by loss.backward(). The weight update is replaced by opt.step(). All the mathematical complexity we manually coded in NumPy — chain rule, matrix transposes, ReLU gradients — is handled automatically by autograd.
Concept
NumPy (Manual)
PyTorch (Automatic)
Forward pass
Z1 = X @ W1 + b1
model(X)
Loss
Manual BCE formula
nn.BCEWithLogitsLoss()
Backward pass
15 lines of chain rule
loss.backward()
Weight update
W -= lr * dW
opt.step()
Batching
Manual loop over samples
DataLoader
GPU support
Not available
model.to('cuda')
Interactive: Decision Boundary Explorer
This visualization shows a real neural network learning to classify two-moons data in real time. The background color shows the network's classification: blue regions are predicted class 0, orange regions are predicted class 1. The boundary between colors is the decision boundary — the curve where σ(z)=0.5.
Loading Decision Boundary Explorer...
Try these experiments:
Width = 4, Depth = 1: Watch the boundary struggle. With only 4 ReLU neurons (5 linear regions), the network can only make a crude, angular boundary.
Width = 16, Depth = 1: Now 17 linear regions are available. The boundary becomes smooth, curving neatly between the moons. This matches our NumPy result.
Width = 4, Depth = 3: Can depth compensate for narrow width? Theoretically yes (O(43)=64 regions), but optimization may struggle with the deeper network.
Learning rate = 3.0: Try cranking up the learning rate. The boundary may oscillate wildly or even diverge. Too-fast learning destroys the structure the network has built.
Mini-Batch Training and Validation
Why Mini-Batches?
Our NumPy code used full-batch gradient descent: compute the gradient over all 300 samples, then update weights once. This gives the most accurate gradient estimate but has three problems at scale:
Memory: With 1 million images, computing all activations simultaneously requires storing 1M × width matrices in RAM/VRAM. A batch of 32 uses 31,000× less memory.
Speed: One full-batch update is much more work than 31,250 mini-batch updates (each using 32 samples). Mini-batch SGD makes progress after seeing just 32 examples.
Regularization: The noise from random mini-batches actually helps generalization. Full-batch gradient descent finds sharp minima; mini-batch SGD tends to find flatter minima that generalize better (Smith & Le, 2018).
Strategy
Samples per Update
Updates per Epoch
Gradient Noise
Full-batch (our NumPy code)
300
1
None (exact gradient)
Mini-batch (batch_size=32)
32
8
Moderate (good balance)
Stochastic (batch_size=1)
1
300
High (noisy but fast)
In practice, batch sizes of 32–256 give the best tradeoff. Our PyTorch code uses batch_size=32 with shuffle=True, meaning each epoch presents the data in a different random order, with 8 weight updates per epoch.
Train vs Validation: Detecting Overfitting
We split the 300 samples into 240 training and 60 validation. Training data is used to compute gradients and update weights. Validation data is held out — the network never sees it during training. By comparing train accuracy and validation accuracy, we can detect overfitting:
train \u2248 val (e.g., 100% / 98.3%): Good generalization. The model has learned the true pattern.
train >> val (e.g., 100% / 70%): Overfitting. The model memorized training data but fails on new data.
train \u2248 val but both low (e.g., 75% / 73%): Underfitting. The model is too simple for the task.
Our 2→16→1 network achieves 100% train / 98.3% val — excellent generalization. The 1.7% gap (1 misclassified validation point out of 60) is likely due to noisy outliers, not overfitting.
Production practice: Large models use a three-way split: train (80%), validation (10%), test (10%). Validation guides hyperparameter tuning (learning rate, architecture). Test is touched once at the very end to report the final number. Using test data for tuning is a form of information leakage.
Architecture Comparison
Our PyTorch experiment compared five architectures on the two-moons task:
Architecture
Parameters
Train Accuracy
Val Accuracy
Verdict
2→4→1
17
100.0%
96.7%
Tight fit, slight underfit on val
2→16→1
65
100.0%
98.3%
Sweet spot for this task
2→32→1
129
100.0%
98.3%
Extra capacity unused
2→8→8→1
105
100.0%
98.3%
Depth = no benefit here
2→16→16→1
337
100.0%
98.3%
5× more params, same result
Several important observations emerge:
All architectures reach 100% training accuracy. The two-moons task is simple enough that even 17 parameters can memorize the training set with Adam (but not necessarily generalize).
Width ≥ 16 reaches the generalization ceiling. Going from 4 to 16 hidden neurons improves val accuracy from 96.7% to 98.3%. But going from 16 to 32 adds no benefit — we hit the noise floor.
Depth does not help for this task. 2→8→8→1 (105 params) performs identically to 2→16→1 (65 params). The two-moons boundary has no hierarchical structure — it is a single smooth curve. Depth excels on tasks with hierarchy (edges → shapes → objects).
Diminishing returns. 2→16→16→1 has 5× more parameters than 2→16→1 but identical accuracy. Those extra parameters are wasted capacity that could slow training or cause overfitting on harder tasks.
The architect's lesson: Start with the simplest architecture that works. Add capacity only when you see underfitting (training accuracy too low). Add depth only when the task has hierarchical structure. For two moons, a single hidden layer with 16 neurons is the right answer.
Connection to Modern Systems
Every concept from this section scales directly to modern deep learning systems:
The nn.Module Pattern Everywhere
Every major architecture — ResNet, Transformer, Diffusion models, LLaMA — is implemented as a class extending nn.Module. The pattern is always the same: define layers in __init__, define data flow in forward. A GPT-4-scale model is just a very large nn.Module with hundreds of sub-modules nested inside each other.
BCE Loss in Language Models
While language models use categorical cross-entropy (the multi-class generalization of BCE), the core idea is identical: measure the information-theoretic distance between predicted probabilities and true labels. For a vocabulary of 50,000 tokens, the output layer has 50,000 neurons instead of 1, and we use nn.CrossEntropyLoss instead of nn.BCEWithLogitsLoss. But the gradient is still y^−y — the beautiful cancellation carries over.
DataLoader at Scale
Our DataLoader with batch_size=32 and shuffle=True is the same pattern used for training LLaMA on trillions of tokens. The only additions at scale are: num_workers for parallel data loading across CPU cores, pin_memory for faster CPU-to-GPU transfer, and DistributedSampler for splitting data across multiple GPUs. The core abstraction — iterate over shuffled mini-batches — is unchanged.
Train/Eval Mode in Production
Our model.train() / model.eval() switching becomes critical in modern networks. Transformers use dropout (randomly zeroing activations during training for regularization) and layer normalization (which may behave differently in train vs eval modes in some implementations). The KV-cache optimization used during inference (storing previously computed key/value pairs) only works in eval mode. In Flash Attention, the causal mask applied during training is different from the mask during generation. The two modes are not cosmetic — they enable fundamental optimizations.
Architecture Search at Scale
Our manual comparison of 5 architectures is the simplest form of architecture search. At scale, teams use Neural Architecture Search (NAS) to automatically explore thousands of architectures. Google's EfficientNet and Meta's RegNet were discovered by automated architecture search. The principle is the same: define a search space, train each candidate, and pick the one with the best validation accuracy per parameter.
Summary
This section bridged the gap from architecture theory to practical PyTorch implementation:
Classification requires sigmoid + cross-entropy. Sigmoid converts logits to probabilities. Cross-entropy provides gradients proportional to prediction error, even when sigmoid saturates. The gradient y^−y is elegant and numerically stable.
nn.Module is the universal pattern. Define layers in __init__, define data flow in forward(). Every PyTorch network from a 65-parameter MLP to GPT-4 follows this blueprint.
The training pipeline is fixed. DataLoader → forward → loss → backward → step. This 5-line inner loop does not change between our tiny MLP and billion-parameter models.
Mini-batch training saves memory and improves generalization. Batch size 32 with shuffling provides a good balance of gradient accuracy and computational efficiency.
Train/validation splits detect overfitting. When train accuracy >> val accuracy, the model is memorizing instead of learning. Our 2→16→1 MLP shows healthy generalization: 100% train, 98.3% val.
Architecture matters, but diminishing returns are real. On two moons, 16 hidden neurons is the sweet spot. Going wider or deeper wastes parameters without improving accuracy.
In the next section, we explore the Universal Approximation Theorem — the mathematical guarantee that a sufficiently wide MLP can approximate any continuous function to any desired precision.