Chapter 10
25 min read
Section 31 of 65

Network Architecture Design

Multi-Layer Perceptrons

Learning Objectives

By the end of this section, you will be able to:

  1. Identify the four design choices that define an MLP: width, depth, activation function, and weight initialization
  2. Explain the effect of width on learning capacity and show that more neurons means more "bends" in the function the network can represent
  3. Explain the effect of depth and why deeper networks can represent exponentially more complex functions per parameter
  4. Compare activation functions (ReLU, Leaky ReLU, GELU, Tanh) and know when to use each
  5. Apply He and Xavier initialization and explain why proper initialization prevents signals from vanishing or exploding
  6. Build flexible MLPs in PyTorch using nn.Sequential and nn.init

The Architecture Question

In Chapters 7–9, we learned how neural networks learn: forward propagation computes predictions, backpropagation computes gradients, and optimizers update weights. But we always used the same architecture: 4344 \to 3 \to 4 with ReLU and random initialization.

Why 3 hidden neurons? Why not 2, or 8, or 100? Why one hidden layer instead of three? Why ReLU instead of tanh? These choices were arbitrary — and that's a problem, because architecture determines what a network can and cannot learn.

The Central Question of This Chapter: Given a task (like diagonal flip), how do we choose the right architecture? What are the design knobs, and what happens when we turn them?

Think of it like building a house. Knowing how to lay bricks (forward pass), mix mortar (backprop), and plan the schedule (optimizer) is essential — but it does not tell you how many rooms to build or how tall to make the ceilings. Architecture is the blueprint that determines what the finished structure can do.


What Defines an MLP?

A Multi-Layer Perceptron (MLP) — also called a feedforward neural network or fully connected network — has four architectural choices:

Design ChoiceWhat It ControlsExample
WidthNeurons per hidden layer4→8→4 (8 neurons wide)
DepthNumber of hidden layers4→4→4→4 (2 hidden layers)
ActivationNon-linearity between layersReLU, GELU, Tanh
InitializationStarting weight valuesHe, Xavier, random

Each choice has consequences. Width controls how many features a single layer can detect. Depth controls how many levels of abstraction the network builds. Activation controls the shape of the non-linearity. Initialization controls whether training starts in a healthy state.

Let us examine each one, starting with the most intuitive: width.


Width: How Many Neurons Per Layer?

Consider our diagonal flip task. The network must learn to swap positions 1 and 2 of a 4-element vector while keeping positions 0 and 3 unchanged. How many hidden neurons does it need?

The Effect of Width

Each hidden neuron with ReLU activation contributes one "bend" — a point where the function changes slope. A neuron computes hj=max(0,wjTx+bj)h_j = \max(0, \mathbf{w}_j^T \mathbf{x} + b_j). This is a hinge function: linear on one side, zero on the other. The boundary between the two sides is a hyperplane defined by wjTx+bj=0\mathbf{w}_j^T \mathbf{x} + b_j = 0.

With nn hidden neurons in a single layer, the network can partition the input space into at most n+1n + 1 linear regions. Within each region, the network computes a different linear function. The more regions, the more complex the overall function.

WidthParametersMax Linear RegionsFinal Loss (200 epochs)
2 neurons2230.0650
3 neurons (Ch 7-9)3140.0640
8 neurons7690.0243
16 neurons148170.0005

The pattern is clear: more neurons → lower loss. With only 2 neurons, the network can create 3 linear regions — not enough to capture the diagonal flip mapping for all 16 images. With 16 neurons, it has 17 regions and achieves nearly zero loss.

Why not always use maximum width? More parameters means more memory, slower training, and a higher risk of overfitting (memorizing the training data instead of learning the pattern). With only 16 training images and 148 parameters, the network has over 9 parameters per training example — dangerous territory. We will study overfitting and regularization in Chapter 12.

Width as Feature Detectors

Each hidden neuron acts as a feature detector. It learns to respond to a specific pattern in the input. For example, one neuron might detect "position 1 is 1 AND position 2 is 0", while another detects the reverse. The output layer then combines these features to produce the final mapping.

With too few neurons, the network lacks enough feature detectors to distinguish all the patterns it needs. It is forced to make compromises — mapping several distinct inputs to similar outputs. This is underfitting: the model is too simple for the task.


Depth: How Many Hidden Layers?

Width is not the only way to add capacity. We can also add depth — more hidden layers. And depth has a remarkable property: it adds capacity exponentially rather than linearly.

Why Depth Matters

Consider a single hidden layer with nn ReLU neurons. It creates at most n+1n + 1 linear regions. Now add a second hidden layer with nn neurons. Each neuron in the second layer can "fold" the first layer's regions, effectively multiplying the number of regions:

  • 1 hidden layer with nn neurons: up to n+1n + 1 regions
  • 2 hidden layers with nn neurons each: up to O(n2)O(n^2) regions
  • LL hidden layers with nn neurons each: up to O(nL)O(n^L) regions

This is the exponential advantage of depth. A network with 4 neurons per layer and 3 hidden layers has only 4×4+4+4×4+4+4×4+4=604 \times 4 + 4 + 4 \times 4 + 4 + 4 \times 4 + 4 = 60 parameters, but can theoretically represent O(43)=64O(4^3) = 64 linear regions. A single-layer network would need 6363 neurons (and 507 parameters!) to match that number of regions.

But deeper is not always better. In practice, the exponential gain is limited by two factors: (1) vanishing gradients — the error signal weakens as it flows backward through many layers, making early layers learn slowly; (2) optimization difficulty — deeper loss landscapes have more saddle points and flat regions. For our tiny diagonal flip task, a 2-layer network actually converges slower than a 1-layer network because the optimization challenge outweighs the expressiveness gain at this scale.

Depth vs Width: A Practical Comparison

ArchitectureParametersTheoretical RegionsBest For
4→16→4 (wide)14817Simple pattern matching
4→4→4→4 (deep)52~64Hierarchical features
4→8→8→4 (balanced)140~64General-purpose

For small problems like diagonal flip, width often wins because the optimization is easier. For complex problems like image recognition or language understanding, depth wins because real-world data has hierarchical structure — edges combine into shapes, shapes combine into objects, objects combine into scenes. Each layer captures one level of abstraction.

The depth principle: If your task has natural hierarchy (low-level features composing into high-level concepts), add depth. If your task is a complex mapping without obvious hierarchy, add width. Most real-world tasks have hierarchy, which is why modern networks are deep.

Interactive: Architecture Explorer

Use the sliders below to adjust the number of neurons in each layer. Watch how the network structure changes — more neurons mean more connections (weights), and each connection is a learnable parameter. Click "Animate Forward Pass" to see data flow through the layers.

Loading network visualizer...

Try these experiments:

  1. Set all hidden layers to 1 neuron — notice the extreme bottleneck. Information must squeeze through a single number.
  2. Set a hidden layer to 8 neurons — the web of connections grows dramatically. Each connection is a weight the optimizer must tune.
  3. Compare 4→8→4 versus 4→4→4→4 — similar parameter counts, but very different structures.

Interactive: Width vs Depth

This visualization shows a real neural network learning a curved target function (dashed line). Adjust width and depth, then click Train to watch the cyan prediction line converge toward the target. The network diagram on the left updates to show the current architecture.

Loading Width vs Depth Explorer...

Try these experiments:

  1. Width = 2, Depth = 1: Train and watch the network struggle. With only 2 neurons, it can make at most 3 linear segments — not enough to match the wavy target.
  2. Width = 8, Depth = 1: Now 9 segments are possible. The fit is much better, but look at the parameter count.
  3. Width = 4, Depth = 3: Fewer parameters than Width=8/Depth=1, but potentially more linear regions due to the exponential depth effect. Does it learn better?
  4. Width = 16, Depth = 1: Brute force with many neurons. Fast convergence but many parameters.

Activation Functions: Beyond ReLU

We have used ReLU exclusively so far. But there are several activation functions, each with distinct properties. The choice of activation is the third architectural decision.

The Activation Function Zoo

FunctionFormulaRangeKey Property
ReLUmax(0, x)[0, ∞)Fast, sparse, but dead neurons
Leaky ReLUmax(αx, x), α=0.01(-∞, ∞)No dead neurons
GELUx · Φ(x)≈ (-0.17, ∞)Smooth, used in transformers
Tanhtanh(x)(-1, 1)Zero-centered, bounded
Sigmoid1/(1+e⁻ˣ)(0, 1)Output as probability

ReLU: The Workhorse

ReLU (max(0,x)\max(0, x)) is the default activation for hidden layers. It is fast to compute (just a comparison), produces sparse activations (many neurons output exactly 0), and has a simple gradient (1 if active, 0 if not).

The main weakness is the dying ReLU problem. If a neuron's pre-activation is always negative (for all training inputs), its gradient is always 0, and it can never recover. The neuron is permanently "dead." This happens more with large learning rates or poor initialization.

Leaky ReLU: Fixing Dead Neurons

Leaky ReLU replaces the flat zero region with a small negative slope: f(x)=max(αx,x)f(x) = \max(\alpha x, x) where α=0.01\alpha = 0.01. Now even negative inputs produce a small gradient (α\alpha), so no neuron can completely die. The output is xx when positive, 0.01x0.01x when negative.

GELU: The Transformer Standard

GELU (Gaussian Error Linear Unit) is used in GPT, BERT, and most modern transformers. It is defined as GELU(x)=xΦ(x)\text{GELU}(x) = x \cdot \Phi(x) where Φ(x)\Phi(x) is the standard Gaussian CDF. Unlike ReLU's hard cutoff at 0, GELU provides a smooth transition. Small negative values are slightly attenuated rather than completely zeroed out. This smoothness helps gradient-based optimization, especially in very deep networks.

Tanh and Sigmoid: The Classics

Before ReLU, tanh and sigmoid were standard. Tanh (tanh(x)=exexex+ex\tanh(x) = \frac{e^x - e^{-x}}{e^x + e^{-x}}) outputs values in (1,1)(-1, 1) and is zero-centered, which helps optimization. Sigmoid (σ(x)=11+ex\sigma(x) = \frac{1}{1 + e^{-x}}) outputs values in (0,1)(0, 1) and is mainly used for output layers where we need probabilities.

The main problem with both is vanishing gradients: for large x|x|, the derivative approaches 0. In deep networks, this means early layers receive almost no gradient signal and learn extremely slowly. ReLU solved this by having a constant gradient of 1 for positive inputs.

When to use what: Use ReLU as your default for hidden layers. Use GELU if you are building a transformer or very deep network. Use Leaky ReLU if you observe many dead neurons. Use sigmoid only at the output layer for binary classification. Use tanh for RNNs or when zero-centered output is important.

Weight Initialization: Why It Matters

The fourth architectural choice is how to set the initial weight values. This might seem trivial — they are just starting values that will be trained away — but initialization has a dramatic impact on whether training succeeds at all.

The Variance Problem

Consider a layer with nn inputs: z=i=1nwixiz = \sum_{i=1}^{n} w_i x_i. If inputs have variance Var(x)=1\text{Var}(x) = 1 and weights have variance Var(w)=σ2\text{Var}(w) = \sigma^2, then by the properties of independent random variables:

Var(z)=nσ2Var(x)=nσ2\text{Var}(z) = n \cdot \sigma^2 \cdot \text{Var}(x) = n\sigma^2

The output variance is nn times the weight variance. For a layer with n=512n = 512 inputs and naive σ=1\sigma = 1 initialization, the output variance is 512 — the signal explodes. After a few layers, activations overflow to infinity. Conversely, if σ=0.01\sigma = 0.01, the output variance is 512×0.0001=0.0512512 \times 0.0001 = 0.0512 — the signal vanishes toward zero.

Xavier Initialization (for Tanh/Sigmoid)

Xavier Glorot (2010) proposed: set σ=2nin+nout\sigma = \sqrt{\frac{2}{n_{\text{in}} + n_{\text{out}}}}. This keeps the variance of both the forward pass and the backward pass approximately 1. It is designed for linear or tanh activations that preserve their input variance.

He Initialization (for ReLU)

Kaiming He (2015) observed that ReLU kills approximately half the activations (all negative values become 0), which halves the variance. To compensate, He init uses σ=2nin\sigma = \sqrt{\frac{2}{n_{\text{in}}}} — doubling the variance compared to Xavier. This is the standard for ReLU networks and what we have been using in our code.

InitializationScale σBest ForEpoch 0 LossEpoch 99 Loss
Small random (σ=0.01)0.01Nothing — don't use0.23020.0984
Large random (σ=1.0)1.0Nothing — don't use1.33120.0658
Xavier√(2/(n_in+n_out))Tanh, Sigmoid0.35320.0305
He (Kaiming)√(2/n_in)ReLU, Leaky ReLU0.54850.0244

Look at the epoch 99 losses. He initialization reaches the lowest loss (0.0244) — it is specifically designed for ReLU. Xavier is close (0.0305) but slightly worse because it underestimates the variance needed for ReLU. Small random initialization (0.0984) converges 4× slower because the signal vanishes through layers. Large random (0.0658) starts with high loss because the initial predictions are wildly wrong, wasting early training steps recovering.

Rule of thumb: Use He initialization with ReLU (the default in PyTorch). Use Xavier initialization with tanh or sigmoid. Never use plain random initialization — the scale matters enormously.

NumPy: Comparing Widths from Scratch

Let us put the theory into practice. The following code trains three architectures — 4→2→4, 4→8→4, and 4→16→4 — on our diagonal flip task and compares their final losses. This is the same forward/backward code from Chapters 7–8, now wrapped in a function so we can vary the width.

NumPy — Width Comparison on Diagonal Flip
🐍compare_widths.py
1import numpy as np

NumPy provides fast numerical arrays and vectorized operations. We use it for matrix multiplication (@ operator), element-wise operations, and random number generation. All math runs as optimized C code under the hood.

EXECUTION STATE
numpy = Library for numerical computing — ndarray, linear algebra, random numbers, vectorized math
3# Dataset: 2×2 binary images → diagonal flip

Our running example from Chapters 7–9. We have 16 possible 2×2 binary images. The task is to flip each image along the diagonal (transpose it). In flattened form: [a,b,c,d] → [a,c,b,d]. Positions 1 and 2 swap.

EXECUTION STATE
diagonal flip = Transpose of 2×2 matrix: [[a,b],[c,d]] → [[a,c],[b,d]]. As flat vector: [a,b,c,d] → [a,c,b,d]. Positions 0,3 stay, positions 1,2 swap.
5X = np.array([...]) — All 16 input images

Generates all 16 binary 2×2 images as flattened 4-element vectors using a nested list comprehension. Each of i,j,k,l is 0 or 1, giving 2⁴ = 16 combinations.

EXECUTION STATE
⬇ X shape = (16, 4) — 16 images, each a 4-element vector
X[0] = [0, 0, 0, 0] — all-black image
X[5] = [0, 1, 0, 1]
X[10] = [1, 0, 1, 0]
X[15] = [1, 1, 1, 1] — all-white image
8Y = np.array([...]) — All 16 target outputs

For each input x, the target is [x[0], x[2], x[1], x[3]] — swapping positions 1 and 2. This is the flattened diagonal flip.

EXECUTION STATE
⬇ Y shape = (16, 4) — 16 target vectors
X[5]=[0,1,0,1] → Y[5] = [0, 0, 1, 1] — positions 1 and 2 swapped
X[10]=[1,0,1,0] → Y[10] = [1, 1, 0, 0] — positions 1 and 2 swapped
10def relu(x): — ReLU activation

ReLU (Rectified Linear Unit): returns x if x > 0, else 0. This is the non-linearity that gives the network its power to learn non-linear mappings. Without it, stacking linear layers would collapse to a single linear transformation.

EXECUTION STATE
⬇ input: x = A NumPy array of any shape. ReLU is applied element-wise.
⬆ returns = np.maximum(0, x) — same shape as input, all negatives replaced by 0
→ Examples = relu([-2, -1, 0, 1, 2]) = [0, 0, 0, 1, 2]
11return np.maximum(0, x)

np.maximum(0, x) compares each element of x with 0 and keeps the larger. This is an element-wise operation — it works on arrays of any shape.

EXECUTION STATE
📚 np.maximum(a, b) = NumPy function: element-wise maximum. Compares each pair of elements and returns the larger. np.maximum(0, [-3, 2, -1, 5]) = [0, 2, 0, 5]
14def train(width, epochs=200, lr=0.05): — Training function

Trains a single-hidden-layer MLP with the given width on the diagonal flip task. The architecture is always 4→width→4: 4 input neurons, 'width' hidden neurons with ReLU, 4 output neurons (linear). Returns the final loss and parameter count.

EXECUTION STATE
⬇ input: width = Number of neurons in the hidden layer. We will try width = 2, 8, and 16 to see how capacity affects learning.
⬇ input: epochs = 200 = Number of complete passes through all 16 images. Each epoch: forward all 16, backward all 16, update weights 16 times.
⬇ input: lr = 0.05 = Learning rate. Same as Chapter 8–9. Controls step size: w_new = w_old - 0.05 × gradient.
⬆ returns = (final_loss, param_count) — the average loss after training and total learnable parameters
15np.random.seed(42) — Reproducible initialization

Sets the random seed so every call to train() starts with the exact same random weights. This lets us compare widths fairly — the only difference is architecture, not luck of initialization.

EXECUTION STATE
📚 np.random.seed(42) = Fixes the random number generator’s starting state. All subsequent np.random.randn() calls produce the same sequence. Seed 42 is a convention (from Hitchhiker’s Guide).
16W1 = np.random.randn(4, width) * np.sqrt(2/4) — He initialization

Creates the weight matrix for Layer 1 (input → hidden) using He initialization. The shape is (4, width): 4 input features, 'width' output neurons. He init scales random weights by √(2/fan_in) to keep activations at healthy variance through ReLU layers.

EXECUTION STATE
📚 np.random.randn(4, width) = Draws random values from standard normal distribution N(0,1). Shape (4, width) = fan_in × fan_out.
📚 np.sqrt(2/4) = 0.7071 = He initialization scale factor: √(2/fan_in) = √(2/4) = 0.7071. Named after Kaiming He who proved this preserves variance through ReLU layers. Without proper scaling, signals either explode or vanish as they pass through layers.
⬆ W1 shape = (4, width) — For width=8: shape (4, 8) with 32 weights
→ Why He for ReLU? = ReLU kills ~50% of activations (all negatives become 0). He init compensates by using √(2/n) instead of √(1/n), effectively doubling the variance to account for the halving by ReLU.
17b1 = np.zeros(width) — Hidden layer biases

Initializes biases to zero. One bias per hidden neuron. Biases are usually initialized to 0 because the weight initialization already handles the variance. Some practitioners use small positive values for ReLU to avoid dead neurons, but zero is standard.

EXECUTION STATE
b1 shape = (width,) — one bias per hidden neuron. For width=8: [0, 0, 0, 0, 0, 0, 0, 0]
18W2 = np.random.randn(width, 4) * np.sqrt(2/width) — Output weights

Weight matrix for Layer 2 (hidden → output). Shape (width, 4): 'width' inputs from hidden layer, 4 outputs. He init with fan_in = width.

EXECUTION STATE
📚 np.sqrt(2/width) = For width=2: √(2/2) = 1.0. For width=8: √(2/8) = 0.5. For width=16: √(2/16) = 0.354. Wider layers → smaller initial weights to compensate for more inputs being summed.
⬆ W2 shape = (width, 4) — For width=8: shape (8, 4) with 32 weights
19b2 = np.zeros(4) — Output biases

Four output biases, one per output neuron. Always initialized to zero.

EXECUTION STATE
b2 = [0, 0, 0, 0] — one bias per output neuron
21for epoch in range(epochs): — Training loop

Each epoch processes all 16 images once. With 200 epochs, the network sees each image 200 times. The loss should decrease over epochs as the weights improve.

LOOP TRACE · 7 iterations
width=2, epoch=0
avg loss = 0.3863 — high error, random predictions
width=2, epoch=50
avg loss = 0.0935 — learning, but plateauing
width=2, epoch=199
avg loss = 0.0650 — stuck! Not enough capacity
width=8, epoch=0
avg loss = 0.4278 — similar start
width=8, epoch=199
avg loss = 0.0243 — 2.7× better than width=2!
width=16, epoch=0
avg loss = 0.2583 — slightly better start
width=16, epoch=199
avg loss = 0.0005 — nearly perfect!
22loss = 0.0 — Accumulate epoch loss

Reset the loss counter at the start of each epoch. We sum up losses from all 16 images, then divide by 16 to get the average.

23for i in range(16): — Loop over all images

Process each of the 16 binary images one at a time. For each image: forward pass, compute loss, backward pass, update weights. This is stochastic gradient descent (SGD) — one update per image.

24h = relu(X[i] @ W1 + b1) — Hidden layer forward

The full hidden layer computation in one line: (1) X[i] @ W1 = matrix multiply input by weights, (2) + b1 = add biases, (3) relu() = apply activation. The result h has shape (width,) — one activation per hidden neuron.

EXECUTION STATE
📚 @ (matrix multiply) = Python’s matrix multiplication operator. X[i] is shape (4,) and W1 is shape (4, width), so X[i] @ W1 produces shape (width,). Each element is a dot product of the input with one column of W1.
X[i] @ W1 + b1 = Pre-activation: z = weighted sum + bias. The linear transformation before the non-linearity.
relu(z) = Post-activation: h = max(0, z). Negative pre-activations become 0 (the neuron is ‘off’). This creates the non-linearity that makes the network powerful.
⬆ h shape = (width,) — For width=8: 8 activation values, roughly half will be 0 (killed by ReLU)
25y_hat = h @ W2 + b2 — Output layer forward

The output layer is linear (no activation). h is shape (width,), W2 is shape (width, 4), so h @ W2 produces shape (4,). These are the network’s 4 predictions.

EXECUTION STATE
y_hat shape = (4,) — one predicted value per output pixel
26loss += 0.5 * np.mean((y_hat - Y[i])**2) — MSE loss

Mean Squared Error between prediction and target, scaled by 0.5 for cleaner gradients (the 2 from the derivative cancels the 0.5). This is the same loss from Chapter 7.

EXECUTION STATE
📚 np.mean() = Average across all 4 elements. MSE = (1/4) × ∑(y_hat_i - y_i)²
0.5 factor = Makes the gradient d/dy of 0.5(y-t)² = (y-t) instead of 2(y-t). A convenience that simplifies backprop math.
28dout = (y_hat - Y[i]) / 4 — Output gradient

Gradient of the loss with respect to y_hat. The /4 comes from the np.mean() over 4 elements. This is the starting point for backpropagation — the error signal that flows backward through the network.

EXECUTION STATE
dout shape = (4,) — one gradient per output neuron
→ Math = ∂L/∂ŷ = (ŷ - y) / 4. Positive dout[j] means output j was too high. Negative means too low.
29dW2 = np.outer(h, dout) — Output weight gradient

Gradient for W2. The outer product of the hidden activations h (what went IN to the layer) and dout (the error signal). Shape matches W2: (width, 4).

EXECUTION STATE
📚 np.outer(a, b) = Outer product: every element of a times every element of b. For a of shape (m,) and b of shape (n,): produces matrix of shape (m, n). outer([1,2], [3,4]) = [[3,4],[6,8]]
dW2 shape = (width, 4) — same shape as W2. Each element tells how much to adjust the corresponding weight.
30db2 = dout — Output bias gradient

Bias gradient equals the output error directly. Shape (4,) — one gradient per output bias. The bias derivative is always 1 (bias doesn’t multiply anything), so dL/db = dL/dy_hat × dy_hat/db = dout × 1 = dout.

31dh = dout @ W2.T * (X[i] @ W1 + b1 > 0) — Hidden gradient

Backpropagate through the hidden layer in two steps: (1) dout @ W2.T sends the error back through W2, (2) * (z > 0) applies the ReLU derivative — gradients flow through active neurons (z > 0) and are blocked by dead neurons (z ≤ 0).

EXECUTION STATE
dout @ W2.T = Shape: (4,) @ (4, width) = (width,). Distributes the output error to each hidden neuron, weighted by W2.
(X[i] @ W1 + b1 > 0) = Boolean mask, shape (width,). True where pre-activation was positive (ReLU was active). ReLU derivative: 1 if z>0, 0 if z≤0.
.T (transpose) = W2 has shape (width, 4). W2.T has shape (4, width). We need this shape for the matrix multiply to work.
32dW1 = np.outer(X[i], dh) — Input weight gradient

Same pattern as dW2: outer product of the layer’s input (X[i]) and the layer’s gradient (dh). Shape (4, width) matches W1.

33db1 = dh — Hidden bias gradient

Same pattern as db2: bias gradient equals the layer gradient directly. Shape (width,).

35W1 -= lr * dW1; b1 -= lr * db1 — Update hidden layer

Gradient descent update for the hidden layer weights and biases. w_new = w_old - 0.05 × gradient. The -= operator modifies the arrays in place.

EXECUTION STATE
lr = 0.05 = Each weight adjusts by 5% of its gradient per image. Over 16 images per epoch, that’s 16 small updates per epoch.
36W2 -= lr * dW2; b2 -= lr * db2 — Update output layer

Same gradient descent update for the output layer. All four weight/bias arrays are updated after each image.

38if epoch % 50 == 0 or epoch == 199: — Print progress

Print the loss every 50 epochs (0, 50, 100, 150) and at the final epoch (199). This lets us see how fast each architecture converges.

39print(f" Epoch {epoch:3d}: loss = {loss/16:.4f}")

Prints the average loss (total loss / 16 images). Lower is better. We’ll compare these numbers across widths to see which architecture learns faster.

41params = 4*width + width + width*4 + 4 — Count parameters

Total learnable parameters in a 4→width→4 network. Layer 1: 4×width weights + width biases. Layer 2: width×4 weights + 4 biases.

EXECUTION STATE
width=2 = 4×2 + 2 + 2×4 + 4 = 8 + 2 + 8 + 4 = 22 params
width=8 = 4×8 + 8 + 8×4 + 4 = 32 + 8 + 32 + 4 = 76 params
width=16 = 4×16 + 16 + 16×4 + 4 = 64 + 16 + 64 + 4 = 148 params
42return loss/16, params — Return final metrics

Returns the average loss from the final epoch and the total parameter count. These two numbers tell the full story: how well did the network learn, and how many parameters did it need?

EXECUTION STATE
⬆ return for width=2 = (0.0650, 22) — 22 params, loss stuck at 0.065
⬆ return for width=8 = (0.0243, 76) — 76 params, loss 2.7× better
⬆ return for width=16 = (0.0005, 148) — 148 params, nearly perfect
44# ── Compare three widths ──

Now we run the experiment: train the same task with width=2, 8, and 16. Same data, same seed, same optimizer. The ONLY difference is the number of hidden neurons.

45for w in [2, 8, 16]: — Width sweep

Test three architectures: a narrow network (2 neurons), a medium network (8 neurons), and a wide network (16 neurons). This reveals the effect of width on learning capacity.

LOOP TRACE · 3 iterations
w=2 (4→2→4)
Parameters = 22 — the minimum viable network
Final loss = 0.065020 — poor. Only 3 linear regions possible.
w=8 (4→8→4)
Parameters = 76 — 3.5× more capacity
Final loss = 0.024271 — 2.7× better than width=2
w=16 (4→16→4)
Parameters = 148 — 6.7× more capacity than width=2
Final loss = 0.000498 — 130× better than width=2!
46print(f"\nWidth = {w} (4→{w}→4):")

Prints the architecture description before training. The → notation shows data flow: 4 inputs → w hidden neurons → 4 outputs.

47final_loss, params = train(w) — Train and measure

Calls the train function with the current width. Returns both the final loss and parameter count.

48print(f" Parameters: {params}, Final loss: {final_loss:.6f}")

Prints the summary for this width. Comparing across widths reveals the core insight: more neurons = more capacity = lower loss. But the relationship is not linear — doubling width gives much more than double the improvement because each neuron adds an independent ReLU ‘bend’ to the function.

13 lines without explanation
1import numpy as np
2
3# ── Dataset: 2×2 binary images → diagonal flip ──
4# Flattened: [a,b,c,d] → [a,c,b,d] (transpose)
5X = np.array([[i,j,k,l]
6              for i in [0,1] for j in [0,1]
7              for k in [0,1] for l in [0,1]], dtype=float)
8Y = np.array([[x[0],x[2],x[1],x[3]] for x in X])
9
10def relu(x):
11    return np.maximum(0, x)
12
13# ── Train a 4→width→4 network ──
14def train(width, epochs=200, lr=0.05):
15    np.random.seed(42)
16    W1 = np.random.randn(4, width) * np.sqrt(2/4)
17    b1 = np.zeros(width)
18    W2 = np.random.randn(width, 4) * np.sqrt(2/width)
19    b2 = np.zeros(4)
20
21    for epoch in range(epochs):
22        loss = 0.0
23        for i in range(16):
24            h = relu(X[i] @ W1 + b1)
25            y_hat = h @ W2 + b2
26            loss += 0.5 * np.mean((y_hat - Y[i])**2)
27
28            dout = (y_hat - Y[i]) / 4
29            dW2 = np.outer(h, dout)
30            db2 = dout
31            dh = dout @ W2.T * (X[i] @ W1 + b1 > 0)
32            dW1 = np.outer(X[i], dh)
33            db1 = dh
34
35            W1 -= lr * dW1; b1 -= lr * db1
36            W2 -= lr * dW2; b2 -= lr * db2
37
38        if epoch % 50 == 0 or epoch == 199:
39            print(f"  Epoch {epoch:3d}: loss = {loss/16:.4f}")
40
41    params = 4*width + width + width*4 + 4
42    return loss/16, params
43
44# ── Compare three widths ──
45for w in [2, 8, 16]:
46    print(f"\nWidth = {w} (4→{w}→4):")
47    final_loss, params = train(w)
48    print(f"  Parameters: {params}, Final loss: {final_loss:.6f}")

The results tell a clear story:

  • Width 2 (22 params): Loss plateaus at 0.065 — the network simply cannot represent the diagonal flip with only 3 linear regions.
  • Width 8 (76 params): Loss reaches 0.024 — 2.7× better. With 9 linear regions, the network has enough capacity to approximate the mapping.
  • Width 16 (148 params): Loss reaches 0.0005 — nearly perfect. Each of the 17 linear regions precisely captures part of the input-output mapping.
Key insight: The diagonal flip requires position-wise independence: position 0 maps to itself, position 3 maps to itself, and positions 1 and 2 swap. With 2 hidden neurons, the network cannot build enough independent pathways. With 16 neurons, each position gets dedicated feature detectors.

PyTorch: Flexible MLP Builder

In practice, we do not write the forward and backward passes by hand. PyTorch's nn.Sequential\texttt{nn.Sequential} lets us define any MLP architecture in a few lines, and autograd handles all the gradient computation. The following code introduces a factory function pattern that builds any MLP from a list of layer sizes.

PyTorch — Building and Comparing MLP Architectures
🐍flexible_mlp.py
1import torch

PyTorch is the deep learning framework we use throughout this book. It provides tensors (like NumPy arrays but with GPU support), automatic differentiation (autograd), and neural network building blocks.

EXECUTION STATE
torch = PyTorch core library — tensors, autograd, GPU computing
2import torch.nn as nn

torch.nn contains all neural network building blocks: layers (nn.Linear, nn.Conv2d), activations (nn.ReLU, nn.GELU), loss functions (nn.MSELoss), and the nn.Module base class. We alias it as nn for convenience.

EXECUTION STATE
torch.nn = Neural network module — provides nn.Linear, nn.ReLU, nn.Sequential, nn.Module, nn.init, and more
4# Build any MLP architecture with nn.Sequential

nn.Sequential is PyTorch’s simplest way to stack layers. You pass it a list of layers and it chains them together: output of layer 1 becomes input of layer 2, and so on. Perfect for MLPs where data flows straight through.

5def make_mlp(layer_sizes): — MLP factory function

A function that takes a list of layer sizes and returns a complete MLP model. For example, [4, 8, 4] creates a 4→8→4 network. [4, 4, 4, 4] creates a 4→4→4→4 network with 2 hidden layers. This is the power of abstraction — one function builds any architecture.

EXECUTION STATE
⬇ input: layer_sizes = List of integers defining the architecture. [4, 8, 4] = input(4) → hidden(8) → output(4). [4, 4, 4, 4] = input(4) → hidden(4) → hidden(4) → output(4).
⬆ returns = nn.Sequential model ready for training. Call model(X) for forward pass.
6layers = [] — Accumulate layer modules

We build the layer list dynamically, then pass it to nn.Sequential. Each element will be either an nn.Linear or an nn.ReLU.

7for i in range(len(layer_sizes) - 1): — Pair adjacent sizes

For [4, 8, 4], this loops i=0 (4→8) and i=1 (8→4). For [4, 4, 4, 4], it loops i=0 (4→4), i=1 (4→4), i=2 (4→4). Each iteration creates one nn.Linear layer connecting adjacent sizes.

LOOP TRACE · 2 iterations
For [4, 8, 4]:
i=0 = Creates nn.Linear(4, 8) + nn.ReLU() — input to hidden
i=1 = Creates nn.Linear(8, 4) — hidden to output (no ReLU)
For [4, 4, 4, 4]:
i=0 = Creates nn.Linear(4, 4) + nn.ReLU() — input to hidden 1
i=1 = Creates nn.Linear(4, 4) + nn.ReLU() — hidden 1 to hidden 2
i=2 = Creates nn.Linear(4, 4) — hidden 2 to output (no ReLU)
8layers.append(nn.Linear(layer_sizes[i], layer_sizes[i+1]))

Creates a fully-connected layer that maps from layer_sizes[i] inputs to layer_sizes[i+1] outputs. Internally, nn.Linear stores a weight matrix of shape (out, in) and a bias vector of shape (out,). The forward pass computes: output = input @ W.T + b.

EXECUTION STATE
📚 nn.Linear(in_features, out_features) = Creates a linear transformation layer. Stores weight matrix W of shape (out, in) and bias b of shape (out,). Forward: y = x @ W.T + b. Note: PyTorch stores W transposed compared to our NumPy code!
9if i < len(layer_sizes) - 2: — Not the last layer

Add ReLU after every hidden layer but NOT after the output layer. The output layer is linear because we want unrestricted output values (regression), not clamped-to-positive values. For [4, 8, 4]: add ReLU after i=0 (the hidden layer), skip after i=1 (the output layer).

EXECUTION STATE
len(layer_sizes) - 2 = Index of the last Linear layer. For [4, 8, 4]: len=3, so last is i=1. We add ReLU only for i < 1, i.e., i=0.
10layers.append(nn.ReLU())

Adds a ReLU activation layer. In nn.Sequential, layers execute in order, so the data flow is: Linear → ReLU → Linear → ReLU → ... → Linear.

EXECUTION STATE
📚 nn.ReLU() = PyTorch’s ReLU module. Applies max(0, x) element-wise. Has no learnable parameters. Equivalent to our def relu(x) in NumPy.
11model = nn.Sequential(*layers) — Build the model

nn.Sequential takes the list of layers and chains them into a model. The * unpacks the list: nn.Sequential(Linear, ReLU, Linear). When you call model(x), it passes x through each layer in sequence.

EXECUTION STATE
📚 nn.Sequential(*layers) = Creates a container that runs layers in order. For [Linear(4,8), ReLU(), Linear(8,4)]: model(x) = Linear2(ReLU(Linear1(x))). The * operator unpacks the list into separate arguments.
→ For [4, 8, 4] = Sequential(Linear(4,8), ReLU(), Linear(8,4))
→ For [4, 4, 4, 4] = Sequential(Linear(4,4), ReLU(), Linear(4,4), ReLU(), Linear(4,4))
13# Apply He initialization

PyTorch’s default initialization is Kaiming uniform, which is close to He init. But we explicitly apply Kaiming normal (He init) for consistency with our NumPy code. This ensures variance is preserved through ReLU layers.

14for m in model.modules(): — Iterate all layers

model.modules() returns all sub-modules in the Sequential container: the Sequential itself, each Linear, and each ReLU. We filter for nn.Linear to initialize only the weight-bearing layers.

EXECUTION STATE
📚 model.modules() = Iterator over all modules and sub-modules. For Sequential(Linear, ReLU, Linear): yields Sequential, Linear(4,8), ReLU(), Linear(8,4).
15if isinstance(m, nn.Linear): — Only init Linear layers

Checks if the current module is an nn.Linear layer. ReLU has no parameters, so we skip it. isinstance() is a Python built-in that checks the type of an object.

16nn.init.kaiming_normal_(m.weight, nonlinearity='relu')

He initialization (called Kaiming in PyTorch, after Kaiming He’s first name). Sets weights to random normal with std = √(2/fan_in). The trailing _ means in-place: it modifies m.weight directly.

EXECUTION STATE
📚 nn.init.kaiming_normal_(tensor, nonlinearity) = Fills tensor with values from N(0, std²) where std = √(2/fan_in). The nonlinearity='relu' parameter tells it to use the factor of 2 (because ReLU kills half the activations). For nonlinearity='linear' or 'tanh', it uses √(1/fan_in) instead.
→ For Linear(4, 8) = fan_in=4, std = √(2/4) = 0.707. Each weight drawn from N(0, 0.707²)
17nn.init.zeros_(m.bias) — Zero-init biases

Sets all biases to 0. Standard practice — the weight initialization handles the variance, biases start neutral.

EXECUTION STATE
📚 nn.init.zeros_(tensor) = Fills tensor with zeros in-place. Equivalent to m.bias.data.fill_(0).
19return model — Ready-to-train MLP

Returns the fully initialized model. Call model(X) for forward pass, optimizer.step() for weight updates. The model tracks all parameters automatically for autograd.

EXECUTION STATE
⬆ return: model = nn.Sequential with He-initialized weights. Ready for model(X) → predictions, loss.backward() → gradients, optimizer.step() → updates.
21# Dataset

Same diagonal flip dataset as the NumPy code, now as PyTorch tensors. torch.float32 is required because nn.Linear expects float input (not int or double).

22X = torch.tensor([...], dtype=torch.float32) — Input tensor

All 16 binary 2×2 images as a (16, 4) float tensor. Identical to the NumPy X array but as a PyTorch tensor.

EXECUTION STATE
X shape = torch.Size([16, 4]) — 16 images, 4 features each
dtype=torch.float32 = 32-bit floating point. Required by nn.Linear. Also called torch.float.
25Y = torch.stack([...]) — Target tensor

Target outputs for diagonal flip. torch.stack() converts a list of 1D tensors into a 2D tensor by stacking them as rows.

EXECUTION STATE
📚 torch.stack(tensors) = Concatenates a sequence of tensors along a new dimension. [tensor(4,), tensor(4,), ...] → tensor(16, 4). Unlike torch.cat which joins along an existing dimension.
Y shape = torch.Size([16, 4]) — same shape as X
27# Compare architectures

We test five architectures: three single-hidden-layer (varying width) and two double-hidden-layer (varying width). This reveals both the width effect and the depth effect.

28configs = { ... } — Architecture dictionary

Maps architecture names to layer size lists. Each list defines a complete architecture. The dictionary makes it easy to loop over architectures and print results.

EXECUTION STATE
4→2→4: [4, 2, 4] = Narrow: 22 params. 1 hidden layer, 2 neurons.
4→8→4: [4, 8, 4] = Medium: 76 params. 1 hidden layer, 8 neurons.
4→16→4: [4, 16, 4] = Wide: 148 params. 1 hidden layer, 16 neurons.
4→4→4→4: [4, 4, 4, 4] = Deep narrow: 52 params. 2 hidden layers, 4 neurons each.
4→8→8→4: [4, 8, 8, 4] = Deep wide: 140 params. 2 hidden layers, 8 neurons each.
35torch.manual_seed(42) — Reproducible results

Sets PyTorch’s random seed for reproducibility. Unlike NumPy, this also affects CUDA operations if using GPU. We set it once before the loop so all architectures start from a consistent random state.

36for name, sizes in configs.items(): — Test each architecture

Iterates over each architecture. configs.items() yields (name, sizes) pairs. For each, we build a model, train it for 200 epochs, and print the result.

37model = make_mlp(sizes) — Build model

Calls our factory function to create an He-initialized MLP with the specified layer sizes. Each call creates a fresh model with new random weights.

38params = sum(p.numel() for p in model.parameters())

Counts total learnable parameters by summing the element count of every parameter tensor. This is the standard PyTorch idiom for parameter counting.

EXECUTION STATE
📚 model.parameters() = Iterator over all learnable parameter tensors in the model. For a 4→8→4 network: yields W1(8,4), b1(8,), W2(4,8), b2(4,).
📚 p.numel() = Number of elements in tensor p. For shape (8, 4): numel() = 32. For shape (8,): numel() = 8.
39opt = torch.optim.Adam(model.parameters(), lr=0.01)

Creates an Adam optimizer (from Chapter 9) for this model. Adam adapts the learning rate per-parameter using momentum and RMSprop. lr=0.01 is Adam’s standard learning rate — higher than SGD because Adam internally scales down the effective step size.

EXECUTION STATE
📚 torch.optim.Adam(params, lr) = Adam optimizer: maintains per-parameter momentum (m) and squared gradient (v) buffers. Update: w -= lr × m̂ / (√v̂ + ε). Converges faster than SGD for most problems.
lr=0.01 = Adam’s effective step size is lr × m̂/√v̂, which is typically much smaller than lr due to normalization. So lr=0.01 with Adam ≈ lr=0.001 with SGD.
41for epoch in range(200): — Training loop

Train for 200 epochs. Unlike our NumPy code that processes one image at a time, here we pass ALL 16 images at once (batch training). This is more efficient and gives a cleaner gradient.

42pred = model(X) — Full batch forward pass

Passes all 16 images through the model simultaneously. X is (16, 4), pred is (16, 4). PyTorch handles the batch dimension automatically — each of the 16 rows is processed independently through the same weights.

EXECUTION STATE
model(X) = Calls the model’s forward() method. nn.Sequential runs each layer in order on the entire batch. Shape: (16, 4) → (16, width) → ... → (16, 4).
pred shape = torch.Size([16, 4]) — 16 predictions, one per image
43loss = nn.functional.mse_loss(pred, Y) — Compute MSE

Mean Squared Error loss over all 16 images and all 4 outputs. Unlike our NumPy code where we scaled by 0.5, PyTorch’s mse_loss uses the standard formula: mean((pred - Y)²) with no 0.5 factor.

EXECUTION STATE
📚 nn.functional.mse_loss(input, target) = Computes mean((input - target)²) averaged over all elements. For shape (16, 4): averages over 64 values total. Returns a scalar tensor.
44opt.zero_grad() — Clear old gradients

Resets all parameter gradients to zero before backward(). PyTorch accumulates gradients by default (useful for gradient accumulation), so we must zero them each step to avoid mixing gradients from different epochs.

45loss.backward() — Backpropagation

Computes gradients for ALL parameters in the model via reverse-mode autodiff. This single call replaces the entire manual backprop from our NumPy code (dout, dW2, db2, dh, dW1, db1). PyTorch traces the computation graph from loss back to the weights.

46opt.step() — Update weights with Adam

Applies the Adam update rule to every parameter. For each weight: updates momentum buffer m, updates squared gradient buffer v, applies bias correction, then updates the weight. Replaces our manual W -= lr * dW.

48print(f"{name:14s} params={params:4d} loss={loss.item():.6f}")

Prints the architecture name, parameter count, and final loss. The :14s format pads the name to 14 characters for aligned output. loss.item() converts the 0-dimensional tensor to a Python float.

EXECUTION STATE
📚 loss.item() = Extracts the scalar value from a 0-d tensor as a Python float. Required because loss is a torch.Tensor, not a plain number. Only works on single-element tensors.
16 lines without explanation
1import torch
2import torch.nn as nn
3
4# ── Build any MLP architecture with nn.Sequential ──
5def make_mlp(layer_sizes):
6    layers = []
7    for i in range(len(layer_sizes) - 1):
8        layers.append(nn.Linear(layer_sizes[i], layer_sizes[i+1]))
9        if i < len(layer_sizes) - 2:
10            layers.append(nn.ReLU())
11    model = nn.Sequential(*layers)
12
13    # Apply He initialization
14    for m in model.modules():
15        if isinstance(m, nn.Linear):
16            nn.init.kaiming_normal_(m.weight, nonlinearity='relu')
17            nn.init.zeros_(m.bias)
18
19    return model
20
21# ── Dataset ──
22X = torch.tensor([[i,j,k,l]
23                   for i in [0,1] for j in [0,1]
24                   for k in [0,1] for l in [0,1]], dtype=torch.float32)
25Y = torch.stack([torch.tensor([x[0],x[2],x[1],x[3]]) for x in X])
26
27# ── Compare architectures ──
28configs = {
29    "4→2→4":       [4, 2, 4],
30    "4→8→4":       [4, 8, 4],
31    "4→16→4":      [4, 16, 4],
32    "4→4→4→4":     [4, 4, 4, 4],
33    "4→8→8→4":     [4, 8, 8, 4],
34}
35
36torch.manual_seed(42)
37for name, sizes in configs.items():
38    model = make_mlp(sizes)
39    params = sum(p.numel() for p in model.parameters())
40    opt = torch.optim.Adam(model.parameters(), lr=0.01)
41
42    for epoch in range(200):
43        pred = model(X)
44        loss = nn.functional.mse_loss(pred, Y)
45        opt.zero_grad()
46        loss.backward()
47        opt.step()
48
49    print(f"{name:14s}  params={params:4d}  loss={loss.item():.6f}")

The make_mlp\texttt{make\_mlp} function is a pattern you will use throughout your deep learning career. It separates architecture definition (the list of sizes) from architecture construction (the loop that builds layers). This makes experimentation trivial: just change the list.

PyTorch tip: After building a model, you can inspect it by printing: print(model)\texttt{print(model)}. This shows all layers and their shapes. To see all parameter shapes: for name, p in model.named_parameters(): print(name, p.shape)\texttt{for name, p in model.named\_parameters(): print(name, p.shape)}.

Architecture Design Heuristics

There is no formula for the optimal architecture. But decades of practice have produced reliable heuristics:

Starting Points

  1. Start with one hidden layer. A single hidden layer can approximate any continuous function (the Universal Approximation Theorem, which we prove in Section 3 of this chapter). Start simple and add depth only if needed.
  2. Set width to 2–4× the input dimension. For our 4-input task, that suggests 8–16 hidden neurons — which matches our experimental results. For a 784-input MNIST task, start with 256–512 hidden neurons.
  3. Use a pyramid or funnel shape. In multi-layer networks, gradually decrease width: 784→512→256→128→10. This forces the network to compress information into progressively more abstract representations.
  4. Equal-width layers work too. Research by Hanin and Rolnick (2019) showed that equal-width networks (same number of neurons in every hidden layer) perform surprisingly well. The funnel shape is traditional but not always superior.

Practical Decision Tree

SituationRecommendation
Task is simple (few inputs, clear pattern)1 hidden layer, width ≈ 2-4× input size
Task has natural hierarchy2-3 hidden layers, funnel or equal width
Model underfits (training loss stays high)Add more neurons (width) or more layers (depth)
Model overfits (train low, test high)Reduce width, add dropout (Chapter 12)
Training is unstable (loss oscillates/NaN)Check initialization, reduce learning rate
Very deep network (>5 layers)Consider skip connections (Chapter 14)
The architect's mindset: Architecture design is empirical, not theoretical. You will not solve it with a formula. Instead, build a baseline, measure its performance, identify the bottleneck (underfitting or overfitting), and make a targeted change. The heuristics above tell you which direction to search.

Connection to Modern Systems

Every concept in this section scales directly to the architectures powering today's AI systems:

Transformers Use MLPs

Every transformer block contains a feed-forward network (FFN) — which is just a 2-layer MLP! In GPT-4 and LLaMA, each FFN takes the form dmodel4dmodeldmodeld_{\text{model}} \to 4d_{\text{model}} \to d_{\text{model}}. For a model with dmodel=4096d_{\text{model}} = 4096, that is a 40961638440964096 \to 16384 \to 4096 MLP with over 67 million parameters per layer. The width factor of 4× matches our heuristic.

GELU in Practice

Modern transformers (BERT, GPT, LLaMA) all use GELU activation in their FFN layers, not ReLU. The smooth gradient of GELU helps training stability in networks with 100+ layers. Some variants like LLaMA use SwiGLU, a gated variant that combines two linear transformations: SwiGLU(x)=(xW1)σ(xW2)\text{SwiGLU}(x) = (xW_1) \cdot \sigma(xW_2) where σ\sigma is the sigmoid function. This gives each neuron a learnable gate that controls information flow.

Initialization at Scale

Proper initialization becomes even more critical at scale. GPT-3 (175B parameters, 96 layers) uses a modified initialization where the output projection of each residual block is scaled by 1/2N1/\sqrt{2N} where NN is the number of layers. Without this, the residual connections cause variance to grow with depth. The principle is the same as He initialization — keep the signal variance stable — but adapted for the specific architecture.

Width and Memory

The width of the FFN layers is the largest memory consumer in transformers. During inference with KV-cache, the attention layers store cached key/value pairs (proportional to sequence length), but the FFN layers store the full weight matrices (proportional to dmodel2d_{\text{model}}^2). This is why techniques like MoE (Mixture of Experts) in models like Mixtral-8x7B activate only a subset of the FFN width for each token — getting the capacity of a wide network with the compute cost of a narrow one.


Summary

MLP architecture design comes down to four choices, each with measurable consequences:

  1. Width (neurons per layer) controls how many features the network can detect. More width = more linear regions = more complex functions. But more parameters risk overfitting.
  2. Depth (number of hidden layers) adds capacity exponentially. LL layers of width nn can create O(nL)O(n^L) linear regions. But deeper networks are harder to optimize (vanishing gradients).
  3. Activation function determines the shape of the non-linearity. ReLU is the default, GELU for transformers, Leaky ReLU if dead neurons are a problem.
  4. Initialization determines whether the signal survives through layers. He init for ReLU (σ=2/nin\sigma = \sqrt{2/n_{\text{in}}}), Xavier for tanh/sigmoid (σ=2/(nin+nout)\sigma = \sqrt{2/(n_{\text{in}} + n_{\text{out}})}).

We demonstrated these concepts on the diagonal flip task: width 2 could not learn (loss 0.065), width 8 learned reasonably (loss 0.024), and width 16 achieved near-perfect accuracy (loss 0.0005). In PyTorch, the make_mlp\texttt{make\_mlp} factory pattern makes architecture experimentation trivial.

In the next section, we will put this knowledge to work: building complete MLPs in PyTorch, training them on real data, and systematically comparing architectures using proper train/validation splits.

Loading comments...