Learning Objectives
By the end of this section, you will be able to:
- Identify the four design choices that define an MLP: width, depth, activation function, and weight initialization
- Explain the effect of width on learning capacity and show that more neurons means more "bends" in the function the network can represent
- Explain the effect of depth and why deeper networks can represent exponentially more complex functions per parameter
- Compare activation functions (ReLU, Leaky ReLU, GELU, Tanh) and know when to use each
- Apply He and Xavier initialization and explain why proper initialization prevents signals from vanishing or exploding
- Build flexible MLPs in PyTorch using nn.Sequential and nn.init
The Architecture Question
In Chapters 7–9, we learned how neural networks learn: forward propagation computes predictions, backpropagation computes gradients, and optimizers update weights. But we always used the same architecture: with ReLU and random initialization.
Why 3 hidden neurons? Why not 2, or 8, or 100? Why one hidden layer instead of three? Why ReLU instead of tanh? These choices were arbitrary — and that's a problem, because architecture determines what a network can and cannot learn.
The Central Question of This Chapter: Given a task (like diagonal flip), how do we choose the right architecture? What are the design knobs, and what happens when we turn them?
Think of it like building a house. Knowing how to lay bricks (forward pass), mix mortar (backprop), and plan the schedule (optimizer) is essential — but it does not tell you how many rooms to build or how tall to make the ceilings. Architecture is the blueprint that determines what the finished structure can do.
What Defines an MLP?
A Multi-Layer Perceptron (MLP) — also called a feedforward neural network or fully connected network — has four architectural choices:
| Design Choice | What It Controls | Example |
|---|---|---|
| Width | Neurons per hidden layer | 4→8→4 (8 neurons wide) |
| Depth | Number of hidden layers | 4→4→4→4 (2 hidden layers) |
| Activation | Non-linearity between layers | ReLU, GELU, Tanh |
| Initialization | Starting weight values | He, Xavier, random |
Each choice has consequences. Width controls how many features a single layer can detect. Depth controls how many levels of abstraction the network builds. Activation controls the shape of the non-linearity. Initialization controls whether training starts in a healthy state.
Let us examine each one, starting with the most intuitive: width.
Width: How Many Neurons Per Layer?
Consider our diagonal flip task. The network must learn to swap positions 1 and 2 of a 4-element vector while keeping positions 0 and 3 unchanged. How many hidden neurons does it need?
The Effect of Width
Each hidden neuron with ReLU activation contributes one "bend" — a point where the function changes slope. A neuron computes . This is a hinge function: linear on one side, zero on the other. The boundary between the two sides is a hyperplane defined by .
With hidden neurons in a single layer, the network can partition the input space into at most linear regions. Within each region, the network computes a different linear function. The more regions, the more complex the overall function.
| Width | Parameters | Max Linear Regions | Final Loss (200 epochs) |
|---|---|---|---|
| 2 neurons | 22 | 3 | 0.0650 |
| 3 neurons (Ch 7-9) | 31 | 4 | 0.0640 |
| 8 neurons | 76 | 9 | 0.0243 |
| 16 neurons | 148 | 17 | 0.0005 |
The pattern is clear: more neurons → lower loss. With only 2 neurons, the network can create 3 linear regions — not enough to capture the diagonal flip mapping for all 16 images. With 16 neurons, it has 17 regions and achieves nearly zero loss.
Width as Feature Detectors
Each hidden neuron acts as a feature detector. It learns to respond to a specific pattern in the input. For example, one neuron might detect "position 1 is 1 AND position 2 is 0", while another detects the reverse. The output layer then combines these features to produce the final mapping.
With too few neurons, the network lacks enough feature detectors to distinguish all the patterns it needs. It is forced to make compromises — mapping several distinct inputs to similar outputs. This is underfitting: the model is too simple for the task.
Depth: How Many Hidden Layers?
Width is not the only way to add capacity. We can also add depth — more hidden layers. And depth has a remarkable property: it adds capacity exponentially rather than linearly.
Why Depth Matters
Consider a single hidden layer with ReLU neurons. It creates at most linear regions. Now add a second hidden layer with neurons. Each neuron in the second layer can "fold" the first layer's regions, effectively multiplying the number of regions:
- 1 hidden layer with neurons: up to regions
- 2 hidden layers with neurons each: up to regions
- hidden layers with neurons each: up to regions
This is the exponential advantage of depth. A network with 4 neurons per layer and 3 hidden layers has only parameters, but can theoretically represent linear regions. A single-layer network would need neurons (and 507 parameters!) to match that number of regions.
Depth vs Width: A Practical Comparison
| Architecture | Parameters | Theoretical Regions | Best For |
|---|---|---|---|
| 4→16→4 (wide) | 148 | 17 | Simple pattern matching |
| 4→4→4→4 (deep) | 52 | ~64 | Hierarchical features |
| 4→8→8→4 (balanced) | 140 | ~64 | General-purpose |
For small problems like diagonal flip, width often wins because the optimization is easier. For complex problems like image recognition or language understanding, depth wins because real-world data has hierarchical structure — edges combine into shapes, shapes combine into objects, objects combine into scenes. Each layer captures one level of abstraction.
The depth principle: If your task has natural hierarchy (low-level features composing into high-level concepts), add depth. If your task is a complex mapping without obvious hierarchy, add width. Most real-world tasks have hierarchy, which is why modern networks are deep.
Interactive: Architecture Explorer
Use the sliders below to adjust the number of neurons in each layer. Watch how the network structure changes — more neurons mean more connections (weights), and each connection is a learnable parameter. Click "Animate Forward Pass" to see data flow through the layers.
Try these experiments:
- Set all hidden layers to 1 neuron — notice the extreme bottleneck. Information must squeeze through a single number.
- Set a hidden layer to 8 neurons — the web of connections grows dramatically. Each connection is a weight the optimizer must tune.
- Compare 4→8→4 versus 4→4→4→4 — similar parameter counts, but very different structures.
Interactive: Width vs Depth
This visualization shows a real neural network learning a curved target function (dashed line). Adjust width and depth, then click Train to watch the cyan prediction line converge toward the target. The network diagram on the left updates to show the current architecture.
Try these experiments:
- Width = 2, Depth = 1: Train and watch the network struggle. With only 2 neurons, it can make at most 3 linear segments — not enough to match the wavy target.
- Width = 8, Depth = 1: Now 9 segments are possible. The fit is much better, but look at the parameter count.
- Width = 4, Depth = 3: Fewer parameters than Width=8/Depth=1, but potentially more linear regions due to the exponential depth effect. Does it learn better?
- Width = 16, Depth = 1: Brute force with many neurons. Fast convergence but many parameters.
Activation Functions: Beyond ReLU
We have used ReLU exclusively so far. But there are several activation functions, each with distinct properties. The choice of activation is the third architectural decision.
The Activation Function Zoo
| Function | Formula | Range | Key Property |
|---|---|---|---|
| ReLU | max(0, x) | [0, ∞) | Fast, sparse, but dead neurons |
| Leaky ReLU | max(αx, x), α=0.01 | (-∞, ∞) | No dead neurons |
| GELU | x · Φ(x) | ≈ (-0.17, ∞) | Smooth, used in transformers |
| Tanh | tanh(x) | (-1, 1) | Zero-centered, bounded |
| Sigmoid | 1/(1+e⁻ˣ) | (0, 1) | Output as probability |
ReLU: The Workhorse
ReLU () is the default activation for hidden layers. It is fast to compute (just a comparison), produces sparse activations (many neurons output exactly 0), and has a simple gradient (1 if active, 0 if not).
The main weakness is the dying ReLU problem. If a neuron's pre-activation is always negative (for all training inputs), its gradient is always 0, and it can never recover. The neuron is permanently "dead." This happens more with large learning rates or poor initialization.
Leaky ReLU: Fixing Dead Neurons
Leaky ReLU replaces the flat zero region with a small negative slope: where . Now even negative inputs produce a small gradient (), so no neuron can completely die. The output is when positive, when negative.
GELU: The Transformer Standard
GELU (Gaussian Error Linear Unit) is used in GPT, BERT, and most modern transformers. It is defined as where is the standard Gaussian CDF. Unlike ReLU's hard cutoff at 0, GELU provides a smooth transition. Small negative values are slightly attenuated rather than completely zeroed out. This smoothness helps gradient-based optimization, especially in very deep networks.
Tanh and Sigmoid: The Classics
Before ReLU, tanh and sigmoid were standard. Tanh () outputs values in and is zero-centered, which helps optimization. Sigmoid () outputs values in and is mainly used for output layers where we need probabilities.
The main problem with both is vanishing gradients: for large , the derivative approaches 0. In deep networks, this means early layers receive almost no gradient signal and learn extremely slowly. ReLU solved this by having a constant gradient of 1 for positive inputs.
Weight Initialization: Why It Matters
The fourth architectural choice is how to set the initial weight values. This might seem trivial — they are just starting values that will be trained away — but initialization has a dramatic impact on whether training succeeds at all.
The Variance Problem
Consider a layer with inputs: . If inputs have variance and weights have variance , then by the properties of independent random variables:
The output variance is times the weight variance. For a layer with inputs and naive initialization, the output variance is 512 — the signal explodes. After a few layers, activations overflow to infinity. Conversely, if , the output variance is — the signal vanishes toward zero.
Xavier Initialization (for Tanh/Sigmoid)
Xavier Glorot (2010) proposed: set . This keeps the variance of both the forward pass and the backward pass approximately 1. It is designed for linear or tanh activations that preserve their input variance.
He Initialization (for ReLU)
Kaiming He (2015) observed that ReLU kills approximately half the activations (all negative values become 0), which halves the variance. To compensate, He init uses — doubling the variance compared to Xavier. This is the standard for ReLU networks and what we have been using in our code.
| Initialization | Scale σ | Best For | Epoch 0 Loss | Epoch 99 Loss |
|---|---|---|---|---|
| Small random (σ=0.01) | 0.01 | Nothing — don't use | 0.2302 | 0.0984 |
| Large random (σ=1.0) | 1.0 | Nothing — don't use | 1.3312 | 0.0658 |
| Xavier | √(2/(n_in+n_out)) | Tanh, Sigmoid | 0.3532 | 0.0305 |
| He (Kaiming) | √(2/n_in) | ReLU, Leaky ReLU | 0.5485 | 0.0244 |
Look at the epoch 99 losses. He initialization reaches the lowest loss (0.0244) — it is specifically designed for ReLU. Xavier is close (0.0305) but slightly worse because it underestimates the variance needed for ReLU. Small random initialization (0.0984) converges 4× slower because the signal vanishes through layers. Large random (0.0658) starts with high loss because the initial predictions are wildly wrong, wasting early training steps recovering.
Rule of thumb: Use He initialization with ReLU (the default in PyTorch). Use Xavier initialization with tanh or sigmoid. Never use plain random initialization — the scale matters enormously.
NumPy: Comparing Widths from Scratch
Let us put the theory into practice. The following code trains three architectures — 4→2→4, 4→8→4, and 4→16→4 — on our diagonal flip task and compares their final losses. This is the same forward/backward code from Chapters 7–8, now wrapped in a function so we can vary the width.
The results tell a clear story:
- Width 2 (22 params): Loss plateaus at 0.065 — the network simply cannot represent the diagonal flip with only 3 linear regions.
- Width 8 (76 params): Loss reaches 0.024 — 2.7× better. With 9 linear regions, the network has enough capacity to approximate the mapping.
- Width 16 (148 params): Loss reaches 0.0005 — nearly perfect. Each of the 17 linear regions precisely captures part of the input-output mapping.
PyTorch: Flexible MLP Builder
In practice, we do not write the forward and backward passes by hand. PyTorch's lets us define any MLP architecture in a few lines, and autograd handles all the gradient computation. The following code introduces a factory function pattern that builds any MLP from a list of layer sizes.
The function is a pattern you will use throughout your deep learning career. It separates architecture definition (the list of sizes) from architecture construction (the loop that builds layers). This makes experimentation trivial: just change the list.
Architecture Design Heuristics
There is no formula for the optimal architecture. But decades of practice have produced reliable heuristics:
Starting Points
- Start with one hidden layer. A single hidden layer can approximate any continuous function (the Universal Approximation Theorem, which we prove in Section 3 of this chapter). Start simple and add depth only if needed.
- Set width to 2–4× the input dimension. For our 4-input task, that suggests 8–16 hidden neurons — which matches our experimental results. For a 784-input MNIST task, start with 256–512 hidden neurons.
- Use a pyramid or funnel shape. In multi-layer networks, gradually decrease width: 784→512→256→128→10. This forces the network to compress information into progressively more abstract representations.
- Equal-width layers work too. Research by Hanin and Rolnick (2019) showed that equal-width networks (same number of neurons in every hidden layer) perform surprisingly well. The funnel shape is traditional but not always superior.
Practical Decision Tree
| Situation | Recommendation |
|---|---|
| Task is simple (few inputs, clear pattern) | 1 hidden layer, width ≈ 2-4× input size |
| Task has natural hierarchy | 2-3 hidden layers, funnel or equal width |
| Model underfits (training loss stays high) | Add more neurons (width) or more layers (depth) |
| Model overfits (train low, test high) | Reduce width, add dropout (Chapter 12) |
| Training is unstable (loss oscillates/NaN) | Check initialization, reduce learning rate |
| Very deep network (>5 layers) | Consider skip connections (Chapter 14) |
Connection to Modern Systems
Every concept in this section scales directly to the architectures powering today's AI systems:
Transformers Use MLPs
Every transformer block contains a feed-forward network (FFN) — which is just a 2-layer MLP! In GPT-4 and LLaMA, each FFN takes the form . For a model with , that is a MLP with over 67 million parameters per layer. The width factor of 4× matches our heuristic.
GELU in Practice
Modern transformers (BERT, GPT, LLaMA) all use GELU activation in their FFN layers, not ReLU. The smooth gradient of GELU helps training stability in networks with 100+ layers. Some variants like LLaMA use SwiGLU, a gated variant that combines two linear transformations: where is the sigmoid function. This gives each neuron a learnable gate that controls information flow.
Initialization at Scale
Proper initialization becomes even more critical at scale. GPT-3 (175B parameters, 96 layers) uses a modified initialization where the output projection of each residual block is scaled by where is the number of layers. Without this, the residual connections cause variance to grow with depth. The principle is the same as He initialization — keep the signal variance stable — but adapted for the specific architecture.
Width and Memory
The width of the FFN layers is the largest memory consumer in transformers. During inference with KV-cache, the attention layers store cached key/value pairs (proportional to sequence length), but the FFN layers store the full weight matrices (proportional to ). This is why techniques like MoE (Mixture of Experts) in models like Mixtral-8x7B activate only a subset of the FFN width for each token — getting the capacity of a wide network with the compute cost of a narrow one.
Summary
MLP architecture design comes down to four choices, each with measurable consequences:
- Width (neurons per layer) controls how many features the network can detect. More width = more linear regions = more complex functions. But more parameters risk overfitting.
- Depth (number of hidden layers) adds capacity exponentially. layers of width can create linear regions. But deeper networks are harder to optimize (vanishing gradients).
- Activation function determines the shape of the non-linearity. ReLU is the default, GELU for transformers, Leaky ReLU if dead neurons are a problem.
- Initialization determines whether the signal survives through layers. He init for ReLU (), Xavier for tanh/sigmoid ().
We demonstrated these concepts on the diagonal flip task: width 2 could not learn (loss 0.065), width 8 learned reasonably (loss 0.024), and width 16 achieved near-perfect accuracy (loss 0.0005). In PyTorch, the factory pattern makes architecture experimentation trivial.
In the next section, we will put this knowledge to work: building complete MLPs in PyTorch, training them on real data, and systematically comparing architectures using proper train/validation splits.