Chapter 6
22 min read
Section 38 of 178

Multi-Layer Perceptrons (MLPs)

From Perceptrons to Deep Networks

Learning Objectives

By the end of this section, you will:

  • Understand why single-layer perceptrons fail on non-linear problems like XOR
  • Master the architecture of Multi-Layer Perceptrons (MLPs)
  • Trace data flow through the network during forward propagation
  • See how hidden layers transform the input space to solve non-linear problems
  • Implement MLPs from scratch in PyTorch
  • Make informed decisions about network width and depth
  • Avoid common pitfalls when designing and training MLPs

The Story Behind MLPs

The story of Multi-Layer Perceptrons is one of the most dramatic tales in the history of artificial intelligence. It involves a crisis, a decade of doubt, and ultimately a triumphant comeback.

The Rise of the Perceptron (1958)

In 1958, Frank Rosenblatt introduced the Perceptron, a simple computational model inspired by biological neurons. The New York Times declared it "the embryo of an electronic computer that [the Navy] expects will be able to walk, talk, see, write, reproduce itself and be conscious of its existence."

The perceptron was remarkably good at learning linearly separable patterns—problems where a straight line (or hyperplane) could separate different classes. It could learn to distinguish AND, OR, and even recognize simple patterns.

The Fall: Minsky and Papert's Critique (1969)

In 1969, Marvin Minsky and Seymour Papert published "Perceptrons", a mathematical analysis proving fundamental limitations of single-layer perceptrons. Most devastatingly, they showed that a single perceptron cannot learn the XOR function.

The XOR Problem: XOR (exclusive or) outputs 1 when exactly one input is 1. No single straight line can separate the (0,0)→0, (1,1)→0 points from the (0,1)→1, (1,0)→1 points. This proved that perceptrons couldn't solve even simple non-linear problems.

This critique, combined with the overhyped promises, triggered the first "AI Winter"—a period of reduced funding and interest in neural network research that lasted over a decade.

The Resurrection: Hidden Layers and Backpropagation (1986)

The solution was actually known but not widely appreciated: add hidden layers. In 1986, Rumelhart, Hinton, and Williams popularized the backpropagation algorithm, demonstrating how to train multi-layer networks efficiently.

The Key Insight

Hidden layers can transform the input space to make non-linear problems linearly separable. The hidden layer essentially performs automatic feature engineering!

The XOR Problem: Why One Layer Isn't Enough

To truly understand why MLPs were revolutionary, we need to understand the XOR problem deeply. Let's look at the truth table:

x₁x₂XOR OutputGeometric Location
000Bottom-left corner
011Top-left corner
101Bottom-right corner
110Top-right corner

Notice how the outputs form a checkerboard pattern: opposite corners have the same label. This is the defining characteristic of XOR—and why it's impossible to separate with a single line.

Mathematical Proof

A single perceptron computes:

y=σ(w1x1+w2x2+b)y = \sigma(w_1 x_1 + w_2 x_2 + b)

The decision boundary is where w1x1+w2x2+b=0w_1 x_1 + w_2 x_2 + b = 0, which is a straight line. No matter how you rotate or shift this line, you cannot separate the XOR points!

Try It Yourself

In the interactive demo below, try to find any straight line that separates the XOR points. You'll quickly discover it's impossible. Then add a hidden layer and watch the magic happen!

Interactive: The XOR Problem

Experiment with different problems (AND, OR, XOR) and network architectures. Notice how XOR cannot be solved by a single-layer perceptron, but can be solved when you add a hidden layer.

Loading interactive demo...


MLP Architecture

A Multi-Layer Perceptron consists of layers of neurons, where each neuron in one layer connects to every neuron in the next layer. This is why MLPs are also called fully connected or dense networks.

Components of an MLP

  1. Input Layer: Receives the raw features. The number of neurons equals the number of input features. This layer performs no computation—it just passes data forward.
  2. Hidden Layers: One or more layers between input and output. Each neuron computes a weighted sum of its inputs, adds a bias, and applies an activation function. These layers learn intermediate representations.
  3. Output Layer: Produces the final prediction. The number of neurons depends on the task: 1 for binary classification, C for C-class classification, or any number for regression.

Key Terminology

TermDefinitionExample
WidthNumber of neurons in a layerA layer with 64 neurons has width 64
DepthNumber of hidden layersA network with 3 hidden layers has depth 3
ParametersTotal weights + biasesA 2→4→1 network has 2×4 + 4 + 4×1 + 1 = 17 params
CapacityAbility to model complex functionsDeeper/wider networks have more capacity

Interactive: Network Architecture

Explore how changing the number of neurons in each layer affects the network structure. Watch how connections are drawn between layers and how activations flow during forward propagation.

Loading interactive demo...


Forward Propagation

Forward propagation is the process of computing the network's output given an input. Data flows from the input layer through hidden layers to the output layer, with each neuron performing the same basic computation.

The Neuron Computation

Each neuron performs three steps:

  1. Weighted Sum: Multiply each input by its corresponding weight and sum them:
    z=i=1nwixi=w1x1+w2x2++wnxnz = \sum_{i=1}^{n} w_i x_i = w_1 x_1 + w_2 x_2 + \cdots + w_n x_n
  2. Add Bias: Add a learnable bias term:
    z=i=1nwixi+bz = \sum_{i=1}^{n} w_i x_i + b
  3. Apply Activation: Pass through a non-linear activation function:
    a=σ(z)a = \sigma(z)

Matrix Notation

For a layer with nn inputs and mm outputs, we can write the forward pass compactly as:
a=σ(Wx+b)\mathbf{a} = \sigma(W\mathbf{x} + \mathbf{b})
where WW is an m×nm \times n weight matrix, x\mathbf{x} is the input vector, and b\mathbf{b} is the bias vector.

Interactive: Step-by-Step Forward Pass

Step through a complete forward pass and see exactly how inputs are transformed at each stage. Watch the weighted sums and activation values being computed in real-time.

Loading interactive demo...


The Magic of Hidden Layers

Hidden layers are where the real magic happens. They transform the input space in a way that makes previously unsolvable problems suddenly solvable. But how?

Feature Learning

Each hidden neuron creates a new feature by computing a weighted combination of inputs. These learned features capture patterns in the data that are useful for the task.

Geometric Interpretation: Each hidden neuron draws a hyperplane through the input space. Points on one side activate the neuron (output near 1), while points on the other side don't (output near 0). The hidden layer output is a new coordinate system based on distances from these hyperplanes!

Solving XOR with Hidden Layers

For XOR, we need two hidden neurons that together create a new feature space where the points become linearly separable:

  • Hidden Neuron 1: Fires when x1+x2>0.5x_1 + x_2 > 0.5 (at least one input is 1)
  • Hidden Neuron 2: Fires when x1+x2>1.5x_1 + x_2 > 1.5 (both inputs are 1)

In this new (h₁, h₂) space:

  • (0,0) → (0, 0): Both neurons inactive → output 0
  • (0,1) or (1,0) → (1, 0): First neuron active → output 1
  • (1,1) → (1, 1): Both neurons active → output 0

Now a simple line like h1h2=0.5h_1 - h_2 = 0.5 can separate the classes!


Interactive: Space Transformation

This visualization shows how hidden layers warp the input space. Adjust the weights of two hidden neurons and watch how the XOR points get rearranged in the hidden layer's feature space.

Loading interactive demo...


Activation Functions

Activation functions introduce non-linearity into the network. Without them, no matter how many layers we stack, the network can only learn linear transformations.

Why Non-linearity Matters

A composition of linear functions is still linear: f(g(x))=ABx=Cxf(g(x)) = ABx = Cx. Without activation functions, a 100-layer network would be equivalent to a single-layer network!

Common Activation Functions

FunctionFormulaRangeUse Case
Sigmoidσ(z) = 1/(1+e⁻ᶻ)(0, 1)Binary output, probabilities
Tanhtanh(z) = (eᶻ-e⁻ᶻ)/(eᶻ+e⁻ᶻ)(-1, 1)Zero-centered outputs
ReLUmax(0, z)[0, ∞)Hidden layers (most common)
Leaky ReLUmax(αz, z), α≈0.01(-∞, ∞)Prevents dying neurons

ReLU: The Modern Default

ReLU (Rectified Linear Unit) has become the default activation for hidden layers because:

  • Efficient: Just a comparison operation—much faster than exp() in sigmoid/tanh
  • No vanishing gradient: Gradient is 1 for positive inputs, not squashed like sigmoid
  • Sparse activation: Negative inputs produce 0, creating sparse representations

Dying ReLU Problem

If a neuron's weights push all inputs to negative values, it outputs 0 for all inputs and receives 0 gradients—it "dies" and never recovers. Leaky ReLU fixes this by allowing small gradients for negative inputs.

Mathematical Formulation

Let's formalize the MLP mathematically. Consider a network with LL layers (including input and output).

Notation

  • a(l)\mathbf{a}^{(l)}: Activation vector at layer ll
  • W(l)W^{(l)}: Weight matrix from layer l1l-1 to ll
  • b(l)\mathbf{b}^{(l)}: Bias vector at layer ll
  • σ(l)\sigma^{(l)}: Activation function at layer ll

Forward Pass Equations

Starting with input a(0)=x\mathbf{a}^{(0)} = \mathbf{x}, for each layer l=1,2,,Ll = 1, 2, \ldots, L:

z(l)=W(l)a(l1)+b(l)\mathbf{z}^{(l)} = W^{(l)} \mathbf{a}^{(l-1)} + \mathbf{b}^{(l)}
a(l)=σ(l)(z(l))\mathbf{a}^{(l)} = \sigma^{(l)}(\mathbf{z}^{(l)})

The final output is y^=a(L)\hat{\mathbf{y}} = \mathbf{a}^{(L)}.

Dimensionality

For a layer mapping nl1n_{l-1} inputs to nln_l outputs:

  • W(l)W^{(l)} has shape nl×nl1n_l \times n_{l-1}
  • b(l)\mathbf{b}^{(l)} has shape nln_l
  • Total parameters for this layer: nl×nl1+nl=nl(nl1+1)n_l \times n_{l-1} + n_l = n_l(n_{l-1} + 1)

PyTorch Implementation

Let's implement a flexible MLP in PyTorch that can have any number of hidden layers. Click on code lines to see detailed explanations.

Flexible MLP Implementation
🐍mlp.py
1Import PyTorch

Import the core PyTorch library for tensor operations and neural network training.

2Neural Network Module

nn.Module is the base class for all neural network modules. Your models should subclass this.

3Functional API

F provides stateless functions like activation functions, while nn provides stateful modules with learnable parameters.

5Class Definition

Our MLP inherits from nn.Module. This gives us automatic parameter tracking, GPU support, and more.

EXAMPLE
class MyModel(nn.Module):
8Constructor

The __init__ method defines the architecture. input_dim is feature count, hidden_dims is a list of hidden layer sizes.

EXAMPLE
hidden_dims=[64, 32] creates two hidden layers
9Super Init

Always call the parent constructor first. This initializes internal PyTorch bookkeeping.

EXAMPLE
super().__init__()
12Layer List

We build layers dynamically so we can have any number of hidden layers. This is more flexible than hardcoding.

15Hidden Layer Loop

For each hidden dimension, we add a Linear layer followed by a ReLU activation.

16Linear Layer

nn.Linear(in, out) creates a fully connected layer: y = Wx + b. It has in*out weights and out biases.

EXAMPLE
nn.Linear(2, 4) has 2*4=8 weights and 4 biases
17ReLU Activation

ReLU(x) = max(0, x). This introduces non-linearity, allowing the network to learn complex patterns.

21Output Layer

The final layer maps to output_dim. For binary classification, output_dim=1 with sigmoid applied later.

23Sequential Container

nn.Sequential chains modules together. Forward pass goes through each module in order.

EXAMPLE
Sequential(Linear, ReLU, Linear)
25Forward Method

Defines how data flows through the network. Called automatically when you do model(x).

26Network Call

Pass input through the sequential network. Each layer transforms the data in sequence.

29Model Instantiation

Create a 2-4-3-1 network: 2 inputs, hidden layers of 4 and 3 neurons, 1 output.

32Create Input

Create a batch of 1 sample with 2 features. Shape: (batch_size=1, features=2).

33Forward Pass

Calling model(x) invokes the forward method. The output is the network's prediction.

19 lines without explanation
1import torch
2import torch.nn as nn
3import torch.nn.functional as F
4
5class MLP(nn.Module):
6    """Multi-Layer Perceptron for classification"""
7
8    def __init__(self, input_dim, hidden_dims, output_dim):
9        super().__init__()
10
11        # Build layers dynamically
12        layers = []
13        prev_dim = input_dim
14
15        # Hidden layers
16        for hidden_dim in hidden_dims:
17            layers.append(nn.Linear(prev_dim, hidden_dim))
18            layers.append(nn.ReLU())
19            prev_dim = hidden_dim
20
21        # Output layer
22        layers.append(nn.Linear(prev_dim, output_dim))
23
24        self.network = nn.Sequential(*layers)
25
26    def forward(self, x):
27        return self.network(x)
28
29# Create an MLP: 2 inputs -> [4, 3] hidden -> 1 output
30model = MLP(input_dim=2, hidden_dims=[4, 3], output_dim=1)
31
32# Example forward pass
33x = torch.tensor([[0.5, 0.8]])
34output = model(x)
35print(f"Input: {x}")
36print(f"Output: {output}")

Complete XOR Implementation

For a comprehensive walkthrough of training a neural network on the XOR problem—including step-by-step forward pass, backward pass, and gradient calculations—see Section 6: Building Your First Neural Network.

Interactive: MLP Playground

Experiment with different datasets, network architectures, and hyperparameters. Watch how the decision boundary evolves during training.

Loading interactive demo...


MLP Design Choices

Width vs Depth

Should you make your network wider (more neurons per layer) or deeper (more layers)? The answer depends on the problem:

AspectWider NetworksDeeper Networks
CapacityCan memorize more examplesCan learn more complex hierarchies
TrainingOften easier to trainCan suffer from vanishing gradients
GeneralizationMay overfit more easilyOften generalize better with same param count
ComputationMore parallelizableSequential dependency between layers
Universal Approximation Theorem: A single hidden layer with enough neurons can approximate any continuous function. However, deeper networks can represent the same function with exponentially fewer parameters.

Practical Guidelines

  1. Start small: Begin with 1-2 hidden layers and increase if needed
  2. Power of 2: Layer sizes like 32, 64, 128, 256 are common (GPU efficiency)
  3. Tapered: Decreasing layer sizes (e.g., 256→128→64) often work well
  4. ReLU for hidden: Use ReLU in hidden layers unless you have a specific reason not to
  5. Output activation: Sigmoid for binary classification, softmax for multiclass, none for regression

Common Pitfalls

1. Forgetting Non-linearity

If you forget activation functions between layers, your network is equivalent to a single linear layer—no matter how deep it is!

2. Wrong Output Activation

Using sigmoid in the output layer for multiclass classification, or ReLU for regression with negative targets. Match your output activation to your task.

3. Vanishing/Exploding Gradients

Very deep networks can have gradients that shrink or explode as they propagate backwards. Solutions: ReLU, batch normalization, skip connections.

4. Too Many Parameters

A network that's too large will memorize the training data instead of learning general patterns. Start small and increase capacity only if needed.

Debugging Tips

  • Always check that your loss decreases during training
  • If loss is NaN or Inf, check for numerical issues (learning rate too high, bad inputs)
  • Print intermediate shapes to catch dimension mismatches
  • Start with a tiny dataset to verify learning works

Test Your Understanding

Loading interactive demo...


Summary

Key Takeaways

  1. Single-layer perceptrons fail on non-linear problems like XOR because they can only create linear decision boundaries
  2. Hidden layers solve this by transforming the input space to make non-linear problems linearly separable
  3. Forward propagation computes outputs by passing data through layers: weighted sum → add bias → apply activation
  4. Non-linear activation functions (like ReLU) are essential—without them, deep networks collapse to linear models
  5. The Universal Approximation Theorem guarantees MLPs can approximate any continuous function, but deeper networks often need fewer parameters
  6. PyTorch's nn.Module makes it easy to build flexible MLPs with automatic parameter tracking and gradient computation

In the next section, we'll explore the Universal Approximation Theorem in depth—the mathematical foundation that explains why neural networks are such powerful function approximators.