Chapter 7
25 min read
Section 21 of 65

From Input to Output

Forward Propagation

Learning Objectives

By the end of this section, you will be able to:

  1. Flatten a 2D image into a 1D vector and understand why neural networks need this
  2. Trace data through every neuron in a network, calculating each value by hand
  3. Apply weights, biases, and activation functions step by step
  4. See why an untrained network produces garbage and why that's the starting point for learning
The Philosophy: Most books show you formulas. We're going to do something different. We'll take one tiny, concrete example and trace every single number through a neural network by hand. No shortcuts. No "the details are left as an exercise." Every multiplication, every addition, every activation — you'll see it all.

The Problem: Flipping a Tiny Image

Imagine the world's smallest camera — it captures images that are just 2 pixels wide and 2 pixels tall. Each pixel is either black (0) or white (1). We want to build a neural network that learns to flip the image diagonally — reflecting it across the diagonal from top-left to bottom-right. A diagonal flip swaps the row and column of each pixel: pixel at (row, col) moves to (col, row).

Input ImageFlattenedTarget (flipped)Output Image
[1001]\begin{bmatrix}1&0\\0&1\end{bmatrix}[1,0,0,1][1,0,0,1][1,0,0,1][1,0,0,1][1001]\begin{bmatrix}1&0\\0&1\end{bmatrix}
[0100]\begin{bmatrix}0&1\\0&0\end{bmatrix}[0,1,0,0][0,1,0,0][0,0,1,0][0,0,1,0][0010]\begin{bmatrix}0&0\\1&0\end{bmatrix}
[1100]\begin{bmatrix}1&1\\0&0\end{bmatrix}[1,1,0,0][1,1,0,0][1,0,1,0][1,0,1,0][1010]\begin{bmatrix}1&0\\1&0\end{bmatrix}
[1011]\begin{bmatrix}1&0\\1&1\end{bmatrix}[1,0,1,1][1,0,1,1][1,1,0,1][1,1,0,1][1101]\begin{bmatrix}1&1\\0&1\end{bmatrix}

Look at the second row. The input has a white pixel at position (row 0, col 1). After the flip, it moves to (row 1, col 0). The diagonal swap is: [p00,p01,p10,p11][p00,p10,p01,p11][p_{00}, p_{01}, p_{10}, p_{11}] \to [p_{00}, p_{10}, p_{01}, p_{11}] — positions 1 and 2 swap while 0 and 3 stay.

Why this example? The diagonal flip is a real geometric transformation with a clear pattern. It's simple enough to calculate by hand but complex enough that a single neuron can't do it — you genuinely need a multi-layer network.

Step 1: Flatten the Image

Neural networks eat 1D vectors, not 2D grids. We read pixels left-to-right, top-to-bottom. A 2×2 image with pixels p00,p01,p10,p11p_{00}, p_{01}, p_{10}, p_{11} becomes a 4-element vector x=[p00,p01,p10,p11]\mathbf{x} = [p_{00}, p_{01}, p_{10}, p_{11}].

For our running example, we'll use the input [1011]\begin{bmatrix} 1 & 0 \\ 1 & 1 \end{bmatrix} which flattens to x=[1,0,1,1]\mathbf{x} = [1, 0, 1, 1]. The expected output after diagonal flip is [1101]\begin{bmatrix} 1 & 1 \\ 0 & 1 \end{bmatrix} which flattens to y=[1,1,0,1]\mathbf{y} = [1, 1, 0, 1].

Reading order matters. As long as you flatten input and output using the same order (row-major), the network can learn the mapping. The network doesn't know it's looking at an image — it just sees numbers.

Quick Check

If the input image is [[0,1],[1,0]], what is the flattened vector?


Step 2: Design the Network

We'll build the simplest multi-layer network: Input(4)Hidden(3)Output(4)\text{Input}(4) \to \text{Hidden}(3) \to \text{Output}(4). That's 4 input neurons, 3 hidden neurons with ReLU activation, and 4 output neurons.

LayerSizeParametersCount
Input \to Hidden434 \to 3W1\mathbf{W}_1 (4×3)(4 \times 3) + b1\mathbf{b}_1 (3)12+3=1512 + 3 = 15
Hidden \to Output343 \to 4W2\mathbf{W}_2 (3×4)(3 \times 4) + b2\mathbf{b}_2 (4)12+4=1612 + 4 = 16
Total31 parameters

Each connection has a weight and each neuron has a bias. These 31 numbers are everything the network knows. Right now they're random. After training, they'll encode the diagonal flip.

Why 3 hidden neurons?

Three is the smallest hidden layer that can learn this task. With 2, there aren't enough degrees of freedom. With more, it works but there are more numbers to track by hand. Three is the sweet spot for learning.


Step 3: Initialize Weights

Before training, we start with random weights. Here are the exact values we'll trace through:

Layer 1: W1\mathbf{W}_1 (4×3) and b1\mathbf{b}_1 (3)

W1=[0.20.50.10.30.40.20.10.30.50.40.20.1]\mathbf{W}_1 = \begin{bmatrix} 0.2 & -0.5 & 0.1 \\ -0.3 & 0.4 & -0.2 \\ 0.1 & 0.3 & 0.5 \\ -0.4 & 0.2 & -0.1 \end{bmatrix},     b1=[0.1,  0.1,  0.0]\;\;\mathbf{b}_1 = [0.1,\; -0.1,\; 0.0]

Each row of W1\mathbf{W}_1 corresponds to one input. Each column corresponds to one hidden neuron. So W1[0][1]=0.5W_1[0][1] = -0.5 is the weight from input 0 to hidden neuron 1.

Layer 2: W2\mathbf{W}_2 (3×4) and b2\mathbf{b}_2 (4)

W2=[0.30.20.40.10.10.50.30.20.20.40.10.5]\mathbf{W}_2 = \begin{bmatrix} 0.3 & -0.2 & 0.4 & 0.1 \\ -0.1 & 0.5 & -0.3 & 0.2 \\ 0.2 & -0.4 & 0.1 & -0.5 \end{bmatrix},     b2=[0.0,  0.1,  0.1,  0.0]\;\;\mathbf{b}_2 = [0.0,\; 0.1,\; -0.1,\; 0.0]

These are random. The specific values don't matter — they're random starting points. After training (Chapter 8), these numbers will change to encode the diagonal flip.

Interactive Forward Pass Explorer

Below is the complete forward propagation pipeline. Click "Next" to advance through each computation step, or click any neuron in the network to jump directly to its calculation. You can also toggle the input pixels (the 2×2 grid on the left) to see how different inputs produce different outputs.

Loading interactive visualization...

Step 4: First Layer — Weighted Sums

We push x=[1,0,1,1]\mathbf{x} = [1, 0, 1, 1] through the first layer. For each hidden neuron jj, we compute a weighted sum of all inputs plus a bias: zj=i=03xiW1[i][j]+b1[j]z_j = \sum_{i=0}^{3} x_i \cdot W_1[i][j] + b_1[j].

Hidden neuron 0: z0z_0

z0=(1)(0.2)+(0)(0.3)+(1)(0.1)+(1)(0.4)+0.1z_0 = (1)(0.2) + (0)(-0.3) + (1)(0.1) + (1)(-0.4) + 0.1
=0.2+0+0.10.4+0.1=0.0= 0.2 + 0 + 0.1 - 0.4 + 0.1 = \mathbf{0.0}

Hidden neuron 1: z1z_1

z1=(1)(0.5)+(0)(0.4)+(1)(0.3)+(1)(0.2)+(0.1)z_1 = (1)(-0.5) + (0)(0.4) + (1)(0.3) + (1)(0.2) + (-0.1)
=0.5+0+0.3+0.20.1=0.1= -0.5 + 0 + 0.3 + 0.2 - 0.1 = \mathbf{-0.1}

Hidden neuron 2: z2z_2

z2=(1)(0.1)+(0)(0.2)+(1)(0.5)+(1)(0.1)+0.0z_2 = (1)(0.1) + (0)(-0.2) + (1)(0.5) + (1)(-0.1) + 0.0
=0.1+0+0.50.1+0=0.5= 0.1 + 0 + 0.5 - 0.1 + 0 = \mathbf{0.5}

Pre-activation values: z(1)=[0.0,  0.1,  0.5]\mathbf{z}^{(1)} = [0.0,\; -0.1,\; 0.5]

What just happened? Each hidden neuron looked at ALL four input pixels and computed a weighted combination. Neuron 2 got a high value (0.5) because its weights align well with this input. Neuron 1 got a negative value (−0.1), meaning this input "disagrees" with what neuron 1 looks for.

Step 5: Apply ReLU Activation

We can't pass raw weighted sums to the next layer — if we did, the network would collapse to a single linear transformation (we prove this in Section 2). We apply ReLU (Rectified Linear Unit): ReLU(z)=max(0,z)\text{ReLU}(z) = \max(0, z). Positive values pass through, negative values become zero.

Neuronzz (pre-activation)ReLU(z)\text{ReLU}(z)Status
h0h_00.00.0max(0,  0.0)=0.0\max(0,\; 0.0) = 0.0Zero
h1h_10.1-0.1max(0,  0.1)=0.0\max(0,\; -0.1) = 0.0❌ Dead (clamped)
h2h_20.50.5max(0,  0.5)=0.5\max(0,\; 0.5) = 0.5✅ Alive

After activation: h=[0.0,  0.0,  0.5]\mathbf{h} = [0.0,\; 0.0,\; 0.5]

Dead neurons: Neurons 0 and 1 output zero — they're "dead" for this input. Different inputs will activate different neurons. The network learns which neurons should fire for which patterns. This selective activation is what gives neural networks their power.
What is each neuron "looking for"? Each hidden neuron has a weight vector — the pattern of inputs that excites it most:
NeuronWeights from inputResponds toStatus for [1,0,1,1]
h0h_0[0.2,  0.3,  0.1,  0.4][0.2,\; -0.3,\; 0.1,\; -0.4]Pixels 0,2 ON; pixels 1,3 OFFz0=0.0z_0 = 0.0
h1h_1[0.5,  0.4,  0.3,  0.2][-0.5,\; 0.4,\; 0.3,\; 0.2]Pixel 0 OFF; pixels 1,2,3 ONz1=0.1z_1 = -0.1
h2h_2[0.1,  0.2,  0.5,  0.1][0.1,\; -0.2,\; 0.5,\; -0.1]Strongly: pixel 2 (bottom-left) ONz2=0.5z_2 = 0.5

Right now these are random, meaningless patterns. After training (Chapter 8), each neuron will learn a specific spatial detector that helps compute the diagonal flip — for instance, one neuron might learn to detect whether pixels 1 and 2 need to swap.

Quick Check

What is ReLU(-3.7)?


Step 6: Second Layer — Output

Push h=[0.0,0.0,0.5]\mathbf{h} = [0.0, 0.0, 0.5] through the output layer. Since h0=h1=0h_0 = h_1 = 0, only h2=0.5h_2 = 0.5 contributes — the dead neurons pass nothing forward.

Output neuron 0: y^0\hat{y}_0

y^0=(0.0)(0.3)+(0.0)(0.1)+(0.5)(0.2)+0.0=0.1\hat{y}_0 = (0.0)(0.3) + (0.0)(-0.1) + (0.5)(0.2) + 0.0 = \mathbf{0.1}

Output neuron 1: y^1\hat{y}_1

y^1=(0.0)(0.2)+(0.0)(0.5)+(0.5)(0.4)+0.1=0.1\hat{y}_1 = (0.0)(-0.2) + (0.0)(0.5) + (0.5)(-0.4) + 0.1 = \mathbf{-0.1}

Output neuron 2: y^2\hat{y}_2

y^2=(0.0)(0.4)+(0.0)(0.3)+(0.5)(0.1)+(0.1)=0.05\hat{y}_2 = (0.0)(0.4) + (0.0)(-0.3) + (0.5)(0.1) + (-0.1) = \mathbf{-0.05}

Output neuron 3: y^3\hat{y}_3

y^3=(0.0)(0.1)+(0.0)(0.2)+(0.5)(0.5)+0.0=0.25\hat{y}_3 = (0.0)(0.1) + (0.0)(0.2) + (0.5)(-0.5) + 0.0 = \mathbf{-0.25}

Network prediction: y^=[0.1,  0.1,  0.05,  0.25]\hat{\mathbf{y}} = [0.1,\; -0.1,\; -0.05,\; -0.25]


The Full Forward Pass

StageValuesOperation
Input x\mathbf{x}[1,  0,  1,  1][1,\; 0,\; 1,\; 1]2×22 \times 2 image flattened
z(1)\mathbf{z}^{(1)}[0.0,  0.1,  0.5][0.0,\; -0.1,\; 0.5]xW1+b1\mathbf{x} \cdot \mathbf{W}_1 + \mathbf{b}_1
Hidden h\mathbf{h}[0.0,  0.0,  0.5][0.0,\; 0.0,\; 0.5]ReLU(z(1))\text{ReLU}(\mathbf{z}^{(1)})
Output y^\hat{\mathbf{y}}[0.1,  0.1,  0.05,  0.25][0.1,\; -0.1,\; -0.05,\; -0.25]hW2+b2\mathbf{h} \cdot \mathbf{W}_2 + \mathbf{b}_2
Target y\mathbf{y}[1,  1,  0,  1][1,\; 1,\; 0,\; 1]Diagonal flip of input

As a single equation: y^=W2TReLU(W1Tx+b1)+b2\hat{\mathbf{y}} = \mathbf{W}_2^T \cdot \text{ReLU}(\mathbf{W}_1^T \cdot \mathbf{x} + \mathbf{b}_1) + \mathbf{b}_2


What Went Wrong? Measuring the Error

OutputPredictedExpectedError
y^0\hat{y}_00.10.1110.90.9
y^1\hat{y}_10.1-0.1111.11.1
y^2\hat{y}_20.05-0.05000.050.05 (close!)
y^3\hat{y}_30.25-0.25111.251.25

The prediction is terrible — exactly what we expect from random weights. We measure the error with Mean Squared Error (MSE): L=14i=03(y^iyi)2L = \frac{1}{4} \sum_{i=0}^{3} (\hat{y}_i - y_i)^2

L=14[(0.11)2+(0.11)2+(0.050)2+(0.251)2]L = \frac{1}{4}[(0.1-1)^2 + (-0.1-1)^2 + (-0.05-0)^2 + (-0.25-1)^2]
=14[0.81+1.21+0.0025+1.5625]=3.5854=0.896= \frac{1}{4}[0.81 + 1.21 + 0.0025 + 1.5625] = \frac{3.585}{4} = \mathbf{0.896}

An MSE of 0.896 is very high (perfect = 0). This is the loss — it tells the network how wrong it is. In Chapter 8, we'll use this loss to compute gradients and update the weights.

The key insight: Forward propagation is just arithmetic — multiplications and additions organized in layers. Nothing magical. The magic happens when we run it backward (backpropagation) to figure out how to adjust each of the 31 weights.

Summary

  1. Flatten first: Convert 2D images into 1D vectors. Our 2×2 image becomes a 4D vector.
  2. Layer by layer: Each layer computes weighted sum + bias, then applies activation.
  3. Forward pass: y^=W2TReLU(W1Tx+b1)+b2\hat{\mathbf{y}} = W_2^T \cdot \text{ReLU}(W_1^T \cdot \mathbf{x} + b_1) + b_2
  4. Random weights = garbage. The network needs training to produce meaningful results.
  5. Loss measures error. MSE = 0.896 tells us the predictions are far from target.

In the next section, we'll rewrite this entire computation using matrix multiplication — one clean equation instead of computing each neuron individually — and implement it in Python.

Loading comments...