Flatten a 2D image into a 1D vector and understand why neural networks need this
Trace data through every neuron in a network, calculating each value by hand
Apply weights, biases, and activation functions step by step
See why an untrained network produces garbage and why that's the starting point for learning
The Philosophy: Most books show you formulas. We're going to do something different. We'll take one tiny, concrete example and trace every single number through a neural network by hand. No shortcuts. No "the details are left as an exercise." Every multiplication, every addition, every activation — you'll see it all.
The Problem: Flipping a Tiny Image
Imagine the world's smallest camera — it captures images that are just 2 pixels wide and 2 pixels tall. Each pixel is either black (0) or white (1). We want to build a neural network that learns to flip the image diagonally — reflecting it across the diagonal from top-left to bottom-right. A diagonal flip swaps the row and column of each pixel: pixel at (row, col) moves to (col, row).
Input Image
Flattened
Target (flipped)
Output Image
[1001]
[1,0,0,1]
[1,0,0,1]
[1001]
[0010]
[0,1,0,0]
[0,0,1,0]
[0100]
[1010]
[1,1,0,0]
[1,0,1,0]
[1100]
[1101]
[1,0,1,1]
[1,1,0,1]
[1011]
Look at the second row. The input has a white pixel at position (row 0, col 1). After the flip, it moves to (row 1, col 0). The diagonal swap is: [p00,p01,p10,p11]→[p00,p10,p01,p11] — positions 1 and 2 swap while 0 and 3 stay.
Why this example? The diagonal flip is a real geometric transformation with a clear pattern. It's simple enough to calculate by hand but complex enough that a single neuron can't do it — you genuinely need a multi-layer network.
Step 1: Flatten the Image
Neural networks eat 1D vectors, not 2D grids. We read pixels left-to-right, top-to-bottom. A 2×2 image with pixels p00,p01,p10,p11 becomes a 4-element vector x=[p00,p01,p10,p11].
For our running example, we'll use the input [1101] which flattens to x=[1,0,1,1]. The expected output after diagonal flip is [1011] which flattens to y=[1,1,0,1].
Reading order matters. As long as you flatten input and output using the same order (row-major), the network can learn the mapping. The network doesn't know it's looking at an image — it just sees numbers.
Quick Check
If the input image is [[0,1],[1,0]], what is the flattened vector?
Step 2: Design the Network
We'll build the simplest multi-layer network: Input(4)→Hidden(3)→Output(4). That's 4 input neurons, 3 hidden neurons with ReLU activation, and 4 output neurons.
Layer
Size
Parameters
Count
Input → Hidden
4→3
W1(4×3) + b1 (3)
12+3=15
Hidden → Output
3→4
W2(3×4) + b2 (4)
12+4=16
Total
31 parameters
Each connection has a weight and each neuron has a bias. These 31 numbers are everything the network knows. Right now they're random. After training, they'll encode the diagonal flip.
Why 3 hidden neurons?
Three is the smallest hidden layer that can learn this task. With 2, there aren't enough degrees of freedom. With more, it works but there are more numbers to track by hand. Three is the sweet spot for learning.
Step 3: Initialize Weights
Before training, we start with random weights. Here are the exact values we'll trace through:
Each row of W1 corresponds to one input. Each column corresponds to one hidden neuron. So W1[0][1]=−0.5 is the weight from input 0 to hidden neuron 1.
These are random. The specific values don't matter — they're random starting points. After training (Chapter 8), these numbers will change to encode the diagonal flip.
Interactive Forward Pass Explorer
Below is the complete forward propagation pipeline. Click "Next" to advance through each computation step, or click any neuron in the network to jump directly to its calculation. You can also toggle the input pixels (the 2×2 grid on the left) to see how different inputs produce different outputs.
Loading interactive visualization...
Step 4: First Layer — Weighted Sums
We push x=[1,0,1,1] through the first layer. For each hidden neuron j, we compute a weighted sum of all inputs plus a bias: zj=∑i=03xi⋅W1[i][j]+b1[j].
What just happened? Each hidden neuron looked at ALL four input pixels and computed a weighted combination. Neuron 2 got a high value (0.5) because its weights align well with this input. Neuron 1 got a negative value (−0.1), meaning this input "disagrees" with what neuron 1 looks for.
Step 5: Apply ReLU Activation
We can't pass raw weighted sums to the next layer — if we did, the network would collapse to a single linear transformation (we prove this in Section 2). We apply ReLU (Rectified Linear Unit): ReLU(z)=max(0,z). Positive values pass through, negative values become zero.
Neuron
z (pre-activation)
ReLU(z)
Status
h0
0.0
max(0,0.0)=0.0
Zero
h1
−0.1
max(0,−0.1)=0.0
❌ Dead (clamped)
h2
0.5
max(0,0.5)=0.5
✅ Alive
After activation: h=[0.0,0.0,0.5]
Dead neurons: Neurons 0 and 1 output zero — they're "dead" for this input. Different inputs will activate different neurons. The network learns which neurons should fire for which patterns. This selective activation is what gives neural networks their power.
What is each neuron "looking for"? Each hidden neuron has a weight vector — the pattern of inputs that excites it most:
Neuron
Weights from input
Responds to
Status for [1,0,1,1]
h0
[0.2,−0.3,0.1,−0.4]
Pixels 0,2 ON; pixels 1,3 OFF
z0=0.0
h1
[−0.5,0.4,0.3,0.2]
Pixel 0 OFF; pixels 1,2,3 ON
z1=−0.1
h2
[0.1,−0.2,0.5,−0.1]
Strongly: pixel 2 (bottom-left) ON
z2=0.5
Right now these are random, meaningless patterns. After training (Chapter 8), each neuron will learn a specific spatial detector that helps compute the diagonal flip — for instance, one neuron might learn to detect whether pixels 1 and 2 need to swap.
Quick Check
What is ReLU(-3.7)?
Step 6: Second Layer — Output
Push h=[0.0,0.0,0.5] through the output layer. Since h0=h1=0, only h2=0.5 contributes — the dead neurons pass nothing forward.
An MSE of 0.896 is very high (perfect = 0). This is the loss — it tells the network how wrong it is. In Chapter 8, we'll use this loss to compute gradients and update the weights.
The key insight: Forward propagation is just arithmetic — multiplications and additions organized in layers. Nothing magical. The magic happens when we run it backward (backpropagation) to figure out how to adjust each of the 31 weights.
Summary
Flatten first: Convert 2D images into 1D vectors. Our 2×2 image becomes a 4D vector.
Layer by layer: Each layer computes weighted sum + bias, then applies activation.
Forward pass:y^=W2T⋅ReLU(W1T⋅x+b1)+b2
Random weights = garbage. The network needs training to produce meaningful results.
Loss measures error. MSE = 0.896 tells us the predictions are far from target.
In the next section, we'll rewrite this entire computation using matrix multiplication — one clean equation instead of computing each neuron individually — and implement it in Python.