Boo-AI — Master Artificial Intelligence by Building from Scratch

Learning Objectives

By the end of this section, you will be able to:

Flatten a 2D image into a 1D vector and understand why neural networks need this
Trace data through every neuron in a network, calculating each value by hand
Apply weights, biases, and activation functions step by step
See why an untrained network produces garbage and why that's the starting point for learning

The Philosophy: Most books show you formulas. We're going to do something different. We'll take one tiny, concrete example and trace every single number through a neural network by hand. No shortcuts. No "the details are left as an exercise." Every multiplication, every addition, every activation — you'll see it all.

The Problem: Flipping a Tiny Image

Imagine the world's smallest camera — it captures images that are just 2 pixels wide and 2 pixels tall. Each pixel is either black (0) or white (1). We want to build a neural network that learns to flip the image diagonally — reflecting it across the diagonal from top-left to bottom-right. A diagonal flip swaps the row and column of each pixel: pixel at (row, col) moves to (col, row).

Input Image	Flattened	Target (flipped)	Output Image
$\begin{bmatrix}1&0\\0&1\end{bmatrix}$	$[1,0,0,1]$	$[1,0,0,1]$	$\begin{bmatrix}1&0\\0&1\end{bmatrix}$
$\begin{bmatrix}0&1\\0&0\end{bmatrix}$	$[0,1,0,0]$	$[0,0,1,0]$	$\begin{bmatrix}0&0\\1&0\end{bmatrix}$
$\begin{bmatrix}1&1\\0&0\end{bmatrix}$	$[1,1,0,0]$	$[1,0,1,0]$	$\begin{bmatrix}1&0\\1&0\end{bmatrix}$
$\begin{bmatrix}1&0\\1&1\end{bmatrix}$	$[1,0,1,1]$	$[1,1,0,1]$	$\begin{bmatrix}1&1\\0&1\end{bmatrix}$

Look at the second row. The input has a white pixel at position (row 0, col 1). After the flip, it moves to (row 1, col 0). The diagonal swap is: $[p_{00}, p_{01}, p_{10}, p_{11}] \to [p_{00}, p_{10}, p_{01}, p_{11}]$ — positions 1 and 2 swap while 0 and 3 stay.

Why this example? The diagonal flip is a real geometric transformation with a clear pattern. It's simple enough to calculate by hand but complex enough that a single neuron can't do it — you genuinely need a multi-layer network.

Step 1: Flatten the Image

Neural networks eat 1D vectors, not 2D grids. We read pixels left-to-right, top-to-bottom. A 2×2 image with pixels $p_{00}, p_{01}, p_{10}, p_{11}$ becomes a 4-element vector $\mathbf{x} = [p_{00}, p_{01}, p_{10}, p_{11}]$ .

For our running example, we'll use the input $\begin{bmatrix} 1 & 0 \\ 1 & 1 \end{bmatrix}$ which flattens to $\mathbf{x} = [1, 0, 1, 1]$ . The expected output after diagonal flip is $\begin{bmatrix} 1 & 1 \\ 0 & 1 \end{bmatrix}$ which flattens to $\mathbf{y} = [1, 1, 0, 1]$ .

Reading order matters. As long as you flatten input and output using the same order (row-major), the network can learn the mapping. The network doesn't know it's looking at an image — it just sees numbers.

Quick Check

If the input image is [[0,1],[1,0]], what is the flattened vector?

Step 2: Design the Network

We'll build the simplest multi-layer network: $\text{Input}(4) \to \text{Hidden}(3) \to \text{Output}(4)$ . That's 4 input neurons, 3 hidden neurons with ReLU activation, and 4 output neurons.

Layer	Size	Parameters	Count
Input $\to$ Hidden	$4 \to 3$	$\mathbf{W}_1$ $(4 \times 3)$ + $\mathbf{b}_1$ (3)	$12 + 3 = 15$
Hidden $\to$ Output	$3 \to 4$	$\mathbf{W}_2$ $(3 \times 4)$ + $\mathbf{b}_2$ (4)	$12 + 4 = 16$
Total			31 parameters

Each connection has a weight and each neuron has a bias. These 31 numbers are everything the network knows. Right now they're random. After training, they'll encode the diagonal flip.

Why 3 hidden neurons?

Three is the smallest hidden layer that can learn this task. With 2, there aren't enough degrees of freedom. With more, it works but there are more numbers to track by hand. Three is the sweet spot for learning.

Step 3: Initialize Weights

Before training, we start with random weights. Here are the exact values we'll trace through:

Layer 1: $\mathbf{W}_1$ (4×3) and $\mathbf{b}_1$ (3)

$\mathbf{W}_1 = \begin{bmatrix} 0.2 & -0.5 & 0.1 \\ -0.3 & 0.4 & -0.2 \\ 0.1 & 0.3 & 0.5 \\ -0.4 & 0.2 & -0.1 \end{bmatrix}$ , $\;\;\mathbf{b}_1 = [0.1,\; -0.1,\; 0.0]$

Each row of $\mathbf{W}_1$ corresponds to one input. Each column corresponds to one hidden neuron. So $W_1[0][1] = -0.5$ is the weight from input 0 to hidden neuron 1.

Layer 2: $\mathbf{W}_2$ (3×4) and $\mathbf{b}_2$ (4)

$\mathbf{W}_2 = \begin{bmatrix} 0.3 & -0.2 & 0.4 & 0.1 \\ -0.1 & 0.5 & -0.3 & 0.2 \\ 0.2 & -0.4 & 0.1 & -0.5 \end{bmatrix}$ , $\;\;\mathbf{b}_2 = [0.0,\; 0.1,\; -0.1,\; 0.0]$

These are random. The specific values don't matter — they're random starting points. After training (Chapter 8), these numbers will change to encode the diagonal flip.

Interactive Forward Pass Explorer

Below is the complete forward propagation pipeline. Click "Next" to advance through each computation step, or click any neuron in the network to jump directly to its calculation. You can also toggle the input pixels (the 2×2 grid on the left) to see how different inputs produce different outputs.

Loading interactive visualization...

Step 4: First Layer — Weighted Sums

We push $\mathbf{x} = [1, 0, 1, 1]$ through the first layer. For each hidden neuron $j$ , we compute a weighted sum of all inputs plus a bias: $z_j = \sum_{i=0}^{3} x_i \cdot W_1[i][j] + b_1[j]$ .

Hidden neuron 0: $z_0$

$z_0 = (1)(0.2) + (0)(-0.3) + (1)(0.1) + (1)(-0.4) + 0.1$
$= 0.2 + 0 + 0.1 - 0.4 + 0.1 = \mathbf{0.0}$

Hidden neuron 1: $z_1$

$z_1 = (1)(-0.5) + (0)(0.4) + (1)(0.3) + (1)(0.2) + (-0.1)$
$= -0.5 + 0 + 0.3 + 0.2 - 0.1 = \mathbf{-0.1}$

Hidden neuron 2: $z_2$

$z_2 = (1)(0.1) + (0)(-0.2) + (1)(0.5) + (1)(-0.1) + 0.0$
$= 0.1 + 0 + 0.5 - 0.1 + 0 = \mathbf{0.5}$

Pre-activation values: $\mathbf{z}^{(1)} = [0.0,\; -0.1,\; 0.5]$

What just happened? Each hidden neuron looked at ALL four input pixels and computed a weighted combination. Neuron 2 got a high value (0.5) because its weights align well with this input. Neuron 1 got a negative value (−0.1), meaning this input "disagrees" with what neuron 1 looks for.

Step 5: Apply ReLU Activation

We can't pass raw weighted sums to the next layer — if we did, the network would collapse to a single linear transformation (we prove this in Section 2). We apply ReLU (Rectified Linear Unit): $\text{ReLU}(z) = \max(0, z)$ . Positive values pass through, negative values become zero.

Neuron	$z$ (pre-activation)	$\text{ReLU}(z)$	Status
$h_0$	$0.0$	$\max(0,\; 0.0) = 0.0$	Zero
$h_1$	$-0.1$	$\max(0,\; -0.1) = 0.0$	❌ Dead (clamped)
$h_2$	$0.5$	$\max(0,\; 0.5) = 0.5$	✅ Alive

After activation: $\mathbf{h} = [0.0,\; 0.0,\; 0.5]$

Dead neurons: Neurons 0 and 1 output zero — they're "dead" for this input. Different inputs will activate different neurons. The network learns which neurons should fire for which patterns. This selective activation is what gives neural networks their power.

What is each neuron "looking for"? Each hidden neuron has a weight vector — the pattern of inputs that excites it most:

Neuron	Weights from input	Responds to	Status for [1,0,1,1]
$h_0$	$[0.2,\; -0.3,\; 0.1,\; -0.4]$	Pixels 0,2 ON; pixels 1,3 OFF	$z_0 = 0.0$
$h_1$	$[-0.5,\; 0.4,\; 0.3,\; 0.2]$	Pixel 0 OFF; pixels 1,2,3 ON	$z_1 = -0.1$
$h_2$	$[0.1,\; -0.2,\; 0.5,\; -0.1]$	Strongly: pixel 2 (bottom-left) ON	$z_2 = 0.5$

Right now these are random, meaningless patterns. After training (Chapter 8), each neuron will learn a specific spatial detector that helps compute the diagonal flip — for instance, one neuron might learn to detect whether pixels 1 and 2 need to swap.

Quick Check

What is ReLU(-3.7)?

Step 6: Second Layer — Output

Push $\mathbf{h} = [0.0, 0.0, 0.5]$ through the output layer. Since $h_0 = h_1 = 0$ , only $h_2 = 0.5$ contributes — the dead neurons pass nothing forward.

Output neuron 0: $\hat{y}_0$

$\hat{y}_0 = (0.0)(0.3) + (0.0)(-0.1) + (0.5)(0.2) + 0.0 = \mathbf{0.1}$

Output neuron 1: $\hat{y}_1$

$\hat{y}_1 = (0.0)(-0.2) + (0.0)(0.5) + (0.5)(-0.4) + 0.1 = \mathbf{-0.1}$

Output neuron 2: $\hat{y}_2$

$\hat{y}_2 = (0.0)(0.4) + (0.0)(-0.3) + (0.5)(0.1) + (-0.1) = \mathbf{-0.05}$

Output neuron 3: $\hat{y}_3$

$\hat{y}_3 = (0.0)(0.1) + (0.0)(0.2) + (0.5)(-0.5) + 0.0 = \mathbf{-0.25}$

Network prediction: $\hat{\mathbf{y}} = [0.1,\; -0.1,\; -0.05,\; -0.25]$

The Full Forward Pass

Stage	Values	Operation
Input $\mathbf{x}$	$[1,\; 0,\; 1,\; 1]$	$2 \times 2$ image flattened
$\mathbf{z}^{(1)}$	$[0.0,\; -0.1,\; 0.5]$	$\mathbf{x} \cdot \mathbf{W}_1 + \mathbf{b}_1$
Hidden $\mathbf{h}$	$[0.0,\; 0.0,\; 0.5]$	$\text{ReLU}(\mathbf{z}^{(1)})$
Output $\hat{\mathbf{y}}$	$[0.1,\; -0.1,\; -0.05,\; -0.25]$	$\mathbf{h} \cdot \mathbf{W}_2 + \mathbf{b}_2$
Target $\mathbf{y}$	$[1,\; 1,\; 0,\; 1]$	Diagonal flip of input

As a single equation: $\hat{\mathbf{y}} = \mathbf{W}_2^T \cdot \text{ReLU}(\mathbf{W}_1^T \cdot \mathbf{x} + \mathbf{b}_1) + \mathbf{b}_2$

What Went Wrong? Measuring the Error

Output	Predicted	Expected	Error
$\hat{y}_0$	$0.1$	$1$	$0.9$
$\hat{y}_1$	$-0.1$	$1$	$1.1$
$\hat{y}_2$	$-0.05$	$0$	$0.05$ (close!)
$\hat{y}_3$	$-0.25$	$1$	$1.25$

The prediction is terrible — exactly what we expect from random weights. We measure the error with Mean Squared Error (MSE): $L = \frac{1}{4} \sum_{i=0}^{3} (\hat{y}_i - y_i)^2$

$L = \frac{1}{4}[(0.1-1)^2 + (-0.1-1)^2 + (-0.05-0)^2 + (-0.25-1)^2]$
$= \frac{1}{4}[0.81 + 1.21 + 0.0025 + 1.5625] = \frac{3.585}{4} = \mathbf{0.896}$

An MSE of 0.896 is very high (perfect = 0). This is the loss — it tells the network how wrong it is. In Chapter 8, we'll use this loss to compute gradients and update the weights.

The key insight: Forward propagation is just arithmetic — multiplications and additions organized in layers. Nothing magical. The magic happens when we run it backward (backpropagation) to figure out how to adjust each of the 31 weights.

Summary

Flatten first: Convert 2D images into 1D vectors. Our 2×2 image becomes a 4D vector.
Layer by layer: Each layer computes weighted sum + bias, then applies activation.
Forward pass: $\hat{\mathbf{y}} = W_2^T \cdot \text{ReLU}(W_1^T \cdot \mathbf{x} + b_1) + b_2$
Random weights = garbage. The network needs training to produce meaningful results.
Loss measures error. MSE = 0.896 tells us the predictions are far from target.

In the next section, we'll rewrite this entire computation using matrix multiplication — one clean equation instead of computing each neuron individually — and implement it in Python.

Learning Objectives

The Problem: Flipping a Tiny Image

Step 1: Flatten the Image

Quick Check

Step 2: Design the Network

Why 3 hidden neurons?

Step 3: Initialize Weights

Layer 1: W1\mathbf{W}_1W1​ (4×3) and b1\mathbf{b}_1b1​ (3)

Layer 2: W2\mathbf{W}_2W2​ (3×4) and b2\mathbf{b}_2b2​ (4)

Interactive Forward Pass Explorer

Step 4: First Layer — Weighted Sums

Hidden neuron 0: z0z_0z0​

Hidden neuron 1: z1z_1z1​

Hidden neuron 2: z2z_2z2​

Step 5: Apply ReLU Activation

Quick Check

Step 6: Second Layer — Output

Output neuron 0: y^0\hat{y}_0y^​0​

Output neuron 1: y^1\hat{y}_1y^​1​

Output neuron 2: y^2\hat{y}_2y^​2​

Output neuron 3: y^3\hat{y}_3y^​3​

The Full Forward Pass

What Went Wrong? Measuring the Error

Summary

Layer 1: $\mathbf{W}_1$ (4×3) and $\mathbf{b}_1$ (3)

Layer 2: $\mathbf{W}_2$ (3×4) and $\mathbf{b}_2$ (4)

Hidden neuron 0: $z_0$

Hidden neuron 1: $z_1$

Hidden neuron 2: $z_2$

Output neuron 0: $\hat{y}_0$

Output neuron 1: $\hat{y}_1$

Output neuron 2: $\hat{y}_2$

Output neuron 3: $\hat{y}_3$