Boo-AI — Master Artificial Intelligence by Building from Scratch

Learning Objectives

By the end of this section, you will be able to:

Rewrite per-neuron calculations as matrix multiplications and verify they produce identical results
Read dimension annotations and know the shape of every tensor at each stage
Prove why non-linearity is essential — without it, deep networks collapse to shallow ones
Implement the forward pass in Python using NumPy in just 3 lines

Why this matters: In Section 1, we computed each neuron one at a time — 12 separate multiplications for each layer. Matrix multiplication does an entire layer in one operation. Every framework (PyTorch, TensorFlow) works this way. The code mirrors the math line for line.

From Neurons to One Matrix Operation

In Section 1, we computed each hidden neuron separately. Look at the pattern:

$z_0 = x_0 W_1[0][0] + x_1 W_1[1][0] + x_2 W_1[2][0] + x_3 W_1[3][0] + b_1[0]$
$z_1 = x_0 W_1[0][1] + x_1 W_1[1][1] + x_2 W_1[2][1] + x_3 W_1[3][1] + b_1[1]$
$z_2 = x_0 W_1[0][2] + x_1 W_1[1][2] + x_2 W_1[2][2] + x_3 W_1[3][2] + b_1[2]$

Each equation is a dot product between $\mathbf{x}$ and a column of $\mathbf{W}_1$ . All three at once is a matrix-vector multiplication: $\mathbf{z}^{(1)} = \mathbf{x} \cdot \mathbf{W}_1 + \mathbf{b}_1$

In NumPy this is just: $\texttt{z1 = x @ W1 + b1}$ . One line replaces three calculations.

Loading matrix visualization...

Layer 1 as Matrix Multiplication

$\mathbf{x} \cdot \mathbf{W}_1 = [1, 0, 1, 1] \cdot \begin{bmatrix} 0.2 & -0.5 & 0.1 \\ -0.3 & 0.4 & -0.2 \\ 0.1 & 0.3 & 0.5 \\ -0.4 & 0.2 & -0.1 \end{bmatrix}$

The vector $\mathbf{x}$ dots with each column:

Column 0: $(1)(0.2) + (0)(-0.3) + (1)(0.1) + (1)(-0.4) = -0.1$
Column 1: $(1)(-0.5) + (0)(0.4) + (1)(0.3) + (1)(0.2) = 0.0$
Column 2: $(1)(0.1) + (0)(-0.2) + (1)(0.5) + (1)(-0.1) = 0.5$

$\mathbf{x} \cdot \mathbf{W}_1 + \mathbf{b}_1 = [-0.1, 0.0, 0.5] + [0.1, -0.1, 0.0] = [0.0, -0.1, 0.5]$ — same as Section 1.

Quick Check

When multiplying x with shape (4,) by W₁ with shape (4,3), what is the result shape?

Layer 2 as Matrix Multiplication

After ReLU: $\mathbf{h} = [0.0, 0.0, 0.5]$ . The output layer is $\hat{\mathbf{y}} = \mathbf{h} \cdot \mathbf{W}_2 + \mathbf{b}_2$ .

Since $h_0 = h_1 = 0$ , only the third row of $\mathbf{W}_2$ contributes: $[0.5 \times 0.2,\; 0.5 \times (-0.4),\; 0.5 \times 0.1,\; 0.5 \times (-0.5)] = [0.1, -0.2, 0.05, -0.25]$

Adding bias: $[0.1, -0.2, 0.05, -0.25] + [0.0, 0.1, -0.1, 0.0] = [0.1, -0.1, -0.05, -0.25]$ — matches Section 1 exactly.

The Complete Forward Pass Equation

Two lines of math, or one combined:

$\mathbf{h} = \text{ReLU}(\mathbf{x} \cdot \mathbf{W}_1 + \mathbf{b}_1)$
$\hat{\mathbf{y}} = \mathbf{h} \cdot \mathbf{W}_2 + \mathbf{b}_2$

Dimension annotations at each step:

Expression	Dimensions	Description
$\mathbf{x}$	$(4,)$	Input vector
$\mathbf{W}_1$	$(4, 3)$	Weight matrix
$\mathbf{x} \cdot \mathbf{W}_1$	$(4,) \times (4,3) = (3,)$	Pre-activation
$+ \mathbf{b}_1$	$(3,) + (3,) = (3,)$	Add bias
$\text{ReLU}(\cdot)$	$(3,)$	Element-wise, same shape
$\mathbf{W}_2$	$(3, 4)$	Weight matrix
$\mathbf{h} \cdot \mathbf{W}_2$	$(3,) \times (3,4) = (4,)$	Output
$+ \mathbf{b}_2$	$(4,) + (4,) = (4,)$	Final output

Dimension rule: When multiplying (a,) × (a, b), the shared dimension

a

cancels out, giving (b,). If dimensions don't match, you have a bug. Checking dimensions is the #1 debugging tool.

Why Non-Linearity Is Essential

Critical question: what if we remove ReLU and just stack two linear layers?

$\hat{\mathbf{y}} = (\mathbf{x} \cdot \mathbf{W}_1 + \mathbf{b}_1) \cdot \mathbf{W}_2 + \mathbf{b}_2 = \mathbf{x} \cdot \mathbf{W}_1 \mathbf{W}_2 + \mathbf{b}_1 \mathbf{W}_2 + \mathbf{b}_2$

Define $\mathbf{W}_{\text{eff}} = \mathbf{W}_1 \mathbf{W}_2$ and $\mathbf{b}_{\text{eff}} = \mathbf{b}_1 \mathbf{W}_2 + \mathbf{b}_2$ . Then: $\hat{\mathbf{y}} = \mathbf{x} \cdot \mathbf{W}_{\text{eff}} + \mathbf{b}_{\text{eff}}$

This is just one linear layer! Two linear layers without activation collapse into a single linear transformation. You could stack 100 layers and it'd still be equivalent to one.

Proof with our numbers

Let's verify with the exact weights from Section 1. Compute the effective single-layer matrix $\mathbf{W}_{\text{eff}} = \mathbf{W}_1 \mathbf{W}_2$ (shape 4×4):

$\mathbf{W}_{\text{eff}} = \begin{bmatrix} 0.2 & -0.5 & 0.1 \\ -0.3 & 0.4 & -0.2 \\ 0.1 & 0.3 & 0.5 \\ -0.4 & 0.2 & -0.1 \end{bmatrix} \begin{bmatrix} 0.3 & -0.2 & 0.4 & 0.1 \\ -0.1 & 0.5 & -0.3 & 0.2 \\ 0.2 & -0.4 & 0.1 & -0.5 \end{bmatrix} = \begin{bmatrix} 0.09 & -0.33 & 0.24 & -0.05 \\ -0.17 & 0.34 & -0.26 & 0.15 \\ 0.10 & -0.17 & 0.00 & -0.18 \\ -0.16 & 0.22 & -0.23 & 0.02 \end{bmatrix}$

And the effective bias: $\mathbf{b}_{\text{eff}} = \mathbf{b}_1 \cdot \mathbf{W}_2 + \mathbf{b}_2 = [0.04,\; 0.03,\; -0.03,\; -0.01]$

Now compute the output using just this single layer: $\mathbf{x} \cdot \mathbf{W}_{\text{eff}} + \mathbf{b}_{\text{eff}} = [1,0,1,1] \cdot \mathbf{W}_{\text{eff}} + \mathbf{b}_{\text{eff}} = [0.07,\; -0.25,\; 0.01,\; -0.22]$ — identical to running the two-layer network without ReLU. Two layers bought us nothing.

Quick Check

If you stack 5 linear layers with no activation function between them, the result is equivalent to how many layers?

Without non-linearity, depth is an illusion. ReLU breaks the linearity, allowing each layer to genuinely add computational power. A 2-layer network with ReLU can represent functions that no single linear layer can.

Dimensions Cheat Sheet

General pattern for any layer:

What	Shape	Rule
Input to layer	$(n_{\text{in}},)$	Previous layer’s output size
Weight matrix $\mathbf{W}$	$(n_{\text{in}}, n_{\text{out}})$	Rows = inputs, Cols = outputs
Bias vector $\mathbf{b}$	$(n_{\text{out}},)$	One per output neuron
$\mathbf{x} \cdot \mathbf{W} + \mathbf{b}$	$(n_{\text{out}},)$	Output of the layer

Memory trick: W has shape (from, to). Input features come first, output neurons come second. Multiply: (from,) × (from, to) = (to,).

Python Implementation

Here's the complete forward pass in Python with NumPy. Click any line to see the exact values flowing through. The core is just 3 lines: $\texttt{z1 = x @ W1 + b1}$ , $\texttt{h = np.maximum(0, z1)}$ , $\texttt{y\_hat = h @ W2 + b2}$ .

Forward Pass — NumPy Implementation

🐍forward_pass.py

Explanation(18)

Code(38)

1import numpy as np

NumPy provides fast N-dimensional arrays and matrix operations. All math here — matrix multiply via @, element-wise max, mean — runs as optimized C code, not slow Python loops.

EXECUTION STATE

numpy = Library for numerical computing — ndarray type, linear algebra, element-wise math

as np = Alias so we write np.array() instead of numpy.array()

4W1 = np.array([...]) — First layer weights (4×3)

Weight matrix connecting 4 inputs to 3 hidden neurons. Each row is one input feature, each column is one hidden neuron. W1[i][j] = weight from input i to hidden neuron j.

EXECUTION STATE

W1 shape = (4, 3) — 4 input features × 3 hidden neurons = 12 weights

W1 =

       h0    h1    h2
x0   0.2  -0.5   0.1
x1  -0.3   0.4  -0.2
x2   0.1   0.3   0.5
x3  -0.4   0.2  -0.1

Reading W1 = Column 0 = [0.2, -0.3, 0.1, -0.4] — all weights feeding into hidden neuron 0

10b1 = np.array([0.1, -0.1, 0.0]) — First layer biases

One bias per hidden neuron. Added after the weighted sum. Lets each neuron shift its activation threshold.

EXECUTION STATE

b1 = [0.1, -0.1, 0.0] — one per hidden neuron (h0, h1, h2)

b1[0] = 0.1 = Hidden neuron 0 starts with a small positive push

12W2 = np.array([...]) — Second layer weights (3×4)

Weight matrix connecting 3 hidden neurons to 4 output neurons. W2[i][j] = weight from hidden neuron i to output j.

EXECUTION STATE

W2 shape = (3, 4) — 3 hidden neurons × 4 outputs = 12 weights

W2 =

       y0    y1    y2    y3
h0   0.3  -0.2   0.4   0.1
h1  -0.1   0.5  -0.3   0.2
h2   0.2  -0.4   0.1  -0.5

17b2 = np.array([0.0, 0.1, -0.1, 0.0]) — Output biases

One bias per output neuron. b2[1] = 0.1 gives output neuron 1 a slight positive offset.

EXECUTION STATE

b2 = [0.0, 0.1, -0.1, 0.0] — one per output (ŷ0, ŷ1, ŷ2, ŷ3)

20image = np.array([[1,0],[1,1]]) — Input image

Our 2×2 binary image. Top row: pixel(0,0)=1 (white), pixel(0,1)=0 (black). Bottom row: pixel(1,0)=1, pixel(1,1)=1.

EXECUTION STATE

image (2×2) =

1  0
1  1

22x = image.flatten() → [1, 0, 1, 1]

Flatten 2D image to 1D vector, reading left-to-right, top-to-bottom. This is what the network actually receives as input.

EXECUTION STATE

📚 .flatten() = NumPy method: collapses all dimensions into a 1D array. For a 2×2 array, reads row 0 then row 1: [[1,0],[1,1]] → [1,0,1,1]

.astype(float) = Convert from integer to float64 — needed for matrix multiplication with float weights

⬆ result: x = [1.0, 0.0, 1.0, 1.0]

25z1 = x @ W1 + b1 — Hidden layer pre-activation

THE KEY LINE. One matrix multiplication replaces the 3 separate neuron calculations from Section 1. x (1×4) @ W1 (4×3) produces a (1×3) vector — one value per hidden neuron — then we add the bias.

EXECUTION STATE

📚 @ operator = Python matrix multiplication. x @ W1 computes the dot product of x with each column of W1 simultaneously. Equivalent to: [x·col0, x·col1, x·col2]

⬇ x (4,) = [1.0, 0.0, 1.0, 1.0]

⬇ W1 (4×3) =

       h0    h1    h2
x0   0.2  -0.5   0.1
x1  -0.3   0.4  -0.2
x2   0.1   0.3   0.5
x3  -0.4   0.2  -0.1

── Neuron 0 ── =

x·col0 = (1)(0.2) + (0)(-0.3) + (1)(0.1) + (1)(-0.4) = -0.1

── Neuron 1 ── =

x·col1 = (1)(-0.5) + (0)(0.4) + (1)(0.3) + (1)(0.2) = 0.0

── Neuron 2 ── =

x·col2 = (1)(0.1) + (0)(-0.2) + (1)(0.5) + (1)(-0.1) = 0.5

x @ W1 = [-0.1, 0.0, 0.5]

+ b1 = [-0.1+0.1, 0.0+(-0.1), 0.5+0.0] = [0.0, -0.1, 0.5]

⬆ result: z1 = [0.0, -0.1, 0.5] ← matches Section 1!

26h = np.maximum(0, z1) — ReLU activation

Element-wise ReLU: keep positive values, clamp negatives to zero. This non-linearity is what makes the network more powerful than a single linear transformation.

EXECUTION STATE

📚 np.maximum(a, b) = Element-wise maximum of two arrays (or array and scalar). np.maximum(0, [-0.1, 0.5]) → [0.0, 0.5]. Different from np.max() which finds the single largest element.

⬇ z1 = [0.0, -0.1, 0.5]

z1[0] = 0.0 = max(0, 0.0) = 0.0 — on the boundary

z1[1] = -0.1 = max(0, -0.1) = 0.0 — DEAD: negative → clamped to zero

z1[2] = 0.5 = max(0, 0.5) = 0.5 — ALIVE: positive → passes through

⬆ result: h = [0.0, 0.0, 0.5] — only neuron 2 is active!

27y_hat = h @ W2 + b2 — Output layer

Second matrix multiplication: h (1×3) @ W2 (3×4) = output (1×4). Since h0=h1=0, only the third row of W2 contributes — the dead neurons pass NOTHING forward.

EXECUTION STATE

⬇ h = [0.0, 0.0, 0.5] — only h2 matters

⬇ W2 (3×4) =

       y0    y1    y2    y3
h0   0.3  -0.2   0.4   0.1  ← ×0.0 (dead)
h1  -0.1   0.5  -0.3   0.2  ← ×0.0 (dead)
h2   0.2  -0.4   0.1  -0.5  ← ×0.5 (alive!)

h @ W2 = [0.5×0.2, 0.5×(-0.4), 0.5×0.1, 0.5×(-0.5)] = [0.1, -0.2, 0.05, -0.25]

+ b2 = [0.1+0.0, -0.2+0.1, 0.05+(-0.1), -0.25+0.0]

⬆ result: y_hat = [0.1, -0.1, -0.05, -0.25] ← matches Section 1!

30target = image.T.flatten() — Diagonal flip

The target output is the diagonally flipped image. NumPy's .T transposes the 2×2 matrix (swaps rows and columns), then we flatten it.

EXECUTION STATE

📚 .T (transpose) = Swaps rows and columns. [[1,0],[1,1]].T = [[1,1],[0,1]]. For 2×2: pixel(r,c) → pixel(c,r)

image.T (2×2) =

1  1
0  1

⬆ result: target = [1.0, 1.0, 0.0, 1.0]

31mse = np.mean((y_hat - target) ** 2) — Loss

Mean Squared Error: average of squared differences between prediction and target. This single number tells the network how wrong it is.

EXECUTION STATE

y_hat - target = [0.1-1, -0.1-1, -0.05-0, -0.25-1] = [-0.9, -1.1, -0.05, -1.25]

(…) ** 2 = [0.81, 1.21, 0.0025, 1.5625]

📚 np.mean() = Average of all elements: (0.81+1.21+0.0025+1.5625)/4

⬆ result: mse = 0.8963 — very high (perfect = 0.0)

33print(f"Input: {x}")

Display the flattened input vector.

EXECUTION STATE

output = Input: [1. 0. 1. 1.]

34print(f"z1: {z1}")

Display hidden layer pre-activation values.

EXECUTION STATE

output = z1: [ 0. -0.1 0.5]

35print(f"h (ReLU): {h}")

Display hidden layer post-ReLU values. Two neurons are dead (zero).

EXECUTION STATE

output = h (ReLU): [0. 0. 0.5]

36print(f"Prediction: ...")

Display the network's output, rounded to 4 decimal places.

EXECUTION STATE

output = Prediction: [ 0.1 -0.1 -0.05 -0.25]

37print(f"Target: {target}")

Display the target (diagonally flipped image).

EXECUTION STATE

output = Target: [1. 1. 0. 1.]

38print(f"MSE Loss: {mse:.4f}")

Display the loss value. 0.8963 is terrible — random weights produce garbage.

EXECUTION STATE

output = MSE Loss: 0.8963

20 lines without explanation

1import numpy as np
2
3# ── Network weights (same values from Section 1) ──
4W1 = np.array([
5    [ 0.2, -0.5,  0.1],
6    [-0.3,  0.4, -0.2],
7    [ 0.1,  0.3,  0.5],
8    [-0.4,  0.2, -0.1]
9])
10b1 = np.array([0.1, -0.1, 0.0])
11
12W2 = np.array([
13    [ 0.3, -0.2,  0.4,  0.1],
14    [-0.1,  0.5, -0.3,  0.2],
15    [ 0.2, -0.4,  0.1, -0.5]
16])
17b2 = np.array([0.0, 0.1, -0.1, 0.0])
18
19# ── Input image ──
20image = np.array([[1, 0],
21                  [1, 1]])
22x = image.flatten().astype(float)
23
24# ── Forward pass (one matrix op per layer) ──
25z1 = x @ W1 + b1
26h = np.maximum(0, z1)
27y_hat = h @ W2 + b2
28
29# ── Target and loss ──
30target = image.T.flatten().astype(float)
31mse = np.mean((y_hat - target) ** 2)
32
33print(f"Input:      {x}")
34print(f"z1:         {z1}")
35print(f"h (ReLU):   {h}")
36print(f"Prediction: {np.round(y_hat, 4)}")
37print(f"Target:     {target}")
38print(f"MSE Loss:   {mse:.4f}")

Summary

Matrix multiplication replaces loops. One $\texttt{x @ W1 + b1}$ computes all 3 hidden neurons at once.
Two matrix multiplications + ReLU = complete forward pass. Three lines of Python.
Without activation, depth is useless. Multiple linear layers collapse to one.
Always check dimensions. (a,) × (a, b) = (b,). Inner must match.

In the next section, we'll implement this in PyTorch using $\texttt{nn.Module}$ — the production way to build neural networks — and create the full training dataset.