Rewrite per-neuron calculations as matrix multiplications and verify they produce identical results
Read dimension annotations and know the shape of every tensor at each stage
Prove why non-linearity is essential — without it, deep networks collapse to shallow ones
Implement the forward pass in Python using NumPy in just 3 lines
Why this matters: In Section 1, we computed each neuron one at a time — 12 separate multiplications for each layer. Matrix multiplication does an entire layer in one operation. Every framework (PyTorch, TensorFlow) works this way. The code mirrors the math line for line.
From Neurons to One Matrix Operation
In Section 1, we computed each hidden neuron separately. Look at the pattern:
Dimension rule: When multiplying (a,) × (a, b), the shared dimension a cancels out, giving (b,). If dimensions don't match, you have a bug. Checking dimensions is the #1 debugging tool.
Why Non-Linearity Is Essential
Critical question: what if we remove ReLU and just stack two linear layers?
y^=(x⋅W1+b1)⋅W2+b2=x⋅W1W2+b1W2+b2
Define Weff=W1W2 and beff=b1W2+b2. Then: y^=x⋅Weff+beff
This is just one linear layer! Two linear layers without activation collapse into a single linear transformation. You could stack 100 layers and it'd still be equivalent to one.
Proof with our numbers
Let's verify with the exact weights from Section 1. Compute the effective single-layer matrix Weff=W1W2 (shape 4×4):
And the effective bias: beff=b1⋅W2+b2=[0.04,0.03,−0.03,−0.01]
Now compute the output using just this single layer: x⋅Weff+beff=[1,0,1,1]⋅Weff+beff=[0.07,−0.25,0.01,−0.22] — identical to running the two-layer network without ReLU. Two layers bought us nothing.
Quick Check
If you stack 5 linear layers with no activation function between them, the result is equivalent to how many layers?
Without non-linearity, depth is an illusion. ReLU breaks the linearity, allowing each layer to genuinely add computational power. A 2-layer network with ReLU can represent functions that no single linear layer can.
Dimensions Cheat Sheet
General pattern for any layer:
What
Shape
Rule
Input to layer
(nin,)
Previous layer’s output size
Weight matrix W
(nin,nout)
Rows = inputs, Cols = outputs
Bias vector b
(nout,)
One per output neuron
x⋅W+b
(nout,)
Output of the layer
Memory trick: W has shape (from, to). Input features come first, output neurons come second. Multiply: (from,) × (from, to) = (to,).
Python Implementation
Here's the complete forward pass in Python with NumPy. Click any line to see the exact values flowing through. The core is just 3 lines: z1 = x @ W1 + b1, h = np.maximum(0, z1), y_hat = h @ W2 + b2.
Forward Pass — NumPy Implementation
🐍forward_pass.py
Explanation(18)
Code(38)
1import numpy as np
NumPy provides fast N-dimensional arrays and matrix operations. All math here — matrix multiply via @, element-wise max, mean — runs as optimized C code, not slow Python loops.
EXECUTION STATE
numpy = Library for numerical computing — ndarray type, linear algebra, element-wise math
as np = Alias so we write np.array() instead of numpy.array()
4W1 = np.array([...]) — First layer weights (4×3)
Weight matrix connecting 4 inputs to 3 hidden neurons. Each row is one input feature, each column is one hidden neuron. W1[i][j] = weight from input i to hidden neuron j.
Flatten 2D image to 1D vector, reading left-to-right, top-to-bottom. This is what the network actually receives as input.
EXECUTION STATE
📚 .flatten() = NumPy method: collapses all dimensions into a 1D array. For a 2×2 array, reads row 0 then row 1: [[1,0],[1,1]] → [1,0,1,1]
.astype(float) = Convert from integer to float64 — needed for matrix multiplication with float weights
⬆ result: x = [1.0, 0.0, 1.0, 1.0]
25z1 = x @ W1 + b1 — Hidden layer pre-activation
THE KEY LINE. One matrix multiplication replaces the 3 separate neuron calculations from Section 1. x (1×4) @ W1 (4×3) produces a (1×3) vector — one value per hidden neuron — then we add the bias.
EXECUTION STATE
📚 @ operator = Python matrix multiplication. x @ W1 computes the dot product of x with each column of W1 simultaneously. Equivalent to: [x·col0, x·col1, x·col2]
Element-wise ReLU: keep positive values, clamp negatives to zero. This non-linearity is what makes the network more powerful than a single linear transformation.
EXECUTION STATE
📚 np.maximum(a, b) = Element-wise maximum of two arrays (or array and scalar). np.maximum(0, [-0.1, 0.5]) → [0.0, 0.5]. Different from np.max() which finds the single largest element.
⬇ z1 = [0.0, -0.1, 0.5]
z1[0] = 0.0 = max(0, 0.0) = 0.0 — on the boundary
z1[1] = -0.1 = max(0, -0.1) = 0.0 — DEAD: negative → clamped to zero
⬆ result: h = [0.0, 0.0, 0.5] — only neuron 2 is active!
27y_hat = h @ W2 + b2 — Output layer
Second matrix multiplication: h (1×3) @ W2 (3×4) = output (1×4). Since h0=h1=0, only the third row of W2 contributes — the dead neurons pass NOTHING forward.
📚 np.mean() = Average of all elements: (0.81+1.21+0.0025+1.5625)/4
⬆ result: mse = 0.8963 — very high (perfect = 0.0)
33print(f"Input: {x}")
Display the flattened input vector.
EXECUTION STATE
output = Input: [1. 0. 1. 1.]
34print(f"z1: {z1}")
Display hidden layer pre-activation values.
EXECUTION STATE
output = z1: [ 0. -0.1 0.5]
35print(f"h (ReLU): {h}")
Display hidden layer post-ReLU values. Two neurons are dead (zero).
EXECUTION STATE
output = h (ReLU): [0. 0. 0.5]
36print(f"Prediction: ...")
Display the network's output, rounded to 4 decimal places.
EXECUTION STATE
output = Prediction: [ 0.1 -0.1 -0.05 -0.25]
37print(f"Target: {target}")
Display the target (diagonally flipped image).
EXECUTION STATE
output = Target: [1. 1. 0. 1.]
38print(f"MSE Loss: {mse:.4f}")
Display the loss value. 0.8963 is terrible — random weights produce garbage.
EXECUTION STATE
output = MSE Loss: 0.8963
20 lines without explanation
1import numpy as np
23# ── Network weights (same values from Section 1) ──4W1 = np.array([5[0.2,-0.5,0.1],6[-0.3,0.4,-0.2],7[0.1,0.3,0.5],8[-0.4,0.2,-0.1]9])10b1 = np.array([0.1,-0.1,0.0])1112W2 = np.array([13[0.3,-0.2,0.4,0.1],14[-0.1,0.5,-0.3,0.2],15[0.2,-0.4,0.1,-0.5]16])17b2 = np.array([0.0,0.1,-0.1,0.0])1819# ── Input image ──20image = np.array([[1,0],21[1,1]])22x = image.flatten().astype(float)2324# ── Forward pass (one matrix op per layer) ──25z1 = x @ W1 + b1
26h = np.maximum(0, z1)27y_hat = h @ W2 + b2
2829# ── Target and loss ──30target = image.T.flatten().astype(float)31mse = np.mean((y_hat - target)**2)3233print(f"Input: {x}")34print(f"z1: {z1}")35print(f"h (ReLU): {h}")36print(f"Prediction: {np.round(y_hat,4)}")37print(f"Target: {target}")38print(f"MSE Loss: {mse:.4f}")
Summary
Matrix multiplication replaces loops. One x @ W1 + b1 computes all 3 hidden neurons at once.
Two matrix multiplications + ReLU = complete forward pass. Three lines of Python.
Without activation, depth is useless. Multiple linear layers collapse to one.
Always check dimensions. (a,) × (a, b) = (b,). Inner must match.
In the next section, we'll implement this in PyTorch using nn.Module — the production way to build neural networks — and create the full training dataset.