Chapter 7
22 min read
Section 22 of 65

Matrix Multiplication in Networks

Forward Propagation

Learning Objectives

By the end of this section, you will be able to:

  1. Rewrite per-neuron calculations as matrix multiplications and verify they produce identical results
  2. Read dimension annotations and know the shape of every tensor at each stage
  3. Prove why non-linearity is essential — without it, deep networks collapse to shallow ones
  4. Implement the forward pass in Python using NumPy in just 3 lines
Why this matters: In Section 1, we computed each neuron one at a time — 12 separate multiplications for each layer. Matrix multiplication does an entire layer in one operation. Every framework (PyTorch, TensorFlow) works this way. The code mirrors the math line for line.

From Neurons to One Matrix Operation

In Section 1, we computed each hidden neuron separately. Look at the pattern:

z0=x0W1[0][0]+x1W1[1][0]+x2W1[2][0]+x3W1[3][0]+b1[0]z_0 = x_0 W_1[0][0] + x_1 W_1[1][0] + x_2 W_1[2][0] + x_3 W_1[3][0] + b_1[0]
z1=x0W1[0][1]+x1W1[1][1]+x2W1[2][1]+x3W1[3][1]+b1[1]z_1 = x_0 W_1[0][1] + x_1 W_1[1][1] + x_2 W_1[2][1] + x_3 W_1[3][1] + b_1[1]
z2=x0W1[0][2]+x1W1[1][2]+x2W1[2][2]+x3W1[3][2]+b1[2]z_2 = x_0 W_1[0][2] + x_1 W_1[1][2] + x_2 W_1[2][2] + x_3 W_1[3][2] + b_1[2]

Each equation is a dot product between x\mathbf{x} and a column of W1\mathbf{W}_1. All three at once is a matrix-vector multiplication: z(1)=xW1+b1\mathbf{z}^{(1)} = \mathbf{x} \cdot \mathbf{W}_1 + \mathbf{b}_1

In NumPy this is just: z1 = x @ W1 + b1\texttt{z1 = x @ W1 + b1}. One line replaces three calculations.


Loading matrix visualization...

Layer 1 as Matrix Multiplication

xW1=[1,0,1,1][0.20.50.10.30.40.20.10.30.50.40.20.1]\mathbf{x} \cdot \mathbf{W}_1 = [1, 0, 1, 1] \cdot \begin{bmatrix} 0.2 & -0.5 & 0.1 \\ -0.3 & 0.4 & -0.2 \\ 0.1 & 0.3 & 0.5 \\ -0.4 & 0.2 & -0.1 \end{bmatrix}

The vector x\mathbf{x} dots with each column:

Column 0: (1)(0.2)+(0)(0.3)+(1)(0.1)+(1)(0.4)=0.1(1)(0.2) + (0)(-0.3) + (1)(0.1) + (1)(-0.4) = -0.1
Column 1: (1)(0.5)+(0)(0.4)+(1)(0.3)+(1)(0.2)=0.0(1)(-0.5) + (0)(0.4) + (1)(0.3) + (1)(0.2) = 0.0
Column 2: (1)(0.1)+(0)(0.2)+(1)(0.5)+(1)(0.1)=0.5(1)(0.1) + (0)(-0.2) + (1)(0.5) + (1)(-0.1) = 0.5

xW1+b1=[0.1,0.0,0.5]+[0.1,0.1,0.0]=[0.0,0.1,0.5]\mathbf{x} \cdot \mathbf{W}_1 + \mathbf{b}_1 = [-0.1, 0.0, 0.5] + [0.1, -0.1, 0.0] = [0.0, -0.1, 0.5]same as Section 1.

Quick Check

When multiplying x with shape (4,) by W₁ with shape (4,3), what is the result shape?


Layer 2 as Matrix Multiplication

After ReLU: h=[0.0,0.0,0.5]\mathbf{h} = [0.0, 0.0, 0.5]. The output layer is y^=hW2+b2\hat{\mathbf{y}} = \mathbf{h} \cdot \mathbf{W}_2 + \mathbf{b}_2.

Since h0=h1=0h_0 = h_1 = 0, only the third row of W2\mathbf{W}_2 contributes: [0.5×0.2,  0.5×(0.4),  0.5×0.1,  0.5×(0.5)]=[0.1,0.2,0.05,0.25][0.5 \times 0.2,\; 0.5 \times (-0.4),\; 0.5 \times 0.1,\; 0.5 \times (-0.5)] = [0.1, -0.2, 0.05, -0.25]

Adding bias: [0.1,0.2,0.05,0.25]+[0.0,0.1,0.1,0.0]=[0.1,0.1,0.05,0.25][0.1, -0.2, 0.05, -0.25] + [0.0, 0.1, -0.1, 0.0] = [0.1, -0.1, -0.05, -0.25]matches Section 1 exactly.


The Complete Forward Pass Equation

Two lines of math, or one combined:

h=ReLU(xW1+b1)\mathbf{h} = \text{ReLU}(\mathbf{x} \cdot \mathbf{W}_1 + \mathbf{b}_1)
y^=hW2+b2\hat{\mathbf{y}} = \mathbf{h} \cdot \mathbf{W}_2 + \mathbf{b}_2

Dimension annotations at each step:

ExpressionDimensionsDescription
x\mathbf{x}(4,)(4,)Input vector
W1\mathbf{W}_1(4,3)(4, 3)Weight matrix
xW1\mathbf{x} \cdot \mathbf{W}_1(4,)×(4,3)=(3,)(4,) \times (4,3) = (3,)Pre-activation
+b1+ \mathbf{b}_1(3,)+(3,)=(3,)(3,) + (3,) = (3,)Add bias
ReLU()\text{ReLU}(\cdot)(3,)(3,)Element-wise, same shape
W2\mathbf{W}_2(3,4)(3, 4)Weight matrix
hW2\mathbf{h} \cdot \mathbf{W}_2(3,)×(3,4)=(4,)(3,) \times (3,4) = (4,)Output
+b2+ \mathbf{b}_2(4,)+(4,)=(4,)(4,) + (4,) = (4,)Final output
Dimension rule: When multiplying (a,) × (a, b), the shared dimension aa cancels out, giving (b,). If dimensions don't match, you have a bug. Checking dimensions is the #1 debugging tool.

Why Non-Linearity Is Essential

Critical question: what if we remove ReLU and just stack two linear layers?

y^=(xW1+b1)W2+b2=xW1W2+b1W2+b2\hat{\mathbf{y}} = (\mathbf{x} \cdot \mathbf{W}_1 + \mathbf{b}_1) \cdot \mathbf{W}_2 + \mathbf{b}_2 = \mathbf{x} \cdot \mathbf{W}_1 \mathbf{W}_2 + \mathbf{b}_1 \mathbf{W}_2 + \mathbf{b}_2

Define Weff=W1W2\mathbf{W}_{\text{eff}} = \mathbf{W}_1 \mathbf{W}_2 and beff=b1W2+b2\mathbf{b}_{\text{eff}} = \mathbf{b}_1 \mathbf{W}_2 + \mathbf{b}_2. Then: y^=xWeff+beff\hat{\mathbf{y}} = \mathbf{x} \cdot \mathbf{W}_{\text{eff}} + \mathbf{b}_{\text{eff}}

This is just one linear layer! Two linear layers without activation collapse into a single linear transformation. You could stack 100 layers and it'd still be equivalent to one.

Proof with our numbers

Let's verify with the exact weights from Section 1. Compute the effective single-layer matrix Weff=W1W2\mathbf{W}_{\text{eff}} = \mathbf{W}_1 \mathbf{W}_2 (shape 4×4):

Weff=[0.20.50.10.30.40.20.10.30.50.40.20.1][0.30.20.40.10.10.50.30.20.20.40.10.5]=[0.090.330.240.050.170.340.260.150.100.170.000.180.160.220.230.02]\mathbf{W}_{\text{eff}} = \begin{bmatrix} 0.2 & -0.5 & 0.1 \\ -0.3 & 0.4 & -0.2 \\ 0.1 & 0.3 & 0.5 \\ -0.4 & 0.2 & -0.1 \end{bmatrix} \begin{bmatrix} 0.3 & -0.2 & 0.4 & 0.1 \\ -0.1 & 0.5 & -0.3 & 0.2 \\ 0.2 & -0.4 & 0.1 & -0.5 \end{bmatrix} = \begin{bmatrix} 0.09 & -0.33 & 0.24 & -0.05 \\ -0.17 & 0.34 & -0.26 & 0.15 \\ 0.10 & -0.17 & 0.00 & -0.18 \\ -0.16 & 0.22 & -0.23 & 0.02 \end{bmatrix}

And the effective bias: beff=b1W2+b2=[0.04,  0.03,  0.03,  0.01]\mathbf{b}_{\text{eff}} = \mathbf{b}_1 \cdot \mathbf{W}_2 + \mathbf{b}_2 = [0.04,\; 0.03,\; -0.03,\; -0.01]

Now compute the output using just this single layer: xWeff+beff=[1,0,1,1]Weff+beff=[0.07,  0.25,  0.01,  0.22]\mathbf{x} \cdot \mathbf{W}_{\text{eff}} + \mathbf{b}_{\text{eff}} = [1,0,1,1] \cdot \mathbf{W}_{\text{eff}} + \mathbf{b}_{\text{eff}} = [0.07,\; -0.25,\; 0.01,\; -0.22]identical to running the two-layer network without ReLU. Two layers bought us nothing.

Quick Check

If you stack 5 linear layers with no activation function between them, the result is equivalent to how many layers?

Without non-linearity, depth is an illusion. ReLU breaks the linearity, allowing each layer to genuinely add computational power. A 2-layer network with ReLU can represent functions that no single linear layer can.

Dimensions Cheat Sheet

General pattern for any layer:

WhatShapeRule
Input to layer(nin,)(n_{\text{in}},)Previous layer’s output size
Weight matrix W\mathbf{W}(nin,nout)(n_{\text{in}}, n_{\text{out}})Rows = inputs, Cols = outputs
Bias vector b\mathbf{b}(nout,)(n_{\text{out}},)One per output neuron
xW+b\mathbf{x} \cdot \mathbf{W} + \mathbf{b}(nout,)(n_{\text{out}},)Output of the layer
Memory trick: W has shape (from, to). Input features come first, output neurons come second. Multiply: (from,) × (from, to) = (to,).

Python Implementation

Here's the complete forward pass in Python with NumPy. Click any line to see the exact values flowing through. The core is just 3 lines: z1 = x @ W1 + b1\texttt{z1 = x @ W1 + b1}, h = np.maximum(0, z1)\texttt{h = np.maximum(0, z1)}, y_hat = h @ W2 + b2\texttt{y\_hat = h @ W2 + b2}.

Forward Pass — NumPy Implementation
🐍forward_pass.py
1import numpy as np

NumPy provides fast N-dimensional arrays and matrix operations. All math here — matrix multiply via @, element-wise max, mean — runs as optimized C code, not slow Python loops.

EXECUTION STATE
numpy = Library for numerical computing — ndarray type, linear algebra, element-wise math
as np = Alias so we write np.array() instead of numpy.array()
4W1 = np.array([...]) — First layer weights (4×3)

Weight matrix connecting 4 inputs to 3 hidden neurons. Each row is one input feature, each column is one hidden neuron. W1[i][j] = weight from input i to hidden neuron j.

EXECUTION STATE
W1 shape = (4, 3) — 4 input features × 3 hidden neurons = 12 weights
W1 =
       h0    h1    h2
x0   0.2  -0.5   0.1
x1  -0.3   0.4  -0.2
x2   0.1   0.3   0.5
x3  -0.4   0.2  -0.1
Reading W1 = Column 0 = [0.2, -0.3, 0.1, -0.4] — all weights feeding into hidden neuron 0
10b1 = np.array([0.1, -0.1, 0.0]) — First layer biases

One bias per hidden neuron. Added after the weighted sum. Lets each neuron shift its activation threshold.

EXECUTION STATE
b1 = [0.1, -0.1, 0.0] — one per hidden neuron (h0, h1, h2)
b1[0] = 0.1 = Hidden neuron 0 starts with a small positive push
12W2 = np.array([...]) — Second layer weights (3×4)

Weight matrix connecting 3 hidden neurons to 4 output neurons. W2[i][j] = weight from hidden neuron i to output j.

EXECUTION STATE
W2 shape = (3, 4) — 3 hidden neurons × 4 outputs = 12 weights
W2 =
       y0    y1    y2    y3
h0   0.3  -0.2   0.4   0.1
h1  -0.1   0.5  -0.3   0.2
h2   0.2  -0.4   0.1  -0.5
17b2 = np.array([0.0, 0.1, -0.1, 0.0]) — Output biases

One bias per output neuron. b2[1] = 0.1 gives output neuron 1 a slight positive offset.

EXECUTION STATE
b2 = [0.0, 0.1, -0.1, 0.0] — one per output (ŷ0, ŷ1, ŷ2, ŷ3)
20image = np.array([[1,0],[1,1]]) — Input image

Our 2×2 binary image. Top row: pixel(0,0)=1 (white), pixel(0,1)=0 (black). Bottom row: pixel(1,0)=1, pixel(1,1)=1.

EXECUTION STATE
image (2×2) =
1  0
1  1
22x = image.flatten() → [1, 0, 1, 1]

Flatten 2D image to 1D vector, reading left-to-right, top-to-bottom. This is what the network actually receives as input.

EXECUTION STATE
📚 .flatten() = NumPy method: collapses all dimensions into a 1D array. For a 2×2 array, reads row 0 then row 1: [[1,0],[1,1]] → [1,0,1,1]
.astype(float) = Convert from integer to float64 — needed for matrix multiplication with float weights
⬆ result: x = [1.0, 0.0, 1.0, 1.0]
25z1 = x @ W1 + b1 — Hidden layer pre-activation

THE KEY LINE. One matrix multiplication replaces the 3 separate neuron calculations from Section 1. x (1×4) @ W1 (4×3) produces a (1×3) vector — one value per hidden neuron — then we add the bias.

EXECUTION STATE
📚 @ operator = Python matrix multiplication. x @ W1 computes the dot product of x with each column of W1 simultaneously. Equivalent to: [x·col0, x·col1, x·col2]
⬇ x (4,) = [1.0, 0.0, 1.0, 1.0]
⬇ W1 (4×3) =
       h0    h1    h2
x0   0.2  -0.5   0.1
x1  -0.3   0.4  -0.2
x2   0.1   0.3   0.5
x3  -0.4   0.2  -0.1
── Neuron 0 ── =
x·col0 = (1)(0.2) + (0)(-0.3) + (1)(0.1) + (1)(-0.4) = -0.1
── Neuron 1 ── =
x·col1 = (1)(-0.5) + (0)(0.4) + (1)(0.3) + (1)(0.2) = 0.0
── Neuron 2 ── =
x·col2 = (1)(0.1) + (0)(-0.2) + (1)(0.5) + (1)(-0.1) = 0.5
x @ W1 = [-0.1, 0.0, 0.5]
+ b1 = [-0.1+0.1, 0.0+(-0.1), 0.5+0.0] = [0.0, -0.1, 0.5]
⬆ result: z1 = [0.0, -0.1, 0.5] ← matches Section 1!
26h = np.maximum(0, z1) — ReLU activation

Element-wise ReLU: keep positive values, clamp negatives to zero. This non-linearity is what makes the network more powerful than a single linear transformation.

EXECUTION STATE
📚 np.maximum(a, b) = Element-wise maximum of two arrays (or array and scalar). np.maximum(0, [-0.1, 0.5]) → [0.0, 0.5]. Different from np.max() which finds the single largest element.
⬇ z1 = [0.0, -0.1, 0.5]
z1[0] = 0.0 = max(0, 0.0) = 0.0 — on the boundary
z1[1] = -0.1 = max(0, -0.1) = 0.0 — DEAD: negative → clamped to zero
z1[2] = 0.5 = max(0, 0.5) = 0.5 — ALIVE: positive → passes through
⬆ result: h = [0.0, 0.0, 0.5] — only neuron 2 is active!
27y_hat = h @ W2 + b2 — Output layer

Second matrix multiplication: h (1×3) @ W2 (3×4) = output (1×4). Since h0=h1=0, only the third row of W2 contributes — the dead neurons pass NOTHING forward.

EXECUTION STATE
⬇ h = [0.0, 0.0, 0.5] — only h2 matters
⬇ W2 (3×4) =
       y0    y1    y2    y3
h0   0.3  -0.2   0.4   0.1  ← ×0.0 (dead)
h1  -0.1   0.5  -0.3   0.2  ← ×0.0 (dead)
h2   0.2  -0.4   0.1  -0.5  ← ×0.5 (alive!)
h @ W2 = [0.5×0.2, 0.5×(-0.4), 0.5×0.1, 0.5×(-0.5)] = [0.1, -0.2, 0.05, -0.25]
+ b2 = [0.1+0.0, -0.2+0.1, 0.05+(-0.1), -0.25+0.0]
⬆ result: y_hat = [0.1, -0.1, -0.05, -0.25] ← matches Section 1!
30target = image.T.flatten() — Diagonal flip

The target output is the diagonally flipped image. NumPy's .T transposes the 2×2 matrix (swaps rows and columns), then we flatten it.

EXECUTION STATE
📚 .T (transpose) = Swaps rows and columns. [[1,0],[1,1]].T = [[1,1],[0,1]]. For 2×2: pixel(r,c) → pixel(c,r)
image.T (2×2) =
1  1
0  1
⬆ result: target = [1.0, 1.0, 0.0, 1.0]
31mse = np.mean((y_hat - target) ** 2) — Loss

Mean Squared Error: average of squared differences between prediction and target. This single number tells the network how wrong it is.

EXECUTION STATE
y_hat - target = [0.1-1, -0.1-1, -0.05-0, -0.25-1] = [-0.9, -1.1, -0.05, -1.25]
(…) ** 2 = [0.81, 1.21, 0.0025, 1.5625]
📚 np.mean() = Average of all elements: (0.81+1.21+0.0025+1.5625)/4
⬆ result: mse = 0.8963 — very high (perfect = 0.0)
33print(f"Input: {x}")

Display the flattened input vector.

EXECUTION STATE
output = Input: [1. 0. 1. 1.]
34print(f"z1: {z1}")

Display hidden layer pre-activation values.

EXECUTION STATE
output = z1: [ 0. -0.1 0.5]
35print(f"h (ReLU): {h}")

Display hidden layer post-ReLU values. Two neurons are dead (zero).

EXECUTION STATE
output = h (ReLU): [0. 0. 0.5]
36print(f"Prediction: ...")

Display the network's output, rounded to 4 decimal places.

EXECUTION STATE
output = Prediction: [ 0.1 -0.1 -0.05 -0.25]
37print(f"Target: {target}")

Display the target (diagonally flipped image).

EXECUTION STATE
output = Target: [1. 1. 0. 1.]
38print(f"MSE Loss: {mse:.4f}")

Display the loss value. 0.8963 is terrible — random weights produce garbage.

EXECUTION STATE
output = MSE Loss: 0.8963
20 lines without explanation
1import numpy as np
2
3# ── Network weights (same values from Section 1) ──
4W1 = np.array([
5    [ 0.2, -0.5,  0.1],
6    [-0.3,  0.4, -0.2],
7    [ 0.1,  0.3,  0.5],
8    [-0.4,  0.2, -0.1]
9])
10b1 = np.array([0.1, -0.1, 0.0])
11
12W2 = np.array([
13    [ 0.3, -0.2,  0.4,  0.1],
14    [-0.1,  0.5, -0.3,  0.2],
15    [ 0.2, -0.4,  0.1, -0.5]
16])
17b2 = np.array([0.0, 0.1, -0.1, 0.0])
18
19# ── Input image ──
20image = np.array([[1, 0],
21                  [1, 1]])
22x = image.flatten().astype(float)
23
24# ── Forward pass (one matrix op per layer) ──
25z1 = x @ W1 + b1
26h = np.maximum(0, z1)
27y_hat = h @ W2 + b2
28
29# ── Target and loss ──
30target = image.T.flatten().astype(float)
31mse = np.mean((y_hat - target) ** 2)
32
33print(f"Input:      {x}")
34print(f"z1:         {z1}")
35print(f"h (ReLU):   {h}")
36print(f"Prediction: {np.round(y_hat, 4)}")
37print(f"Target:     {target}")
38print(f"MSE Loss:   {mse:.4f}")

Summary

  1. Matrix multiplication replaces loops. One x @ W1 + b1\texttt{x @ W1 + b1} computes all 3 hidden neurons at once.
  2. Two matrix multiplications + ReLU = complete forward pass. Three lines of Python.
  3. Without activation, depth is useless. Multiple linear layers collapse to one.
  4. Always check dimensions. (a,) × (a, b) = (b,). Inner must match.

In the next section, we'll implement this in PyTorch using nn.Module\texttt{nn.Module} — the production way to build neural networks — and create the full training dataset.

Loading comments...