Boo-AI — Master Artificial Intelligence by Building from Scratch

Why the Chain Rule Matters

In Section 2, we learned how derivatives measure instantaneous change and how gradients point in the direction of steepest ascent. In Section 3, we saw that cross-entropy loss measures how well a network's predictions match the true labels. But there's a critical gap: how does the network compute the derivative of the loss with respect to a weight buried deep inside the network?

Consider a 10-layer neural network. The loss depends on the output of layer 10, which depends on layer 9, which depends on layer 8, and so on down to layer 1. To compute $\frac{\partial L}{\partial w_1}$ (how a weight in layer 1 affects the final loss), we need to trace the influence through all 10 layers. The chain rule is the mathematical tool that lets us do exactly this.

The Big Idea: The chain rule says: if $y$ depends on $u$ , and $u$ depends on $x$ , then the rate at which $y$ changes with $x$ is the product of the individual rates: $\frac{dy}{dx} = \frac{dy}{du} \cdot \frac{du}{dx}$ . Backpropagation is nothing more than the chain rule applied systematically to every layer of a neural network.

Here is the chain rule in the context of neural network training:

Step	What Happens	Direction
Forward pass	Input flows through layers to produce output and loss	Left → Right
Compute loss	Compare prediction with true label	At the end
Backward pass	Chain rule propagates gradients from loss back through every layer	Right → Left
Update weights	Each weight adjusts by -lr × gradient	At every layer

The Chain Rule: Derivatives of Composed Functions

When one function is applied after another — $h(x) = f(g(x))$ — we call this function composition. The chain rule tells us how to differentiate the composed function:

$\frac{dh}{dx} = \frac{df}{du} \bigg|_{u=g(x)} \cdot \frac{dg}{dx}$

In words: the derivative of the composition equals the derivative of the outer function (evaluated at the inner function's output) times the derivative of the inner function. Think of it as a chain of gear ratios: if the inner function amplifies change by 3× and the outer function amplifies by 14×, the total amplification is 3 × 14 = 42×.

Step-by-Step Process

Identify the composition: Express $h(x)$ as $f(g(x))$ . Which is the inner function and which is the outer?
Differentiate each part separately: Find $g'(x)$ (inner derivative) and $f'(u)$ (outer derivative).
Evaluate the outer derivative at the inner function's output: Compute $f'(g(x))$ , not $f'(x)$ .
Multiply: $h'(x) = f'(g(x)) \cdot g'(x)$ .

Let's apply this to a concrete example: $h(x) = (3x + 1)^2$ .

Chain Rule — Basic Example

🐍chain_rule_basic.py

Explanation(31)

Code(39)

1import numpy as np

NumPy is imported for potential array operations, though this example uses plain Python arithmetic to make the chain rule mechanics crystal clear.

EXECUTION STATE

📚 numpy = Not strictly needed here, but imported for consistency. The chain rule concepts apply equally to scalars and arrays.

3# Two simple functions composed: h(x) = f(g(x))

Function composition is when you feed the output of one function as the input to another. In neural networks, every layer is a function, and the whole network is a composition: layer3(layer2(layer1(input))). The chain rule tells us how to differentiate through this entire chain.

4# g(x) = 3x + 1 (inner function)

The inner function is applied first. Think of this as one layer of a neural network: a linear transformation w·x + b with w=3, b=1.

5# f(u) = u**2 (outer function)

The outer function is applied second, to the output of g. Think of this as the loss function: it takes the layer’s output and squares it (like MSE loss).

6# h(x) = (3x + 1)**2

The full composition. We want dh/dx: how does the final output change when x changes? The chain rule lets us compute this by breaking it into two simpler derivatives.

8def g(x): — Inner function (the “layer”)

A simple linear function. In a neural network, this represents the linear transformation z = w·x + b before the activation function.

EXECUTION STATE

⬇ input: x = The input variable. At x=2.0: g(2.0) = 3×2 + 1 = 7.

⬆ returns = 3x + 1 — a linear function with slope 3 and intercept 1.

9return 3 * x + 1

The linear computation. Its derivative with respect to x is simply 3 (the coefficient of x).

EXECUTION STATE

3 * x + 1 = At x=2.0: 3×2.0 + 1 = 7.0

derivative dg/dx = d(3x+1)/dx = 3 — constant, independent of x. The slope of a line is always its coefficient.

11def f(u): — Outer function (the “loss”)

A quadratic function. In a neural network, this could represent part of the MSE loss: squaring the difference between prediction and target.

EXECUTION STATE

⬇ input: u = The output of g(x). At u=7.0: f(7.0) = 49.0.

⬆ returns = u² — a quadratic function. Always non-negative.

12return u ** 2

Squares the input. Its derivative with respect to u is 2u (power rule).

EXECUTION STATE

u ** 2 = At u=7.0: 7.0² = 49.0

derivative df/du = d(u²)/du = 2u. At u=7: df/du = 14.0

14def h(x): — The full composition f(g(x))

The composed function. h(x) = f(g(x)) = (3x + 1)². We want dh/dx, but we can’t easily differentiate (3x+1)² directly in a general setting. The chain rule breaks it into parts we CAN differentiate.

EXECUTION STATE

⬇ input: x = The original input. h(2.0) = f(g(2.0)) = f(7.0) = 49.0.

⬆ returns = (3x + 1)² — the result of applying g then f.

15return f(g(x))

First computes g(x), then passes the result to f. This is the execution order of a neural network’s forward pass: data flows through layers sequentially.

EXECUTION STATE

g(x) = g(2.0) = 7.0 — the intermediate value (hidden layer output)

f(g(x)) = f(7.0) = 49.0 — the final output (loss value)

⬆ return: 49.0 = h(2.0) = (3×2+1)² = 7² = 49

17# Evaluate at x = 2.0

We pick x=2.0 as our evaluation point. The chain rule gives us dh/dx at this specific point.

18x = 2.0

The point where we want to know the derivative.

EXECUTION STATE

x = 2.0 = Our input value. The chain rule will tell us: if x increases from 2.0 by a tiny amount, how much does h(x) change?

19u = g(x) — Compute the intermediate value

The intermediate value u = g(x) is crucial — we need it to evaluate df/du at the right point.

EXECUTION STATE

u = g(2.0) = 3×2.0 + 1 = 7.0 — this is the value that f sees as input. The chain rule evaluates df/du at THIS point.

20result = h(x)

The final output of the composed function.

EXECUTION STATE

result = h(2.0) = f(7.0) = 49.0

21print(f"x = {x}")

Prints the input value.

EXECUTION STATE

output = x = 2.0

22print(f"g(x) = 3*{x} + 1 = {u}")

Prints the intermediate value.

EXECUTION STATE

output = g(x) = 3*2.0 + 1 = 7.0

23print(f"h(x) = f(g(x)) = {u}**2 = {result}")

Prints the final result.

EXECUTION STATE

output = h(x) = f(g(x)) = 7.0**2 = 49.0

25# Chain rule: dh/dx = df/du * dg/dx

THE chain rule formula: the derivative of a composition is the product of the individual derivatives. Think of it as a chain of gears: each gear ratio multiplies the previous one. If g amplifies change by 3× and f amplifies by 14×, then the total amplification is 3×14 = 42×.

26dg_dx = 3.0 — Derivative of the inner function

dg/dx = d(3x+1)/dx = 3. The slope of the linear function g. This tells us: a tiny change in x produces 3× that change in u.

EXECUTION STATE

dg_dx = 3.0 = The inner derivative. For every 1 unit increase in x, g(x) increases by 3 units. This is constant for a linear function.

27df_du = 2 * u — Derivative of the outer function

df/du = d(u²)/du = 2u. Evaluated at u = g(x) = 7.0, this gives 14.0. This tells us: at u=7, a tiny change in u produces 14× that change in f.

EXECUTION STATE

df_du = 2 * 7.0 = 14.0 = The outer derivative, evaluated at the intermediate point u=7.0. If u were 3, this would be 6. The key insight: we evaluate at u=g(x), not at some other point.

28dh_dx = df_du * dg_dx — The chain rule multiplication!

The chain rule: multiply the outer derivative by the inner derivative. The total rate of change is the product of the individual rates of change along the chain.

EXECUTION STATE

dh_dx = 14.0 × 3.0 = 42.0 = The derivative of the composed function. If x increases by 0.001, h(x) increases by approximately 0.001 × 42 = 0.042. Think of it as: x changes by ε, u changes by 3ε (via g), and f changes by 14×(3ε) = 42ε (via f).

30print(f"--- Chain Rule ---")

Header for the chain rule output.

EXECUTION STATE

output = --- Chain Rule ---

31print(f"dg/dx = 3")

Displays the inner derivative.

EXECUTION STATE

output = dg/dx = 3

32print(f"df/du = 2*{u} = {df_du}")

Displays the outer derivative evaluated at u=7.

EXECUTION STATE

output = df/du = 2*7.0 = 14.0

33print(f"dh/dx = df/du * dg/dx = {df_du} * {dg_dx} = {dh_dx}")

The chain rule result: 14 × 3 = 42.

EXECUTION STATE

output = dh/dx = df/du * dg/dx = 14.0 * 3.0 = 42.0

35# Numerical verification

We verify our analytical chain rule result using numerical differentiation (the difference quotient from Section 2). If the numerical approximation matches our analytical result, we know our chain rule calculation is correct.

36h_val = 1e-7

A tiny step size for the numerical approximation. Small enough for accuracy, but not so small that floating-point errors dominate.

EXECUTION STATE

h_val = 1e-7 = 0.0000001 — one ten-millionth. The finite difference approximation converges to the true derivative as h→0.

37numerical = (h(x + h_val) - h(x)) / h_val

The difference quotient: [h(x+ε) - h(x)] / ε. This approximates dh/dx numerically.

EXECUTION STATE

h(x + h_val) = h(2.0000001) = (3×2.0000001 + 1)² = 7.0000003² = 49.0000042...

h(x) = h(2.0) = 49.0

numerical = (49.0000042 - 49.0) / 0.0000001 ≈ 42.000001 — matches our analytical result!

38print(f"Numerical dh/dx = {numerical:.6f}")

Prints the numerical approximation for comparison.

EXECUTION STATE

output = Numerical dh/dx = 42.000001

39print(f"Match: {abs(dh_dx - numerical) < 1e-4}")

Confirms the analytical and numerical results agree to within a small tolerance.

EXECUTION STATE

|42.0 - 42.000001| = 0.000001 < 0.0001 → True. Our chain rule calculation is verified! ✓

8 lines without explanation

1import numpy as np
2
3# Two simple functions composed: h(x) = f(g(x))
4# g(x) = 3x + 1   (inner function)
5# f(u) = u**2      (outer function)
6# h(x) = (3x + 1)**2
7
8def g(x):
9    return 3 * x + 1
10
11def f(u):
12    return u ** 2
13
14def h(x):
15    return f(g(x))
16
17# Evaluate at x = 2.0
18x = 2.0
19u = g(x)
20result = h(x)
21print(f"x = {x}")
22print(f"g(x) = 3*{x} + 1 = {u}")
23print(f"h(x) = f(g(x)) = {u}**2 = {result}")
24
25# Chain rule: dh/dx = df/du * dg/dx
26dg_dx = 3.0             # derivative of g: d(3x+1)/dx = 3
27df_du = 2 * u            # derivative of f: d(u**2)/du = 2u
28dh_dx = df_du * dg_dx    # chain rule: multiply!
29
30print(f"\n--- Chain Rule ---")
31print(f"dg/dx = 3")
32print(f"df/du = 2*{u} = {df_du}")
33print(f"dh/dx = df/du * dg/dx = {df_du} * {dg_dx} = {dh_dx}")
34
35# Numerical verification
36h_val = 1e-7
37numerical = (h(x + h_val) - h(x)) / h_val
38print(f"\nNumerical dh/dx = {numerical:.6f}")
39print(f"Match: {abs(dh_dx - numerical) < 1e-4}")

Longer Chains

The chain rule extends naturally to longer compositions. For $h(x) = f(g(k(x)))$ :

$\frac{dh}{dx} = \frac{df}{dg} \cdot \frac{dg}{dk} \cdot \frac{dk}{dx}$

Each additional function in the chain adds one more factor to the product. A 10-layer neural network has 10 (or more) factors in its chain rule product. This is exactly what backpropagation computes — one factor per layer.

Computational Graphs: Visualizing the Chain Rule

A computational graph makes the chain rule visual. Each node represents an operation (add, multiply, ReLU), and edges carry values (forward) or gradients (backward).

For a simple neuron with $z = w \cdot x + b$ , $a = \text{ReLU}(z)$ , and $L = (a - t)^2$ , the graph looks like:

Forward pass (left to right): Values flow through the graph. Each node computes its output from its inputs. This is what happens when you call $\texttt{model(x)}$ in PyTorch.

Backward pass (right to left): Gradients flow in reverse. At each node, the incoming gradient is multiplied by the local derivative. This is what happens when you call $\texttt{loss.backward()}$ .

Explore the computational graph interactively — run the forward pass to see values flow, then the backward pass to see gradients propagate via the chain rule:

Loading computational graph visualization...

Key Insight: Each edge in the backward pass carries the chain rule. The gradient at any node equals the product of all local derivatives along the path from that node to the loss. Backpropagation is just the chain rule applied to a graph, computing all gradients in a single backward traversal.

Chain Rule Through a Neural Network Layer

Now let's apply the chain rule to an actual neural network computation. A single neuron with ReLU activation and MSE loss has this structure:

Linear: $z = w \cdot x + b$
Activation: $a = \text{ReLU}(z) = \max(0, z)$
Loss: $L = (a - t)^2$

The chain rule for $\frac{\partial L}{\partial w}$ multiplies three derivatives along the chain $L \leftarrow a \leftarrow z \leftarrow w$ :

$\frac{\partial L}{\partial w} = \frac{\partial L}{\partial a} \cdot \frac{\partial a}{\partial z} \cdot \frac{\partial z}{\partial w}$

Each factor has a clear meaning:

Factor	Formula	Meaning
$\frac{\partial L}{\partial a} = 2(a-t)$	Derivative of loss w.r.t. activation	How sensitive is the loss to the output?
$\frac{\partial a}{\partial z} = \begin{cases} 1 & z > 0 \\ 0 & z \leq 0 \end{cases}$	ReLU derivative	Is the neuron active? (gate: open or closed)
$\frac{\partial z}{\partial w} = x$	Derivative of linear w.r.t. weight	How does the weight affect the pre-activation?

Let's compute all of these step by step:

Loading backprop calculus playground...

Adding a Second Input

A real neuron almost never has just one input. Let's see what changes when the same neuron is fed by two inputs $a_1^{(L-1)}$ and $a_2^{(L-1)}$ through weights $w_1$ and $w_2$ . The chain rule still gives one gradient per weight, but both gradients share an upstream factor $\delta = \frac{\partial a}{\partial z} \cdot \frac{\partial C}{\partial a}$ — the insight backpropagation exploits to avoid recomputing work.

Loading two-input playground...

Scaling to the Full Network

Now let's apply the same chain rule to the full 4 → 3 → 4 diagonal-flip network from Chapter 7 — same math, just more indices. You'll see every gradient equation play out with real numbers, watch dead ReLU neurons block gradient flow, and see how zero inputs zero out entire rows of the weight gradient. Toggle the input and target bits below the playground to explore different configurations; the weights and biases stay fixed at the values we'll reuse in Chapter 8.

Loading full-network playground...

Chain Rule Through a Single Neuron

🐍neural_layer_chain_rule.py

Explanation(34)

Code(44)

1import numpy as np

NumPy imported for consistency. This example uses plain Python to make the chain rule steps transparent.

EXECUTION STATE

📚 numpy = Not used directly here, but consistent with our other code blocks.

3# Single neuron: input -> linear -> ReLU -> loss

This is the simplest possible neural network: one input, one weight, one bias, one activation function, and a loss. Despite its simplicity, the chain rule pattern here is EXACTLY the same pattern used in networks with millions of parameters — just repeated many times.

4# z = w*x + b (linear transformation)

The linear step: multiply input by weight and add bias. This is one matrix multiplication in a real network.

5# a = ReLU(z) (activation function)

The non-linear activation. ReLU(z) = max(0, z). Without this, stacking layers would collapse to a single linear transformation.

6# loss = (a - target)**2 (MSE loss)

Mean Squared Error: penalizes the squared difference between prediction and target.

8x = 2.0 — Input value

A single input to our neuron.

EXECUTION STATE

x = 2.0 = The input feature. In a real network, this would be one element of the input vector (e.g., a pixel intensity).

9w = 1.5 — Weight (learnable parameter)

The weight that we want to optimize. The chain rule will compute dloss/dw, telling us how to adjust this weight to reduce the loss.

EXECUTION STATE

w = 1.5 = The learnable parameter. After computing its gradient, gradient descent updates: w_new = w - lr * dloss/dw.

10b = -1.0 — Bias (learnable parameter)

The bias term. Also learnable — the chain rule computes dloss/db too.

EXECUTION STATE

b = -1.0 = The bias shifts the activation threshold. Without bias, the neuron’s output would always pass through the origin.

11target = 3.0 — Desired output

What we want the neuron to output. The loss measures how far we are from this target.

EXECUTION STATE

target = 3.0 = The ground truth label for this training example. The loss is 0 only when a = target = 3.0.

13# === Forward Pass ===

The forward pass computes the output by flowing data left-to-right through the computation graph: x → z → a → loss.

14z = w * x + b — Linear transformation

The linear step: weight times input plus bias. This is the fundamental computation of every neuron.

EXECUTION STATE

w * x = 1.5 × 2.0 = 3.0 — the weighted input

w * x + b = 3.0 + (-1.0) = 2.0 — the pre-activation value

z = 2.0 = The linear output before activation. We need to remember this value for the backward pass (ReLU’s derivative depends on the sign of z).

15print(f"z = w*x + b = {w}*{x} + {b} = {z}")

Displays the linear computation.

EXECUTION STATE

output = z = w*x + b = 1.5*2.0 + -1.0 = 2.0

17a = max(0.0, z) — ReLU activation

ReLU (Rectified Linear Unit): passes positive values unchanged, clamps negatives to zero. Since z=2.0 > 0, the value passes through unchanged.

EXECUTION STATE

📚 max(0, z) = Python’s built-in max function. ReLU(z) = max(0, z). If z > 0: output = z. If z ≤ 0: output = 0 (kills the signal).

a = max(0, 2.0) = 2.0 = Since z=2.0 > 0, ReLU passes it unchanged. The neuron is “active.” If z were negative, a=0 and the neuron would be “dead” (no gradient flows).

18print(f"a = ReLU({z}) = max(0, {z}) = {a}")

Displays the activation output.

EXECUTION STATE

output = a = ReLU(2.0) = max(0, 2.0) = 2.0

20loss = (a - target) ** 2 — MSE loss

The loss function: squared difference between our prediction (a=2.0) and the target (3.0). The loss is 1.0 — we’re off by 1 unit, and squaring gives 1.0.

EXECUTION STATE

a - target = 2.0 - 3.0 = -1.0 — the error (prediction is below target)

(a - target)**2 = (-1.0)² = 1.0 — squaring makes the loss positive and penalizes large errors more than small ones

loss = 1.0 = Our prediction (2.0) is 1 unit away from the target (3.0). The gradient will tell us to increase w and b to push the output closer to 3.0.

21print(f"loss = (a - target)^2 = ({a} - {target})^2 = {loss}")

Displays the loss computation.

EXECUTION STATE

output = loss = (a - target)^2 = (2.0 - 3.0)^2 = 1.0

23# === Backward Pass (Chain Rule) ===

Now we flow gradients right-to-left through the computation graph. At each step, we compute the local derivative and multiply it with the accumulated gradient from later in the chain. This is backpropagation — the chain rule applied systematically.

24print(f"--- Backward Pass ---")

Header for backward pass output.

EXECUTION STATE

output = --- Backward Pass ---

26# Step 1: dloss/da

Start at the loss and work backward. The first local derivative is dloss/da: how does the loss change when the activation changes?

27dloss_da = 2 * (a - target) — Loss gradient

Derivative of (a - target)² with respect to a: d/da[(a-t)²] = 2(a-t). This is the starting point of backpropagation.

EXECUTION STATE

derivative rule = d/da[(a-t)²] = 2(a-t) by the power rule. The chain rule starts here and works backward.

dloss_da = 2×(2.0-3.0) = -2.0 = Negative gradient: the loss decreases if we increase a. This makes sense — we need a to go from 2.0 toward 3.0.

28print(f"dloss/da = 2*(a-target) = 2*({a}-{target}) = {dloss_da}")

Displays the loss gradient.

EXECUTION STATE

output = dloss/da = 2*(a-target) = 2*(2.0-3.0) = -2.0

30# Step 2: da/dz (ReLU derivative)

Next link in the chain: how does the activation change when z changes? The ReLU derivative is a simple switch: 1 if z > 0, 0 if z ≤ 0.

31da_dz = 1.0 if z > 0 else 0.0 — ReLU derivative

ReLU’s derivative is piecewise: for z > 0, the output equals z, so the derivative is 1 (pass-through). For z ≤ 0, the output is constant at 0, so the derivative is 0 (gradient is killed).

EXECUTION STATE

z = 2.0 > 0 = True — the neuron is active, so gradient passes through.

da_dz = 1.0 = ReLU passes the gradient unchanged when z > 0. If z were -1.0, da_dz would be 0.0, blocking ALL gradient flow. This is the “dying ReLU” problem.

32print(f"da/dz = {da_dz} (z={z} > 0, so ReLU passes gradient)")

Displays the ReLU derivative.

EXECUTION STATE

output = da/dz = 1.0 (z=2.0 > 0, so ReLU passes gradient)

34# Step 3: Chain to get dloss/dz

Apply the chain rule: dloss/dz = dloss/da × da/dz. We multiply the accumulated gradient by the local derivative at each step.

35dloss_dz = dloss_da * da_dz — Chain rule multiplication

Multiply the gradient from the loss (-2.0) by the ReLU derivative (1.0). The gradient passes through unchanged because ReLU is active.

EXECUTION STATE

dloss_dz = -2.0 × 1.0 = -2.0 = The gradient at the pre-activation z. This will be used to compute gradients for both w and b (since z depends on both).

36print(f"dloss/dz = {dloss_da} * {da_dz} = {dloss_dz}")

Displays the chained gradient.

EXECUTION STATE

output = dloss/dz = -2.0 * 1.0 = -2.0

38# Step 4: Gradients for w and b

The final chain rule step: z = w·x + b, so dz/dw = x and dz/db = 1. We multiply dloss/dz by each to get the parameter gradients.

39dz_dw = x — How z changes with w

Since z = w·x + b, the derivative of z with respect to w is just x. The input acts as a “scaling factor” for the weight gradient.

EXECUTION STATE

dz_dw = x = 2.0 = d(w·x + b)/dw = x. Larger inputs produce larger weight gradients — this is why input normalization matters.

40dz_db = 1.0 — How z changes with b

Since z = w·x + b, the derivative of z with respect to b is always 1. The bias gradient equals the upstream gradient directly.

EXECUTION STATE

dz_db = 1.0 = d(w·x + b)/db = 1. The bias gradient is always the upstream gradient unchanged.

41dloss_dw = dloss_dz * dz_dw — Weight gradient

The final chain rule for the weight: dloss/dw = dloss/dz × dz/dw = -2.0 × 2.0 = -4.0.

EXECUTION STATE

dloss_dw = -2.0 × 2.0 = -4.0 = Negative gradient: increase w to reduce loss. With learning rate 0.01: w_new = 1.5 - 0.01×(-4.0) = 1.5 + 0.04 = 1.54.

42dloss_db = dloss_dz * dz_db — Bias gradient

The chain rule for the bias: dloss/db = dloss/dz × dz/db = -2.0 × 1.0 = -2.0.

EXECUTION STATE

dloss_db = -2.0 × 1.0 = -2.0 = Negative gradient: increase b to reduce loss. With learning rate 0.01: b_new = -1.0 - 0.01×(-2.0) = -1.0 + 0.02 = -0.98.

43print(f"dloss/dw = dloss/dz * x = {dloss_dz} * {x} = {dloss_dw}")

Displays the weight gradient.

EXECUTION STATE

output = dloss/dw = dloss/dz * x = -2.0 * 2.0 = -4.0

44print(f"dloss/db = dloss/dz * 1 = {dloss_db}")

Displays the bias gradient.

EXECUTION STATE

output = dloss/db = dloss/dz * 1 = -2.0

10 lines without explanation

1import numpy as np
2
3# Single neuron: input -> linear -> ReLU -> loss
4# z = w*x + b     (linear transformation)
5# a = ReLU(z)     (activation function)
6# loss = (a - target)**2  (MSE loss)
7
8x = 2.0        # input
9w = 1.5        # weight
10b = -1.0       # bias
11target = 3.0   # desired output
12
13# === Forward Pass ===
14z = w * x + b
15print(f"z = w*x + b = {w}*{x} + {b} = {z}")
16
17a = max(0.0, z)
18print(f"a = ReLU({z}) = max(0, {z}) = {a}")
19
20loss = (a - target) ** 2
21print(f"loss = (a - target)^2 = ({a} - {target})^2 = {loss}")
22
23# === Backward Pass (Chain Rule) ===
24print(f"\n--- Backward Pass ---")
25
26# Step 1: dloss/da
27dloss_da = 2 * (a - target)
28print(f"dloss/da = 2*(a-target) = 2*({a}-{target}) = {dloss_da}")
29
30# Step 2: da/dz (ReLU derivative)
31da_dz = 1.0 if z > 0 else 0.0
32print(f"da/dz = {da_dz} (z={z} > 0, so ReLU passes gradient)")
33
34# Step 3: Chain to get dloss/dz
35dloss_dz = dloss_da * da_dz
36print(f"dloss/dz = {dloss_da} * {da_dz} = {dloss_dz}")
37
38# Step 4: Gradients for w and b
39dz_dw = x
40dz_db = 1.0
41dloss_dw = dloss_dz * dz_dw
42dloss_db = dloss_dz * dz_db
43print(f"\ndloss/dw = dloss/dz * x = {dloss_dz} * {x} = {dloss_dw}")
44print(f"dloss/db = dloss/dz * 1 = {dloss_db}")

The ReLU Gate: Notice how ReLU acts as a gate in the backward pass. When $z > 0$ , the gate is open and gradients flow through unchanged ( $\frac{\partial a}{\partial z} = 1$ ). When $z \leq 0$ , the gate is closed and ALL downstream gradients become zero. This is the dying ReLU problem: if a neuron's pre-activation is always negative, it can never learn because no gradient flows through it.

The Multivariable Chain Rule

In neural networks, a single intermediate variable often affects the loss through multiple paths. For example, the hidden layer output $h$ might affect multiple neurons in the next layer. When this happens, the gradient is the sum of contributions from all paths:

$\frac{\partial L}{\partial h} = \sum_{j} \frac{\partial L}{\partial z_j} \cdot \frac{\partial z_j}{\partial h}$

In our two-layer network example, $h$ only connects to one output neuron, so there's only one path. But in real networks with many neurons per layer, each hidden neuron contributes to many output neurons, and the multivariable chain rule sums all contributions. This is why matrix multiplication appears in backpropagation: multiplying by the transpose of the weight matrix automatically sums gradients across all paths.

Loading multi-neuron figure...

Key Principle: Sum Over Paths

If a variable $h$ affects the loss through $n$ different pathways, the total gradient is the sum of the gradients along each pathway. This is the multivariable chain rule in action, and it is automatically handled by backpropagation.

Backpropagation: Chain Rule Through Multiple Layers

Now let's see the chain rule in action through a two-layer network. This is where the pattern of backpropagation becomes clear: at each layer, we (1) compute local parameter gradients and (2) pass the gradient backward to the previous layer.

The full chain for $\frac{\partial L}{\partial w_1}$ has FIVE factors:

$\frac{\partial L}{\partial w_1} = \frac{\partial L}{\partial y} \cdot \frac{\partial y}{\partial h} \cdot \frac{\partial h}{\partial z_1} \cdot \frac{\partial z_1}{\partial w_1}$

Each factor corresponds to passing through one operation in reverse:

$\frac{\partial L}{\partial y} = 2(y - t)$ — loss derivative
$\frac{\partial y}{\partial h} = w_2$ — through layer 2's weight
$\frac{\partial h}{\partial z_1}$ = 1 or 0 — through ReLU gate
$\frac{\partial z_1}{\partial w_1} = x$ — the input

Watch how the gradient grows as it flows backward:

Location	Gradient Value	How It Got There
$\partial L / \partial y$	−4.0	Starting point: 2(2−4)
$\partial L / \partial w_2$	−6.0	−4.0 × h = −4.0 × 1.5
$\partial L / \partial h$	−8.0	−4.0 × w₂ = −4.0 × 2.0
$\partial L / \partial z_1$	−8.0	−8.0 × 1.0 (ReLU open)
$\partial L / \partial w_1$	−24.0	−8.0 × x = −8.0 × 3.0
$\partial L / \partial b_1$	−8.0	−8.0 × 1.0

Notice that $|\text{dloss/dw1}| = 24$ is much larger than $|\text{dloss/dw2}| = 6$ . The gradient grew as it passed through layer 2's weight ( $w_2 = 2$ ) and the input ( $x = 3$ ). In very deep networks, this multiplication can cause gradients to either explode (if weights are large) or vanish (if weights are small) — two fundamental challenges in deep learning.

Backpropagation Through Two Layers

🐍two_layer_backprop.py

Explanation(44)

Code(54)

1import numpy as np

NumPy imported for consistency.

EXECUTION STATE

📚 numpy = Standard numerical computing import.

3# Two-layer network:

A two-layer network demonstrates the key insight of backpropagation: the chain rule propagates gradients backward through EVERY layer. The gradient at layer 1 depends on the gradient at layer 2, multiplied by the connection between them. This is how deep networks learn — each layer gets a gradient signal that tells it how to adjust its weights.

4# Layer 1: z1 = w1*x + b1, h = ReLU(z1)

Layer 1 takes the input x, applies a linear transform, then ReLU activation.

5# Layer 2: y = w2*h + b2

Layer 2 takes layer 1’s output h, applies another linear transform to produce the final prediction y.

6# Loss: L = (y - target)**2

MSE loss comparing prediction y to target.

8x = 3.0 — Input

The input to the two-layer network.

EXECUTION STATE

x = 3.0 = Input feature value.

9w1, b1 = 0.5, 0.0 — Layer 1 parameters

Weight and bias for the first layer.

EXECUTION STATE

w1 = 0.5 = Layer 1 weight. Controls how much the input is amplified.

b1 = 0.0 = Layer 1 bias. No shift in this case.

10w2, b2 = 2.0, -1.0 — Layer 2 parameters

Weight and bias for the second layer.

EXECUTION STATE

w2 = 2.0 = Layer 2 weight. This will appear in the gradient chain when propagating back to layer 1.

b2 = -1.0 = Layer 2 bias.

11target = 4.0

The desired output for this training example.

EXECUTION STATE

target = 4.0 = We want the network to output 4.0. Currently it outputs 2.0, so loss = 4.0.

13# === Forward Pass ===

Data flows forward through both layers: x → z1 → h → z2 → y → loss.

14z1 = w1 * x + b1 — Layer 1 linear

First linear transformation.

EXECUTION STATE

z1 = 0.5 × 3.0 + 0.0 = 1.5 = Pre-activation for layer 1. Positive, so ReLU will pass it through.

15h = max(0.0, z1) — Layer 1 ReLU

ReLU activation. Since z1=1.5 > 0, the output is 1.5.

EXECUTION STATE

h = max(0, 1.5) = 1.5 = The hidden layer output. This is what layer 2 sees as input.

16z2 = w2 * h + b2 — Layer 2 linear

Second linear transformation, using layer 1’s output as input.

EXECUTION STATE

z2 = 2.0 × 1.5 + (-1.0) = 2.0 = The network’s final output (no activation on the output layer for regression).

17y = z2 — Final prediction

For regression, the output layer has no activation — the linear output is the prediction directly.

EXECUTION STATE

y = 2.0 = The network predicts 2.0, but the target is 4.0. Loss = (2-4)² = 4.0.

18loss = (y - target) ** 2 — MSE loss

Squared error between prediction and target.

EXECUTION STATE

loss = (2.0 - 4.0)² = 4.0 = The error is -2.0, squared gives 4.0. All 4 parameters need updating.

20print("=== Forward Pass ===")

Header for the forward pass output section.

EXECUTION STATE

output = === Forward Pass ===

21print(f"z1 = {w1}*{x} + {b1} = {z1}")

Displays layer 1 linear output.

EXECUTION STATE

output = z1 = 0.5*3.0 + 0.0 = 1.5

22print(f"h = ReLU({z1}) = {h}")

Displays the hidden activation.

EXECUTION STATE

output = h = ReLU(1.5) = 1.5

23print(f"z2 = {w2}*{h} + {b2} = {z2}")

Displays layer 2 linear output.

EXECUTION STATE

output = z2 = 2.0*1.5 + -1.0 = 2.0

24print(f"y = {y}")

Displays the final prediction.

EXECUTION STATE

output = y = 2.0

25print(f"loss = ({y} - {target})^2 = {loss}")

Displays the loss.

EXECUTION STATE

output = loss = (2.0 - 4.0)^2 = 4.0

27# === Backward Pass ===

Gradients flow backward: loss → y → z2 → h → z1 → (w1, b1). At each step, we multiply the upstream gradient by the local derivative.

28print("=== Backward Pass ===")

Header for backward pass.

EXECUTION STATE

output = === Backward Pass ===

30# Start from the loss

The backward pass always starts from the loss. The initial gradient is dloss/dloss = 1.0 (implicitly). The first computation is dloss/dy.

31dloss_dy = 2 * (y - target) — Starting gradient

d/dy[(y-t)²] = 2(y-t). This is the starting point for backpropagation.

EXECUTION STATE

dloss_dy = 2×(2.0-4.0) = -4.0 = The gradient at the output. Negative = increase y to decrease loss. Magnitude 4.0 = the error is significant.

32print(f"dloss/dy = 2*({y}-{target}) = {dloss_dy}")

Displays the output gradient.

EXECUTION STATE

output = dloss/dy = 2*(2.0-4.0) = -4.0

34# Layer 2 gradients

Since y = w2·h + b2, we compute dloss/dw2 and dloss/db2 using the chain rule.

35dloss_dw2 = dloss_dy * h — Layer 2 weight gradient

dy/dw2 = h (the input to layer 2). So dloss/dw2 = dloss/dy × h.

EXECUTION STATE

dloss_dw2 = -4.0 × 1.5 = -6.0 = Layer 2’s weight gradient. Negative = increase w2 to reduce loss. Larger than dloss/dw1 because the gradient hasn’t been attenuated yet.

36dloss_db2 = dloss_dy * 1.0 — Layer 2 bias gradient

dy/db2 = 1 (bias derivative is always 1). So dloss/db2 = dloss/dy directly.

EXECUTION STATE

dloss_db2 = -4.0 = Bias gradient equals the upstream gradient unchanged.

37print(f"dloss/dw2 = dloss/dy * h = ...")

Displays layer 2 weight gradient.

EXECUTION STATE

output = dloss/dw2 = dloss/dy * h = -4.0 * 1.5 = -6.0

38print(f"dloss/db2 = {dloss_db2}")

Displays layer 2 bias gradient.

EXECUTION STATE

output = dloss/db2 = -4.0

40# Pass gradient through to layer 1

To compute layer 1’s gradients, we need to propagate the gradient through layer 2’s weight. This is the key step that connects layers: dy/dh = w2.

41dloss_dh = dloss_dy * w2 — Gradient flowing to layer 1

Since y = w2·h + b2, we have dy/dh = w2. The gradient is scaled by the weight w2 as it passes backward. This is why weight magnitude matters: large weights amplify gradients, small weights diminish them.

EXECUTION STATE

dloss_dh = -4.0 × 2.0 = -8.0 = The gradient arriving at layer 1’s output. Notice it grew larger (from -4 to -8) because w2=2.0 amplified it. This is why deep networks can have exploding gradients.

42print(f"dloss/dh = dloss/dy * w2 = ...")

Displays the gradient flowing backward to layer 1.

EXECUTION STATE

output = dloss/dh = dloss/dy * w2 = -4.0 * 2.0 = -8.0

44# Through ReLU

The gradient must pass through ReLU. Since z1=1.5 > 0, the gradient passes unchanged.

45dh_dz1 = 1.0 if z1 > 0 else 0.0 — ReLU gate

ReLU either passes the gradient (z1 > 0) or blocks it entirely (z1 ≤ 0).

EXECUTION STATE

dh_dz1 = 1.0 = z1=1.5 > 0 → gradient flows. If z1 were negative, ALL layer 1 gradients would be zero — the dying ReLU problem.

46dloss_dz1 = dloss_dh * dh_dz1 — Pre-activation gradient

Chain rule through ReLU.

EXECUTION STATE

dloss_dz1 = -8.0 × 1.0 = -8.0 = Gradient at layer 1’s pre-activation. Will be used to compute w1 and b1 gradients.

47print(f"dh/dz1 = {dh_dz1} (ReLU derivative)")

Displays the ReLU derivative.

EXECUTION STATE

output = dh/dz1 = 1.0 (ReLU derivative)

48print(f"dloss/dz1 = {dloss_dh} * {dh_dz1} = {dloss_dz1}")

Displays the chained gradient at z1.

EXECUTION STATE

output = dloss/dz1 = -8.0 * 1.0 = -8.0

50# Layer 1 gradients

Finally, compute layer 1’s parameter gradients using the accumulated gradient.

51dloss_dw1 = dloss_dz1 * x — Layer 1 weight gradient

Since z1 = w1·x + b1, dz1/dw1 = x. The chain rule gives dloss/dw1 = dloss/dz1 × x.

EXECUTION STATE

dloss_dw1 = -8.0 × 3.0 = -24.0 = The LARGEST gradient! Layer 1’s weight gradient is amplified by both w2 (layer 2’s weight) AND x (the input). This shows how gradients accumulate through the chain rule.

52dloss_db1 = dloss_dz1 * 1.0 — Layer 1 bias gradient

dz1/db1 = 1, so the bias gradient equals the upstream gradient.

EXECUTION STATE

dloss_db1 = -8.0 = Layer 1 bias gradient. Same magnitude as dloss/dz1.

53print(f"dloss/dw1 = dloss/dz1 * x = ...")

Displays layer 1 weight gradient.

EXECUTION STATE

output = dloss/dw1 = dloss/dz1 * x = -8.0 * 3.0 = -24.0

54print(f"dloss/db1 = {dloss_db1}")

Displays layer 1 bias gradient.

EXECUTION STATE

output = dloss/db1 = -8.0

10 lines without explanation

1import numpy as np
2
3# Two-layer network:
4# Layer 1: z1 = w1*x + b1, h = ReLU(z1)
5# Layer 2: y = w2*h + b2
6# Loss: L = (y - target)**2
7
8x = 3.0
9w1, b1 = 0.5, 0.0     # layer 1 parameters
10w2, b2 = 2.0, -1.0     # layer 2 parameters
11target = 4.0
12
13# === Forward Pass ===
14z1 = w1 * x + b1
15h = max(0.0, z1)
16z2 = w2 * h + b2
17y = z2
18loss = (y - target) ** 2
19
20print("=== Forward Pass ===")
21print(f"z1 = {w1}*{x} + {b1} = {z1}")
22print(f"h  = ReLU({z1}) = {h}")
23print(f"z2 = {w2}*{h} + {b2} = {z2}")
24print(f"y  = {y}")
25print(f"loss = ({y} - {target})^2 = {loss}")
26
27# === Backward Pass ===
28print("\n=== Backward Pass ===")
29
30# Start from the loss
31dloss_dy = 2 * (y - target)
32print(f"dloss/dy = 2*({y}-{target}) = {dloss_dy}")
33
34# Layer 2 gradients
35dloss_dw2 = dloss_dy * h
36dloss_db2 = dloss_dy * 1.0
37print(f"dloss/dw2 = dloss/dy * h = {dloss_dy} * {h} = {dloss_dw2}")
38print(f"dloss/db2 = {dloss_db2}")
39
40# Pass gradient through to layer 1
41dloss_dh = dloss_dy * w2
42print(f"\ndloss/dh = dloss/dy * w2 = {dloss_dy} * {w2} = {dloss_dh}")
43
44# Through ReLU
45dh_dz1 = 1.0 if z1 > 0 else 0.0
46dloss_dz1 = dloss_dh * dh_dz1
47print(f"dh/dz1 = {dh_dz1} (ReLU derivative)")
48print(f"dloss/dz1 = {dloss_dh} * {dh_dz1} = {dloss_dz1}")
49
50# Layer 1 gradients
51dloss_dw1 = dloss_dz1 * x
52dloss_db1 = dloss_dz1 * 1.0
53print(f"\ndloss/dw1 = dloss/dz1 * x = {dloss_dz1} * {x} = {dloss_dw1}")
54print(f"dloss/db1 = {dloss_db1}")

PyTorch Autograd: Automatic Chain Rule

In practice, you never manually implement backpropagation. PyTorch's autograd system automatically applies the chain rule by recording the computation graph during the forward pass and traversing it in reverse during $\texttt{.backward()}$ .

The workflow is beautifully simple:

Create parameters with $\texttt{requires\_grad=True}$ — this tells PyTorch to track operations.
Forward pass: Write the computation normally. PyTorch silently builds the computation graph behind the scenes.
Call $\texttt{loss.backward()}$ : PyTorch traverses the graph in reverse, applying the chain rule at every node. All gradients appear in $\texttt{param.grad}$ .

Let's verify that autograd produces exactly the same gradients as our manual chain rule calculation:

PyTorch Autograd — Automatic Chain Rule

🐍pytorch_autograd.py

Explanation(31)

Code(37)

1import torch

PyTorch’s autograd system implements the chain rule automatically. You define the forward pass, and PyTorch handles the entire backward pass — computing gradients for every parameter with a single call to .backward().

EXECUTION STATE

📚 torch = PyTorch’s core library. The autograd engine records operations during the forward pass and applies the chain rule during .backward() to compute all gradients.

3# Same two-layer computation as above

We reproduce the exact same network from our manual chain rule calculation. The gradients should match exactly, verifying our hand calculations.

4x = torch.tensor(3.0)

Input as a PyTorch tensor. No requires_grad needed for x — we don’t optimize the input.

EXECUTION STATE

x = tensor(3.0) = The input. No gradient tracking needed — we optimize weights, not inputs.

5w1 = torch.tensor(0.5, requires_grad=True)

Layer 1 weight with gradient tracking enabled. PyTorch will record every operation involving this tensor.

EXECUTION STATE

📚 requires_grad=True = Tells PyTorch to track operations on this tensor for gradient computation. Every operation creates a node in the computation graph.

w1 = tensor(0.5, requires_grad=True) = Layer 1 weight. PyTorch will compute dloss/dw1 during .backward().

6b1 = torch.tensor(0.0, requires_grad=True)

Layer 1 bias with gradient tracking.

EXECUTION STATE

b1 = tensor(0.0, requires_grad=True) = Layer 1 bias. Gradient tracking enabled.

7w2 = torch.tensor(2.0, requires_grad=True)

Layer 2 weight with gradient tracking.

EXECUTION STATE

w2 = tensor(2.0, requires_grad=True) = Layer 2 weight.

8b2 = torch.tensor(-1.0, requires_grad=True)

Layer 2 bias with gradient tracking.

EXECUTION STATE

b2 = tensor(-1.0, requires_grad=True) = Layer 2 bias.

9target = torch.tensor(4.0)

Target value. No gradient needed for the target.

EXECUTION STATE

target = tensor(4.0) = Ground truth label. Not a learnable parameter.

11# Forward pass — PyTorch records the computation graph

As we perform operations, PyTorch silently builds a directed acyclic graph (DAG) connecting inputs to outputs. Each operation (multiply, add, relu, square) becomes a node. This graph is later traversed in reverse to compute gradients.

12z1 = w1 * x + b1

Layer 1 linear transformation. PyTorch records: MulBackward(w1, x) → AddBackward(result, b1).

EXECUTION STATE

z1 = tensor(1.5) = 0.5 × 3.0 + 0.0 = 1.5. Same as our manual calculation. Has grad_fn attached (points to the computation graph).

13h = torch.relu(z1)

ReLU activation. PyTorch records: ReluBackward(z1). During backward, it will apply the ReLU derivative (1 if z1 > 0, else 0).

EXECUTION STATE

📚 torch.relu() = ReLU activation function. relu(x) = max(0, x). PyTorch automatically computes its derivative during .backward().

h = tensor(1.5) = relu(1.5) = 1.5 (z1 > 0, passes through).

14z2 = w2 * h + b2

Layer 2 linear transformation.

EXECUTION STATE

z2 = tensor(2.0) = 2.0 × 1.5 + (-1.0) = 2.0. Same as manual.

15y = z2

Final prediction (identity activation for regression).

EXECUTION STATE

y = tensor(2.0) = The network’s prediction. Target is 4.0.

16loss = (y - target) ** 2

MSE loss. This is the root of the computation graph — backward will start here.

EXECUTION STATE

loss = tensor(4.0, grad_fn=<PowBackward0>) = (2.0 - 4.0)² = 4.0. The grad_fn tells us PyTorch recorded the power operation.

18print(f"z1 = {z1.item():.4f}")

Displays layer 1 output.

EXECUTION STATE

output = z1 = 1.5000

19print(f"h = {h.item():.4f}")

Displays hidden activation.

EXECUTION STATE

output = h = 1.5000

20print(f"y = {y.item():.4f}")

Displays prediction.

EXECUTION STATE

output = y = 2.0000

21print(f"loss = {loss.item():.4f}")

Displays loss.

EXECUTION STATE

output = loss = 4.0000

23# Backward pass — one line computes ALL gradients!

This is the magic of autograd. One call to .backward() traverses the entire computation graph in reverse, applying the chain rule at every node. In a real network with millions of parameters, this single call computes all gradients.

24loss.backward() — Automatic chain rule!

Triggers reverse-mode automatic differentiation. PyTorch walks backward through the graph: PowBackward → SubBackward → AddBackward → MulBackward → ReluBackward → AddBackward → MulBackward. At each node, it computes the local derivative and multiplies by the upstream gradient — exactly the chain rule we did manually.

EXECUTION STATE

📚 .backward() = Computes gradients of the loss with respect to ALL tensors with requires_grad=True. Gradients are accumulated in the .grad attribute of each parameter. This is the entire backpropagation algorithm in one line.

→ internally = 1. dloss/dy = 2(y-t) = -4.0 2. dloss/dw2 = -4.0 × h = -6.0 3. dloss/db2 = -4.0 4. dloss/dh = -4.0 × w2 = -8.0 5. dloss/dz1 = -8.0 × relu’(z1) = -8.0 6. dloss/dw1 = -8.0 × x = -24.0 7. dloss/db1 = -8.0

26print(f"--- Gradients (PyTorch autograd) ---")

Header for autograd gradient output.

EXECUTION STATE

output = --- Gradients (PyTorch autograd) ---

27print(f"dloss/dw1 = {w1.grad.item():.4f}")

The gradient for w1, computed automatically by autograd.

EXECUTION STATE

w1.grad = -24.0 = Matches our manual calculation! ✓ autograd applied the chain rule: dloss/dy × dy/dh × dh/dz1 × dz1/dw1 = (-4) × 2 × 1 × 3 = -24.

28print(f"dloss/db1 = {b1.grad.item():.4f}")

Gradient for b1.

EXECUTION STATE

b1.grad = -8.0 = Matches manual! ✓

29print(f"dloss/dw2 = {w2.grad.item():.4f}")

Gradient for w2.

EXECUTION STATE

w2.grad = -6.0 = Matches manual! ✓

30print(f"dloss/db2 = {b2.grad.item():.4f}")

Gradient for b2.

EXECUTION STATE

b2.grad = -4.0 = Matches manual! ✓ All four gradients computed by autograd match our hand calculations exactly.

32# Compare with our manual calculations

Side-by-side comparison to confirm autograd matches our manual chain rule. This is the ultimate verification: if our understanding of the chain rule is correct, the numbers must match.

33print(f"--- Manual values ---")

Header for manual values comparison.

EXECUTION STATE

output = --- Manual values ---

34print(f"dloss/dw1 = -24.0")

Manual value for comparison.

EXECUTION STATE

autograd: -24.0 vs manual: -24.0 = ✓ Match! The chain rule derivation is verified.

35print(f"dloss/db1 = -8.0")

Manual bias gradient.

EXECUTION STATE

autograd: -8.0 vs manual: -8.0 = ✓ Match!

36print(f"dloss/dw2 = -6.0")

Manual layer 2 weight gradient.

EXECUTION STATE

autograd: -6.0 vs manual: -6.0 = ✓ Match!

37print(f"dloss/db2 = -4.0")

Manual layer 2 bias gradient.

EXECUTION STATE

autograd: -4.0 vs manual: -4.0 = ✓ All four gradients match! PyTorch’s autograd correctly applied the chain rule through both layers.

6 lines without explanation

1import torch
2
3# Same two-layer computation as above, using PyTorch autograd
4x = torch.tensor(3.0)
5w1 = torch.tensor(0.5, requires_grad=True)
6b1 = torch.tensor(0.0, requires_grad=True)
7w2 = torch.tensor(2.0, requires_grad=True)
8b2 = torch.tensor(-1.0, requires_grad=True)
9target = torch.tensor(4.0)
10
11# Forward pass — PyTorch records the computation graph
12z1 = w1 * x + b1
13h = torch.relu(z1)
14z2 = w2 * h + b2
15y = z2
16loss = (y - target) ** 2
17
18print(f"z1 = {z1.item():.4f}")
19print(f"h  = {h.item():.4f}")
20print(f"y  = {y.item():.4f}")
21print(f"loss = {loss.item():.4f}")
22
23# Backward pass — one line computes ALL gradients!
24loss.backward()
25
26print(f"\n--- Gradients (PyTorch autograd) ---")
27print(f"dloss/dw1 = {w1.grad.item():.4f}")
28print(f"dloss/db1 = {b1.grad.item():.4f}")
29print(f"dloss/dw2 = {w2.grad.item():.4f}")
30print(f"dloss/db2 = {b2.grad.item():.4f}")
31
32# Compare with our manual calculations
33print(f"\n--- Manual values ---")
34print(f"dloss/dw1 = -24.0")
35print(f"dloss/db1 = -8.0")
36print(f"dloss/dw2 = -6.0")
37print(f"dloss/db2 = -4.0")

The Takeaway: PyTorch’s autograd is not magic — it is the chain rule, implemented as a graph traversal algorithm. Every time you call $\texttt{loss.backward()}$ , PyTorch walks through the computational graph in reverse order, multiplying local derivatives at each node. This is exactly what we did by hand, but automated for networks with millions of parameters.

Summary and What's Next

The chain rule is the mathematical engine that powers neural network training. Here's what we covered:

The chain rule says the derivative of a composition $f(g(x))$ is the product of individual derivatives: $\frac{dh}{dx} = \frac{df}{du} \cdot \frac{dg}{dx}$ .
Computational graphs visualize the chain rule as forward-flowing values and backward-flowing gradients through nodes and edges.
Through a single neuron, the chain rule produces three factors: loss derivative × activation derivative × linear derivative.
Through multiple layers, the gradient accumulates as a product of all local derivatives along the path, which can lead to exploding or vanishing gradients.
The multivariable chain rule sums gradient contributions across multiple paths when a variable affects the loss through multiple downstream neurons.
PyTorch autograd implements the chain rule automatically. You define the forward pass; $\texttt{.backward()}$ handles the rest.

Chapter Complete! With vectors (Section 1), derivatives (Section 2), probability (Section 3), and the chain rule (this section), you now have all the mathematical foundations needed for neural networks. In the next chapter, we build on these foundations to construct the perceptron — the simplest neural network — and see these mathematical tools in action.

Concept	Formula	Role in Neural Networks
Chain rule (single)	$\frac{dh}{dx} = \frac{df}{du} \cdot \frac{dg}{dx}$	Differentiate through composed layers
Chain rule (multi-path)	$\frac{\partial L}{\partial h} = \sum_{j} \frac{\partial L}{\partial z_j} \cdot \frac{\partial z_j}{\partial h}$	Sum gradients from multiple outputs
ReLU derivative	$\begin{cases} 1 & z > 0 \\ 0 & z \leq 0 \end{cases}$	Gate that passes or blocks gradient
Linear derivative	$\frac{\partial z}{\partial w} = x, \; \frac{\partial z}{\partial b} = 1$	Input scales weight gradient
Backpropagation	Repeated chain rule, layer by layer	Computes all gradients in one pass
PyTorch autograd	$\texttt{loss.backward()}$	Automatic chain rule via graph traversal