In Section 2, we learned how derivatives measure instantaneous change and how gradients point in the direction of steepest ascent. In Section 3, we saw that cross-entropy loss measures how well a network's predictions match the true labels. But there's a critical gap: how does the network compute the derivative of the loss with respect to a weight buried deep inside the network?
Consider a 10-layer neural network. The loss depends on the output of layer 10, which depends on layer 9, which depends on layer 8, and so on down to layer 1. To compute∂w1∂L (how a weight in layer 1 affects the final loss), we need to trace the influence through all 10 layers. The chain rule is the mathematical tool that lets us do exactly this.
The Big Idea: The chain rule says: if y depends on u, and u depends on x, then the rate at which y changes with x is the product of the individual rates:dxdy=dudy⋅dxdu. Backpropagation is nothing more than the chain rule applied systematically to every layer of a neural network.
Here is the chain rule in the context of neural network training:
Step
What Happens
Direction
Forward pass
Input flows through layers to produce output and loss
Left → Right
Compute loss
Compare prediction with true label
At the end
Backward pass
Chain rule propagates gradients from loss back through every layer
Right → Left
Update weights
Each weight adjusts by -lr × gradient
At every layer
The Chain Rule: Derivatives of Composed Functions
When one function is applied after another — h(x)=f(g(x)) — we call this function composition. The chain rule tells us how to differentiate the composed function:
dxdh=dudfu=g(x)⋅dxdg
In words: the derivative of the composition equals the derivative of the outer function (evaluated at the inner function's output) times the derivative of the inner function. Think of it as a chain of gear ratios: if the inner function amplifies change by 3× and the outer function amplifies by 14×, the total amplification is 3 × 14 = 42×.
Step-by-Step Process
Identify the composition: Express h(x) asf(g(x)). Which is the inner function and which is the outer?
Differentiate each part separately: Find g′(x) (inner derivative) and f′(u) (outer derivative).
Evaluate the outer derivative at the inner function's output: Compute f′(g(x)), not f′(x).
Multiply:h′(x)=f′(g(x))⋅g′(x).
Let's apply this to a concrete example: h(x)=(3x+1)2.
Chain Rule — Basic Example
🐍chain_rule_basic.py
Explanation(31)
Code(39)
1import numpy as np
NumPy is imported for potential array operations, though this example uses plain Python arithmetic to make the chain rule mechanics crystal clear.
EXECUTION STATE
📚 numpy = Not strictly needed here, but imported for consistency. The chain rule concepts apply equally to scalars and arrays.
3# Two simple functions composed: h(x) = f(g(x))
Function composition is when you feed the output of one function as the input to another. In neural networks, every layer is a function, and the whole network is a composition: layer3(layer2(layer1(input))). The chain rule tells us how to differentiate through this entire chain.
4# g(x) = 3x + 1 (inner function)
The inner function is applied first. Think of this as one layer of a neural network: a linear transformation w·x + b with w=3, b=1.
5# f(u) = u**2 (outer function)
The outer function is applied second, to the output of g. Think of this as the loss function: it takes the layer’s output and squares it (like MSE loss).
6# h(x) = (3x + 1)**2
The full composition. We want dh/dx: how does the final output change when x changes? The chain rule lets us compute this by breaking it into two simpler derivatives.
8def g(x): — Inner function (the “layer”)
A simple linear function. In a neural network, this represents the linear transformation z = w·x + b before the activation function.
EXECUTION STATE
⬇ input: x = The input variable. At x=2.0: g(2.0) = 3×2 + 1 = 7.
⬆ returns = 3x + 1 — a linear function with slope 3 and intercept 1.
9return 3 * x + 1
The linear computation. Its derivative with respect to x is simply 3 (the coefficient of x).
EXECUTION STATE
3 * x + 1 = At x=2.0: 3×2.0 + 1 = 7.0
derivative dg/dx = d(3x+1)/dx = 3 — constant, independent of x. The slope of a line is always its coefficient.
11def f(u): — Outer function (the “loss”)
A quadratic function. In a neural network, this could represent part of the MSE loss: squaring the difference between prediction and target.
EXECUTION STATE
⬇ input: u = The output of g(x). At u=7.0: f(7.0) = 49.0.
⬆ returns = u² — a quadratic function. Always non-negative.
12return u ** 2
Squares the input. Its derivative with respect to u is 2u (power rule).
The composed function. h(x) = f(g(x)) = (3x + 1)². We want dh/dx, but we can’t easily differentiate (3x+1)² directly in a general setting. The chain rule breaks it into parts we CAN differentiate.
EXECUTION STATE
⬇ input: x = The original input. h(2.0) = f(g(2.0)) = f(7.0) = 49.0.
⬆ returns = (3x + 1)² — the result of applying g then f.
15return f(g(x))
First computes g(x), then passes the result to f. This is the execution order of a neural network’s forward pass: data flows through layers sequentially.
EXECUTION STATE
g(x) = g(2.0) = 7.0 — the intermediate value (hidden layer output)
f(g(x)) = f(7.0) = 49.0 — the final output (loss value)
⬆ return: 49.0 = h(2.0) = (3×2+1)² = 7² = 49
17# Evaluate at x = 2.0
We pick x=2.0 as our evaluation point. The chain rule gives us dh/dx at this specific point.
18x = 2.0
The point where we want to know the derivative.
EXECUTION STATE
x = 2.0 = Our input value. The chain rule will tell us: if x increases from 2.0 by a tiny amount, how much does h(x) change?
19u = g(x) — Compute the intermediate value
The intermediate value u = g(x) is crucial — we need it to evaluate df/du at the right point.
EXECUTION STATE
u = g(2.0) = 3×2.0 + 1 = 7.0 — this is the value that f sees as input. The chain rule evaluates df/du at THIS point.
20result = h(x)
The final output of the composed function.
EXECUTION STATE
result = h(2.0) = f(7.0) = 49.0
21print(f"x = {x}")
Prints the input value.
EXECUTION STATE
output = x = 2.0
22print(f"g(x) = 3*{x} + 1 = {u}")
Prints the intermediate value.
EXECUTION STATE
output = g(x) = 3*2.0 + 1 = 7.0
23print(f"h(x) = f(g(x)) = {u}**2 = {result}")
Prints the final result.
EXECUTION STATE
output = h(x) = f(g(x)) = 7.0**2 = 49.0
25# Chain rule: dh/dx = df/du * dg/dx
THE chain rule formula: the derivative of a composition is the product of the individual derivatives. Think of it as a chain of gears: each gear ratio multiplies the previous one. If g amplifies change by 3× and f amplifies by 14×, then the total amplification is 3×14 = 42×.
26dg_dx = 3.0 — Derivative of the inner function
dg/dx = d(3x+1)/dx = 3. The slope of the linear function g. This tells us: a tiny change in x produces 3× that change in u.
EXECUTION STATE
dg_dx = 3.0 = The inner derivative. For every 1 unit increase in x, g(x) increases by 3 units. This is constant for a linear function.
27df_du = 2 * u — Derivative of the outer function
df/du = d(u²)/du = 2u. Evaluated at u = g(x) = 7.0, this gives 14.0. This tells us: at u=7, a tiny change in u produces 14× that change in f.
EXECUTION STATE
df_du = 2 * 7.0 = 14.0 = The outer derivative, evaluated at the intermediate point u=7.0. If u were 3, this would be 6. The key insight: we evaluate at u=g(x), not at some other point.
28dh_dx = df_du * dg_dx — The chain rule multiplication!
The chain rule: multiply the outer derivative by the inner derivative. The total rate of change is the product of the individual rates of change along the chain.
EXECUTION STATE
dh_dx = 14.0 × 3.0 = 42.0 = The derivative of the composed function. If x increases by 0.001, h(x) increases by approximately 0.001 × 42 = 0.042. Think of it as: x changes by ε, u changes by 3ε (via g), and f changes by 14×(3ε) = 42ε (via f).
We verify our analytical chain rule result using numerical differentiation (the difference quotient from Section 2). If the numerical approximation matches our analytical result, we know our chain rule calculation is correct.
36h_val = 1e-7
A tiny step size for the numerical approximation. Small enough for accuracy, but not so small that floating-point errors dominate.
EXECUTION STATE
h_val = 1e-7 = 0.0000001 — one ten-millionth. The finite difference approximation converges to the true derivative as h→0.
37numerical = (h(x + h_val) - h(x)) / h_val
The difference quotient: [h(x+ε) - h(x)] / ε. This approximates dh/dx numerically.
The chain rule extends naturally to longer compositions. For h(x)=f(g(k(x))):
dxdh=dgdf⋅dkdg⋅dxdk
Each additional function in the chain adds one more factor to the product. A 10-layer neural network has 10 (or more) factors in its chain rule product. This is exactly what backpropagation computes — one factor per layer.
Computational Graphs: Visualizing the Chain Rule
A computational graph makes the chain rule visual. Each node represents an operation (add, multiply, ReLU), and edges carry values (forward) or gradients (backward).
For a simple neuron with z=w⋅x+b, a=ReLU(z), and L=(a−t)2, the graph looks like:
Forward pass (left to right): Values flow through the graph. Each node computes its output from its inputs. This is what happens when you callmodel(x) in PyTorch.
Backward pass (right to left): Gradients flow in reverse. At each node, the incoming gradient is multiplied by the local derivative. This is what happens when you call loss.backward().
Explore the computational graph interactively — run the forward pass to see values flow, then the backward pass to see gradients propagate via the chain rule:
Loading computational graph visualization...
Key Insight: Each edge in the backward pass carries the chain rule. The gradient at any node equals the product of all local derivatives along the path from that node to the loss. Backpropagation is just the chain rule applied to a graph, computing all gradients in a single backward traversal.
Chain Rule Through a Neural Network Layer
Now let's apply the chain rule to an actual neural network computation. A single neuron with ReLU activation and MSE loss has this structure:
Linear:z=w⋅x+b
Activation:a=ReLU(z)=max(0,z)
Loss:L=(a−t)2
The chain rule for ∂w∂L multiplies three derivatives along the chain L←a←z←w:
∂w∂L=∂a∂L⋅∂z∂a⋅∂w∂z
Each factor has a clear meaning:
Factor
Formula
Meaning
∂a∂L=2(a−t)
Derivative of loss w.r.t. activation
How sensitive is the loss to the output?
∂z∂a={10z>0z≤0
ReLU derivative
Is the neuron active? (gate: open or closed)
∂w∂z=x
Derivative of linear w.r.t. weight
How does the weight affect the pre-activation?
Let's compute all of these step by step:
Loading backprop calculus playground...
Adding a Second Input
A real neuron almost never has just one input. Let's see what changes when the same neuron is fed by two inputs a1(L−1) anda2(L−1) through weights w1 andw2. The chain rule still gives one gradient per weight, but both gradients share an upstream factorδ=∂z∂a⋅∂a∂C — the insight backpropagation exploits to avoid recomputing work.
Loading two-input playground...
Scaling to the Full Network
Now let's apply the same chain rule to the full 4 → 3 → 4 diagonal-flip network from Chapter 7 — same math, just more indices. You'll see every gradient equation play out with real numbers, watch dead ReLU neurons block gradient flow, and see how zero inputs zero out entire rows of the weight gradient. Toggle the input and target bits below the playground to explore different configurations; the weights and biases stay fixed at the values we'll reuse in Chapter 8.
Loading full-network playground...
Chain Rule Through a Single Neuron
🐍neural_layer_chain_rule.py
Explanation(34)
Code(44)
1import numpy as np
NumPy imported for consistency. This example uses plain Python to make the chain rule steps transparent.
EXECUTION STATE
📚 numpy = Not used directly here, but consistent with our other code blocks.
3# Single neuron: input -> linear -> ReLU -> loss
This is the simplest possible neural network: one input, one weight, one bias, one activation function, and a loss. Despite its simplicity, the chain rule pattern here is EXACTLY the same pattern used in networks with millions of parameters — just repeated many times.
4# z = w*x + b (linear transformation)
The linear step: multiply input by weight and add bias. This is one matrix multiplication in a real network.
5# a = ReLU(z) (activation function)
The non-linear activation. ReLU(z) = max(0, z). Without this, stacking layers would collapse to a single linear transformation.
6# loss = (a - target)**2 (MSE loss)
Mean Squared Error: penalizes the squared difference between prediction and target.
8x = 2.0 — Input value
A single input to our neuron.
EXECUTION STATE
x = 2.0 = The input feature. In a real network, this would be one element of the input vector (e.g., a pixel intensity).
9w = 1.5 — Weight (learnable parameter)
The weight that we want to optimize. The chain rule will compute dloss/dw, telling us how to adjust this weight to reduce the loss.
EXECUTION STATE
w = 1.5 = The learnable parameter. After computing its gradient, gradient descent updates: w_new = w - lr * dloss/dw.
10b = -1.0 — Bias (learnable parameter)
The bias term. Also learnable — the chain rule computes dloss/db too.
EXECUTION STATE
b = -1.0 = The bias shifts the activation threshold. Without bias, the neuron’s output would always pass through the origin.
11target = 3.0 — Desired output
What we want the neuron to output. The loss measures how far we are from this target.
EXECUTION STATE
target = 3.0 = The ground truth label for this training example. The loss is 0 only when a = target = 3.0.
13# === Forward Pass ===
The forward pass computes the output by flowing data left-to-right through the computation graph: x → z → a → loss.
14z = w * x + b — Linear transformation
The linear step: weight times input plus bias. This is the fundamental computation of every neuron.
EXECUTION STATE
w * x = 1.5 × 2.0 = 3.0 — the weighted input
w * x + b = 3.0 + (-1.0) = 2.0 — the pre-activation value
z = 2.0 = The linear output before activation. We need to remember this value for the backward pass (ReLU’s derivative depends on the sign of z).
15print(f"z = w*x + b = {w}*{x} + {b} = {z}")
Displays the linear computation.
EXECUTION STATE
output = z = w*x + b = 1.5*2.0 + -1.0 = 2.0
17a = max(0.0, z) — ReLU activation
ReLU (Rectified Linear Unit): passes positive values unchanged, clamps negatives to zero. Since z=2.0 > 0, the value passes through unchanged.
EXECUTION STATE
📚 max(0, z) = Python’s built-in max function. ReLU(z) = max(0, z). If z > 0: output = z. If z ≤ 0: output = 0 (kills the signal).
a = max(0, 2.0) = 2.0 = Since z=2.0 > 0, ReLU passes it unchanged. The neuron is “active.” If z were negative, a=0 and the neuron would be “dead” (no gradient flows).
18print(f"a = ReLU({z}) = max(0, {z}) = {a}")
Displays the activation output.
EXECUTION STATE
output = a = ReLU(2.0) = max(0, 2.0) = 2.0
20loss = (a - target) ** 2 — MSE loss
The loss function: squared difference between our prediction (a=2.0) and the target (3.0). The loss is 1.0 — we’re off by 1 unit, and squaring gives 1.0.
EXECUTION STATE
a - target = 2.0 - 3.0 = -1.0 — the error (prediction is below target)
(a - target)**2 = (-1.0)² = 1.0 — squaring makes the loss positive and penalizes large errors more than small ones
loss = 1.0 = Our prediction (2.0) is 1 unit away from the target (3.0). The gradient will tell us to increase w and b to push the output closer to 3.0.
output = loss = (a - target)^2 = (2.0 - 3.0)^2 = 1.0
23# === Backward Pass (Chain Rule) ===
Now we flow gradients right-to-left through the computation graph. At each step, we compute the local derivative and multiply it with the accumulated gradient from later in the chain. This is backpropagation — the chain rule applied systematically.
24print(f"--- Backward Pass ---")
Header for backward pass output.
EXECUTION STATE
output = --- Backward Pass ---
26# Step 1: dloss/da
Start at the loss and work backward. The first local derivative is dloss/da: how does the loss change when the activation changes?
27dloss_da = 2 * (a - target) — Loss gradient
Derivative of (a - target)² with respect to a: d/da[(a-t)²] = 2(a-t). This is the starting point of backpropagation.
EXECUTION STATE
derivative rule = d/da[(a-t)²] = 2(a-t) by the power rule. The chain rule starts here and works backward.
dloss_da = 2×(2.0-3.0) = -2.0 = Negative gradient: the loss decreases if we increase a. This makes sense — we need a to go from 2.0 toward 3.0.
Next link in the chain: how does the activation change when z changes? The ReLU derivative is a simple switch: 1 if z > 0, 0 if z ≤ 0.
31da_dz = 1.0 if z > 0 else 0.0 — ReLU derivative
ReLU’s derivative is piecewise: for z > 0, the output equals z, so the derivative is 1 (pass-through). For z ≤ 0, the output is constant at 0, so the derivative is 0 (gradient is killed).
EXECUTION STATE
z = 2.0 > 0 = True — the neuron is active, so gradient passes through.
da_dz = 1.0 = ReLU passes the gradient unchanged when z > 0. If z were -1.0, da_dz would be 0.0, blocking ALL gradient flow. This is the “dying ReLU” problem.
32print(f"da/dz = {da_dz} (z={z} > 0, so ReLU passes gradient)")
Multiply the gradient from the loss (-2.0) by the ReLU derivative (1.0). The gradient passes through unchanged because ReLU is active.
EXECUTION STATE
dloss_dz = -2.0 × 1.0 = -2.0 = The gradient at the pre-activation z. This will be used to compute gradients for both w and b (since z depends on both).
1import numpy as np
23# Single neuron: input -> linear -> ReLU -> loss4# z = w*x + b (linear transformation)5# a = ReLU(z) (activation function)6# loss = (a - target)**2 (MSE loss)78x =2.0# input9w =1.5# weight10b =-1.0# bias11target =3.0# desired output1213# === Forward Pass ===14z = w * x + b
15print(f"z = w*x + b = {w}*{x} + {b} = {z}")1617a =max(0.0, z)18print(f"a = ReLU({z}) = max(0, {z}) = {a}")1920loss =(a - target)**221print(f"loss = (a - target)^2 = ({a} - {target})^2 = {loss}")2223# === Backward Pass (Chain Rule) ===24print(f"\n--- Backward Pass ---")2526# Step 1: dloss/da27dloss_da =2*(a - target)28print(f"dloss/da = 2*(a-target) = 2*({a}-{target}) = {dloss_da}")2930# Step 2: da/dz (ReLU derivative)31da_dz =1.0if z >0else0.032print(f"da/dz = {da_dz} (z={z} > 0, so ReLU passes gradient)")3334# Step 3: Chain to get dloss/dz35dloss_dz = dloss_da * da_dz
36print(f"dloss/dz = {dloss_da} * {da_dz} = {dloss_dz}")3738# Step 4: Gradients for w and b39dz_dw = x
40dz_db =1.041dloss_dw = dloss_dz * dz_dw
42dloss_db = dloss_dz * dz_db
43print(f"\ndloss/dw = dloss/dz * x = {dloss_dz} * {x} = {dloss_dw}")44print(f"dloss/db = dloss/dz * 1 = {dloss_db}")
The ReLU Gate: Notice how ReLU acts as a gate in the backward pass. When z>0, the gate is open and gradients flow through unchanged (∂z∂a=1). When z≤0, the gate is closed and ALL downstream gradients become zero. This is the dying ReLU problem: if a neuron's pre-activation is always negative, it can never learn because no gradient flows through it.
The Multivariable Chain Rule
In neural networks, a single intermediate variable often affects the loss through multiple paths. For example, the hidden layer output h might affect multiple neurons in the next layer. When this happens, the gradient is the sum of contributions from all paths:
∂h∂L=∑j∂zj∂L⋅∂h∂zj
In our two-layer network example, h only connects to one output neuron, so there's only one path. But in real networks with many neurons per layer, each hidden neuron contributes to many output neurons, and the multivariable chain rule sums all contributions. This is why matrix multiplication appears in backpropagation: multiplying by the transpose of the weight matrix automatically sums gradients across all paths.
Loading multi-neuron figure...
Key Principle: Sum Over Paths
If a variable h affects the loss through n different pathways, the total gradient is the sum of the gradients along each pathway. This is the multivariable chain rule in action, and it is automatically handled by backpropagation.
Backpropagation: Chain Rule Through Multiple Layers
Now let's see the chain rule in action through a two-layer network. This is where the pattern of backpropagation becomes clear: at each layer, we (1) compute local parameter gradients and (2) pass the gradient backward to the previous layer.
The full chain for ∂w1∂L has FIVE factors:
∂w1∂L=∂y∂L⋅∂h∂y⋅∂z1∂h⋅∂w1∂z1
Each factor corresponds to passing through one operation in reverse:
∂y∂L=2(y−t) — loss derivative
∂h∂y=w2 — through layer 2's weight
∂z1∂h = 1 or 0 — through ReLU gate
∂w1∂z1=x — the input
Watch how the gradient grows as it flows backward:
Location
Gradient Value
How It Got There
∂L/∂y
−4.0
Starting point: 2(2−4)
∂L/∂w2
−6.0
−4.0 × h = −4.0 × 1.5
∂L/∂h
−8.0
−4.0 × w₂ = −4.0 × 2.0
∂L/∂z1
−8.0
−8.0 × 1.0 (ReLU open)
∂L/∂w1
−24.0
−8.0 × x = −8.0 × 3.0
∂L/∂b1
−8.0
−8.0 × 1.0
Notice that ∣dloss/dw1∣=24 is much larger than ∣dloss/dw2∣=6. The gradient grew as it passed through layer 2's weight (w2=2) and the input (x=3). In very deep networks, this multiplication can cause gradients to either explode (if weights are large) or vanish (if weights are small) — two fundamental challenges in deep learning.
Backpropagation Through Two Layers
🐍two_layer_backprop.py
Explanation(44)
Code(54)
1import numpy as np
NumPy imported for consistency.
EXECUTION STATE
📚 numpy = Standard numerical computing import.
3# Two-layer network:
A two-layer network demonstrates the key insight of backpropagation: the chain rule propagates gradients backward through EVERY layer. The gradient at layer 1 depends on the gradient at layer 2, multiplied by the connection between them. This is how deep networks learn — each layer gets a gradient signal that tells it how to adjust its weights.
4# Layer 1: z1 = w1*x + b1, h = ReLU(z1)
Layer 1 takes the input x, applies a linear transform, then ReLU activation.
5# Layer 2: y = w2*h + b2
Layer 2 takes layer 1’s output h, applies another linear transform to produce the final prediction y.
6# Loss: L = (y - target)**2
MSE loss comparing prediction y to target.
8x = 3.0 — Input
The input to the two-layer network.
EXECUTION STATE
x = 3.0 = Input feature value.
9w1, b1 = 0.5, 0.0 — Layer 1 parameters
Weight and bias for the first layer.
EXECUTION STATE
w1 = 0.5 = Layer 1 weight. Controls how much the input is amplified.
b1 = 0.0 = Layer 1 bias. No shift in this case.
10w2, b2 = 2.0, -1.0 — Layer 2 parameters
Weight and bias for the second layer.
EXECUTION STATE
w2 = 2.0 = Layer 2 weight. This will appear in the gradient chain when propagating back to layer 1.
b2 = -1.0 = Layer 2 bias.
11target = 4.0
The desired output for this training example.
EXECUTION STATE
target = 4.0 = We want the network to output 4.0. Currently it outputs 2.0, so loss = 4.0.
13# === Forward Pass ===
Data flows forward through both layers: x → z1 → h → z2 → y → loss.
14z1 = w1 * x + b1 — Layer 1 linear
First linear transformation.
EXECUTION STATE
z1 = 0.5 × 3.0 + 0.0 = 1.5 = Pre-activation for layer 1. Positive, so ReLU will pass it through.
15h = max(0.0, z1) — Layer 1 ReLU
ReLU activation. Since z1=1.5 > 0, the output is 1.5.
EXECUTION STATE
h = max(0, 1.5) = 1.5 = The hidden layer output. This is what layer 2 sees as input.
16z2 = w2 * h + b2 — Layer 2 linear
Second linear transformation, using layer 1’s output as input.
EXECUTION STATE
z2 = 2.0 × 1.5 + (-1.0) = 2.0 = The network’s final output (no activation on the output layer for regression).
17y = z2 — Final prediction
For regression, the output layer has no activation — the linear output is the prediction directly.
EXECUTION STATE
y = 2.0 = The network predicts 2.0, but the target is 4.0. Loss = (2-4)² = 4.0.
18loss = (y - target) ** 2 — MSE loss
Squared error between prediction and target.
EXECUTION STATE
loss = (2.0 - 4.0)² = 4.0 = The error is -2.0, squared gives 4.0. All 4 parameters need updating.
20print("=== Forward Pass ===")
Header for the forward pass output section.
EXECUTION STATE
output = === Forward Pass ===
21print(f"z1 = {w1}*{x} + {b1} = {z1}")
Displays layer 1 linear output.
EXECUTION STATE
output = z1 = 0.5*3.0 + 0.0 = 1.5
22print(f"h = ReLU({z1}) = {h}")
Displays the hidden activation.
EXECUTION STATE
output = h = ReLU(1.5) = 1.5
23print(f"z2 = {w2}*{h} + {b2} = {z2}")
Displays layer 2 linear output.
EXECUTION STATE
output = z2 = 2.0*1.5 + -1.0 = 2.0
24print(f"y = {y}")
Displays the final prediction.
EXECUTION STATE
output = y = 2.0
25print(f"loss = ({y} - {target})^2 = {loss}")
Displays the loss.
EXECUTION STATE
output = loss = (2.0 - 4.0)^2 = 4.0
27# === Backward Pass ===
Gradients flow backward: loss → y → z2 → h → z1 → (w1, b1). At each step, we multiply the upstream gradient by the local derivative.
28print("=== Backward Pass ===")
Header for backward pass.
EXECUTION STATE
output = === Backward Pass ===
30# Start from the loss
The backward pass always starts from the loss. The initial gradient is dloss/dloss = 1.0 (implicitly). The first computation is dloss/dy.
31dloss_dy = 2 * (y - target) — Starting gradient
d/dy[(y-t)²] = 2(y-t). This is the starting point for backpropagation.
EXECUTION STATE
dloss_dy = 2×(2.0-4.0) = -4.0 = The gradient at the output. Negative = increase y to decrease loss. Magnitude 4.0 = the error is significant.
Since y = w2·h + b2, we have dy/dh = w2. The gradient is scaled by the weight w2 as it passes backward. This is why weight magnitude matters: large weights amplify gradients, small weights diminish them.
EXECUTION STATE
dloss_dh = -4.0 × 2.0 = -8.0 = The gradient arriving at layer 1’s output. Notice it grew larger (from -4 to -8) because w2=2.0 amplified it. This is why deep networks can have exploding gradients.
42print(f"dloss/dh = dloss/dy * w2 = ...")
Displays the gradient flowing backward to layer 1.
Finally, compute layer 1’s parameter gradients using the accumulated gradient.
51dloss_dw1 = dloss_dz1 * x — Layer 1 weight gradient
Since z1 = w1·x + b1, dz1/dw1 = x. The chain rule gives dloss/dw1 = dloss/dz1 × x.
EXECUTION STATE
dloss_dw1 = -8.0 × 3.0 = -24.0 = The LARGEST gradient! Layer 1’s weight gradient is amplified by both w2 (layer 2’s weight) AND x (the input). This shows how gradients accumulate through the chain rule.
In practice, you never manually implement backpropagation. PyTorch's autograd system automatically applies the chain rule by recording the computation graph during the forward pass and traversing it in reverse during .backward().
The workflow is beautifully simple:
Create parameters with requires_grad=True — this tells PyTorch to track operations.
Forward pass: Write the computation normally. PyTorch silently builds the computation graph behind the scenes.
Call loss.backward(): PyTorch traverses the graph in reverse, applying the chain rule at every node. All gradients appear in param.grad.
Let's verify that autograd produces exactly the same gradients as our manual chain rule calculation:
PyTorch Autograd — Automatic Chain Rule
🐍pytorch_autograd.py
Explanation(31)
Code(37)
1import torch
PyTorch’s autograd system implements the chain rule automatically. You define the forward pass, and PyTorch handles the entire backward pass — computing gradients for every parameter with a single call to .backward().
EXECUTION STATE
📚 torch = PyTorch’s core library. The autograd engine records operations during the forward pass and applies the chain rule during .backward() to compute all gradients.
3# Same two-layer computation as above
We reproduce the exact same network from our manual chain rule calculation. The gradients should match exactly, verifying our hand calculations.
4x = torch.tensor(3.0)
Input as a PyTorch tensor. No requires_grad needed for x — we don’t optimize the input.
EXECUTION STATE
x = tensor(3.0) = The input. No gradient tracking needed — we optimize weights, not inputs.
5w1 = torch.tensor(0.5, requires_grad=True)
Layer 1 weight with gradient tracking enabled. PyTorch will record every operation involving this tensor.
EXECUTION STATE
📚 requires_grad=True = Tells PyTorch to track operations on this tensor for gradient computation. Every operation creates a node in the computation graph.
w1 = tensor(0.5, requires_grad=True) = Layer 1 weight. PyTorch will compute dloss/dw1 during .backward().
target = tensor(4.0) = Ground truth label. Not a learnable parameter.
11# Forward pass — PyTorch records the computation graph
As we perform operations, PyTorch silently builds a directed acyclic graph (DAG) connecting inputs to outputs. Each operation (multiply, add, relu, square) becomes a node. This graph is later traversed in reverse to compute gradients.
z2 = tensor(2.0) = 2.0 × 1.5 + (-1.0) = 2.0. Same as manual.
15y = z2
Final prediction (identity activation for regression).
EXECUTION STATE
y = tensor(2.0) = The network’s prediction. Target is 4.0.
16loss = (y - target) ** 2
MSE loss. This is the root of the computation graph — backward will start here.
EXECUTION STATE
loss = tensor(4.0, grad_fn=<PowBackward0>) = (2.0 - 4.0)² = 4.0. The grad_fn tells us PyTorch recorded the power operation.
18print(f"z1 = {z1.item():.4f}")
Displays layer 1 output.
EXECUTION STATE
output = z1 = 1.5000
19print(f"h = {h.item():.4f}")
Displays hidden activation.
EXECUTION STATE
output = h = 1.5000
20print(f"y = {y.item():.4f}")
Displays prediction.
EXECUTION STATE
output = y = 2.0000
21print(f"loss = {loss.item():.4f}")
Displays loss.
EXECUTION STATE
output = loss = 4.0000
23# Backward pass — one line computes ALL gradients!
This is the magic of autograd. One call to .backward() traverses the entire computation graph in reverse, applying the chain rule at every node. In a real network with millions of parameters, this single call computes all gradients.
24loss.backward() — Automatic chain rule!
Triggers reverse-mode automatic differentiation. PyTorch walks backward through the graph: PowBackward → SubBackward → AddBackward → MulBackward → ReluBackward → AddBackward → MulBackward. At each node, it computes the local derivative and multiplies by the upstream gradient — exactly the chain rule we did manually.
EXECUTION STATE
📚 .backward() = Computes gradients of the loss with respect to ALL tensors with requires_grad=True. Gradients are accumulated in the .grad attribute of each parameter. This is the entire backpropagation algorithm in one line.
b2.grad = -4.0 = Matches manual! ✓ All four gradients computed by autograd match our hand calculations exactly.
32# Compare with our manual calculations
Side-by-side comparison to confirm autograd matches our manual chain rule. This is the ultimate verification: if our understanding of the chain rule is correct, the numbers must match.
33print(f"--- Manual values ---")
Header for manual values comparison.
EXECUTION STATE
output = --- Manual values ---
34print(f"dloss/dw1 = -24.0")
Manual value for comparison.
EXECUTION STATE
autograd: -24.0 vs manual: -24.0 = ✓ Match! The chain rule derivation is verified.
35print(f"dloss/db1 = -8.0")
Manual bias gradient.
EXECUTION STATE
autograd: -8.0 vs manual: -8.0 = ✓ Match!
36print(f"dloss/dw2 = -6.0")
Manual layer 2 weight gradient.
EXECUTION STATE
autograd: -6.0 vs manual: -6.0 = ✓ Match!
37print(f"dloss/db2 = -4.0")
Manual layer 2 bias gradient.
EXECUTION STATE
autograd: -4.0 vs manual: -4.0 = ✓ All four gradients match! PyTorch’s autograd correctly applied the chain rule through both layers.
6 lines without explanation
1import torch
23# Same two-layer computation as above, using PyTorch autograd4x = torch.tensor(3.0)5w1 = torch.tensor(0.5, requires_grad=True)6b1 = torch.tensor(0.0, requires_grad=True)7w2 = torch.tensor(2.0, requires_grad=True)8b2 = torch.tensor(-1.0, requires_grad=True)9target = torch.tensor(4.0)1011# Forward pass — PyTorch records the computation graph12z1 = w1 * x + b1
13h = torch.relu(z1)14z2 = w2 * h + b2
15y = z2
16loss =(y - target)**21718print(f"z1 = {z1.item():.4f}")19print(f"h = {h.item():.4f}")20print(f"y = {y.item():.4f}")21print(f"loss = {loss.item():.4f}")2223# Backward pass — one line computes ALL gradients!24loss.backward()2526print(f"\n--- Gradients (PyTorch autograd) ---")27print(f"dloss/dw1 = {w1.grad.item():.4f}")28print(f"dloss/db1 = {b1.grad.item():.4f}")29print(f"dloss/dw2 = {w2.grad.item():.4f}")30print(f"dloss/db2 = {b2.grad.item():.4f}")3132# Compare with our manual calculations33print(f"\n--- Manual values ---")34print(f"dloss/dw1 = -24.0")35print(f"dloss/db1 = -8.0")36print(f"dloss/dw2 = -6.0")37print(f"dloss/db2 = -4.0")
The Takeaway: PyTorch’s autograd is not magic — it is the chain rule, implemented as a graph traversal algorithm. Every time you call loss.backward(), PyTorch walks through the computational graph in reverse order, multiplying local derivatives at each node. This is exactly what we did by hand, but automated for networks with millions of parameters.
Summary and What's Next
The chain rule is the mathematical engine that powers neural network training. Here's what we covered:
The chain rule says the derivative of a composition f(g(x)) is the product of individual derivatives: dxdh=dudf⋅dxdg.
Computational graphs visualize the chain rule as forward-flowing values and backward-flowing gradients through nodes and edges.
Through a single neuron, the chain rule produces three factors: loss derivative × activation derivative × linear derivative.
Through multiple layers, the gradient accumulates as a product of all local derivatives along the path, which can lead to exploding or vanishing gradients.
The multivariable chain rule sums gradient contributions across multiple paths when a variable affects the loss through multiple downstream neurons.
PyTorch autograd implements the chain rule automatically. You define the forward pass; .backward() handles the rest.
Chapter Complete! With vectors (Section 1), derivatives (Section 2), probability (Section 3), and the chain rule (this section), you now have all the mathematical foundations needed for neural networks. In the next chapter, we build on these foundations to construct the perceptron — the simplest neural network — and see these mathematical tools in action.