Boo-AI — Master Artificial Intelligence by Building from Scratch

Learning Objectives

By the end of this section, you will be able to:

State and apply the chain rule for differentiating compositions of functions
Identify the inner and outer functions in a composition and differentiate each correctly
Understand intuitively why rates of change multiply when functions are composed
Apply the chain rule multiple times for nested compositions
Use Leibniz notation to express the chain rule as cancellation of differentials
Connect the chain rule to backpropagation, the algorithm that powers neural network training
Avoid common mistakes like forgetting the inner derivative

The Big Picture: Differentiating Compositions

"The chain rule is the single most important differentiation technique. It's the mathematical foundation of how neural networks learn."

We've learned the product rule and quotient rule for combining functions through multiplication and division. But what about when we compose functions — feed the output of one function into another?

Consider $f(x) = \sin(x^2)$ . This is a composition: we first square $x$ , then take the sine of the result. If we write $u = x^2$ , then $f(x) = \sin(u)$ .

How do we find $f'(x)$ ? The chain rule tells us:

The Chain Rule

\frac{d}{dx}[f(g(x))] = f'(g(x)) \cdot g'(x)

"The derivative of the outer function (evaluated at the inner) times the derivative of the inner function."

For our example $\sin(x^2)$ :

Outer function: $f(u) = \sin(u)$ , so $f'(u) = \cos(u)$
Inner function: $g(x) = x^2$ , so $g'(x) = 2x$
Chain rule: $\frac{d}{dx}[\sin(x^2)] = \cos(x^2) \cdot 2x = 2x\cos(x^2)$

Why the Chain Rule Matters

The chain rule is arguably the most important differentiation rule because:

Most real-world functions are compositions (e.g., $e^{-x^2}$ , $\ln(\sin x)$ )
It's the mathematical foundation of backpropagation in neural networks
It enables automatic differentiation in deep learning frameworks like PyTorch and TensorFlow
Without it, we couldn't train modern AI systems!

Historical Context

The chain rule was developed alongside calculus itself by Leibniz in the late 17th century. Leibniz's notation for derivatives made the chain rule almost self-evident — it looks like cancellation of fractions:

\frac{dy}{dx} = \frac{dy}{du} \cdot \frac{du}{dx}

This "fraction-like" behavior isn't just a coincidence — it reflects something deep about how rates of change compose. If $y$ changes with $u$ at a certain rate, and $u$ changes with $x$ at another rate, then the rates multiply.

The chain rule took on new importance in the 1980s when Rumelhart, Hinton, and Williams showed that it could be used to train multi-layer neural networks through an algorithm called backpropagation. Today, every time you use ChatGPT, image recognition, or recommendation systems, the chain rule is working behind the scenes!

Intuitive Understanding: Rates Multiply

The key insight of the chain rule is that rates of change multiply when functions are composed. Let's build intuition with a concrete example.

The Gear Analogy

Imagine two connected gears:

Gear 1 (inner function): For every rotation of the input shaft, Gear 1 rotates 3 times. (Rate: 3)
Gear 2 (outer function): For every rotation of Gear 1, Gear 2 rotates 2 times. (Rate: 2)

Question: For every rotation of the input shaft, how many times does Gear 2 rotate?

Answer: 3 × 2 = 6 times! The rates multiply.

Rate Multiplication

Input

g'(x)→

u = g(x)

Intermediate

f'(u)→

y = f(u)

Output

Total rate: f'(g(x)) × g'(x)

Currency Conversion Analogy

Consider converting US Dollars to Japanese Yen through Euros:

1 USD = 0.92 EUR (rate: 0.92)
1 EUR = 158 JPY (rate: 158)

To find how many Yen per Dollar, we multiply the rates: 0.92 × 158 = 145.36 JPY/USD.

This is exactly what the chain rule does — it chains together rates of change by multiplying them!

Formal Statement and Proof

Theorem (Chain Rule): If $g$ is differentiable at $x$ and $f$ is differentiable at $g(x)$ , then the composite function $F(x) = f(g(x))$ is differentiable at $x$ , and:

$F'(x) = f'(g(x)) \cdot g'(x)$

Proof:

Let $u = g(x)$ and $y = f(u)$ .

We want to find $\frac{dy}{dx} = \lim_{h \to 0} \frac{f(g(x+h)) - f(g(x))}{h}$

Let $\Delta u = g(x+h) - g(x)$ . Then:

$\frac{dy}{dx} = \lim_{h \to 0} \frac{f(g(x) + \Delta u) - f(g(x))}{h}$

Multiply and divide by $\Delta u$ (when $\Delta u \neq 0$ ):

$= \lim_{h \to 0} \frac{f(g(x) + \Delta u) - f(g(x))}{\Delta u} \cdot \frac{\Delta u}{h}$

As $h \to 0$ , we have $\Delta u \to 0$ (by continuity of $g$ ):

$= \lim_{\Delta u \to 0} \frac{f(g(x) + \Delta u) - f(g(x))}{\Delta u} \cdot \lim_{h \to 0} \frac{g(x+h) - g(x)}{h}$

$= f'(g(x)) \cdot g'(x)$ ∎

Technical Note

The proof above has a subtle issue: what if $\Delta u = 0$ for some $h \neq 0$ ? A rigorous proof handles this case carefully using a modified definition. For our purposes, the intuitive idea of "rates multiply" is the key insight.

Information Flow Demonstration

Watch how values and derivatives flow through a composed function. The forward pass computes the output, while the backward pass applies the chain rule to compute derivatives.

Information Flow Through Composition

Function: y = (3x + 1)²

Outer: f(u) = u² | Inner: g(x) = 3x + 1

x = 2.00

Forward PassComputing the output value

Input x

2.00

Start with the input value

x = 2.00

→

Inner function

7.00

Apply g(x) = 3x + 1

u = 3(2.00) + 1 = 7.00

→

Outer function

49.00

Apply f(u) = u²

y = (7.00)² = 49.00

Backward PassComputing the derivative (Chain Rule)

dy/du

14.00

Derivative of outer: f'(u) = 2u

f'(u) = 2(7.00) = 14.00

du/dx

3.00

Derivative of inner: g'(x) = 3

g'(x) = 3

dy/dx

42.00

Chain rule: multiply the rates

14.00 × 3 = 42.00

The Chain Rule Formula

dy/dx = dy/du × du/dx = f'(g(x)) × g'(x)

42.00 = 14.00 × 3

Key Insight: The chain rule tells us that rates of change multiply when functions are composed.

If u changes 14.0× as fast as y
And x changes 3× as fast as u
Then x changes 42.0× as fast as y

Interactive Exploration

Use the visualizer below to explore the chain rule with different composed functions. Watch how the derivative of the composition depends on both the inner and outer derivatives.

Chain Rule Interactive Explorer

Function:

Show tangent line

Input

→

Inner: g(x)

x²

→

u = g(x)

→

Outer: f(u)

sin(u)

→

Output

y = f(g(x))

x = 1.00

1.000

g(x) = x²

1.000

f(g(x))

0.841

(fˆg)'(x)

1.081

Chain Rule Breakdown at x = 1.00

f'(g(x)) = cos(u)

0.540

g'(x) = 2x

2.000

(fˆg)'(x)

1.081

cos(x²) · 2x

Composed Function: y = sin(x²)

Derivative: (fˆg)'(x)

The chain rule shows how changes propagate through composed functions: the rate at which y changes with x equals the rate at which y changes with u, multiplied by the rate at which u changes with x.

Worked Examples

Example 1: Power of a Linear Function

Find $\frac{d}{dx}[(3x + 1)^5]$

Solution: Identify inner and outer functions:

Outer: $f(u) = u^5$ , so $f'(u) = 5u^4$
Inner: $g(x) = 3x + 1$ , so $g'(x) = 3$

Applying the chain rule:

$\frac{d}{dx}[(3x+1)^5] = 5(3x+1)^4 \cdot 3$

$= 15(3x+1)^4$

Example 2: Exponential Composition

Find $\frac{d}{dx}[e^{x^2}]$

Solution:

Outer: $f(u) = e^u$ , so $f'(u) = e^u$
Inner: $g(x) = x^2$ , so $g'(x) = 2x$

Applying the chain rule:

$\frac{d}{dx}[e^{x^2}] = e^{x^2} \cdot 2x = 2xe^{x^2}$

Example 3: Trigonometric Composition

Find $\frac{d}{dx}[\cos(3x^2)]$

Solution:

Outer: $f(u) = \cos(u)$ , so $f'(u) = -\sin(u)$
Inner: $g(x) = 3x^2$ , so $g'(x) = 6x$

Applying the chain rule:

$\frac{d}{dx}[\cos(3x^2)] = -\sin(3x^2) \cdot 6x = -6x\sin(3x^2)$

Example 4: Square Root Composition

Find $\frac{d}{dx}[\sqrt{1 + x^2}]$

Solution: Write as $(1 + x^2)^{1/2}$ :

Outer: $f(u) = u^{1/2}$ , so $f'(u) = \frac{1}{2}u^{-1/2} = \frac{1}{2\sqrt{u}}$
Inner: $g(x) = 1 + x^2$ , so $g'(x) = 2x$

Applying the chain rule:

$\frac{d}{dx}[\sqrt{1+x^2}] = \frac{1}{2\sqrt{1+x^2}} \cdot 2x$

$= \frac{x}{\sqrt{1+x^2}}$

Example 5: Logarithmic Composition

Find $\frac{d}{dx}[\ln(\sin x)]$

Solution:

Outer: $f(u) = \ln(u)$ , so $f'(u) = \frac{1}{u}$
Inner: $g(x) = \sin(x)$ , so $g'(x) = \cos(x)$

Applying the chain rule:

$\frac{d}{dx}[\ln(\sin x)] = \frac{1}{\sin x} \cdot \cos x = \frac{\cos x}{\sin x} = \cot x$

Nested Compositions: Chains of Chains

What if we have three or more functions composed together? The chain rule extends naturally — we just multiply all the derivatives:

For $F(x) = f(g(h(x)))$ :

F'(x) = f'(g(h(x))) \cdot g'(h(x)) \cdot h'(x)

Example: Triple Composition

Find $\frac{d}{dx}[\sin(e^{x^2})]$

Solution: Three layers of composition:

Outermost: $f(u) = \sin(u)$ → $f'(u) = \cos(u)$
Middle: $g(v) = e^v$ → $g'(v) = e^v$
Innermost: $h(x) = x^2$ → $h'(x) = 2x$

Applying the chain rule from outside in:

$\frac{d}{dx}[\sin(e^{x^2})] = \cos(e^{x^2}) \cdot e^{x^2} \cdot 2x$

$= 2x e^{x^2} \cos(e^{x^2})$

Strategy for Nested Compositions

Identify all the layers of composition from outside to inside
Write down each function and its derivative
Multiply all the derivatives together, evaluating each at the appropriate composed value

The Chain Rule in Leibniz Notation

Leibniz notation makes the chain rule look like fraction "cancellation":

\frac{dy}{dx} = \frac{dy}{du} \cdot \frac{du}{dx}

The $du$ 's appear to "cancel" like fractions!

While this isn't literally fraction cancellation (these are operators, not fractions), the notation is incredibly useful for remembering the chain rule and extends naturally to multiple variables.

Example Using Leibniz Notation

Let $y = u^3$ and $u = 2x + 1$ . Find $\frac{dy}{dx}$ .

$\frac{dy}{du} = 3u^2$
$\frac{du}{dx} = 2$

Chain rule:

$\frac{dy}{dx} = \frac{dy}{du} \cdot \frac{du}{dx} = 3u^2 \cdot 2 = 6u^2 = 6(2x+1)^2$

Machine Learning Applications: Backpropagation

The chain rule is the mathematical foundation of how neural networks learn. The algorithm is called backpropagation.

How Neural Networks Use the Chain Rule

A neural network is essentially a giant composition of functions:

\text{output} = f_L(f_{L-1}(\cdots f_2(f_1(\text{input})) \cdots))

Each $f_i$ is a layer (linear transformation + activation function)

To train the network, we need to compute how the loss changes with respect to each weight. This requires differentiating through the entire composition — exactly what the chain rule does!

Forward Pass

Compute output from input by applying each layer in sequence. Values flow forward through the network.

Backward Pass

Compute gradients from output to input using the chain rule. Gradients flow backward through the network.

The Backpropagation Equation

For a network with output $y$ and loss $L$ , the gradient with respect to a parameter $w$ in layer $i$ is:

\frac{\partial L}{\partial w} = \frac{\partial L}{\partial y} \cdot \frac{\partial y}{\partial h_L} \cdot \frac{\partial h_L}{\partial h_{L-1}} \cdots \frac{\partial h_{i+1}}{\partial h_i} \cdot \frac{\partial h_i}{\partial w}

Each term is a local derivative; the chain rule multiplies them all

Why Deep Learning Works

The chain rule enables us to compute gradients through arbitrarily deep compositions efficiently. This is why we can train neural networks with hundreds of layers!

Automatic differentiation (autograd) libraries like PyTorch implement the chain rule automatically
Gradient descent uses these gradients to update weights and minimize loss
Every modern AI system — from GPT to image classifiers — relies on the chain rule

Python Implementation

Numerical Verification

Let's verify the chain rule numerically by computing the derivative two ways:

Verifying the Chain Rule Numerically

🐍chain_rule_verify.py

Explanation(5)

Code(59)

3Chain Rule Verification

This function numerically verifies the chain rule by computing F'(x) two ways: using the formula f'(g(x)) · g'(x), and directly from the definition.

11Inner Function Evaluation

First we compute g(x), the value that will be fed into the outer function f. This is the 'u' in f(u).

14Inner Derivative

We compute g'(x) using the central difference formula. This tells us how fast the inner function is changing at x.

17Outer Derivative at Inner Value

We compute f'(g(x)) — the derivative of f, evaluated at the point u = g(x). This is the key insight of the chain rule: evaluate the outer derivative at the inner value.

20Chain Rule Formula

The chain rule: F'(x) = f'(g(x)) · g'(x). We multiply the outer derivative (evaluated at g(x)) by the inner derivative.

54 lines without explanation

1import numpy as np
2
3def chain_rule_numerical(f, g, x, h=1e-7):
4    """
5    Verify the chain rule numerically.
6
7    For F(x) = f(g(x)), the chain rule states:
8    F'(x) = f'(g(x)) * g'(x)
9
10    We'll compute both sides and compare.
11    """
12    # Compute g(x) - the inner function value
13    g_x = g(x)
14
15    # Numerical derivative of g at x
16    g_prime_x = (g(x + h) - g(x - h)) / (2 * h)
17
18    # Numerical derivative of f at g(x)
19    f_prime_at_gx = (f(g_x + h) - f(g_x - h)) / (2 * h)
20
21    # Chain rule: f'(g(x)) * g'(x)
22    chain_derivative = f_prime_at_gx * g_prime_x
23
24    # Direct numerical derivative of composition
25    F = lambda t: f(g(t))
26    F_prime_numerical = (F(x + h) - F(x - h)) / (2 * h)
27
28    return {
29        "x": x,
30        "g(x)": g_x,
31        "g'(x)": g_prime_x,
32        "f'(g(x))": f_prime_at_gx,
33        "Chain Rule": chain_derivative,
34        "Numerical": F_prime_numerical,
35        "Difference": abs(chain_derivative - F_prime_numerical)
36    }
37
38# Example 1: sin(x^2) where f(u) = sin(u), g(x) = x^2
39f1 = lambda u: np.sin(u)
40g1 = lambda x: x**2
41
42result1 = chain_rule_numerical(f1, g1, x=np.sqrt(np.pi/4))
43print("Example: F(x) = sin(x²)")
44for key, value in result1.items():
45    print(f"  {key}: {value:.6f}")
46print()
47
48# Example 2: e^(3x) where f(u) = e^u, g(x) = 3x
49f2 = lambda u: np.exp(u)
50g2 = lambda x: 3 * x
51
52result2 = chain_rule_numerical(f2, g2, x=1.0)
53print("Example: F(x) = e^(3x)")
54for key, value in result2.items():
55    print(f"  {key}: {value:.6f}")
56print()
57
58# Verify: d/dx[e^(3x)] = 3*e^(3x)
59print(f"Expected: 3*e^3 = {3 * np.exp(3):.6f}")

The Chain Rule in Autograd

Here's how the chain rule appears in automatic differentiation, the foundation of deep learning frameworks:

Chain Rule in Automatic Differentiation

🐍chain_rule_autograd.py

Explanation(6)

Code(104)

3Autograd Tensor

A tensor that tracks both its value and gradient. The chain rule is what makes automatic differentiation possible!

21Exponential Backward Pass

The local derivative of e^x is e^x. The chain rule says: multiply by the upstream gradient (result.grad) to get the gradient flowing to the input.

36Square Backward Pass

For y = x², the local derivative is 2x. Chain rule: multiply 2x by the upstream gradient to continue the chain of derivatives.

51Product Rule in Backward

When we have z = a*b, the backward pass uses both the product rule (for local gradients) and chain rule (to connect to upstream gradients).

68Topological Sort

We sort nodes so that each node is processed after all nodes that depend on it. This ensures we have the upstream gradient before computing local gradients.

81Composition Example

f(x) = e^(x²) is a composition of two functions. The chain rule gives f'(x) = e^(x²) · 2x. This is exactly what backpropagation computes!

98 lines without explanation

1import numpy as np
2
3class Tensor:
4    """
5    A simple autograd tensor that tracks gradients.
6    The chain rule is the mathematical foundation of backpropagation!
7    """
8    def __init__(self, value, children=(), op=''):
9        self.value = value
10        self.grad = 0.0
11        self._backward = lambda: None
12        self._children = children
13        self._op = op
14
15    def __repr__(self):
16        return f"Tensor({self.value:.4f}, grad={self.grad:.4f})"
17
18# Define operations with backward pass (chain rule!)
19
20def exp(x: Tensor) -> Tensor:
21    """
22    Forward: y = e^x
23    Backward: dy/dx = e^x (chain rule: multiply by upstream gradient)
24    """
25    result = Tensor(np.exp(x.value), (x,), 'exp')
26
27    def _backward():
28        # Chain rule: local gradient * upstream gradient
29        # For exp: local gradient is e^x = result.value
30        x.grad += result.value * result.grad
31
32    result._backward = _backward
33    return result
34
35def square(x: Tensor) -> Tensor:
36    """
37    Forward: y = x^2
38    Backward: dy/dx = 2x (chain rule)
39    """
40    result = Tensor(x.value ** 2, (x,), 'square')
41
42    def _backward():
43        # Chain rule: local gradient * upstream gradient
44        # For square: local gradient is 2x
45        x.grad += 2 * x.value * result.grad
46
47    result._backward = _backward
48    return result
49
50def multiply(a: Tensor, b: Tensor) -> Tensor:
51    """
52    Forward: z = a * b
53    Backward uses product rule AND chain rule
54    """
55    result = Tensor(a.value * b.value, (a, b), 'mul')
56
57    def _backward():
58        # Product rule gives local gradients: dz/da = b, dz/db = a
59        # Chain rule: multiply by upstream gradient
60        a.grad += b.value * result.grad
61        b.grad += a.value * result.grad
62
63    result._backward = _backward
64    return result
65
66def backward(tensor: Tensor):
67    """Backpropagate gradients using chain rule."""
68    # Topological sort
69    topo = []
70    visited = set()
71    def build_topo(v):
72        if v not in visited:
73            visited.add(v)
74            for child in v._children:
75                build_topo(child)
76            topo.append(v)
77    build_topo(tensor)
78
79    # Backpropagate
80    tensor.grad = 1.0  # dL/dL = 1
81    for node in reversed(topo):
82        node._backward()
83
84# Example: f(x) = e^(x²)
85# This is a composition! Chain rule: f'(x) = e^(x²) * 2x
86
87x = Tensor(2.0)
88y = square(x)      # y = x² = 4
89z = exp(y)         # z = e^4
90
91print("Forward pass:")
92print(f"  x = {x}")
93print(f"  y = x² = {y}")
94print(f"  z = e^(x²) = {z}")
95print()
96
97backward(z)
98
99print("Backward pass (Chain Rule!):")
100print(f"  dz/dy = e^y = {np.exp(y.value):.4f}")
101print(f"  dy/dx = 2x = {2 * x.value:.4f}")
102print(f"  dz/dx = dz/dy * dy/dx = {x.grad:.4f}")
103print()
104print(f"Verification: e^(x²) * 2x = e^4 * 4 = {np.exp(4) * 4:.4f}")

Common Mistakes to Avoid

Mistake 1: Forgetting the inner derivative

Wrong: $\frac{d}{dx}[\sin(3x)] = \cos(3x)$

Correct: $\frac{d}{dx}[\sin(3x)] = \cos(3x) \cdot 3 = 3\cos(3x)$

Always remember to multiply by the derivative of the inner function!

Mistake 2: Wrong evaluation point for outer derivative

Wrong: $\frac{d}{dx}[e^{x^2}] = e^x \cdot 2x$

Correct: $\frac{d}{dx}[e^{x^2}] = e^{x^2} \cdot 2x$

The outer derivative must be evaluated at $u = g(x)$ , not at $x$ !

Mistake 3: Applying chain rule when unnecessary

For $f(x) = x^5$ , just use the power rule directly: $f'(x) = 5x^4$

But for $f(x) = (2x)^5 = 32x^5$ , you can use either the chain rule or simplify first.

Use the chain rule only when there's actual composition involved.

Mistake 4: Confusing the order of composition

Be careful: $f(g(x)) \neq g(f(x))$ in general!

$\sin(x^2)$ : first square, then take sine
$(\sin x)^2$ : first take sine, then square

These have different derivatives! Make sure you identify the correct inner and outer functions.

Mistake 5: Missing the chain rule with implicit inner functions

Wrong: $\frac{d}{dx}[\sqrt{x^2 + 1}] = \frac{1}{2\sqrt{x^2+1}}$

Correct: $\frac{d}{dx}[\sqrt{x^2+1}] = \frac{1}{2\sqrt{x^2+1}} \cdot 2x = \frac{x}{\sqrt{x^2+1}}$

Whenever anything other than a plain $x$ is inside another function, you need the chain rule!

Test Your Understanding

Chain Rule Quiz

Question 1 of 6

Find the derivative of f(x) = sin(2x)

Outer: sin(u), Inner: u = 2x

Summary

The chain rule is the fundamental technique for differentiating composed functions. It states that rates of change multiply when functions are chained together.

Key Formula

\frac{d}{dx}[f(g(x))] = f'(g(x)) \cdot g'(x)

Or in Leibniz notation: $\frac{dy}{dx} = \frac{dy}{du} \cdot \frac{du}{dx}$

Key Concepts

Concept	Description
Outer function	The function on the "outside" — the last operation applied
Inner function	The function on the "inside" — the first operation applied to x
Rates multiply	The chain rule says rates of change multiply through compositions
Backpropagation	The chain rule applied repeatedly to compute gradients in neural networks
Leibniz notation	dy/dx = (dy/du)(du/dx) — looks like fraction cancellation
Common mistake	Forgetting to multiply by the inner derivative g'(x)

Key Takeaways

The chain rule is the most important differentiation rule — it handles all function compositions
Identify the outer function f and inner function g before applying the rule
Evaluate the outer derivative at the inner function value: f'(g(x)), not f'(x)
Multiply by the inner derivative g'(x) — never forget this step!
For nested compositions, apply the chain rule from outside in, multiplying all derivatives
The chain rule is the foundation of backpropagation, making modern AI possible

The Chain Rule in One Sentence:

"Differentiate the outside, leave the inside alone, then multiply by the derivative of the inside."

Coming Next: In the next section, we'll learn Implicit Differentiation — how to find derivatives when the function isn't given explicitly, using the chain rule as our key tool.