Boo-AI — Master Artificial Intelligence by Building from Scratch

Learning Objectives

By the end of this section, you will be able to:

Apply the chain rule to functions of several variables when the independent variables are themselves functions of other variables
Construct dependency trees to identify all paths through which changes propagate
Compute partial derivatives using the chain rule for various cases: one independent variable, two independent variables, and the general case
Use implicit differentiation as a special application of the chain rule
Connect the multivariable chain rule to backpropagation in neural networks and understand why this is the foundation of deep learning

The Big Picture: Tracking Change Through Networks

"When a butterfly flaps its wings in Brazil, how does that affect the weather in Tokyo? The chain rule tells us exactly how changes propagate through interconnected systems."

In single-variable calculus, the chain rule tells us how to differentiate composite functions: if $z = f(y)$ and $y = g(x)$ , then $\frac{dz}{dx} = \frac{dz}{dy} \cdot \frac{dy}{dx}$ . But what happens when functions depend on multiple variables, and those variables depend on other variables?

The multivariable chain rule extends this powerful concept to functions of several variables. The key insight is that when a quantity $z$ depends on intermediate variables like $x$ and $y$ , changes can propagate through multiple paths. We must account for all of them.

Why This Matters

The multivariable chain rule is essential in:

Machine Learning: Backpropagation in neural networks is nothing but the chain rule applied systematically
Physics: Computing how quantities change in different coordinate systems (Cartesian to polar, etc.)
Economics: Understanding how changes in base inputs affect downstream economic indicators
Engineering: Sensitivity analysis and related rates problems in multiple dimensions

Historical Context

The chain rule in its single-variable form was known to Leibniz in the 17th century. The extension to multiple variables developed throughout the 18th and 19th centuries as mathematicians like Euler, Lagrange, and Cauchy formalized the calculus of several variables.

The notation $\frac{\partial z}{\partial x}$ for partial derivatives was introduced by Legendre in 1786 and popularized by Jacobi in the 1840s. This notation elegantly captures the key idea: the "curly d" reminds us that we hold other variables constant.

In the 1960s and 1970s, researchers in machine learning and optimization rediscovered the chain rule's power for computing gradients efficiently. The technique of backpropagation, formalized by Rumelhart, Hinton, and Williams in 1986, is precisely the multivariable chain rule applied to computational graphs—and it became the foundation of modern deep learning.

Review: Single-Variable Chain Rule

Let's first recall the single-variable chain rule. If $z = f(y)$ and $y = g(x)$ , then the derivative of $z$ with respect to $x$ is:

Single-Variable Chain Rule

\frac{dz}{dx} = \frac{dz}{dy} \cdot \frac{dy}{dx}

Multiply the derivatives along the chain

Intuition: If $y$ increases by 1 unit when $x$ increases by 1 unit (i.e., $\frac{dy}{dx} = 1$ ), and $z$ increases by 2 units when $y$ increases by 1 unit (i.e., $\frac{dz}{dy} = 2$ ), then $z$ increases by 2 units when $x$ increases by 1 unit. The rates multiply.

Case 1: One Independent Variable

Suppose $z = f(x, y)$ where both $x$ and $y$ are functions of a single variable $t$ :

$x = g(t)$
$y = h(t)$

How does $z$ change with $t$ ? A change in $t$ causes both $x$ and $y$ to change, and each of these changes affects $z$ .

Chain Rule: Case 1

\frac{dz}{dt} = \frac{\partial z}{\partial x}\frac{dx}{dt} + \frac{\partial z}{\partial y}\frac{dy}{dt}

Sum the contributions from all paths: t → x → z and t → y → z

The Dependency Tree

A dependency tree (also called a computational graph) shows how variables depend on each other. For Case 1:

↗↖

↖↗

The Rule: To find $\frac{dz}{dt}$ , identify all paths from $t$ to $z$ . For each path, multiply the derivatives along that path. Then add all the products together.

Path 1: $t \to x \to z$ contributes $\frac{\partial z}{\partial x} \cdot \frac{dx}{dt}$
Path 2: $t \to y \to z$ contributes $\frac{\partial z}{\partial y} \cdot \frac{dy}{dt}$

Interactive Dependency Explorer

Chain Rule Dependency Tree

Show derivative labels

Output

Intermediate

Input

dz/dt = (∂z/∂x)(dx/dt) + (∂z/∂y)(dy/dt)

Each path from t to z contributes a product of derivatives

Key Insight: To find how z changes with t, we must account for all paths: t affects x and y, which both affect z. We multiply derivatives along each path and add them together.

Case 2: Two Independent Variables

Now suppose $z = f(x, y)$ where $x$ and $y$ are both functions of two independent variables $s$ and $t$ :

$x = g(s, t)$
$y = h(s, t)$

Now we need to find both $\frac{\partial z}{\partial s}$ and $\frac{\partial z}{\partial t}$ .

Chain Rule: Case 2

\frac{\partial z}{\partial s} = \frac{\partial z}{\partial x}\frac{\partial x}{\partial s} + \frac{\partial z}{\partial y}\frac{\partial y}{\partial s}

\frac{\partial z}{\partial t} = \frac{\partial z}{\partial x}\frac{\partial x}{\partial t} + \frac{\partial z}{\partial y}\frac{\partial y}{\partial t}

Partial vs Total Derivatives

Notice that we use partial derivative notation $\frac{\partial}{\partial s}$ rather than $\frac{d}{ds}$ because $z$ ultimately depends on both $s$ and $t$ . When we compute $\frac{\partial z}{\partial s}$ , we hold $t$ constant.

The General Chain Rule

The pattern generalizes to any number of intermediate and independent variables. If $z$ depends on $x_1, x_2, \ldots, x_m$ , and each $x_i$ depends on $t_1, t_2, \ldots, t_n$ , then:

General Chain Rule

\frac{\partial z}{\partial t_j} = \sum_{i=1}^{m} \frac{\partial z}{\partial x_i} \frac{\partial x_i}{\partial t_j}

Sum over all intermediate variables $x_i$

Reading the formula: To find how $z$ changes with $t_j$ , look at every intermediate variable $x_i$ . For each, multiply how $z$ changes with $x_i$ by how $x_i$ changes with $t_j$ . Sum all these contributions.

The Tree Rule

To apply the chain rule:

Draw the dependency tree from inputs to outputs
Identify all paths from your input variable to the output
For each path, multiply all derivatives along the path
Add all the path contributions together

Interactive Chain Rule Visualizer

Explore how the chain rule works for a concrete example. Adjust the parameter $t$ and see how changes propagate through the composite function.

Chain Rule in Action: How Changes Propagate

z(t) = x(t) + y(t)² = t² + sin²(t)

Current point

New point

Tangent

t = 1.00

Δt = 0.50

At t = 1.00:

x(t) = t² = 1.000

y(t) = sin(t) = 0.841

z = x + y² = 1.708

Derivatives:

dx/dt = 2t = 2.000

dy/dt = cos(t) = 0.540

∂z/∂x = 1.000

∂z/∂y = 2y = 1.683

Chain Rule:

dz/dt = (∂z/∂x)(dx/dt) + (∂z/∂y)(dy/dt)

= (1.00)(2.00) + (1.68)(0.54)

= 2.9093

Linear Approximation vs Reality:

Predicted Δz:

1.4546

Actual Δz:

1.5369

Error: 0.082274 (smaller Δt → better approximation)

Implicit Differentiation Revisited

Implicit differentiation, which we learned earlier for single-variable calculus, is actually a special case of the chain rule for multivariable functions.

Suppose $F(x, y) = 0$ defines $y$ implicitly as a function of $x$ . Since $F(x, y(x)) = 0$ for all $x$ , differentiating both sides with respect to $x$ using the chain rule gives:

\frac{\partial F}{\partial x} + \frac{\partial F}{\partial y}\frac{dy}{dx} = 0

Solving for $\frac{dy}{dx}$ :

Implicit Differentiation Formula

\frac{dy}{dx} = -\frac{F_x}{F_y} = -\frac{\partial F/\partial x}{\partial F/\partial y}

(provided $F_y \neq 0$ )

Worked Examples

Example: Polar Coordinates

Suppose $z = f(x, y)$ and we want to express the partial derivatives in polar coordinates, where $x = r\cos\theta$ and $y = r\sin\theta$ .

Find $\frac{\partial z}{\partial r}$ :

\frac{\partial z}{\partial r} = \frac{\partial z}{\partial x}\frac{\partial x}{\partial r} + \frac{\partial z}{\partial y}\frac{\partial y}{\partial r}

= \frac{\partial z}{\partial x}\cos\theta + \frac{\partial z}{\partial y}\sin\theta

Find $\frac{\partial z}{\partial \theta}$ :

\frac{\partial z}{\partial \theta} = \frac{\partial z}{\partial x}\frac{\partial x}{\partial \theta} + \frac{\partial z}{\partial y}\frac{\partial y}{\partial \theta}

= \frac{\partial z}{\partial x}(-r\sin\theta) + \frac{\partial z}{\partial y}(r\cos\theta)

This is exactly how coordinate transformations work—the chain rule tells us how derivatives transform from one coordinate system to another.

Real-World Applications

Physics: Related Rates in Multiple Dimensions

In physics, quantities often depend on multiple variables that change with time. The chain rule is essential for computing how rates of change propagate.

Example: The pressure $P$ of an ideal gas depends on volume $V$ and temperature $T$ through $P = nRT/V$ . If both $V$ and $T$ change with time, how fast does $P$ change?

\frac{dP}{dt} = \frac{\partial P}{\partial V}\frac{dV}{dt} + \frac{\partial P}{\partial T}\frac{dT}{dt} = -\frac{nRT}{V^2}\frac{dV}{dt} + \frac{nR}{V}\frac{dT}{dt}

Other physics applications include:

Thermodynamics: Changes in entropy, internal energy, and free energy with multiple state variables
Fluid mechanics: The material derivative $\frac{D}{Dt} = \frac{\partial}{\partial t} + \mathbf{v} \cdot \nabla$
Electromagnetism: How fields change in moving reference frames

Machine Learning: Backpropagation

The multivariable chain rule is the mathematical foundation of deep learning. When training a neural network, we need to compute how the loss $L$ changes with respect to every weight in the network. This is exactly what the chain rule does!

In a neural network, we have a chain of computations:

input → linear → activation → linear → activation → ... → loss

Each layer's output depends on the previous layer's output and the layer's weights. To find $\frac{\partial L}{\partial w}$ for a weight $w$ deep in the network, we multiply all the partial derivatives along the path from $L$ back to $w$ .

The Vanishing/Exploding Gradient Problem

Since backpropagation multiplies many derivatives together, if each derivative is much smaller than 1 (or much larger than 1), the product can become vanishingly small (or explosively large). This is why careful initialization and normalization are crucial in deep learning!

Interactive Backpropagation Demo

See the chain rule in action for a simple neural network. Watch how gradients flow backward from the loss to the weights.

The Chain Rule IS Backpropagation

w₁ = 0.50

w₂ = -0.30

Show gradient flow

Forward Pass: Compute z

Linear combination of inputs and weights

z = w₁x₁ + w₂x₂ = (0.50)(2) + (-0.30)(3) = 0.100

Gradients (for gradient descent update):

∂L/∂w₁ = -0.473835

∂L/∂w₂ = -0.710753

w₁ ← w₁ - η·(∂L/∂w₁), w₂ ← w₂ - η·(∂L/∂w₂)

Key Insight: Backpropagation is simply the chain rule applied systematically. To find how the loss L changes with any weight w, we multiply all the partial derivatives along the path from L back to w. This is exactly the multivariable chain rule!

Python Implementation

Numerical Chain Rule

Computing the Chain Rule Numerically

🐍chain_rule.py

Explanation(6)

Code(84)

5Numerical Chain Rule Function

This function computes dz/dt for the composition z = f(x(t), y(t)) using numerical derivatives. It demonstrates the multivariable chain rule: we must account for how z changes through both x and y.

27Computing Current Values

First, we evaluate x(t), y(t), and z = f(x, y) at the current point t. These are the values around which we'll compute derivatives.

32Partial Derivative ∂z/∂x

We compute the partial derivative of z with respect to x using central differences. This tells us how z changes when only x changes, holding y fixed.

35Partial Derivative ∂z/∂y

Similarly, we compute ∂z/∂y—how z changes with y, holding x fixed. Together with ∂z/∂x, these capture how z responds to changes in its inputs.

38Derivatives dx/dt and dy/dt

These are ordinary derivatives showing how x and y change with t. Since x and y are single-variable functions of t, we use standard differentiation.

42The Chain Rule Formula

The key formula: dz/dt = (∂z/∂x)(dx/dt) + (∂z/∂y)(dy/dt). Each path from t to z contributes a product of derivatives, and we sum all contributions.

78 lines without explanation

1import numpy as np
2from typing import Callable, Tuple
3
4def numerical_chain_rule(
5    f: Callable[[float, float], float],
6    x: Callable[[float], float],
7    y: Callable[[float], float],
8    t: float,
9    h: float = 1e-7
10) -> Tuple[float, dict]:
11    """
12    Numerically compute dz/dt using the chain rule where
13    z = f(x(t), y(t)).
14
15    This demonstrates the multivariable chain rule:
16    dz/dt = (∂z/∂x)(dx/dt) + (∂z/∂y)(dy/dt)
17
18    Args:
19        f: Function z = f(x, y)
20        x: Function x(t)
21        y: Function y(t)
22        t: Point at which to evaluate
23        h: Step size for numerical derivatives
24
25    Returns:
26        dz/dt and a dictionary of intermediate values
27    """
28    # Current values
29    x_t = x(t)
30    y_t = y(t)
31    z_t = f(x_t, y_t)
32
33    # Partial derivatives of z with respect to x and y
34    # ∂z/∂x ≈ [f(x+h, y) - f(x-h, y)] / (2h)
35    dz_dx = (f(x_t + h, y_t) - f(x_t - h, y_t)) / (2 * h)
36
37    # ∂z/∂y ≈ [f(x, y+h) - f(x, y-h)] / (2h)
38    dz_dy = (f(x_t, y_t + h) - f(x_t, y_t - h)) / (2 * h)
39
40    # Derivatives of x and y with respect to t
41    dx_dt = (x(t + h) - x(t - h)) / (2 * h)
42    dy_dt = (y(t + h) - y(t - h)) / (2 * h)
43
44    # Chain rule: dz/dt = (∂z/∂x)(dx/dt) + (∂z/∂y)(dy/dt)
45    dz_dt = dz_dx * dx_dt + dz_dy * dy_dt
46
47    return dz_dt, {
48        'x': x_t, 'y': y_t, 'z': z_t,
49        'dz_dx': dz_dx, 'dz_dy': dz_dy,
50        'dx_dt': dx_dt, 'dy_dt': dy_dt,
51        'contribution_x': dz_dx * dx_dt,
52        'contribution_y': dz_dy * dy_dt
53    }
54
55# Example: z = x² + xy, x = cos(t), y = sin(t)
56def f(x, y):
57    return x**2 + x*y
58
59def x(t):
60    return np.cos(t)
61
62def y(t):
63    return np.sin(t)
64
65# Evaluate at t = π/4
66t_val = np.pi / 4
67dz_dt, info = numerical_chain_rule(f, x, y, t_val)
68
69print("Chain Rule Calculation for z = x² + xy")
70print("where x = cos(t), y = sin(t)")
71print(f"\nAt t = π/4 = {t_val:.4f}:")
72print(f"  x(t) = {info['x']:.4f}")
73print(f"  y(t) = {info['y']:.4f}")
74print(f"  z = f(x,y) = {info['z']:.4f}")
75print(f"\nPartial derivatives:")
76print(f"  ∂z/∂x = 2x + y = {info['dz_dx']:.4f}")
77print(f"  ∂z/∂y = x = {info['dz_dy']:.4f}")
78print(f"\nDerivatives with respect to t:")
79print(f"  dx/dt = -sin(t) = {info['dx_dt']:.4f}")
80print(f"  dy/dt = cos(t) = {info['dy_dt']:.4f}")
81print(f"\nChain rule components:")
82print(f"  (∂z/∂x)(dx/dt) = {info['contribution_x']:.4f}")
83print(f"  (∂z/∂y)(dy/dt) = {info['contribution_y']:.4f}")
84print(f"\ndz/dt = {dz_dt:.4f}")

Backpropagation Implementation

Backpropagation: Chain Rule in Neural Networks

🐍backprop.py

Explanation(8)

Code(79)

12Forward and Backward Pass

This function demonstrates a complete forward pass (compute output from input) and backward pass (compute gradients using chain rule) for a single neuron.

21Forward Pass

First, compute z = wx + b (linear combination), then a = σ(z) (apply activation), then L = (a-y)² (compute loss). This builds the computational graph.

29Backward Pass Begins

Now we work backwards, applying the chain rule at each step. We compute how L changes with respect to each intermediate variable.

32∂L/∂a

The derivative of L = (a-y)² with respect to a is 2(a-y). This is where gradients start flowing backward.

35∂a/∂z = σ'(z)

The sigmoid derivative σ'(z) = σ(z)(1-σ(z)) = a(1-a). This tells us how the activation changes with z.

39Chain Rule: ∂L/∂z

Now we multiply: ∂L/∂z = (∂L/∂a)(∂a/∂z). This is the chain rule in action—multiplying derivatives along the path from L to z.

42Gradient with respect to weight

Since ∂z/∂w = x, we get ∂L/∂w = (∂L/∂z)·x. This is the gradient we use to update the weight.

67Gradient Descent Update

Finally, we update w and b in the direction that decreases loss: w ← w - η·(∂L/∂w). The chain rule gave us exactly the gradients we needed!

71 lines without explanation

1import numpy as np
2
3def sigmoid(z):
4    """Sigmoid activation function"""
5    return 1 / (1 + np.exp(-z))
6
7def sigmoid_derivative(z):
8    """Derivative of sigmoid: σ'(z) = σ(z)(1 - σ(z))"""
9    s = sigmoid(z)
10    return s * (1 - s)
11
12def forward_backward_pass(x, w, b, y_true):
13    """
14    Demonstrates forward and backward pass through a single
15    neuron, showing the chain rule in action.
16
17    Network: z = w·x + b, a = σ(z), L = (a - y)²
18
19    The chain rule gives us:
20    ∂L/∂w = (∂L/∂a)(∂a/∂z)(∂z/∂w)
21    """
22    # Forward pass
23    z = np.dot(w, x) + b          # Linear combination
24    a = sigmoid(z)                 # Activation
25    loss = (a - y_true) ** 2       # Loss
26
27    # Backward pass (chain rule!)
28    # Each step multiplies derivatives along the path
29
30    # Step 1: ∂L/∂a = 2(a - y)
31    dL_da = 2 * (a - y_true)
32
33    # Step 2: ∂a/∂z = σ'(z) = a(1-a)
34    da_dz = sigmoid_derivative(z)
35
36    # Step 3: Combine using chain rule
37    # ∂L/∂z = (∂L/∂a)(∂a/∂z)
38    dL_dz = dL_da * da_dz
39
40    # Step 4: ∂z/∂w = x, so ∂L/∂w = (∂L/∂z)(∂z/∂w)
41    dL_dw = dL_dz * x
42
43    # ∂z/∂b = 1, so ∂L/∂b = ∂L/∂z
44    dL_db = dL_dz
45
46    return {
47        'z': z, 'a': a, 'loss': loss,
48        'dL_da': dL_da, 'da_dz': da_dz,
49        'dL_dz': dL_dz, 'dL_dw': dL_dw, 'dL_db': dL_db
50    }
51
52# Example: single input, single output
53x = np.array([2.0])      # Input
54w = np.array([0.5])      # Weight
55b = 0.1                  # Bias
56y_true = 1.0             # Target
57
58result = forward_backward_pass(x, w, b, y_true)
59
60print("Forward Pass:")
61print(f"  z = w·x + b = {w[0]}·{x[0]} + {b} = {result['z']:.4f}")
62print(f"  a = σ(z) = {result['a']:.4f}")
63print(f"  Loss = (a - y)² = {result['loss']:.6f}")
64
65print("\nBackward Pass (Chain Rule):")
66print(f"  ∂L/∂a = 2(a - y) = {result['dL_da']:.4f}")
67print(f"  ∂a/∂z = σ'(z) = a(1-a) = {result['da_dz']:.4f}")
68print(f"  ∂L/∂z = (∂L/∂a)(∂a/∂z) = {result['dL_dz']:.4f}")
69print(f"  ∂L/∂w = (∂L/∂z)·x = {result['dL_dw'][0]:.4f}")
70print(f"  ∂L/∂b = ∂L/∂z = {result['dL_db']:.4f}")
71
72# Gradient descent update
73learning_rate = 0.1
74w_new = w - learning_rate * result['dL_dw']
75b_new = b - learning_rate * result['dL_db']
76
77print(f"\nGradient Descent Update (η = {learning_rate}):")
78print(f"  w_new = w - η·(∂L/∂w) = {w[0]} - {learning_rate}·{result['dL_dw'][0]:.4f} = {w_new[0]:.4f}")
79print(f"  b_new = b - η·(∂L/∂b) = {b} - {learning_rate}·{result['dL_db']:.4f} = {b_new:.4f}")

Test Your Understanding

Question 1 of 6

If z = f(x, y), x = g(t), and y = h(t), which formula gives dz/dt?

Summary

Key Formulas

Case	Formula
z = f(x,y), x = g(t), y = h(t)	dz/dt = (∂z/∂x)(dx/dt) + (∂z/∂y)(dy/dt)
z = f(x,y), x = g(s,t), y = h(s,t)	∂z/∂s = (∂z/∂x)(∂x/∂s) + (∂z/∂y)(∂y/∂s)
General case	∂z/∂tⱼ = Σᵢ (∂z/∂xᵢ)(∂xᵢ/∂tⱼ)
Implicit: F(x,y) = 0	dy/dx = -(∂F/∂x)/(∂F/∂y)

The Tree Rule

Draw the dependency tree from inputs to outputs
Find all paths from your variable of interest to the output
Multiply derivatives along each path
Add all the path contributions

Key Takeaways

The multivariable chain rule extends the single-variable rule to functions of several variables
When changes can propagate through multiple paths, we must sum the contributions from each path
Each path contributes a product of derivatives along that path
Implicit differentiation is a special case of the chain rule
Backpropagation in neural networks is exactly the chain rule applied systematically—this is the foundation of deep learning

The Chain of Change:

"When variables depend on other variables that depend on yet other variables, the chain rule tells us exactly how changes propagate—multiply along paths, add across paths."

Coming Next: In the next section, we explore Directional Derivatives and the Gradient—learning how to find the rate of change of a function in any direction, and discovering the gradient vector that points in the direction of steepest ascent.