Chapter 17
22 min read
Section 151 of 353

The Chain Rule for Multivariable Functions

Partial Derivatives

Learning Objectives

By the end of this section, you will be able to:

  1. Apply the chain rule to functions of several variables when the independent variables are themselves functions of other variables
  2. Construct dependency trees to identify all paths through which changes propagate
  3. Compute partial derivatives using the chain rule for various cases: one independent variable, two independent variables, and the general case
  4. Use implicit differentiation as a special application of the chain rule
  5. Connect the multivariable chain rule to backpropagation in neural networks and understand why this is the foundation of deep learning

The Big Picture: Tracking Change Through Networks

"When a butterfly flaps its wings in Brazil, how does that affect the weather in Tokyo? The chain rule tells us exactly how changes propagate through interconnected systems."

In single-variable calculus, the chain rule tells us how to differentiate composite functions: if z=f(y)z = f(y) and y=g(x)y = g(x), then dzdx=dzdydydx\frac{dz}{dx} = \frac{dz}{dy} \cdot \frac{dy}{dx}. But what happens when functions depend on multiple variables, and those variables depend on other variables?

The multivariable chain rule extends this powerful concept to functions of several variables. The key insight is that when a quantity zz depends on intermediate variables like xx and yy, changes can propagate through multiple paths. We must account for all of them.

Why This Matters

The multivariable chain rule is essential in:

  • Machine Learning: Backpropagation in neural networks is nothing but the chain rule applied systematically
  • Physics: Computing how quantities change in different coordinate systems (Cartesian to polar, etc.)
  • Economics: Understanding how changes in base inputs affect downstream economic indicators
  • Engineering: Sensitivity analysis and related rates problems in multiple dimensions

Historical Context

The chain rule in its single-variable form was known to Leibniz in the 17th century. The extension to multiple variables developed throughout the 18th and 19th centuries as mathematicians like Euler, Lagrange, and Cauchy formalized the calculus of several variables.

The notation zx\frac{\partial z}{\partial x} for partial derivatives was introduced by Legendre in 1786 and popularized by Jacobi in the 1840s. This notation elegantly captures the key idea: the "curly d" reminds us that we hold other variables constant.

In the 1960s and 1970s, researchers in machine learning and optimization rediscovered the chain rule's power for computing gradients efficiently. The technique of backpropagation, formalized by Rumelhart, Hinton, and Williams in 1986, is precisely the multivariable chain rule applied to computational graphs—and it became the foundation of modern deep learning.


Review: Single-Variable Chain Rule

Let's first recall the single-variable chain rule. If z=f(y)z = f(y) and y=g(x)y = g(x), then the derivative of zz with respect to xx is:

Single-Variable Chain Rule

dzdx=dzdydydx\frac{dz}{dx} = \frac{dz}{dy} \cdot \frac{dy}{dx}

Multiply the derivatives along the chain

Intuition: If yy increases by 1 unit when xx increases by 1 unit (i.e., dydx=1\frac{dy}{dx} = 1), and zz increases by 2 units when yy increases by 1 unit (i.e., dzdy=2\frac{dz}{dy} = 2), then zz increases by 2 units when xx increases by 1 unit. The rates multiply.


Case 1: One Independent Variable

Suppose z=f(x,y)z = f(x, y) where both xx and yy are functions of a single variable tt:

  • x=g(t)x = g(t)
  • y=h(t)y = h(t)

How does zz change with tt? A change in tt causes both xx and yy to change, and each of these changes affects zz.

Chain Rule: Case 1

dzdt=zxdxdt+zydydt\frac{dz}{dt} = \frac{\partial z}{\partial x}\frac{dx}{dt} + \frac{\partial z}{\partial y}\frac{dy}{dt}

Sum the contributions from all paths: t → x → z and t → y → z

The Dependency Tree

A dependency tree (also called a computational graph) shows how variables depend on each other. For Case 1:

z
xy
t

The Rule: To find dzdt\frac{dz}{dt}, identify all paths from tt to zz. For each path, multiply the derivatives along that path. Then add all the products together.

  • Path 1: txzt \to x \to z contributes zxdxdt\frac{\partial z}{\partial x} \cdot \frac{dx}{dt}
  • Path 2: tyzt \to y \to z contributes zydydt\frac{\partial z}{\partial y} \cdot \frac{dy}{dt}

Interactive Dependency Explorer

Chain Rule Dependency Tree

zxyt
Output
Intermediate
Input
dz/dt = (∂z/∂x)(dx/dt) + (∂z/∂y)(dy/dt)
Each path from t to z contributes a product of derivatives

Key Insight: To find how z changes with t, we must account for all paths: t affects x and y, which both affect z. We multiply derivatives along each path and add them together.


Case 2: Two Independent Variables

Now suppose z=f(x,y)z = f(x, y) where xx and yy are both functions of two independent variables ss and tt:

  • x=g(s,t)x = g(s, t)
  • y=h(s,t)y = h(s, t)

Now we need to find both zs\frac{\partial z}{\partial s} and zt\frac{\partial z}{\partial t}.

Chain Rule: Case 2

zs=zxxs+zyys\frac{\partial z}{\partial s} = \frac{\partial z}{\partial x}\frac{\partial x}{\partial s} + \frac{\partial z}{\partial y}\frac{\partial y}{\partial s}
zt=zxxt+zyyt\frac{\partial z}{\partial t} = \frac{\partial z}{\partial x}\frac{\partial x}{\partial t} + \frac{\partial z}{\partial y}\frac{\partial y}{\partial t}

Partial vs Total Derivatives

Notice that we use partial derivative notation s\frac{\partial}{\partial s} rather than dds\frac{d}{ds} because zz ultimately depends on both ss and tt. When we compute zs\frac{\partial z}{\partial s}, we hold tt constant.


The General Chain Rule

The pattern generalizes to any number of intermediate and independent variables. If zz depends on x1,x2,,xmx_1, x_2, \ldots, x_m, and each xix_i depends on t1,t2,,tnt_1, t_2, \ldots, t_n, then:

General Chain Rule

ztj=i=1mzxixitj\frac{\partial z}{\partial t_j} = \sum_{i=1}^{m} \frac{\partial z}{\partial x_i} \frac{\partial x_i}{\partial t_j}

Sum over all intermediate variables xix_i

Reading the formula: To find how zz changes with tjt_j, look at every intermediate variable xix_i. For each, multiply how zz changes with xix_i by how xix_i changes with tjt_j. Sum all these contributions.

The Tree Rule

To apply the chain rule:

  1. Draw the dependency tree from inputs to outputs
  2. Identify all paths from your input variable to the output
  3. For each path, multiply all derivatives along the path
  4. Add all the path contributions together

Interactive Chain Rule Visualizer

Explore how the chain rule works for a concrete example. Adjust the parameter tt and see how changes propagate through the composite function.

Chain Rule in Action: How Changes Propagate

z(t) = x(t) + y(t)² = t² + sin²(t)
Current point
New point
Tangent
At t = 1.00:
x(t) = t² = 1.000
y(t) = sin(t) = 0.841
z = x + y² = 1.708
Derivatives:
dx/dt = 2t = 2.000
dy/dt = cos(t) = 0.540
∂z/∂x = 1.000
∂z/∂y = 2y = 1.683
Chain Rule:
dz/dt = (∂z/∂x)(dx/dt) + (∂z/∂y)(dy/dt)
= (1.00)(2.00) + (1.68)(0.54)
= 2.9093
Linear Approximation vs Reality:
Predicted Δz:
1.4546
Actual Δz:
1.5369
Error: 0.082274 (smaller Δt → better approximation)

Implicit Differentiation Revisited

Implicit differentiation, which we learned earlier for single-variable calculus, is actually a special case of the chain rule for multivariable functions.

Suppose F(x,y)=0F(x, y) = 0 defines yy implicitly as a function of xx. Since F(x,y(x))=0F(x, y(x)) = 0 for all xx, differentiating both sides with respect to xx using the chain rule gives:

Fx+Fydydx=0\frac{\partial F}{\partial x} + \frac{\partial F}{\partial y}\frac{dy}{dx} = 0

Solving for dydx\frac{dy}{dx}:

Implicit Differentiation Formula

dydx=FxFy=F/xF/y\frac{dy}{dx} = -\frac{F_x}{F_y} = -\frac{\partial F/\partial x}{\partial F/\partial y}

(provided Fy0F_y \neq 0)


Worked Examples

Example: Polar Coordinates

Suppose z=f(x,y)z = f(x, y) and we want to express the partial derivatives in polar coordinates, where x=rcosθx = r\cos\theta and y=rsinθy = r\sin\theta.

Find zr\frac{\partial z}{\partial r}:

zr=zxxr+zyyr\frac{\partial z}{\partial r} = \frac{\partial z}{\partial x}\frac{\partial x}{\partial r} + \frac{\partial z}{\partial y}\frac{\partial y}{\partial r}
=zxcosθ+zysinθ= \frac{\partial z}{\partial x}\cos\theta + \frac{\partial z}{\partial y}\sin\theta

Find zθ\frac{\partial z}{\partial \theta}:

zθ=zxxθ+zyyθ\frac{\partial z}{\partial \theta} = \frac{\partial z}{\partial x}\frac{\partial x}{\partial \theta} + \frac{\partial z}{\partial y}\frac{\partial y}{\partial \theta}
=zx(rsinθ)+zy(rcosθ)= \frac{\partial z}{\partial x}(-r\sin\theta) + \frac{\partial z}{\partial y}(r\cos\theta)

This is exactly how coordinate transformations work—the chain rule tells us how derivatives transform from one coordinate system to another.


Real-World Applications

Physics: Related Rates in Multiple Dimensions

In physics, quantities often depend on multiple variables that change with time. The chain rule is essential for computing how rates of change propagate.

Example: The pressure PP of an ideal gas depends on volume VV and temperature TT through P=nRT/VP = nRT/V. If both VV and TT change with time, how fast does PP change?

dPdt=PVdVdt+PTdTdt=nRTV2dVdt+nRVdTdt\frac{dP}{dt} = \frac{\partial P}{\partial V}\frac{dV}{dt} + \frac{\partial P}{\partial T}\frac{dT}{dt} = -\frac{nRT}{V^2}\frac{dV}{dt} + \frac{nR}{V}\frac{dT}{dt}

Other physics applications include:

  • Thermodynamics: Changes in entropy, internal energy, and free energy with multiple state variables
  • Fluid mechanics: The material derivative DDt=t+v\frac{D}{Dt} = \frac{\partial}{\partial t} + \mathbf{v} \cdot \nabla
  • Electromagnetism: How fields change in moving reference frames

Machine Learning: Backpropagation

The multivariable chain rule is the mathematical foundation of deep learning. When training a neural network, we need to compute how the loss LL changes with respect to every weight in the network. This is exactly what the chain rule does!

In a neural network, we have a chain of computations:

input → linear → activation → linear → activation → ... → loss

Each layer's output depends on the previous layer's output and the layer's weights. To find Lw\frac{\partial L}{\partial w} for a weight ww deep in the network, we multiply all the partial derivatives along the path from LL back to ww.

The Vanishing/Exploding Gradient Problem

Since backpropagation multiplies many derivatives together, if each derivative is much smaller than 1 (or much larger than 1), the product can become vanishingly small (or explosively large). This is why careful initialization and normalization are crucial in deep learning!

Interactive Backpropagation Demo

See the chain rule in action for a simple neural network. Watch how gradients flow backward from the loss to the weights.

The Chain Rule IS Backpropagation

w₁=0.50w₂=-0.30x₁2x₂3z0.10a0.525σ(z)L0.2256y=1
Forward Pass: Compute z
Linear combination of inputs and weights
z = w₁x₁ + w₂x₂ = (0.50)(2) + (-0.30)(3) = 0.100
Gradients (for gradient descent update):
∂L/∂w₁ = -0.473835
∂L/∂w₂ = -0.710753
w₁ ← w₁ - η·(∂L/∂w₁), w₂ ← w₂ - η·(∂L/∂w₂)

Key Insight: Backpropagation is simply the chain rule applied systematically. To find how the loss L changes with any weight w, we multiply all the partial derivatives along the path from L back to w. This is exactly the multivariable chain rule!


Python Implementation

Numerical Chain Rule

Computing the Chain Rule Numerically
🐍chain_rule.py
5Numerical Chain Rule Function

This function computes dz/dt for the composition z = f(x(t), y(t)) using numerical derivatives. It demonstrates the multivariable chain rule: we must account for how z changes through both x and y.

27Computing Current Values

First, we evaluate x(t), y(t), and z = f(x, y) at the current point t. These are the values around which we'll compute derivatives.

32Partial Derivative ∂z/∂x

We compute the partial derivative of z with respect to x using central differences. This tells us how z changes when only x changes, holding y fixed.

35Partial Derivative ∂z/∂y

Similarly, we compute ∂z/∂y—how z changes with y, holding x fixed. Together with ∂z/∂x, these capture how z responds to changes in its inputs.

38Derivatives dx/dt and dy/dt

These are ordinary derivatives showing how x and y change with t. Since x and y are single-variable functions of t, we use standard differentiation.

42The Chain Rule Formula

The key formula: dz/dt = (∂z/∂x)(dx/dt) + (∂z/∂y)(dy/dt). Each path from t to z contributes a product of derivatives, and we sum all contributions.

78 lines without explanation
1import numpy as np
2from typing import Callable, Tuple
3
4def numerical_chain_rule(
5    f: Callable[[float, float], float],
6    x: Callable[[float], float],
7    y: Callable[[float], float],
8    t: float,
9    h: float = 1e-7
10) -> Tuple[float, dict]:
11    """
12    Numerically compute dz/dt using the chain rule where
13    z = f(x(t), y(t)).
14
15    This demonstrates the multivariable chain rule:
16    dz/dt = (∂z/∂x)(dx/dt) + (∂z/∂y)(dy/dt)
17
18    Args:
19        f: Function z = f(x, y)
20        x: Function x(t)
21        y: Function y(t)
22        t: Point at which to evaluate
23        h: Step size for numerical derivatives
24
25    Returns:
26        dz/dt and a dictionary of intermediate values
27    """
28    # Current values
29    x_t = x(t)
30    y_t = y(t)
31    z_t = f(x_t, y_t)
32
33    # Partial derivatives of z with respect to x and y
34    # ∂z/∂x ≈ [f(x+h, y) - f(x-h, y)] / (2h)
35    dz_dx = (f(x_t + h, y_t) - f(x_t - h, y_t)) / (2 * h)
36
37    # ∂z/∂y ≈ [f(x, y+h) - f(x, y-h)] / (2h)
38    dz_dy = (f(x_t, y_t + h) - f(x_t, y_t - h)) / (2 * h)
39
40    # Derivatives of x and y with respect to t
41    dx_dt = (x(t + h) - x(t - h)) / (2 * h)
42    dy_dt = (y(t + h) - y(t - h)) / (2 * h)
43
44    # Chain rule: dz/dt = (∂z/∂x)(dx/dt) + (∂z/∂y)(dy/dt)
45    dz_dt = dz_dx * dx_dt + dz_dy * dy_dt
46
47    return dz_dt, {
48        'x': x_t, 'y': y_t, 'z': z_t,
49        'dz_dx': dz_dx, 'dz_dy': dz_dy,
50        'dx_dt': dx_dt, 'dy_dt': dy_dt,
51        'contribution_x': dz_dx * dx_dt,
52        'contribution_y': dz_dy * dy_dt
53    }
54
55# Example: z = x² + xy, x = cos(t), y = sin(t)
56def f(x, y):
57    return x**2 + x*y
58
59def x(t):
60    return np.cos(t)
61
62def y(t):
63    return np.sin(t)
64
65# Evaluate at t = π/4
66t_val = np.pi / 4
67dz_dt, info = numerical_chain_rule(f, x, y, t_val)
68
69print("Chain Rule Calculation for z = x² + xy")
70print("where x = cos(t), y = sin(t)")
71print(f"\nAt t = π/4 = {t_val:.4f}:")
72print(f"  x(t) = {info['x']:.4f}")
73print(f"  y(t) = {info['y']:.4f}")
74print(f"  z = f(x,y) = {info['z']:.4f}")
75print(f"\nPartial derivatives:")
76print(f"  ∂z/∂x = 2x + y = {info['dz_dx']:.4f}")
77print(f"  ∂z/∂y = x = {info['dz_dy']:.4f}")
78print(f"\nDerivatives with respect to t:")
79print(f"  dx/dt = -sin(t) = {info['dx_dt']:.4f}")
80print(f"  dy/dt = cos(t) = {info['dy_dt']:.4f}")
81print(f"\nChain rule components:")
82print(f"  (∂z/∂x)(dx/dt) = {info['contribution_x']:.4f}")
83print(f"  (∂z/∂y)(dy/dt) = {info['contribution_y']:.4f}")
84print(f"\ndz/dt = {dz_dt:.4f}")

Backpropagation Implementation

Backpropagation: Chain Rule in Neural Networks
🐍backprop.py
12Forward and Backward Pass

This function demonstrates a complete forward pass (compute output from input) and backward pass (compute gradients using chain rule) for a single neuron.

21Forward Pass

First, compute z = wx + b (linear combination), then a = σ(z) (apply activation), then L = (a-y)² (compute loss). This builds the computational graph.

29Backward Pass Begins

Now we work backwards, applying the chain rule at each step. We compute how L changes with respect to each intermediate variable.

32∂L/∂a

The derivative of L = (a-y)² with respect to a is 2(a-y). This is where gradients start flowing backward.

35∂a/∂z = σ'(z)

The sigmoid derivative σ'(z) = σ(z)(1-σ(z)) = a(1-a). This tells us how the activation changes with z.

39Chain Rule: ∂L/∂z

Now we multiply: ∂L/∂z = (∂L/∂a)(∂a/∂z). This is the chain rule in action—multiplying derivatives along the path from L to z.

42Gradient with respect to weight

Since ∂z/∂w = x, we get ∂L/∂w = (∂L/∂z)·x. This is the gradient we use to update the weight.

67Gradient Descent Update

Finally, we update w and b in the direction that decreases loss: w ← w - η·(∂L/∂w). The chain rule gave us exactly the gradients we needed!

71 lines without explanation
1import numpy as np
2
3def sigmoid(z):
4    """Sigmoid activation function"""
5    return 1 / (1 + np.exp(-z))
6
7def sigmoid_derivative(z):
8    """Derivative of sigmoid: σ'(z) = σ(z)(1 - σ(z))"""
9    s = sigmoid(z)
10    return s * (1 - s)
11
12def forward_backward_pass(x, w, b, y_true):
13    """
14    Demonstrates forward and backward pass through a single
15    neuron, showing the chain rule in action.
16
17    Network: z = w·x + b, a = σ(z), L = (a - y)²
18
19    The chain rule gives us:
20    ∂L/∂w = (∂L/∂a)(∂a/∂z)(∂z/∂w)
21    """
22    # Forward pass
23    z = np.dot(w, x) + b          # Linear combination
24    a = sigmoid(z)                 # Activation
25    loss = (a - y_true) ** 2       # Loss
26
27    # Backward pass (chain rule!)
28    # Each step multiplies derivatives along the path
29
30    # Step 1: ∂L/∂a = 2(a - y)
31    dL_da = 2 * (a - y_true)
32
33    # Step 2: ∂a/∂z = σ'(z) = a(1-a)
34    da_dz = sigmoid_derivative(z)
35
36    # Step 3: Combine using chain rule
37    # ∂L/∂z = (∂L/∂a)(∂a/∂z)
38    dL_dz = dL_da * da_dz
39
40    # Step 4: ∂z/∂w = x, so ∂L/∂w = (∂L/∂z)(∂z/∂w)
41    dL_dw = dL_dz * x
42
43    # ∂z/∂b = 1, so ∂L/∂b = ∂L/∂z
44    dL_db = dL_dz
45
46    return {
47        'z': z, 'a': a, 'loss': loss,
48        'dL_da': dL_da, 'da_dz': da_dz,
49        'dL_dz': dL_dz, 'dL_dw': dL_dw, 'dL_db': dL_db
50    }
51
52# Example: single input, single output
53x = np.array([2.0])      # Input
54w = np.array([0.5])      # Weight
55b = 0.1                  # Bias
56y_true = 1.0             # Target
57
58result = forward_backward_pass(x, w, b, y_true)
59
60print("Forward Pass:")
61print(f"  z = w·x + b = {w[0]}·{x[0]} + {b} = {result['z']:.4f}")
62print(f"  a = σ(z) = {result['a']:.4f}")
63print(f"  Loss = (a - y)² = {result['loss']:.6f}")
64
65print("\nBackward Pass (Chain Rule):")
66print(f"  ∂L/∂a = 2(a - y) = {result['dL_da']:.4f}")
67print(f"  ∂a/∂z = σ'(z) = a(1-a) = {result['da_dz']:.4f}")
68print(f"  ∂L/∂z = (∂L/∂a)(∂a/∂z) = {result['dL_dz']:.4f}")
69print(f"  ∂L/∂w = (∂L/∂z)·x = {result['dL_dw'][0]:.4f}")
70print(f"  ∂L/∂b = ∂L/∂z = {result['dL_db']:.4f}")
71
72# Gradient descent update
73learning_rate = 0.1
74w_new = w - learning_rate * result['dL_dw']
75b_new = b - learning_rate * result['dL_db']
76
77print(f"\nGradient Descent Update (η = {learning_rate}):")
78print(f"  w_new = w - η·(∂L/∂w) = {w[0]} - {learning_rate}·{result['dL_dw'][0]:.4f} = {w_new[0]:.4f}")
79print(f"  b_new = b - η·(∂L/∂b) = {b} - {learning_rate}·{result['dL_db']:.4f} = {b_new:.4f}")

Test Your Understanding

Test Your Understanding

Question 1 of 6

If z = f(x, y), x = g(t), and y = h(t), which formula gives dz/dt?


Summary

Key Formulas

CaseFormula
z = f(x,y), x = g(t), y = h(t)dz/dt = (∂z/∂x)(dx/dt) + (∂z/∂y)(dy/dt)
z = f(x,y), x = g(s,t), y = h(s,t)∂z/∂s = (∂z/∂x)(∂x/∂s) + (∂z/∂y)(∂y/∂s)
General case∂z/∂tⱼ = Σᵢ (∂z/∂xᵢ)(∂xᵢ/∂tⱼ)
Implicit: F(x,y) = 0dy/dx = -(∂F/∂x)/(∂F/∂y)

The Tree Rule

  1. Draw the dependency tree from inputs to outputs
  2. Find all paths from your variable of interest to the output
  3. Multiply derivatives along each path
  4. Add all the path contributions

Key Takeaways

  1. The multivariable chain rule extends the single-variable rule to functions of several variables
  2. When changes can propagate through multiple paths, we must sum the contributions from each path
  3. Each path contributes a product of derivatives along that path
  4. Implicit differentiation is a special case of the chain rule
  5. Backpropagation in neural networks is exactly the chain rule applied systematically—this is the foundation of deep learning
The Chain of Change:
"When variables depend on other variables that depend on yet other variables, the chain rule tells us exactly how changes propagate—multiply along paths, add across paths."
Coming Next: In the next section, we explore Directional Derivatives and the Gradient—learning how to find the rate of change of a function in any direction, and discovering the gradient vector that points in the direction of steepest ascent.
Loading comments...