Learning Objectives
By the end of this section, you will be able to:
- Apply the chain rule to functions of several variables when the independent variables are themselves functions of other variables
- Construct dependency trees to identify all paths through which changes propagate
- Compute partial derivatives using the chain rule for various cases: one independent variable, two independent variables, and the general case
- Use implicit differentiation as a special application of the chain rule
- Connect the multivariable chain rule to backpropagation in neural networks and understand why this is the foundation of deep learning
The Big Picture: Tracking Change Through Networks
"When a butterfly flaps its wings in Brazil, how does that affect the weather in Tokyo? The chain rule tells us exactly how changes propagate through interconnected systems."
In single-variable calculus, the chain rule tells us how to differentiate composite functions: if and , then . But what happens when functions depend on multiple variables, and those variables depend on other variables?
The multivariable chain rule extends this powerful concept to functions of several variables. The key insight is that when a quantity depends on intermediate variables like and , changes can propagate through multiple paths. We must account for all of them.
Why This Matters
The multivariable chain rule is essential in:
- Machine Learning: Backpropagation in neural networks is nothing but the chain rule applied systematically
- Physics: Computing how quantities change in different coordinate systems (Cartesian to polar, etc.)
- Economics: Understanding how changes in base inputs affect downstream economic indicators
- Engineering: Sensitivity analysis and related rates problems in multiple dimensions
Historical Context
The chain rule in its single-variable form was known to Leibniz in the 17th century. The extension to multiple variables developed throughout the 18th and 19th centuries as mathematicians like Euler, Lagrange, and Cauchy formalized the calculus of several variables.
The notation for partial derivatives was introduced by Legendre in 1786 and popularized by Jacobi in the 1840s. This notation elegantly captures the key idea: the "curly d" reminds us that we hold other variables constant.
In the 1960s and 1970s, researchers in machine learning and optimization rediscovered the chain rule's power for computing gradients efficiently. The technique of backpropagation, formalized by Rumelhart, Hinton, and Williams in 1986, is precisely the multivariable chain rule applied to computational graphs—and it became the foundation of modern deep learning.
Review: Single-Variable Chain Rule
Let's first recall the single-variable chain rule. If and , then the derivative of with respect to is:
Single-Variable Chain Rule
Multiply the derivatives along the chain
Intuition: If increases by 1 unit when increases by 1 unit (i.e., ), and increases by 2 units when increases by 1 unit (i.e., ), then increases by 2 units when increases by 1 unit. The rates multiply.
Case 1: One Independent Variable
Suppose where both and are functions of a single variable :
How does change with ? A change in causes both and to change, and each of these changes affects .
Chain Rule: Case 1
Sum the contributions from all paths: t → x → z and t → y → z
The Dependency Tree
A dependency tree (also called a computational graph) shows how variables depend on each other. For Case 1:
The Rule: To find , identify all paths from to . For each path, multiply the derivatives along that path. Then add all the products together.
- Path 1: contributes
- Path 2: contributes
Interactive Dependency Explorer
Chain Rule Dependency Tree
Key Insight: To find how z changes with t, we must account for all paths: t affects x and y, which both affect z. We multiply derivatives along each path and add them together.
Case 2: Two Independent Variables
Now suppose where and are both functions of two independent variables and :
Now we need to find both and .
Chain Rule: Case 2
Partial vs Total Derivatives
Notice that we use partial derivative notation rather than because ultimately depends on both and . When we compute , we hold constant.
The General Chain Rule
The pattern generalizes to any number of intermediate and independent variables. If depends on , and each depends on , then:
General Chain Rule
Sum over all intermediate variables
Reading the formula: To find how changes with , look at every intermediate variable . For each, multiply how changes with by how changes with . Sum all these contributions.
The Tree Rule
To apply the chain rule:
- Draw the dependency tree from inputs to outputs
- Identify all paths from your input variable to the output
- For each path, multiply all derivatives along the path
- Add all the path contributions together
Interactive Chain Rule Visualizer
Explore how the chain rule works for a concrete example. Adjust the parameter and see how changes propagate through the composite function.
Chain Rule in Action: How Changes Propagate
Implicit Differentiation Revisited
Implicit differentiation, which we learned earlier for single-variable calculus, is actually a special case of the chain rule for multivariable functions.
Suppose defines implicitly as a function of . Since for all , differentiating both sides with respect to using the chain rule gives:
Solving for :
Implicit Differentiation Formula
(provided )
Worked Examples
Example: Polar Coordinates
Suppose and we want to express the partial derivatives in polar coordinates, where and .
Find :
Find :
This is exactly how coordinate transformations work—the chain rule tells us how derivatives transform from one coordinate system to another.
Real-World Applications
Physics: Related Rates in Multiple Dimensions
In physics, quantities often depend on multiple variables that change with time. The chain rule is essential for computing how rates of change propagate.
Example: The pressure of an ideal gas depends on volume and temperature through . If both and change with time, how fast does change?
Other physics applications include:
- Thermodynamics: Changes in entropy, internal energy, and free energy with multiple state variables
- Fluid mechanics: The material derivative
- Electromagnetism: How fields change in moving reference frames
Machine Learning: Backpropagation
The multivariable chain rule is the mathematical foundation of deep learning. When training a neural network, we need to compute how the loss changes with respect to every weight in the network. This is exactly what the chain rule does!
In a neural network, we have a chain of computations:
Each layer's output depends on the previous layer's output and the layer's weights. To find for a weight deep in the network, we multiply all the partial derivatives along the path from back to .
The Vanishing/Exploding Gradient Problem
Since backpropagation multiplies many derivatives together, if each derivative is much smaller than 1 (or much larger than 1), the product can become vanishingly small (or explosively large). This is why careful initialization and normalization are crucial in deep learning!
Interactive Backpropagation Demo
See the chain rule in action for a simple neural network. Watch how gradients flow backward from the loss to the weights.
The Chain Rule IS Backpropagation
Key Insight: Backpropagation is simply the chain rule applied systematically. To find how the loss L changes with any weight w, we multiply all the partial derivatives along the path from L back to w. This is exactly the multivariable chain rule!
Python Implementation
Numerical Chain Rule
Backpropagation Implementation
Test Your Understanding
Test Your Understanding
If z = f(x, y), x = g(t), and y = h(t), which formula gives dz/dt?
Summary
Key Formulas
| Case | Formula |
|---|---|
| z = f(x,y), x = g(t), y = h(t) | dz/dt = (∂z/∂x)(dx/dt) + (∂z/∂y)(dy/dt) |
| z = f(x,y), x = g(s,t), y = h(s,t) | ∂z/∂s = (∂z/∂x)(∂x/∂s) + (∂z/∂y)(∂y/∂s) |
| General case | ∂z/∂tⱼ = Σᵢ (∂z/∂xᵢ)(∂xᵢ/∂tⱼ) |
| Implicit: F(x,y) = 0 | dy/dx = -(∂F/∂x)/(∂F/∂y) |
The Tree Rule
- Draw the dependency tree from inputs to outputs
- Find all paths from your variable of interest to the output
- Multiply derivatives along each path
- Add all the path contributions
Key Takeaways
- The multivariable chain rule extends the single-variable rule to functions of several variables
- When changes can propagate through multiple paths, we must sum the contributions from each path
- Each path contributes a product of derivatives along that path
- Implicit differentiation is a special case of the chain rule
- Backpropagation in neural networks is exactly the chain rule applied systematically—this is the foundation of deep learning
Coming Next: In the next section, we explore Directional Derivatives and the Gradient—learning how to find the rate of change of a function in any direction, and discovering the gradient vector that points in the direction of steepest ascent.