Learning Objectives
By the end of this section, you will be able to:
- State and apply the chain rule for differentiating compositions of functions
- Identify the inner and outer functions in a composition and differentiate each correctly
- Understand intuitively why rates of change multiply when functions are composed
- Apply the chain rule multiple times for nested compositions
- Use Leibniz notation to express the chain rule as cancellation of differentials
- Connect the chain rule to backpropagation, the algorithm that powers neural network training
- Avoid common mistakes like forgetting the inner derivative
The Big Picture: Differentiating Compositions
"The chain rule is the single most important differentiation technique. It's the mathematical foundation of how neural networks learn."
We've learned the product rule and quotient rule for combining functions through multiplication and division. But what about when we compose functions — feed the output of one function into another?
Consider . This is a composition: we first square , then take the sine of the result. If we write , then .
How do we find ? The chain rule tells us:
The Chain Rule
"The derivative of the outer function (evaluated at the inner) times the derivative of the inner function."
For our example :
- Outer function: , so
- Inner function: , so
- Chain rule:
Why the Chain Rule Matters
The chain rule is arguably the most important differentiation rule because:
- Most real-world functions are compositions (e.g., , )
- It's the mathematical foundation of backpropagation in neural networks
- It enables automatic differentiation in deep learning frameworks like PyTorch and TensorFlow
- Without it, we couldn't train modern AI systems!
Historical Context
The chain rule was developed alongside calculus itself by Leibniz in the late 17th century. Leibniz's notation for derivatives made the chain rule almost self-evident — it looks like cancellation of fractions:
This "fraction-like" behavior isn't just a coincidence — it reflects something deep about how rates of change compose. If changes with at a certain rate, and changes with at another rate, then the rates multiply.
The chain rule took on new importance in the 1980s when Rumelhart, Hinton, and Williams showed that it could be used to train multi-layer neural networks through an algorithm called backpropagation. Today, every time you use ChatGPT, image recognition, or recommendation systems, the chain rule is working behind the scenes!
Intuitive Understanding: Rates Multiply
The key insight of the chain rule is that rates of change multiply when functions are composed. Let's build intuition with a concrete example.
The Gear Analogy
Imagine two connected gears:
- Gear 1 (inner function): For every rotation of the input shaft, Gear 1 rotates 3 times. (Rate: 3)
- Gear 2 (outer function): For every rotation of Gear 1, Gear 2 rotates 2 times. (Rate: 2)
Question: For every rotation of the input shaft, how many times does Gear 2 rotate?
Answer: 3 × 2 = 6 times! The rates multiply.
Rate Multiplication
Total rate: f'(g(x)) × g'(x)
Currency Conversion Analogy
Consider converting US Dollars to Japanese Yen through Euros:
- 1 USD = 0.92 EUR (rate: 0.92)
- 1 EUR = 158 JPY (rate: 158)
To find how many Yen per Dollar, we multiply the rates: 0.92 × 158 = 145.36 JPY/USD.
This is exactly what the chain rule does — it chains together rates of change by multiplying them!
Formal Statement and Proof
Theorem (Chain Rule): If is differentiable at and is differentiable at , then the composite function is differentiable at , and:
Proof:
Let and .
We want to find
Let . Then:
Multiply and divide by (when ):
As , we have (by continuity of ):
∎
Technical Note
The proof above has a subtle issue: what if for some ? A rigorous proof handles this case carefully using a modified definition. For our purposes, the intuitive idea of "rates multiply" is the key insight.
Information Flow Demonstration
Watch how values and derivatives flow through a composed function. The forward pass computes the output, while the backward pass applies the chain rule to compute derivatives.
Information Flow Through Composition
Function: y = (3x + 1)²
Outer: f(u) = u² | Inner: g(x) = 3x + 1
Forward PassComputing the output value
Backward PassComputing the derivative (Chain Rule)
The Chain Rule Formula
Key Insight: The chain rule tells us that rates of change multiply when functions are composed.
- If u changes 14.0× as fast as y
- And x changes 3× as fast as u
- Then x changes 42.0× as fast as y
Interactive Exploration
Use the visualizer below to explore the chain rule with different composed functions. Watch how the derivative of the composition depends on both the inner and outer derivatives.
Chain Rule Interactive Explorer
Chain Rule Breakdown at x = 1.00
cos(x²) · 2x
Composed Function: y = sin(x²)
Derivative: (fˆg)'(x)
The chain rule shows how changes propagate through composed functions: the rate at which y changes with x equals the rate at which y changes with u, multiplied by the rate at which u changes with x.
Worked Examples
Example 1: Power of a Linear Function
Find
Solution: Identify inner and outer functions:
- Outer: , so
- Inner: , so
Applying the chain rule:
Example 2: Exponential Composition
Find
Solution:
- Outer: , so
- Inner: , so
Applying the chain rule:
Example 3: Trigonometric Composition
Find
Solution:
- Outer: , so
- Inner: , so
Applying the chain rule:
Example 4: Square Root Composition
Find
Solution: Write as :
- Outer: , so
- Inner: , so
Applying the chain rule:
Example 5: Logarithmic Composition
Find
Solution:
- Outer: , so
- Inner: , so
Applying the chain rule:
Nested Compositions: Chains of Chains
What if we have three or more functions composed together? The chain rule extends naturally — we just multiply all the derivatives:
For :
Example: Triple Composition
Find
Solution: Three layers of composition:
- Outermost: →
- Middle: →
- Innermost: →
Applying the chain rule from outside in:
Strategy for Nested Compositions
- Identify all the layers of composition from outside to inside
- Write down each function and its derivative
- Multiply all the derivatives together, evaluating each at the appropriate composed value
The Chain Rule in Leibniz Notation
Leibniz notation makes the chain rule look like fraction "cancellation":
The 's appear to "cancel" like fractions!
While this isn't literally fraction cancellation (these are operators, not fractions), the notation is incredibly useful for remembering the chain rule and extends naturally to multiple variables.
Example Using Leibniz Notation
Let and . Find .
Chain rule:
Machine Learning Applications: Backpropagation
The chain rule is the mathematical foundation of how neural networks learn. The algorithm is called backpropagation.
How Neural Networks Use the Chain Rule
A neural network is essentially a giant composition of functions:
Each is a layer (linear transformation + activation function)
To train the network, we need to compute how the loss changes with respect to each weight. This requires differentiating through the entire composition — exactly what the chain rule does!
Forward Pass
Compute output from input by applying each layer in sequence. Values flow forward through the network.
Backward Pass
Compute gradients from output to input using the chain rule. Gradients flow backward through the network.
The Backpropagation Equation
For a network with output and loss , the gradient with respect to a parameter in layer is:
Each term is a local derivative; the chain rule multiplies them all
Why Deep Learning Works
The chain rule enables us to compute gradients through arbitrarily deep compositions efficiently. This is why we can train neural networks with hundreds of layers!
- Automatic differentiation (autograd) libraries like PyTorch implement the chain rule automatically
- Gradient descent uses these gradients to update weights and minimize loss
- Every modern AI system — from GPT to image classifiers — relies on the chain rule
Python Implementation
Numerical Verification
Let's verify the chain rule numerically by computing the derivative two ways:
The Chain Rule in Autograd
Here's how the chain rule appears in automatic differentiation, the foundation of deep learning frameworks:
Common Mistakes to Avoid
Mistake 1: Forgetting the inner derivative
Wrong:
Correct:
Always remember to multiply by the derivative of the inner function!
Mistake 2: Wrong evaluation point for outer derivative
Wrong:
Correct:
The outer derivative must be evaluated at , not at !
Mistake 3: Applying chain rule when unnecessary
For , just use the power rule directly:
But for , you can use either the chain rule or simplify first.
Use the chain rule only when there's actual composition involved.
Mistake 4: Confusing the order of composition
Be careful: in general!
- : first square, then take sine
- : first take sine, then square
These have different derivatives! Make sure you identify the correct inner and outer functions.
Mistake 5: Missing the chain rule with implicit inner functions
Wrong:
Correct:
Whenever anything other than a plain is inside another function, you need the chain rule!
Test Your Understanding
Chain Rule Quiz
Question 1 of 6Find the derivative of f(x) = sin(2x)
Outer: sin(u), Inner: u = 2x
Summary
The chain rule is the fundamental technique for differentiating composed functions. It states that rates of change multiply when functions are chained together.
Key Formula
Or in Leibniz notation:
Key Concepts
| Concept | Description |
|---|---|
| Outer function | The function on the "outside" — the last operation applied |
| Inner function | The function on the "inside" — the first operation applied to x |
| Rates multiply | The chain rule says rates of change multiply through compositions |
| Backpropagation | The chain rule applied repeatedly to compute gradients in neural networks |
| Leibniz notation | dy/dx = (dy/du)(du/dx) — looks like fraction cancellation |
| Common mistake | Forgetting to multiply by the inner derivative g'(x) |
Key Takeaways
- The chain rule is the most important differentiation rule — it handles all function compositions
- Identify the outer function f and inner function g before applying the rule
- Evaluate the outer derivative at the inner function value: f'(g(x)), not f'(x)
- Multiply by the inner derivative g'(x) — never forget this step!
- For nested compositions, apply the chain rule from outside in, multiplying all derivatives
- The chain rule is the foundation of backpropagation, making modern AI possible
Coming Next: In the next section, we'll learn Implicit Differentiation — how to find derivatives when the function isn't given explicitly, using the chain rule as our key tool.