Learning Objectives
By the end of this section, you will be able to:
- Identify critical points of functions of two variables where or the gradient does not exist
- Apply the Second Derivative Test using the Hessian matrix to classify critical points as local maxima, local minima, or saddle points
- Distinguish between local and absolute (global) extrema
- Find absolute extrema on closed, bounded regions by checking both interior critical points and boundary values
- Connect these concepts to optimization in machine learning, particularly gradient descent algorithms
- Visualize saddle points and understand why they are a unique feature of multivariable optimization
The Big Picture: Why Optimization Matters
"The essence of mathematics is not to make simple things complicated, but to make complicated things simple." — Stan Gudder
Optimization — the art of finding the best possible outcome — is one of the most important applications of calculus. In single-variable calculus, we learned to find where a function reaches its maximum or minimum values. Now, with functions of two or more variables, the geometry becomes richer and the applications become even more powerful.
Consider these real-world optimization problems:
📊 Business & Economics
- Maximize profit as a function of price and quantity
- Minimize cost given constraints on resources
- Optimize portfolio allocation across assets
- Find optimal supply chain logistics
🔬 Science & Engineering
- Design structures that minimize material while maximizing strength
- Find equilibrium states in physical systems
- Optimize chemical reaction conditions
- Design efficient heat exchangers
🤖 Machine Learning
- Train neural networks by minimizing loss functions
- Find optimal hyperparameters
- Fit models to data (regression, classification)
- Optimize reinforcement learning policies
🎯 Everyday Life
- Find the shortest path between locations
- Maximize signal strength in wireless networks
- Optimize nutrition while minimizing cost
- Design buildings for maximum natural light
The Central Question
Given a function , how do we find the points where reaches its largest or smallest values? And once we find candidate points, how do we determine which are maxima, which are minima, and which are neither?
Critical Points in Multiple Dimensions
In single-variable calculus, we found that extrema occur where or where doesn't exist. The multivariable version extends this naturally.
Definition: Critical Points
Critical Point
A point is a critical point of if:
- , meaning both and
- OR one or both partial derivatives do not exist at
Geometrically, at a critical point where the gradient is zero, the tangent plane to the surface is horizontal. The surface is "flat" in all directions at that point.
Finding Critical Points: The Process
- Compute the partial derivatives and
- Set both equal to zero and solve the system of equations:
- Each solution is a critical point
- Also check for points where derivatives don't exist
Example: Finding Critical Points
Find all critical points of .
Solution:
Step 1: Compute partial derivatives:
Step 2: Solve :
Critical point:
Value at critical point:
Interactive: Explore Critical Points
Use this interactive visualization to explore different functions and their critical points. The gradient vectors show the direction of steepest ascent — at critical points, these vectors vanish.
The Second Derivative Test
Once we find a critical point, we need to determine whether it's a local maximum, local minimum, or saddle point. Just as in single-variable calculus, we use second derivatives — but now we need all of them, organized into the Hessian matrix.
The Hessian Matrix
Hessian Matrix
For a function with continuous second partial derivatives, the Hessian matrix is:
By Clairaut's theorem, , so the Hessian is symmetric.
The Hessian captures how the function curves in all directions. Its eigenvalues reveal the principal curvatures:
- Both eigenvalues positive: Surface curves upward in all directions (bowl shape) → local minimum
- Both eigenvalues negative: Surface curves downward in all directions (inverted bowl) → local maximum
- Eigenvalues with opposite signs: Surface curves up in some directions and down in others → saddle point
The Second Derivative Test
Second Derivative Test for Functions of Two Variables
Let be a critical point of . Define the discriminant:
This is the determinant of the Hessian: .
(a) If and , then has a local minimum at .
(b) If and , then has a local maximum at .
(c) If , then has a saddle point at .
(d) If , the test is inconclusive.
Why D is the Key
The discriminant equals the product of the Hessian's eigenvalues. When , both eigenvalues have the same sign. When , they have opposite signs — the hallmark of a saddle point.
Example: Applying the Second Derivative Test
Classify the critical point of at .
Solution:
Compute second partial derivatives:
Compute the discriminant:
Since and , the point is a local minimum.
In fact, we can rewrite: , confirming this is a paraboloid with minimum at .
Interactive: Explore the Second Derivative Test
Saddle Points: A Unique Phenomenon
Saddle points are one of the most fascinating features of multivariable calculus — they have no analog in single-variable calculus. A saddle point is a critical point that is neither a maximum nor a minimum.
What Makes a Saddle Point Special?
At a saddle point:
- The gradient is zero (it's a critical point)
- The function increases in some directions from the point
- The function decreases in other directions from the point
- The Hessian is indefinite (has both positive and negative eigenvalues)
The name comes from the shape of a horse saddle: it curves upward front-to-back (like the rider sits) but curves downward side-to-side (so the rider's legs hang down).
Example: The Classic Saddle
Consider . At the origin:
- and at
- , ,
- → Saddle point!
Along the -axis, looks like a parabola opening upward. Along the -axis, looks like a parabola opening downward.
Interactive: Compare Saddle Points with Extrema
Saddle Points in Machine Learning
In high-dimensional optimization (like training neural networks), saddle points are far more common than local minima! Research shows that in high dimensions, most critical points are saddle points, not local minima. This is why optimization algorithms like SGD with momentum are designed to escape saddle points efficiently.
Finding Absolute Extrema on Closed Regions
So far, we've discussed local extrema — points that are maxima or minima in some neighborhood. But many applications require finding absolute (global) extrema — the largest and smallest values on an entire region.
The Extreme Value Theorem
Extreme Value Theorem
If is continuous on a closed, bounded region in , then attains both an absolute maximum and an absolute minimum somewhere on .
Importantly, these absolute extrema can occur either at interior critical points or on the boundary of the region. We must check both!
Method for Finding Absolute Extrema
- Find all interior critical points: Solve inside the region
- Find all boundary critical points: Parametrize the boundary and find extrema of restricted to the boundary (often using Lagrange multipliers or substitution)
- Evaluate at all candidate points
- Compare all values: the largest is the absolute maximum, the smallest is the absolute minimum
Interactive: Finding Absolute Extrema
Explore how the location of absolute extrema depends on the shape and size of the region. As you change the region, watch how the absolute maximum and minimum might shift between interior and boundary points.
When the Region Matters
A critical point that gives a local minimum in the interior might not be the absolute minimum if the boundary has even lower values! Always check the boundary carefully.
Real-World Applications
1. Engineering Design Optimization
Consider designing a rectangular box with maximum volume given a fixed surface area. If the box has dimensions , , and :
- Volume:
- Surface area constraint: (constant)
This becomes an optimization problem that we can solve using the techniques of this section (or, more elegantly, using Lagrange multipliers from the next section).
2. Economic Optimization
A company's profit depends on how many units of two products to produce. If and are the quantities:
- Profit:
- The and terms model diminishing returns
- The term models competition between products
Finding gives the optimal production levels.
3. Least Squares Fitting
When fitting a line to data points , we minimize the sum of squared errors:
Setting and gives the optimal slope and intercept. The Hessian confirms this is a minimum.
Machine Learning Applications
The concepts in this section form the mathematical foundation for training machine learning models. When we "train" a model, we're really finding the minimum of a loss function.
Loss Functions and Optimization Landscapes
A neural network with weights makes predictions and incurs a loss measuring how wrong those predictions are. Training means finding:
This is exactly finding a minimum of a multivariable function!
Gradient Descent
The gradient points in the direction of steepest ascent. To minimize, we move in the opposite direction:
where is the learning rate. This is called gradient descent.
| Concept | Single Variable | Multivariable (ML) |
|---|---|---|
| Derivative/Gradient | f'(x) | ∇L(w) = (∂L/∂w₁, ..., ∂L/∂wₙ) |
| Critical Point | f'(x) = 0 | ∇L(w) = 0 |
| Second Derivative Test | f''(x) > 0 → min | Hessian positive definite → min |
| Descent Direction | -f'(x) | -∇L(w) (steepest descent) |
| Update Rule | x ← x - η·f'(x) | w ← w - η·∇L(w) |
The Challenge of Saddle Points
In high-dimensional neural network training, most critical points are saddle points, not local minima. This is because in dimensions, a true local minimum requires all eigenvalues of the Hessian to be positive — increasingly unlikely as grows.
Escaping Saddle Points
Modern optimization algorithms like SGD with momentum, Adam, and RMSprop include mechanisms to escape saddle points:
- Momentum helps roll past flat regions
- Noise in stochastic gradients provides random perturbations
- Adaptive learning rates adjust step sizes per-parameter
The Hessian in Deep Learning
Computing the full Hessian is impractical for neural networks (it's where can be millions). However, understanding the Hessian's properties helps:
- Condition number: Ratio of largest to smallest eigenvalue affects convergence speed
- Second-order methods (like Newton's method) use Hessian information for faster convergence
- Fisher Information Matrix: An approximation used in natural gradient descent
Python Implementation
Finding and Classifying Critical Points
Gradient Descent in Action
Hessian Analysis
Test Your Understanding
Summary
Finding maximum and minimum values of functions of several variables is a fundamental skill in multivariable calculus with profound applications in optimization, machine learning, and science.
Key Concepts
| Concept | Definition/Formula |
|---|---|
| Critical Point | Where ∇f = 0 or ∇f doesn't exist |
| Hessian Matrix | H = [[fxx, fxy], [fxy, fyy]] |
| Discriminant | D = fxx·fyy - (fxy)² |
| Local Minimum | D > 0 and fxx > 0 |
| Local Maximum | D > 0 and fxx < 0 |
| Saddle Point | D < 0 |
| Inconclusive | D = 0 |
| Gradient Descent | w ← w - η·∇L(w) |
Key Takeaways
- Critical points occur where the gradient is zero or undefined — these are the candidates for extrema
- The Hessian matrix captures the curvature of a surface in all directions; its eigenvalues determine the nature of critical points
- Saddle points are critical points that are neither maxima nor minima — they curve up in some directions and down in others
- For absolute extrema on a closed region, check both interior critical points AND boundary values
- Gradient descent uses to find minima iteratively — this is how we train neural networks
- In high dimensions, saddle points dominate over local minima, making optimization challenging but not impossible
Coming Next: In the next section, we'll study Lagrange Multipliers — an elegant technique for optimization with constraints. Instead of finding the minimum of everywhere, we'll find the minimum subject to a constraint like . This has beautiful geometric interpretation and wide applications in physics and machine learning.