Boo-AI — Master Artificial Intelligence by Building from Scratch

Learning Objectives

By the end of this section, you will be able to:

Identify critical points of functions of two variables where $\nabla f = \mathbf{0}$ or the gradient does not exist
Apply the Second Derivative Test using the Hessian matrix to classify critical points as local maxima, local minima, or saddle points
Distinguish between local and absolute (global) extrema
Find absolute extrema on closed, bounded regions by checking both interior critical points and boundary values
Connect these concepts to optimization in machine learning, particularly gradient descent algorithms
Visualize saddle points and understand why they are a unique feature of multivariable optimization

The Big Picture: Why Optimization Matters

"The essence of mathematics is not to make simple things complicated, but to make complicated things simple." — Stan Gudder

Optimization — the art of finding the best possible outcome — is one of the most important applications of calculus. In single-variable calculus, we learned to find where a function reaches its maximum or minimum values. Now, with functions of two or more variables, the geometry becomes richer and the applications become even more powerful.

Consider these real-world optimization problems:

📊 Business & Economics

Maximize profit as a function of price and quantity
Minimize cost given constraints on resources
Optimize portfolio allocation across assets
Find optimal supply chain logistics

🔬 Science & Engineering

Design structures that minimize material while maximizing strength
Find equilibrium states in physical systems
Optimize chemical reaction conditions
Design efficient heat exchangers

🤖 Machine Learning

Train neural networks by minimizing loss functions
Find optimal hyperparameters
Fit models to data (regression, classification)
Optimize reinforcement learning policies

🎯 Everyday Life

Find the shortest path between locations
Maximize signal strength in wireless networks
Optimize nutrition while minimizing cost
Design buildings for maximum natural light

The Central Question

Given a function $f(x, y)$ , how do we find the points where $f$ reaches its largest or smallest values? And once we find candidate points, how do we determine which are maxima, which are minima, and which are neither?

Critical Points in Multiple Dimensions

In single-variable calculus, we found that extrema occur where $f'(x) = 0$ or where $f'(x)$ doesn't exist. The multivariable version extends this naturally.

Definition: Critical Points

Critical Point

A point $(a, b)$ is a critical point of $f(x, y)$ if:

$\nabla f(a, b) = \mathbf{0}$ , meaning both $f_x(a, b) = 0$ and $f_y(a, b) = 0$
OR one or both partial derivatives do not exist at $(a, b)$

Geometrically, at a critical point where the gradient is zero, the tangent plane to the surface $z = f(x, y)$ is horizontal. The surface is "flat" in all directions at that point.

Finding Critical Points: The Process

Compute the partial derivatives $f_x$ and $f_y$
Set both equal to zero and solve the system of equations:
$f_x(x, y) = 0 \quad \text{and} \quad f_y(x, y) = 0$
Each solution $(a, b)$ is a critical point
Also check for points where derivatives don't exist

Example: Finding Critical Points

Find all critical points of $f(x, y) = x^2 + y^2 - 2x - 4y + 5$ .

Solution:

Step 1: Compute partial derivatives:

$f_x = 2x - 2$

$f_y = 2y - 4$

Step 2: Solve $\nabla f = \mathbf{0}$ :

$2x - 2 = 0 \implies x = 1$

$2y - 4 = 0 \implies y = 2$

Critical point: $(1, 2)$

Value at critical point: $f(1, 2) = 1 + 4 - 2 - 8 + 5 = 0$

Interactive: Explore Critical Points

Use this interactive visualization to explore different functions and their critical points. The gradient vectors show the direction of steepest ascent — at critical points, these vectors vanish.

Loading visualization...

The Second Derivative Test

Once we find a critical point, we need to determine whether it's a local maximum, local minimum, or saddle point. Just as in single-variable calculus, we use second derivatives — but now we need all of them, organized into the Hessian matrix.

The Hessian Matrix

Hessian Matrix

For a function $f(x, y)$ with continuous second partial derivatives, the Hessian matrix is:

H = \begin{bmatrix} f_{xx} & f_{xy} \\ f_{yx} & f_{yy} \end{bmatrix}

By Clairaut's theorem, $f_{xy} = f_{yx}$ , so the Hessian is symmetric.

The Hessian captures how the function curves in all directions. Its eigenvalues reveal the principal curvatures:

Both eigenvalues positive: Surface curves upward in all directions (bowl shape) → local minimum
Both eigenvalues negative: Surface curves downward in all directions (inverted bowl) → local maximum
Eigenvalues with opposite signs: Surface curves up in some directions and down in others → saddle point

The Second Derivative Test

Second Derivative Test for Functions of Two Variables

Let $(a, b)$ be a critical point of $f$ . Define the discriminant:

D = f_{xx}(a,b) \cdot f_{yy}(a,b) - [f_{xy}(a,b)]^2

This is the determinant of the Hessian: $D = \det(H)$ .

(a) If $D > 0$ and $f_{xx}(a,b) > 0$ , then $f$ has a local minimum at $(a, b)$ .

(b) If $D > 0$ and $f_{xx}(a,b) < 0$ , then $f$ has a local maximum at $(a, b)$ .

(c) If $D < 0$ , then $f$ has a saddle point at $(a, b)$ .

(d) If $D = 0$ , the test is inconclusive.

Why D is the Key

The discriminant $D = f_{xx} f_{yy} - f_{xy}^2$ equals the product of the Hessian's eigenvalues. When $D > 0$ , both eigenvalues have the same sign. When $D < 0$ , they have opposite signs — the hallmark of a saddle point.

Example: Applying the Second Derivative Test

Classify the critical point of $f(x, y) = x^2 + y^2 - 2x - 4y + 5$ at $(1, 2)$ .

Solution:

Compute second partial derivatives:

$f_{xx} = 2, \quad f_{yy} = 2, \quad f_{xy} = 0$

Compute the discriminant:

$D = (2)(2) - (0)^2 = 4 > 0$

Since $D > 0$ and $f_{xx} = 2 > 0$ , the point $(1, 2)$ is a local minimum.

In fact, we can rewrite: $f(x,y) = (x-1)^2 + (y-2)^2$ , confirming this is a paraboloid with minimum at $(1, 2)$ .

Interactive: Explore the Second Derivative Test

Loading visualization...

Saddle Points: A Unique Phenomenon

Saddle points are one of the most fascinating features of multivariable calculus — they have no analog in single-variable calculus. A saddle point is a critical point that is neither a maximum nor a minimum.

What Makes a Saddle Point Special?

At a saddle point:

The gradient is zero (it's a critical point)
The function increases in some directions from the point
The function decreases in other directions from the point
The Hessian is indefinite (has both positive and negative eigenvalues)

The name comes from the shape of a horse saddle: it curves upward front-to-back (like the rider sits) but curves downward side-to-side (so the rider's legs hang down).

Example: The Classic Saddle

Consider $f(x, y) = x^2 - y^2$ . At the origin:

$f_x = 2x = 0$ and $f_y = -2y = 0$ at $(0, 0)$
$f_{xx} = 2$ , $f_{yy} = -2$ , $f_{xy} = 0$
$D = (2)(-2) - 0^2 = -4 < 0$ → Saddle point!

Along the $x$ -axis, $f(x, 0) = x^2$ looks like a parabola opening upward. Along the $y$ -axis, $f(0, y) = -y^2$ looks like a parabola opening downward.

Interactive: Compare Saddle Points with Extrema

Loading visualization...

Saddle Points in Machine Learning

In high-dimensional optimization (like training neural networks), saddle points are far more common than local minima! Research shows that in high dimensions, most critical points are saddle points, not local minima. This is why optimization algorithms like SGD with momentum are designed to escape saddle points efficiently.

Finding Absolute Extrema on Closed Regions

So far, we've discussed local extrema — points that are maxima or minima in some neighborhood. But many applications require finding absolute (global) extrema — the largest and smallest values on an entire region.

The Extreme Value Theorem

Extreme Value Theorem

If $f$ is continuous on a closed, bounded region $D$ in $\mathbb{R}^2$ , then $f$ attains both an absolute maximum and an absolute minimum somewhere on $D$ .

Importantly, these absolute extrema can occur either at interior critical points or on the boundary of the region. We must check both!

Method for Finding Absolute Extrema

Find all interior critical points: Solve $\nabla f = \mathbf{0}$ inside the region
Find all boundary critical points: Parametrize the boundary and find extrema of $f$ restricted to the boundary (often using Lagrange multipliers or substitution)
Evaluate $f$ at all candidate points
Compare all values: the largest is the absolute maximum, the smallest is the absolute minimum

Interactive: Finding Absolute Extrema

Explore how the location of absolute extrema depends on the shape and size of the region. As you change the region, watch how the absolute maximum and minimum might shift between interior and boundary points.

Loading visualization...

When the Region Matters

A critical point that gives a local minimum in the interior might not be the absolute minimum if the boundary has even lower values! Always check the boundary carefully.

Real-World Applications

1. Engineering Design Optimization

Consider designing a rectangular box with maximum volume given a fixed surface area. If the box has dimensions $x$ , $y$ , and $z$ :

Volume: $V = xyz$
Surface area constraint: $2xy + 2xz + 2yz = S$ (constant)

This becomes an optimization problem that we can solve using the techniques of this section (or, more elegantly, using Lagrange multipliers from the next section).

2. Economic Optimization

A company's profit depends on how many units of two products to produce. If $x$ and $y$ are the quantities:

Profit: $P(x, y) = 100x + 80y - x^2 - y^2 - xy$
The $-x^2$ and $-y^2$ terms model diminishing returns
The $-xy$ term models competition between products

Finding $\nabla P = \mathbf{0}$ gives the optimal production levels.

3. Least Squares Fitting

When fitting a line $y = mx + b$ to data points $(x_i, y_i)$ , we minimize the sum of squared errors:

E(m, b) = \sum_{i=1}^{n} (y_i - mx_i - b)^2

Setting $\partial E/\partial m = 0$ and $\partial E/\partial b = 0$ gives the optimal slope and intercept. The Hessian confirms this is a minimum.

Machine Learning Applications

The concepts in this section form the mathematical foundation for training machine learning models. When we "train" a model, we're really finding the minimum of a loss function.

Loss Functions and Optimization Landscapes

A neural network with weights $\mathbf{w}$ makes predictions and incurs a loss $L(\mathbf{w})$ measuring how wrong those predictions are. Training means finding:

\mathbf{w}^* = \arg\min_{\mathbf{w}} L(\mathbf{w})

This is exactly finding a minimum of a multivariable function!

Gradient Descent

The gradient $\nabla L$ points in the direction of steepest ascent. To minimize, we move in the opposite direction:

\mathbf{w}_{t+1} = \mathbf{w}_t - \eta \nabla L(\mathbf{w}_t)

where $\eta$ is the learning rate. This is called gradient descent.

Concept	Single Variable	Multivariable (ML)
Derivative/Gradient	f'(x)	∇L(w) = (∂L/∂w₁, ..., ∂L/∂wₙ)
Critical Point	f'(x) = 0	∇L(w) = 0
Second Derivative Test	f''(x) > 0 → min	Hessian positive definite → min
Descent Direction	-f'(x)	-∇L(w) (steepest descent)
Update Rule	x ← x - η·f'(x)	w ← w - η·∇L(w)

The Challenge of Saddle Points

In high-dimensional neural network training, most critical points are saddle points, not local minima. This is because in $n$ dimensions, a true local minimum requires all $n$ eigenvalues of the Hessian to be positive — increasingly unlikely as $n$ grows.

Escaping Saddle Points

Modern optimization algorithms like SGD with momentum, Adam, and RMSprop include mechanisms to escape saddle points:

Momentum helps roll past flat regions
Noise in stochastic gradients provides random perturbations
Adaptive learning rates adjust step sizes per-parameter

The Hessian in Deep Learning

Computing the full Hessian is impractical for neural networks (it's $n \times n$ where $n$ can be millions). However, understanding the Hessian's properties helps:

Condition number: Ratio of largest to smallest eigenvalue affects convergence speed
Second-order methods (like Newton's method) use Hessian information for faster convergence
Fisher Information Matrix: An approximation used in natural gradient descent

Python Implementation

Finding and Classifying Critical Points

Symbolic Analysis of Critical Points

🐍critical_points.py

Explanation(5)

Code(59)

8Symbolic Mathematics

SymPy allows exact symbolic computation of derivatives and solving systems of equations, avoiding numerical approximation errors.

11Computing Partial Derivatives

sp.diff(f, x) computes ∂f/∂x symbolically. We need both partial derivatives to find where the gradient equals zero.

17Solving for Critical Points

sp.solve() finds all points where both ∂f/∂x = 0 and ∂f/∂y = 0 simultaneously. These are the critical points.

20Second Partial Derivatives

The Hessian matrix requires fxx, fyy, and fxy. These determine the curvature of the surface at each critical point.

35The Discriminant D

D = fxx·fyy - (fxy)² is the determinant of the Hessian. Its sign determines whether the critical point is an extremum or saddle point.

54 lines without explanation

1import numpy as np
2from scipy.optimize import minimize
3import sympy as sp
4
5# Define symbolic variables for exact computation
6x, y = sp.symbols('x y')
7
8# Example function: f(x,y) = x³ - 3xy + y³
9f = x**3 - 3*x*y + y**3
10
11# Compute partial derivatives
12f_x = sp.diff(f, x)  # ∂f/∂x = 3x² - 3y
13f_y = sp.diff(f, y)  # ∂f/∂y = -3x + 3y²
14
15print("Finding Critical Points")
16print("=" * 50)
17print(f"f(x,y) = {f}")
18print(f"\n∂f/∂x = {f_x}")
19print(f"∂f/∂y = {f_y}")
20
21# Solve the system ∇f = 0
22critical_points = sp.solve([f_x, f_y], [x, y])
23print(f"\nCritical points: {critical_points}")
24
25# Compute second partial derivatives
26f_xx = sp.diff(f_x, x)  # ∂²f/∂x² = 6x
27f_yy = sp.diff(f_y, y)  # ∂²f/∂y² = 6y
28f_xy = sp.diff(f_x, y)  # ∂²f/∂x∂y = -3
29
30print(f"\nSecond derivatives:")
31print(f"f_xx = {f_xx}")
32print(f"f_yy = {f_yy}")
33print(f"f_xy = {f_xy}")
34
35# Classify each critical point
36print("\nClassification using Second Derivative Test:")
37for point in critical_points:
38    px, py = point
39    # Evaluate second derivatives at the critical point
40    fxx_val = float(f_xx.subs([(x, px), (y, py)]))
41    fyy_val = float(f_yy.subs([(x, px), (y, py)]))
42    fxy_val = float(f_xy.subs([(x, px), (y, py)]))
43
44    # Compute discriminant D = fxx * fyy - (fxy)²
45    D = fxx_val * fyy_val - fxy_val**2
46
47    print(f"\nAt ({px}, {py}):")
48    print(f"  f_xx = {fxx_val}, f_yy = {fyy_val}, f_xy = {fxy_val}")
49    print(f"  D = {D:.4f}")
50
51    if D > 0:
52        if fxx_val > 0:
53            print("  → Local MINIMUM")
54        else:
55            print("  → Local MAXIMUM")
56    elif D < 0:
57        print("  → SADDLE POINT")
58    else:
59        print("  → Test INCONCLUSIVE (D = 0)")

Gradient Descent in Action

Gradient Descent Optimization

🐍gradient_descent.py

Explanation(5)

Code(66)

3Gradient Descent Algorithm

This is the foundational optimization algorithm used in machine learning. It iteratively moves in the direction of steepest descent (-∇f) to find local minima.

17Gradient Computation

The gradient tells us the direction of steepest ascent. We move in the opposite direction to descend toward a minimum.

24The Update Rule

x_new = x_old - η·∇f. The learning rate η controls how big each step is. Too large causes overshooting; too small causes slow convergence.

31Rosenbrock Function

A famous test function with a curved valley. The global minimum is at (1,1). It's challenging because the valley is narrow and curved.

37Analytical Gradient

We compute ∂f/∂x and ∂f/∂y analytically. In neural networks, backpropagation computes these gradients automatically.

61 lines without explanation

1import numpy as np
2import matplotlib.pyplot as plt
3
4def gradient_descent(f, grad_f, x0, learning_rate=0.1, max_iters=100, tol=1e-6):
5    """
6    Gradient descent optimizer for finding local minima.
7
8    In machine learning, this is how we train neural networks:
9    - f is the loss function
10    - grad_f is the gradient (computed via backpropagation)
11    - x0 is the initial weights
12    - learning_rate controls step size
13    """
14    path = [x0.copy()]
15    x = x0.copy()
16
17    for i in range(max_iters):
18        # Compute gradient at current point
19        grad = grad_f(x)
20
21        # Check for convergence
22        if np.linalg.norm(grad) < tol:
23            print(f"Converged at iteration {i}")
24            break
25
26        # Update: move opposite to gradient (steepest descent)
27        x = x - learning_rate * grad
28        path.append(x.copy())
29
30    return x, np.array(path)
31
32# Example: Minimize the Rosenbrock function
33# f(x,y) = (1-x)² + 100(y-x²)²
34# This is a classic benchmark for optimization algorithms
35
36def rosenbrock(x):
37    return (1 - x[0])**2 + 100 * (x[1] - x[0]**2)**2
38
39def rosenbrock_grad(x):
40    """
41    Gradient of the Rosenbrock function:
42    ∂f/∂x = -2(1-x) - 400x(y-x²)
43    ∂f/∂y = 200(y-x²)
44    """
45    dfdx = -2 * (1 - x[0]) - 400 * x[0] * (x[1] - x[0]**2)
46    dfdy = 200 * (x[1] - x[0]**2)
47    return np.array([dfdx, dfdy])
48
49# Start from a random point
50x0 = np.array([-1.5, 2.0])
51
52# Run gradient descent with small learning rate (Rosenbrock is tricky!)
53result, path = gradient_descent(rosenbrock, rosenbrock_grad, x0,
54                                 learning_rate=0.001, max_iters=10000)
55
56print(f"\nRosenbrock Function Optimization")
57print(f"=" * 50)
58print(f"Starting point: {x0}")
59print(f"Final point: ({result[0]:.6f}, {result[1]:.6f})")
60print(f"Known minimum at: (1, 1)")
61print(f"Final value: {rosenbrock(result):.10f}")
62print(f"Iterations: {len(path)}")
63
64# The gradient at a minimum should be (nearly) zero
65print(f"\nGradient at solution: {rosenbrock_grad(result)}")
66print(f"Gradient magnitude: {np.linalg.norm(rosenbrock_grad(result)):.6f}")

Hessian Analysis

Second Derivative Test with Hessian Matrix

🐍hessian_analysis.py

Explanation(3)

Code(76)

5Hessian Matrix

The Hessian is the matrix of all second partial derivatives. It captures the curvature of the surface in all directions.

27Eigenvalue Analysis

Eigenvalues of the Hessian reveal the principal curvatures. Both positive → minimum. Both negative → maximum. Mixed signs → saddle.

37The Determinant Criterion

D = fxx·fyy - (fxy)² is the product of eigenvalues. D > 0 means same-sign eigenvalues. D < 0 means opposite-sign eigenvalues.

73 lines without explanation

1import numpy as np
2
3def hessian_classification(f, f_xx, f_yy, f_xy, point):
4    """
5    Classify a critical point using the Second Derivative Test.
6
7    The Hessian matrix H at a critical point determines the nature:
8    H = [[f_xx, f_xy],
9         [f_xy, f_yy]]
10
11    - If det(H) > 0 and f_xx > 0: Local minimum
12    - If det(H) > 0 and f_xx < 0: Local maximum
13    - If det(H) < 0: Saddle point
14    - If det(H) = 0: Test inconclusive
15    """
16    x, y = point
17
18    # Evaluate second derivatives at the point
19    fxx = f_xx(x, y)
20    fyy = f_yy(x, y)
21    fxy = f_xy(x, y)
22
23    # Build Hessian matrix
24    H = np.array([[fxx, fxy],
25                  [fxy, fyy]])
26
27    # Compute determinant and eigenvalues
28    D = np.linalg.det(H)
29    eigenvalues = np.linalg.eigvals(H)
30
31    print(f"\nAnalysis at point ({x}, {y}):")
32    print(f"Hessian matrix:")
33    print(f"  [{fxx:.3f}  {fxy:.3f}]")
34    print(f"  [{fxy:.3f}  {fyy:.3f}]")
35    print(f"\nDeterminant D = {D:.4f}")
36    print(f"Eigenvalues: {eigenvalues}")
37    print(f"Trace (f_xx + f_yy): {fxx + fyy:.3f}")
38
39    if D > 0:
40        if fxx > 0:
41            print("\n✓ D > 0 and f_xx > 0 → LOCAL MINIMUM")
42            print("  Both eigenvalues are positive (positive definite)")
43        else:
44            print("\n✓ D > 0 and f_xx < 0 → LOCAL MAXIMUM")
45            print("  Both eigenvalues are negative (negative definite)")
46    elif D < 0:
47        print("\n✓ D < 0 → SADDLE POINT")
48        print("  Eigenvalues have opposite signs (indefinite)")
49    else:
50        print("\n? D = 0 → TEST INCONCLUSIVE")
51        print("  At least one eigenvalue is zero (degenerate)")
52
53    return D, eigenvalues, H
54
55# Example: Analyze f(x,y) = x² - y² (classic saddle)
56print("Example 1: f(x,y) = x² - y² (Saddle surface)")
57print("=" * 50)
58
59f = lambda x, y: x**2 - y**2
60f_xx = lambda x, y: 2
61f_yy = lambda x, y: -2
62f_xy = lambda x, y: 0
63
64hessian_classification(f, f_xx, f_yy, f_xy, (0, 0))
65
66# Example: Analyze f(x,y) = x² + y² (Paraboloid)
67print("\n" + "=" * 50)
68print("Example 2: f(x,y) = x² + y² (Bowl/Paraboloid)")
69print("=" * 50)
70
71f = lambda x, y: x**2 + y**2
72f_xx = lambda x, y: 2
73f_yy = lambda x, y: 2
74f_xy = lambda x, y: 0
75
76hessian_classification(f, f_xx, f_yy, f_xy, (0, 0))

Test Your Understanding

Summary

Finding maximum and minimum values of functions of several variables is a fundamental skill in multivariable calculus with profound applications in optimization, machine learning, and science.

Key Concepts

Concept	Definition/Formula
Critical Point	Where ∇f = 0 or ∇f doesn't exist
Hessian Matrix	H = [[fxx, fxy], [fxy, fyy]]
Discriminant	D = fxx·fyy - (fxy)²
Local Minimum	D > 0 and fxx > 0
Local Maximum	D > 0 and fxx < 0
Saddle Point	D < 0
Inconclusive	D = 0
Gradient Descent	w ← w - η·∇L(w)

Key Takeaways

Critical points occur where the gradient is zero or undefined — these are the candidates for extrema
The Hessian matrix captures the curvature of a surface in all directions; its eigenvalues determine the nature of critical points
Saddle points are critical points that are neither maxima nor minima — they curve up in some directions and down in others
For absolute extrema on a closed region, check both interior critical points AND boundary values
Gradient descent uses $-\nabla f$ to find minima iteratively — this is how we train neural networks
In high dimensions, saddle points dominate over local minima, making optimization challenging but not impossible

The Essence of Optimization:

"At a critical point, the gradient vanishes. The Hessian reveals whether we've found a peak, a valley, or a mountain pass — information that guides every optimization algorithm from gradient descent to neural network training."

Coming Next: In the next section, we'll study Lagrange Multipliers — an elegant technique for optimization with constraints. Instead of finding the minimum of $f(x, y)$ everywhere, we'll find the minimum subject to a constraint like $g(x, y) = c$ . This has beautiful geometric interpretation and wide applications in physics and machine learning.