Boo-AI — Master Artificial Intelligence by Building from Scratch

Learning Objectives

By the end of this section, you will be able to:

Compute directional derivatives to find the rate of change of a multivariable function in any direction, not just along coordinate axes
Understand and calculate the gradient vector $\nabla f$ , which captures all partial derivative information and points toward steepest ascent
Apply the gradient-directional derivative formula $D_{\mathbf{u}} f = \nabla f \cdot \mathbf{u}$ to efficiently compute rates of change in arbitrary directions
Visualize gradient fields and understand their relationship to level curves and surface geometry
Connect these concepts to gradient descent, the fundamental optimization algorithm powering modern machine learning

Why This Matters: The gradient is arguably the most important concept in multivariable calculus for applications. It answers the question "which way is up?" at any point on a surface. This simple idea underlies GPS navigation (finding shortest paths), neural network training (minimizing loss functions), physical simulations (heat flow, fluid dynamics), and countless optimization problems across science and engineering.

The Big Picture

In single-variable calculus, the derivative $f'(x)$ tells us the rate of change of $f$ as we move along the number line. There's only one direction to go: left or right.

But for a function of two variables $f(x, y)$ , we have infinitely many directions to explore from any point. We could move along the x-axis, the y-axis, or at any angle in between. Each direction gives a potentially different rate of change.

The Central Questions:

How fast does $f$ change if we move in a specific direction?
Which direction gives the maximum rate of increase?
How can we find this direction efficiently?

The directional derivative answers the first question, and the gradient vector elegantly answers the second and third. Together, they provide a complete picture of how a function changes in all directions at once.

Historical Context

The concept of the gradient emerged from the work of several 19th-century mathematicians who sought to generalize calculus to functions of multiple variables.

Augustin-Louis Cauchy (1789-1857) developed rigorous foundations for multivariable calculus and used gradient-like concepts in optimization problems
William Rowan Hamilton (1805-1865) introduced the nabla operator $\nabla$ (named after a Hebrew harp) while developing his theory of quaternions
Peter Guthrie Tait (1831-1901) and James Clerk Maxwell (1831-1879) extensively used the gradient in their formulation of electromagnetism, where it describes how potential energy varies in space

Maxwell wrote: "The gradient of a scalar field gives both the magnitude and direction of its most rapid increase." This physical intuition—imagining a scalar field like temperature or pressure, and asking which way it increases fastest—remains the best way to understand the gradient.

The Directional Derivative

Definition and Formula

Given a function $f(x, y)$ and a unit vector $\mathbf{u} = (u_1, u_2)$ with $\|\mathbf{u}\| = 1$ , the directional derivative of $f$ at point $(a, b)$ in the direction $\mathbf{u}$ is:

$D_{\mathbf{u}} f(a, b) = \lim_{h \to 0} \frac{f(a + hu_1, b + hu_2) - f(a, b)}{h}$

This measures how fast $f$ changes as we move from $(a, b)$ in the direction $\mathbf{u}$ .

Special Cases: When

\mathbf{u} = \mathbf{i} = (1, 0)

, the directional derivative equals the partial derivative

\frac{\partial f}{\partial x}

. Similarly,

\mathbf{u} = \mathbf{j} = (0, 1)

gives

\frac{\partial f}{\partial y}

. Partial derivatives are just directional derivatives along coordinate axes!

The remarkable fact is that if $f$ is differentiable, we can compute $D_{\mathbf{u}} f$ for any direction using just the partial derivatives:

$D_{\mathbf{u}} f(a, b) = f_x(a, b) \cdot u_1 + f_y(a, b) \cdot u_2 = \nabla f \cdot \mathbf{u}$

Interactive: Directional Derivative Explorer

Explore how the directional derivative changes as you vary the direction. Notice how it equals the maximum value when your direction aligns with the gradient (orange arrow) and zero when perpendicular to it:

Loading visualization...

The Gradient Vector

Definition of the Gradient

The gradient of a function $f(x, y)$ is the vector of its partial derivatives:

$\nabla f(x, y) = \left( \frac{\partial f}{\partial x}, \frac{\partial f}{\partial y} \right) = f_x \mathbf{i} + f_y \mathbf{j}$

The symbol $\nabla$ (nabla or del) is a vector operator. For functions of three variables:

$\nabla f(x, y, z) = \left( \frac{\partial f}{\partial x}, \frac{\partial f}{\partial y}, \frac{\partial f}{\partial z} \right)$

Key Properties of the Gradient

The gradient has several remarkable properties that make it central to calculus:

Property	Statement	Meaning
Direction of Steepest Ascent	∇f points in the direction of maximum increase	Follow ∇f to climb the surface fastest
Maximum Rate of Change	\|\|∇f\|\| equals the maximum directional derivative	The gradient magnitude tells you the steepest slope
Perpendicular to Level Curves	∇f ⟂ level curves f(x,y) = c	Level curves are always crossed at right angles by ∇f
Directional Derivative Formula	D_u f = ∇f · u	Project the gradient onto any direction to get that rate of change

The Gradient Points Toward Steepest Ascent

Among all possible directions, moving in the direction of $\nabla f$ increases $f$ at the maximum possible rate. The rate of increase in this direction is $\|\nabla f\|$ , the magnitude of the gradient.

Conversely, moving in the direction of $-\nabla f$ decreases $f$ at the maximum rate. This is the basis of gradient descent.

Interactive: Gradient Field Visualizer

Explore the gradient vector field on different surfaces. The arrows show the direction of steepest ascent at each point. Notice how their color and length indicate the magnitude of the gradient (rate of steepest climb):

Loading 3D visualization...

What to Observe:

On the paraboloid, all gradients point outward from the minimum at the origin
On the saddle, gradients point in different directions depending on location
Gradient arrows are longer where the surface is steeper (higher rate of change)
Near flat regions (minima, maxima, or saddle points), arrows become very short

The Gradient and Level Curves

One of the most beautiful facts about the gradient is its relationship to level curves (contour lines). A level curve of $f(x, y)$ is the set of points where $f(x, y) = c$ for some constant $c$ .

Theorem: The gradient $\nabla f$ is always perpendicular to level curves of $f$ .

Why? If you move along a level curve, $f$ doesn't change—so the directional derivative along the level curve is zero. Since $D_{\mathbf{u}} f = \nabla f \cdot \mathbf{u} = 0$ , the direction $\mathbf{u}$ tangent to the level curve must be perpendicular to $\nabla f$ .

Hiking Analogy: Imagine hiking on a mountain where level curves represent constant elevation (like contour lines on a topographic map). The gradient at any point is the direction you'd walk to climb most steeply uphill. This direction is always perpendicular to the elevation contour you're standing on.

Computing Directional Derivatives

Example 1: Find the directional derivative of $f(x, y) = x^2 y + y^3$ at the point $(2, 1)$ in the direction toward the origin.

Solution:

Step 1: Compute the partial derivatives: $f_x = 2xy = 2(2)(1) = 4$ and $f_y = x^2 + 3y^2 = 4 + 3 = 7$

Step 2: The gradient at $(2, 1)$ is $\nabla f(2, 1) = (4, 7)$

Step 3: The direction from $(2, 1)$ to the origin is $(0, 0) - (2, 1) = (-2, -1)$ . The unit vector is $\mathbf{u} = \frac{(-2, -1)}{\sqrt{5}} = \left(-\frac{2}{\sqrt{5}}, -\frac{1}{\sqrt{5}}\right)$

Step 4: Compute the directional derivative: $D_{\mathbf{u}} f = \nabla f \cdot \mathbf{u} = (4, 7) \cdot \left(-\frac{2}{\sqrt{5}}, -\frac{1}{\sqrt{5}}\right) = \frac{-8 - 7}{\sqrt{5}} = -\frac{15}{\sqrt{5}} \approx -6.71$

The negative value means $f$ is decreasing as we move toward the origin.

Example 2: At what rate does $f(x, y, z) = x^2 + 2y^2 + 3z^2$ increase most rapidly at $(1, 1, 1)$ ? In what direction?

Solution: The gradient is $\nabla f = (2x, 4y, 6z)$ . At $(1, 1, 1)$ : $\nabla f(1, 1, 1) = (2, 4, 6)$

The maximum rate of increase is $\|\nabla f\| = \sqrt{4 + 16 + 36} = \sqrt{56} \approx 7.48$ , occurring in the direction $\frac{(2, 4, 6)}{\sqrt{56}} = \left(\frac{1}{\sqrt{14}}, \frac{2}{\sqrt{14}}, \frac{3}{\sqrt{14}}\right)$

Real-World Applications

Physics: Temperature and Electric Fields

Temperature Gradient: If $T(x, y, z)$ represents temperature at position $(x, y, z)$ , then $\nabla T$ points in the direction of fastest temperature increase. Heat flows in the opposite direction, from hot to cold, following $-\nabla T$ . This is Fourier's Law of heat conduction.

Electric Field: If $V(x, y, z)$ is the electric potential (voltage), the electric field is $\mathbf{E} = -\nabla V$ . Charges "roll downhill" in the potential landscape.

Engineering: Optimization and Design

Engineers use gradients to optimize designs. If $C(x_1, ..., x_n)$ represents cost as a function of design parameters, then $-\nabla C$ points toward the direction of greatest cost reduction. Iteratively following this direction leads to optimal (minimum cost) designs.

Machine Learning: Gradient Descent

Perhaps the most important modern application of gradients is in machine learning. Every neural network, from simple linear regression to large language models like GPT, is trained using gradient descent.

The idea is beautifully simple: we have a loss function $L(\mathbf{w})$ that measures how wrong our model's predictions are, where $\mathbf{w}$ represents all the model's parameters (weights). We want to find the weights that minimize the loss.

The Gradient Descent Update Rule:

$\mathbf{w}_{\text{new}} = \mathbf{w}_{\text{old}} - \alpha \nabla L(\mathbf{w}_{\text{old}})$

where $\alpha$ is the learning rate (step size). We move opposite to the gradient because we want to descend (minimize), not ascend.

The Learning Rate Matters!

Too small: Learning is slow; may get stuck in local minima
Too large: May overshoot the minimum and diverge
Just right: Converges efficiently to a good minimum

Modern optimizers like Adam and SGD with momentum adaptively adjust learning rates.

Interactive: Gradient Descent Demo

Watch gradient descent navigate different loss landscapes. Try the Rosenbrock function to see how challenging optimization can be, or the saddle function to see a case where gradient descent can get stuck:

Loading gradient descent demo...

Python Implementation

Here's a complete Python implementation showing numerical gradient computation, directional derivatives, and gradient descent optimization:

Gradient Computation and Gradient Descent

🐍python

Explanation(7)

Code(129)

1Imports

Import NumPy for numerical operations and matplotlib for visualization.

4Numerical Gradient Function

Computes the gradient numerically using central differences: ∂f/∂xᵢ ≈ (f(x + heᵢ) - f(x - heᵢ)) / 2h. This is the foundation of automatic differentiation.

23Directional Derivative Function

Computes the directional derivative D_u f = ∇f · u. The formula shows that the directional derivative is the component of the gradient in the direction u.

47Gradient Descent Algorithm

Implements gradient descent: x_new = x_old - α∇f. By moving in the negative gradient direction, we descend toward the minimum of the loss landscape.

77Example Function Definition

Example function f(x,y) = x² + 2y² with its minimum at the origin. The analytical gradient ∇f = (2x, 4y) allows us to verify our numerical computation.

88Gradient Computation Demo

Demonstrates computing the gradient both numerically and analytically, then computing the directional derivative in a 45° direction.

103Gradient Descent Demo

Runs gradient descent from (3, 3) to find the minimum at (0, 0). The algorithm follows the path of steepest descent at each step.

122 lines without explanation

1import numpy as np
2import matplotlib.pyplot as plt
3
4def numerical_gradient(f, x, h=1e-7):
5    """
6    Compute the gradient of f at point x numerically.
7
8    This is how automatic differentiation systems work
9    at their core - computing partial derivatives.
10
11    Args:
12        f: Function f(x) where x is a numpy array
13        x: Point at which to compute the gradient
14        h: Small step for finite difference
15
16    Returns:
17        grad: Gradient vector (same shape as x)
18    """
19    grad = np.zeros_like(x)
20
21    for i in range(len(x)):
22        # Create unit vector in i-th direction
23        e_i = np.zeros_like(x)
24        e_i[i] = h
25
26        # Central difference formula for partial derivative
27        grad[i] = (f(x + e_i) - f(x - e_i)) / (2 * h)
28
29    return grad
30
31def directional_derivative(f, x, u, h=1e-7):
32    """
33    Compute the directional derivative of f at x in direction u.
34
35    D_u f = ∇f · u (gradient dot direction)
36
37    Args:
38        f: Scalar function
39        x: Point of evaluation
40        u: Direction vector (will be normalized)
41        h: Step size for numerical differentiation
42
43    Returns:
44        D_u_f: Rate of change of f in direction u
45    """
46    # Ensure u is a unit vector
47    u_normalized = u / np.linalg.norm(u)
48
49    # Method 1: Use gradient formula
50    grad = numerical_gradient(f, x, h)
51    D_u_f_formula = np.dot(grad, u_normalized)
52
53    # Method 2: Direct limit definition
54    D_u_f_limit = (f(x + h * u_normalized) - f(x)) / h
55
56    return D_u_f_formula, grad
57
58def gradient_descent(f, x0, learning_rate=0.1, max_iters=100, tol=1e-6):
59    """
60    Minimize f using gradient descent.
61
62    The gradient points toward steepest ascent,
63    so we move in the NEGATIVE gradient direction.
64
65    Args:
66        f: Function to minimize
67        x0: Starting point
68        learning_rate: Step size (alpha)
69        max_iters: Maximum iterations
70        tol: Convergence tolerance
71
72    Returns:
73        x: Final point (approximate minimum)
74        history: List of all points visited
75    """
76    x = x0.copy()
77    history = [x.copy()]
78
79    for i in range(max_iters):
80        grad = numerical_gradient(f, x)
81        grad_norm = np.linalg.norm(grad)
82
83        # Check convergence
84        if grad_norm < tol:
85            print(f"Converged at iteration {i}")
86            break
87
88        # Update rule: move opposite to gradient
89        x = x - learning_rate * grad
90        history.append(x.copy())
91
92    return x, history
93
94# Example: f(x,y) = x² + 2y² (elliptical paraboloid)
95def f(p):
96    x, y = p
97    return x**2 + 2*y**2
98
99# Analytical gradient: ∇f = (2x, 4y)
100def grad_f_analytical(p):
101    x, y = p
102    return np.array([2*x, 4*y])
103
104# Test at point (1, 1)
105point = np.array([1.0, 1.0])
106direction = np.array([1.0, 1.0])  # 45° direction
107
108# Compute gradient
109grad_numerical = numerical_gradient(f, point)
110grad_analytical = grad_f_analytical(point)
111
112print(f"Point: {point}")
113print(f"Numerical gradient: {grad_numerical}")
114print(f"Analytical gradient: {grad_analytical}")
115print(f"Gradient magnitude: {np.linalg.norm(grad_numerical):.4f}")
116
117# Compute directional derivative
118D_u_f, _ = directional_derivative(f, point, direction)
119print(f"\nDirection: {direction / np.linalg.norm(direction)}")
120print(f"Directional derivative: {D_u_f:.4f}")
121
122# Run gradient descent from (3, 3)
123x0 = np.array([3.0, 3.0])
124x_min, history = gradient_descent(f, x0, learning_rate=0.1)
125print(f"\nGradient Descent:")
126print(f"Starting point: {x0}")
127print(f"Final point: {x_min}")
128print(f"Function value at minimum: {f(x_min):.6f}")
129print(f"Iterations: {len(history)}")

Test Your Understanding

Summary

In this section, we explored two fundamental concepts that bridge single-variable calculus to multivariable optimization:

Directional Derivative $D_{\mathbf{u}} f$ : The rate of change of $f$ in direction $\mathbf{u}$ . Computed as $D_{\mathbf{u}} f = \nabla f \cdot \mathbf{u}$ .
Gradient Vector $\nabla f$ : The vector of partial derivatives that points in the direction of steepest ascent with magnitude equal to the maximum rate of change.
Key Relationship: The gradient is perpendicular to level curves, and its magnitude gives the maximum directional derivative.
Gradient Descent: Moving in the direction $-\nabla L$ minimizes loss functions, forming the backbone of machine learning optimization.

Looking Ahead: In the next section, we'll use the gradient to find maximum and minimum values of multivariable functions, extending the critical point analysis from single-variable calculus. We'll see that points where $\nabla f = \mathbf{0}$ are candidates for extrema, and we'll develop the second derivative test using the Hessian matrix to classify them.