Boo-AI — Master Artificial Intelligence by Building from Scratch

Learning Objectives

By the end of this section, you will be able to:

Compute second, third, and higher-order derivatives of functions using differentiation rules
Interpret the physical meaning of higher derivatives in terms of motion (velocity, acceleration, jerk)
Use multiple notation systems (Leibniz, Lagrange, Newton) for higher-order derivatives
Connect the second derivative to concavity and curvature of curves
Understand how higher-order derivatives appear in Taylor series approximations
Apply second-order derivatives (the Hessian) in machine learning optimization
Recognize patterns in derivatives of common functions (polynomials, exponentials, trig)

The Big Picture: Derivatives of Derivatives

"The first derivative tells us how a quantity changes. The second tells us how that change is changing. And so on, layer by layer, into the infinite depths of change."

We've learned that the derivative $f'(x)$ measures the instantaneous rate of change of a function. But $f'(x)$ is itself a function — so we can ask: how fast is the rate of change changing?

This leads us to the second derivative $f''(x)$ , which is simply the derivative of the derivative. And we can continue: the third derivative $f'''(x)$ , fourth derivative $f^{(4)}(x)$ , and so on.

The Chain of Derivatives

Original

f(x)

→

1st Derivative

f'(x)

→

2nd Derivative

f''(x)

→

3rd Derivative

f'''(x)

→ ...

Why Higher-Order Derivatives Matter

Higher-order derivatives appear throughout mathematics and its applications:

Physics: Acceleration is the second derivative of position; jerk (important for passenger comfort) is the third
Curve Analysis: The second derivative tells us about concavity and inflection points
Taylor Series: Higher derivatives determine how well polynomials approximate functions
Differential Equations: Many physical laws involve second derivatives (F = ma, wave equation, heat equation)
Machine Learning: The Hessian (matrix of second derivatives) is crucial for optimization algorithms like Newton's method

Historical Context

The concept of higher-order derivatives emerged naturally from the work of Newton and Leibniz in the 17th century. Newton, in his work on mechanics, recognized that acceleration (what we now call the second derivative of position) was the key quantity in his laws of motion.

Leibniz's notation $\\frac{d^2y}{dx^2}$ made the concept of repeated differentiation intuitive — it suggests "differentiating twice with respect to x." This notation proved especially powerful for working with differential equations.

Brook Taylor (1685-1731) showed how all derivatives of a function at a point encode complete information about the function nearby — this became the famous Taylor series. This discovery revealed that higher-order derivatives are not just an abstract concept but carry deep geometric and analytical meaning.

The Second Derivative

The second derivative of $f$ is the derivative of $f'$ :

f''(x) = \\frac{d}{dx}[f'(x)] = \\frac{d}{dx}\\left[\\frac{df}{dx}\\right] = \\frac{d^2f}{dx^2}

Example 1: Polynomial

Find all derivatives of $f(x) = x^4$ :

$f(x) = x^4$
$f'(x) = 4x^3$
$f''(x) = 12x^2$
$f'''(x) = 24x$
$f^{(4)}(x) = 24$
$f^{(5)}(x) = 0$ (and all higher derivatives are 0)

For a polynomial of degree n, the (n+1)th derivative and beyond are all zero.

Example 2: Trigonometric Function

Find the first four derivatives of $f(x) = \\sin(x)$ :

$f(x) = \\sin(x)$
$f'(x) = \\cos(x)$
$f''(x) = -\\sin(x)$
$f'''(x) = -\\cos(x)$
$f^{(4)}(x) = \\sin(x)$

The derivatives of sin(x) cycle with period 4!

Third and Higher Derivatives

We can continue differentiating indefinitely. The nth derivative is written:

f^{(n)}(x) \\quad \\text{or} \\quad \\frac{d^n f}{dx^n}

For most common functions, there are patterns in their higher-order derivatives:

Function	Pattern of Derivatives
xⁿ (polynomial)	Decreases degree by 1 each time; becomes 0 after n+1 derivatives
eˣ	Every derivative equals eˣ (the function is its own derivative!)
sin(x), cos(x)	Cycle with period 4: sin → cos → -sin → -cos → sin
ln(x)	f^(n)(x) = (-1)^(n+1) · (n-1)! / xⁿ for n ≥ 1
e^(ax)	f^(n)(x) = aⁿ · e^(ax)

The Special Property of e\u02E3

The function $f(x) = e^x$ is remarkable: every derivative is e\u02E3. This is why e\u02E3 appears so frequently in differential equations — it's the only function (up to scaling) that equals its own derivative!

Interactive Explorer

Use this interactive visualization to explore how the original function and its first four derivatives relate to each other. Notice how:

When f is increasing, f' is positive
When f is concave up, f'' is positive
Inflection points of f occur where f'' = 0
Each derivative captures finer details about the function's shape

Higher-Order Derivative Explorer

Function

x = 1.00

Values at x = 1.00

f(x)

1.0000

f'(x)

4.0000

f''(x)

12.0000

Derivative Formulas for Polynomial: x⁴

f'(x) = 4x³

f''(x) = 12x²

f'''(x) = 24x

f''''(x) = 24

Notation Systems for Higher-Order Derivatives

There are several common notation systems for higher-order derivatives, each with its own advantages:

Name	1st	2nd	3rd	nth	Best For
Lagrange (Prime)	f'(x)	f''(x)	f'''(x)	f^(n)(x)	Quick calculations
Leibniz	dy/dx	d²y/dx²	d³y/dx³	dⁿy/dxⁿ	Chain rule, physics
Newton (Dot)	ẋ	ẍ	ẋ̇̇	—	Physics (time derivatives)
D-operator	Df	D²f	D³f	Dⁿf	Differential equations

Choosing Notation

Use prime notation (f', f'') for quick calculations. Use Leibniz notation (dy/dx) when you need to be explicit about variables, especially in the chain rule. Use dot notation (\u1E8B, \u1E8D) for derivatives with respect to time in physics.

Physics: Motion in Detail

In physics, higher-order derivatives of position have specific names and physical meanings:

Derivative	Name	Symbol	Physical Meaning	Units (SI)
0th (position)	Position	s	Where the object is	meters (m)
1st	Velocity	v = ds/dt	How fast position changes	m/s
2nd	Acceleration	a = dv/dt	How fast velocity changes	m/s²
3rd	Jerk	j = da/dt	How fast acceleration changes	m/s³
4th	Snap (Jounce)	—	How fast jerk changes	m/s⁴
5th	Crackle	—	How fast snap changes	m/s⁵
6th	Pop	—	How fast crackle changes	m/s⁶

Newton's Second Law $F = ma$ relates force to the second derivative of position. This is why so many physical laws involve second-order differential equations!

Physics of Motion: Higher-Order Derivatives in Action

Speed: 0.5x

Time: 0.00s

Position

10.00

s(t) = 0th derivative

Velocity

5.00

v(t) = s'(t) = 1st derivative

Acceleration

-9.80

a(t) = s''(t) = 2nd derivative

Jerk

0.00

j(t) = s'''(t) = 3rd derivative

What's Happening?

In physics, each derivative tells us something important about motion:

Position s(t): Where the object is
Velocity v(t) = s'(t): How fast position is changing (speed and direction)
Acceleration a(t) = s''(t): How fast velocity is changing (force/mass by Newton's 2nd law)
Jerk j(t) = s'''(t): How fast acceleration is changing (affects passenger comfort!)

Why Jerk Matters

Engineers designing elevators, trains, and roller coasters pay close attention to jerk (the third derivative of position). High jerk causes discomfort and even injury. Smooth transportation requires not just constant velocity, but also smooth changes in acceleration!

Concavity and Curvature

The second derivative reveals crucial information about a curve's shape:

f''(x) > 0: Concave Up

The curve bends upward like a bowl that holds water. The tangent line lies below the curve. The slope f'(x) is increasing.

f''(x) < 0: Concave Down

The curve bends downward like an upside-down bowl. The tangent line lies above the curve. The slope f'(x) is decreasing.

An inflection point occurs where concavity changes — typically where $f''(x) = 0$ (though we must verify the sign actually changes).

Curvature: Quantifying Bending

The curvature $\\kappa$ at a point measures how sharply the curve bends. It's defined as:

\\kappa = \\frac{|f''(x)|}{(1 + [f'(x)]^2)^{3/2}}

The denominator accounts for the slope; otherwise, a tilted straight line would appear curved.

The osculating circle at a point is the circle that best approximates the curve there. Its radius is $R = 1/\\kappa$ , the reciprocal of curvature.

Curvature and the Second Derivative

Curve

Tangent

Osculating Circle

x = 0.50

f'(x) - Slope

1.0000

f''(x) - Concavity

2.0000

Concave up ↑

\u03BA - Curvature

0.7071

R - Radius

1.41

Curvature Formula

\u03BA = |f''(x)| / (1 + f'(x)\u00B2)^(3/2)

R = 1/\u03BA = Radius of osculating circle

The osculating circle is the circle that best approximates the curve at a point. Its radius is the reciprocal of curvature. Notice how:

Where f''(x) = 0 (inflection points), curvature is 0 and radius is infinite (straight line)
Large |f''(x)| means tight curvature (small radius)
The sign of f''(x) determines which side of the curve the center lies

Taylor Series: Higher Derivatives Build Approximations

One of the most profound uses of higher-order derivatives is in Taylor series. The Taylor series of f centered at a is:

f(x) = f(a) + f'(a)(x-a) + \\frac{f''(a)}{2!}(x-a)^2 + \\frac{f'''(a)}{3!}(x-a)^3 + \\cdots

In summation form: $f(x) = \\sum_{n=0}^{\\infty} \\frac{f^{(n)}(a)}{n!}(x-a)^n$

Each term uses a higher-order derivative to capture more detail about the function's behavior near the point a:

0th derivative (f(a)): The value at a (constant term)
1st derivative (f'(a)): The slope at a (linear approximation)
2nd derivative (f''(a)): The curvature at a (quadratic approximation)
Higher derivatives: Finer and finer details of the shape

Taylor Series: Higher-Order Derivatives Build Approximations

Function

Taylor Polynomial Order: 2 (uses derivatives up to f⁽²\u207E)

Taylor Polynomial of Order 2

T₂(x) = 1 + x + 1/2 · x²

Each term uses a higher-order derivative: the n-th term is f^(n)(a) / n! \u00B7 (x-a)^n

Key Insight

Higher-order derivatives capture more and more information about the function's behavior:

0th derivative (value): Where the function is
1st derivative (slope): Which direction it's going
2nd derivative (curvature): How it's bending
Higher: Finer details of the shape

ML Connection

Taylor expansions are fundamental to machine learning:

Gradient descent: Uses 1st-order (gradient)
Newton's method: Uses 2nd-order (Hessian)
Natural gradient: Uses Fisher information
Approximating losses: Taylor around optimum

Patterns in Derivatives of Common Functions

Recognizing patterns in higher-order derivatives saves time and provides insight:

Exponential Functions

For $f(x) = e^{ax}$ :

f^{(n)}(x) = a^n e^{ax}

The special case a = 1 gives $f^{(n)}(x) = e^x$ for all n.

Sine and Cosine

The derivatives cycle with period 4:

sin(x)

sin \u2192 cos \u2192 -sin \u2192 -cos \u2192 sin

cos(x)

cos \u2192 -sin \u2192 -cos \u2192 sin \u2192 cos

General formula: $\\frac{d^n}{dx^n}[\\sin(x)] = \\sin(x + n\\pi/2)$

Polynomials

For $f(x) = x^n$ :

f^{(k)}(x) = \\frac{n!}{(n-k)!} x^{n-k} = n(n-1)(n-2)\\cdots(n-k+1)x^{n-k}

At the nth derivative: $f^{(n)}(x) = n!$ (a constant)
Beyond that: $f^{(n+1)}(x) = 0$

Machine Learning Applications

Second-order derivatives are crucial in machine learning optimization:

The Hessian Matrix

For a function of multiple variables $f(x_1, x_2, \\ldots, x_n)$ , the Hessian matrix contains all second-order partial derivatives:

H_{ij} = \\frac{\\partial^2 f}{\\partial x_i \\partial x_j}

The Hessian is symmetric (H_ij = H_ji) for smooth functions.

Newton's Method for Optimization

While gradient descent uses only first-order information (the gradient), Newton's method uses the Hessian to take smarter steps:

x_{n+1} = x_n - H^{-1} \\nabla f

This often converges faster (quadratically vs linearly) but requires computing and inverting the Hessian.

Gradient Descent

Uses 1st derivative (gradient)
Step: -\u03B1\u2207f
Linear convergence
Cheap per iteration

Newton's Method

Uses 2nd derivative (Hessian)
Step: -H\u207B\u00B9\u2207f
Quadratic convergence
Expensive per iteration

Curvature Information for Better Training

The Hessian provides crucial information:

Eigenvalues: Tell us about the curvature in different directions
Condition number: Large = ill-conditioned loss surface = harder to optimize
Saddle points: Detected by mixed positive/negative eigenvalues
Local minimum: Confirmed by all positive eigenvalues

Practical Approaches

Computing the full Hessian is expensive (O(n\u00B2) storage, O(n\u00B3) to invert). Practical methods include:

Diagonal approximations: Only keep diagonal elements
L-BFGS: Approximate inverse Hessian from gradient history
Natural gradient: Use Fisher information instead
Adaptive methods: Adam, AdaGrad adapt to curvature implicitly

Python Implementation

Computing Higher-Order Derivatives

Here's how to compute higher-order derivatives numerically and recognize their patterns:

Computing nth Derivatives

🐍higher_order_derivatives.py

Explanation(5)

Code(47)

4Recursive nth Derivative

We compute higher-order derivatives recursively: the nth derivative is the derivative of the (n-1)th derivative. This mirrors the mathematical definition.

10Central Difference

The first derivative uses the central difference formula: f'(x) ≈ [f(x+h) - f(x-h)]/(2h), which is more accurate than forward or backward difference.

13Recursive Case

For n > 1, we define a function that computes the (n-1)th derivative, then differentiate that. This chains the differentiation process.

24Polynomial Termination

For x^4: 5th derivative and beyond are 0. Each derivative reduces the degree by 1, and after degree + 1 derivatives, we get zero.

35Trigonometric Cycling

sin(x) derivatives cycle with period 4: sin → cos → -sin → -cos → sin. This is because d/dx(sin) = cos and d/dx(cos) = -sin.

42 lines without explanation

1import numpy as np
2from scipy.misc import derivative
3
4def nth_derivative(f, x, n, h=1e-5):
5    """
6    Compute the nth derivative of f at x numerically.
7
8    Uses recursive central difference formula.
9    """
10    if n == 0:
11        return f(x)
12    elif n == 1:
13        return (f(x + h) - f(x - h)) / (2 * h)
14    else:
15        # Recursive: n-th derivative is derivative of (n-1)th
16        def f_prev(t):
17            return nth_derivative(f, t, n - 1, h)
18        return (f_prev(x + h) - f_prev(x - h)) / (2 * h)
19
20# Example: Analyze f(x) = x^4
21f = lambda x: x**4
22
23print("Derivatives of f(x) = x^4 at x = 2:")
24for n in range(6):
25    deriv = nth_derivative(f, 2.0, n)
26    print(f"  f^({n})(2) = {deriv:.2f}")
27
28# Expected values at x = 2:
29# f(2)     = 16
30# f'(2)    = 4*2^3 = 32
31# f''(2)   = 12*2^2 = 48
32# f'''(2)  = 24*2 = 48
33# f''''(2) = 24
34# f^(5)(2) = 0
35
36print()
37
38# Example: Sin function cycles every 4 derivatives
39g = lambda x: np.sin(x)
40x0 = np.pi / 4
41
42print(f"Derivatives of sin(x) at x = π/4:")
43for n in range(8):
44    deriv = nth_derivative(g, x0, n)
45    print(f"  sin^({n})(π/4) = {deriv:.4f}")
46
47# Pattern: sin -> cos -> -sin -> -cos -> sin (period 4)

The Hessian in Machine Learning

Computing the Hessian matrix for optimization:

Hessian Matrix for Optimization

🐍hessian_optimization.py

Explanation(5)

Code(56)

3The Hessian Matrix

The Hessian is the matrix of all second-order partial derivatives. For a function of n variables, it's an n×n symmetric matrix.

17Mixed Partials

We compute ∂²f/∂x_i∂x_j using the formula with four function evaluations. By Schwarz's theorem, mixed partials are equal: ∂²f/∂x∂y = ∂²f/∂y∂x.

26Finite Difference Formula

This formula approximates the second mixed partial derivative using four strategically placed points, similar to how we approximate first derivatives.

40Eigenvalue Test

If all eigenvalues of the Hessian are positive, the function is locally convex (bowl-shaped upward). This is the second derivative test for multiple variables!

46Newton's Method

Newton's method uses the Hessian to take optimal steps: Δx = -H^(-1)∇f. This uses second-order information to converge faster than gradient descent.

51 lines without explanation

1import numpy as np
2
3def compute_hessian(f, x, h=1e-5):
4    """
5    Compute the Hessian matrix of f at point x.
6
7    The Hessian contains all second-order partial derivatives:
8    H[i,j] = ∂²f / ∂x_i ∂x_j
9
10    Used in Newton's method for optimization.
11    """
12    n = len(x)
13    H = np.zeros((n, n))
14
15    for i in range(n):
16        for j in range(n):
17            # Second partial derivative using finite differences
18            x_pp = x.copy()
19            x_pm = x.copy()
20            x_mp = x.copy()
21            x_mm = x.copy()
22
23            x_pp[i] += h; x_pp[j] += h
24            x_pm[i] += h; x_pm[j] -= h
25            x_mp[i] -= h; x_mp[j] += h
26            x_mm[i] -= h; x_mm[j] -= h
27
28            H[i, j] = (f(x_pp) - f(x_pm) - f(x_mp) + f(x_mm)) / (4 * h * h)
29
30    return H
31
32# Example: Quadratic function (bowl shape)
33# f(x, y) = x^2 + 2y^2 + xy
34# Hessian should be [[2, 1], [1, 4]]
35
36def f(x):
37    return x[0]**2 + 2*x[1]**2 + x[0]*x[1]
38
39x0 = np.array([1.0, 1.0])
40H = compute_hessian(f, x0)
41
42print("Function: f(x,y) = x² + 2y² + xy")
43print(f"Hessian at (1, 1):")
44print(H)
45print()
46
47# Check eigenvalues for convexity
48eigenvalues = np.linalg.eigvalsh(H)
49print(f"Eigenvalues: {eigenvalues}")
50print(f"All positive? {np.all(eigenvalues > 0)} → Convex (local min)")
51
52# Newton's step: Δx = -H^(-1) * gradient
53grad = np.array([2*x0[0] + x0[1], 4*x0[1] + x0[0]])  # gradient
54delta = -np.linalg.solve(H, grad)
55print(f"Newton step from (1,1): {delta}")
56print(f"New point: {x0 + delta}")

Test Your Understanding

Question 1 of 10

If f(x) = x⁴, what is f″(x)?

Summary

Higher-order derivatives extend the power of calculus, revealing progressively finer details about how functions behave.

Key Formulas

Concept	Formula
Second derivative	f′′(x) = d/dx[f′(x)] = d²f/dx²
nth derivative	f^(n)(x) = d^n f/dx^n
Curvature	κ = \|f′′\|/(1 + f′²)^(3/2)
Taylor series term	f^(n)(a)/n! · (x-a)^n
Newton optimization	x_{n+1} = x_n - H⁻¹∇f

Key Concepts

The second derivative measures the rate of change of the rate of change — it tells us about concavity and curvature
In physics, derivatives of position give velocity (1st), acceleration (2nd), and jerk (3rd)
Polynomials terminate: after degree + 1 derivatives, all higher derivatives are 0
Exponentials are special: e\u02E3 equals all its own derivatives
Trig functions cycle with period 4: sin \u2192 cos \u2192 -sin \u2192 -cos \u2192 sin
Taylor series use all derivatives to build polynomial approximations
The Hessian (matrix of second partials) is crucial for optimization in ML

The Takeaway:

Higher-order derivatives peel back layers of change — each one reveals more about how functions evolve, from their slope to their curvature to the finest details of their shape.

Coming Next: In the next chapter, we'll explore Derivatives of Transcendental Functions — how to differentiate exponentials, logarithms, and trigonometric functions.