Boo-AI — Master Artificial Intelligence by Building from Scratch

Learning Objectives

By the end of this section, you will be able to:

Explain the difference between average rate of change and instantaneous rate of change
Describe how the derivative arises as a limit of difference quotients
Visualize the transition from secant lines to tangent lines as $\Delta x \to 0$
State the formal definition of the derivative using limit notation
Interpret the derivative geometrically as the slope of the tangent line
Apply the derivative concept to physics problems involving velocity
Connect derivatives to optimization in machine learning (gradient descent)

The Big Picture: Why We Need Instantaneous Rates

"The derivative is the fundamental new idea of calculus — the concept that distinguishes calculus from algebra." — Richard Courant

Imagine you're driving on a highway. Your speedometer reads 65 mph. But what does "65 mph" actually mean? You haven't traveled 65 miles in the past hour — you just started driving 10 minutes ago. The speedometer is showing your instantaneous velocity: how fast you're going right now, at this exact moment.

This is fundamentally different from average velocity. If you drive 130 miles in 2 hours, your average velocity is 65 mph. But during those 2 hours, you might have sped up, slowed down, or even stopped — the average hides all those details.

The Core Question

How do we calculate a rate of change at a single instant when, by definition, change requires two different moments?

This is the central problem that calculus was invented to solve. The answer — the derivative — revolutionized science and remains essential today.

Where Instantaneous Rates Appear

🚗 Physics

Velocity (rate of position change)
Acceleration (rate of velocity change)
Electric current (rate of charge flow)
Power (rate of energy transfer)

📈 Economics

Marginal cost (rate of cost change)
Marginal revenue
Elasticity of demand
Growth rates

🧬 Biology

Population growth rates
Reaction velocities
Drug concentration decay
Mutation rates

🤖 Machine Learning

Gradient descent optimization
Backpropagation in neural networks
Loss function sensitivity
Learning rate adaptation

Historical Origins: Newton, Leibniz, and the Birth of Calculus

The derivative concept was developed independently by Isaac Newton (1642–1727) in England and Gottfried Wilhelm Leibniz (1646–1716) in Germany during the 1670s and 1680s. Despite a bitter priority dispute, both are credited as co-founders of calculus.

Newton's Motivation: Motion and Gravity

Newton needed calculus to develop his laws of motion and universal gravitation. He asked: How does the Moon's velocity change as it orbits Earth? How does a falling apple accelerate? These questions required understanding instantaneous rates of change.

Newton called derivatives "fluxions" — he imagined quantities as flowing through time, and the fluxion was the rate of flow at any instant. His notation used dots: $\dot{y}$ for dy/dt.

Leibniz's Insight: Infinitesimals

Leibniz approached the problem through infinitely small quantities. He imagined dx as an "infinitesimal" change in x — not zero, but smaller than any positive number. The ratio dy/dx captured the instantaneous rate.

Leibniz's notation ( $\frac{dy}{dx}$ ) proved more practical and is still used today. It suggests that the derivative is a ratio of infinitesimal changes, even though we formalize it as a limit.

The Rigorous Foundation

For nearly 200 years, calculus worked well but lacked rigorous foundations. In the 1800s, Augustin-Louis Cauchy and Karl Weierstrass developed the epsilon-delta definition of limits, finally putting derivatives on solid mathematical ground.

Average Rate of Change: The Starting Point

Before we can understand instantaneous rates, let's be precise about average rates.

Definition: Average Rate of Change

For a function $f(x)$ , the average rate of change from $x = a$ to $x = b$ is:

\text{Average Rate} = \frac{f(b) - f(a)}{b - a} = \frac{\Delta y}{\Delta x}

This is simply the slope of the secant line connecting the points $(a, f(a))$ and $(b, f(b))$ .

Symbol	Name	Meaning
Δy = f(b) - f(a)	Change in output	How much the function value changed
Δx = b - a	Change in input	How much the input changed
Δy/Δx	Difference quotient	Ratio of changes = slope of secant line

Example: Average Velocity

A ball is thrown upward and its height at time $t$ seconds is $s(t) = 40t - 5t^2$ meters.

Question: What is the average velocity from $t = 1$ to $t = 3$ ?

Solution:

$s(1) = 40(1) - 5(1)^2 = 35$ meters

$s(3) = 40(3) - 5(3)^2 = 120 - 45 = 75$ meters

Average velocity = $\frac{75 - 35}{3 - 1} = \frac{40}{2} = 20$ m/s

But this tells us the ball's average behavior over 2 seconds — not its velocity at any particular instant. At $t = 2$ , is the ball moving faster or slower than 20 m/s?

The Limit Process: Shrinking the Interval

Here's Newton and Leibniz's key insight: to find the instantaneous rate at a point, compute the average rate over smaller and smaller intervals containing that point, and see what value the average approaches.

Zooming In on a Single Point

Let's find the instantaneous velocity at $t = 2$ for our ball with $s(t) = 40t - 5t^2$ .

We'll compute the average velocity from $t = 2$ to $t = 2 + h$ for smaller and smaller values of $h$ :

h	Interval	s(2)	s(2+h)	Average Velocity
1	[2, 3]	60	75	(75-60)/1 = 15 m/s
0.5	[2, 2.5]	60	68.75	(68.75-60)/0.5 = 17.5 m/s
0.1	[2, 2.1]	60	61.95	(61.95-60)/0.1 = 19.5 m/s
0.01	[2, 2.01]	60	60.1995	(60.1995-60)/0.01 = 19.95 m/s
0.001	[2, 2.001]	60	60.019995	19.995 m/s

As $h \to 0$ , the average velocity approaches 20 m/s. This limiting value is the instantaneous velocity at $t = 2$ .

Computing the Limit Algebraically

For $s(t) = 40t - 5t^2$ :

$\frac{s(2+h) - s(2)}{h} = \frac{[40(2+h) - 5(2+h)^2] - 60}{h}$

$= \frac{80 + 40h - 5(4 + 4h + h^2) - 60}{h} = \frac{80 + 40h - 20 - 20h - 5h^2 - 60}{h}$

$= \frac{20h - 5h^2}{h} = 20 - 5h$

As $h \to 0$ : $20 - 5h \to 20$ m/s

Interactive: From Secant to Tangent

The geometric interpretation is powerful: as $\Delta x \to 0$ , the secant line (connecting two points) approaches the tangent line (touching at one point).

Interactive: From Secant to Tangent Line

Watch how the secant line approaches the tangent line as Δx → 0. The secant line connects two points on the curve, while the tangent line touches it at exactly one point.

Point x: 1.00

Δx: 2.0000

Show tangent line

Secant Slope (Δy/Δx)

4.000000

True Derivative f'(1.0)

2.000000

Error

2.000000

Key Insight

As Δx approaches 0, the secant line slope approaches the tangent line slope. This limiting value is the derivative — the instantaneous rate of change at the point. Notice how the error decreases as Δx shrinks!

The Formal Definition of the Derivative

We now have all the pieces to state the formal definition that makes calculus rigorous.

Definition: The Derivative

The derivative of $f(x)$ at $x = a$ , denoted $f'(a)$ , is:

f'(a) = \lim_{h \to 0} \frac{f(a + h) - f(a)}{h}

provided this limit exists.

Equivalent notation: We also write $\frac{df}{dx}\bigg|_{x=a}$ or $\frac{d}{dx}f(x)\bigg|_{x=a}$ .

Understanding Each Part

Expression	Meaning
f(a + h)	Function value at a nearby point
f(a + h) - f(a)	Change in output (Δy)
h	Change in input (Δx)
[f(a+h) - f(a)]/h	Slope of secant line (average rate)
lim_{h→0}	Take the limit as interval shrinks to zero
f'(a)	Slope of tangent line at x = a (instantaneous rate)

Alternative Definition

Sometimes it's convenient to use $x$ instead of $a + h$ :

f'(a) = \lim_{x \to a} \frac{f(x) - f(a)}{x - a}

Both definitions are equivalent. Use whichever is more convenient for the problem at hand.

Geometric Interpretation: Tangent Line Slope

The derivative $f'(a)$ equals the slope of the tangent line to the curve $y = f(x)$ at the point $(a, f(a))$ .

What Makes Tangent Lines Special

Touches, doesn't cross (locally): The tangent line touches the curve at exactly one point in a small neighborhood
Best linear approximation: Near the point of tangency, the tangent line is the best straight-line approximation to the curve
Unique direction: The tangent line shows the direction the curve is "heading" at that instant

The Tangent Line Equation

Once we know the derivative, we can write the equation of the tangent line using point-slope form:

y - f(a) = f'(a)(x - a)

Or in slope-intercept form: $y = f'(a)(x - a) + f(a)$

Example: Tangent to a Parabola

Find the tangent line to $f(x) = x^2$ at $x = 2$ .

Step 1: Find the point: $f(2) = 4$ , so the point is $(2, 4)$

Step 2: Find the derivative at this point. Using the limit definition:

$f'(2) = \lim_{h \to 0} \frac{(2+h)^2 - 4}{h} = \lim_{h \to 0} \frac{4 + 4h + h^2 - 4}{h} = \lim_{h \to 0} (4 + h) = 4$

Step 3: Write the tangent line equation:

$y - 4 = 4(x - 2) \Rightarrow y = 4x - 4$

Physics Application: Velocity and Acceleration

The derivative was invented for physics. Here's how it connects position, velocity, and acceleration:

📍

Position

s(t)

Where am I?

🏃

Velocity

v(t) = s'(t)

How fast am I going?

🚀

Acceleration

a(t) = v'(t) = s''(t)

How fast is speed changing?

Physics Application: Velocity from Position

The car moves at a constant speed of 2 m/s. The instantaneous velocity is the derivative of position — the limit of average velocity as Δt → 0.

Time t: 0.00 s

Δt: 1.00 s

Average Velocity (Δs/Δt)

2.0000 m/s

from t = 0.0 to 1.0

Instantaneous Velocity (ds/dt)

2.0000 m/s

at t = 0.0

Difference

0.0000 m/s

error decreases as Δt → 0

The Fundamental Connection

Velocity is the derivative of position: v(t) = ds/dt = lim(Δt→0) [s(t+Δt) - s(t)] / Δt. This is why calculus was invented — to solve problems involving instantaneous rates of change in physics!

Computing Derivatives from the Definition

Let's use the limit definition to compute derivatives of common functions. Later, we'll learn shortcut rules that make this much faster.

Example 1: Derivative of a Linear Function

Find the derivative of $f(x) = 3x + 2$ .

$f'(x) = \lim_{h \to 0} \frac{f(x+h) - f(x)}{h} = \lim_{h \to 0} \frac{[3(x+h) + 2] - [3x + 2]}{h}$

$= \lim_{h \to 0} \frac{3x + 3h + 2 - 3x - 2}{h} = \lim_{h \to 0} \frac{3h}{h} = \lim_{h \to 0} 3 = 3$

Result: $f'(x) = 3$ — the slope of the line!

Example 2: Derivative of x²

Find the derivative of $f(x) = x^2$ .

$f'(x) = \lim_{h \to 0} \frac{(x+h)^2 - x^2}{h} = \lim_{h \to 0} \frac{x^2 + 2xh + h^2 - x^2}{h}$

$= \lim_{h \to 0} \frac{2xh + h^2}{h} = \lim_{h \to 0} (2x + h) = 2x$

Result: $\frac{d}{dx}(x^2) = 2x$

Example 3: Derivative of 1/x

Find the derivative of $f(x) = \frac{1}{x}$ .

$f'(x) = \lim_{h \to 0} \frac{\frac{1}{x+h} - \frac{1}{x}}{h} = \lim_{h \to 0} \frac{x - (x+h)}{h \cdot x(x+h)}$

$= \lim_{h \to 0} \frac{-h}{h \cdot x(x+h)} = \lim_{h \to 0} \frac{-1}{x(x+h)} = \frac{-1}{x^2}$

Result: $\frac{d}{dx}\left(\frac{1}{x}\right) = -\frac{1}{x^2}$

Pattern Emerging

Notice that $\frac{d}{dx}(x^n) = nx^{n-1}$ works for n = 2 (giving 2x) and n = -1 (giving $-1 \cdot x^{-2} = -\frac{1}{x^2}$ ). This is the power rule — we'll prove it in the next section!

Machine Learning Applications

Derivatives are the heart of modern machine learning. Every time you train a neural network, you're computing millions of derivatives!

Gradient Descent: Finding Optimal Parameters

In machine learning, we want to minimize a loss function $L(\theta)$ that measures how wrong our model's predictions are. The derivative tells us which way to adjust parameters to reduce the loss.

The Gradient Descent Update Rule

\theta_{\text{new}} = \theta_{\text{old}} - \alpha \cdot \frac{dL}{d\theta}

where $\alpha$ is the learning rate. We move in the opposite direction of the gradient because we want to decrease the loss.

Why the Derivative Points Uphill

If $\frac{dL}{d\theta} > 0$ : Loss increases when θ increases → we should decrease θ
If $\frac{dL}{d\theta} < 0$ : Loss decreases when θ increases → we should increase θ
If $\frac{dL}{d\theta} = 0$ : We're at a critical point (possibly a minimum!)

Backpropagation: Chain Rule at Scale

In neural networks with many layers, we use the chain rule (which we'll cover later) to propagate gradients backwards from the output to update all parameters. This is called backpropagation.

Why Derivatives Matter for AI

Modern language models like GPT have billions of parameters. Training involves computing the gradient of the loss with respect to each parameter — that's billions of derivatives per training step, computed thousands of times. Without calculus and efficient derivative computation, modern AI wouldn't exist.

Python Implementation

Computing Derivatives Numerically

Let's implement the limit definition of the derivative and see the convergence in action:

Numerical Derivative Computation

🐍derivative_limit.py

Explanation(6)

Code(59)

3Average Rate of Change

This function computes the slope of the secant line connecting two points on the curve: (x, f(x)) and (x + Δx, f(x + Δx)). This is the difference quotient.

EXAMPLE

For f(x) = x² at x = 3 with Δx = 1: (16 - 9) / 1 = 7

12Numerical Derivative

We approximate the derivative by using a very small h. The smaller h gets, the closer we get to the true derivative — this is the limit definition in action.

EXAMPLE

With h = 1e-8, we get very close to the exact value

21Our Test Function

We use f(x) = x² because we know its exact derivative (2x), making it easy to verify our numerical approximations.

30Point of Interest

We want to find f'(3). The true answer is 2(3) = 6. We'll see how different values of h give approximations that approach 6.

38Decreasing Step Sizes

By trying h values from 1 down to 10⁻⁸, we observe the limit process numerically. Each smaller h gives a better approximation to the derivative.

43Error Calculation

We compute the absolute difference between our approximation and the true derivative. Notice how the error shrinks as h → 0.

53 lines without explanation

1import numpy as np
2import matplotlib.pyplot as plt
3
4def average_rate_of_change(f, x, delta_x):
5    """
6    Compute the average rate of change of f
7    from x to x + delta_x.
8
9    This is the slope of the secant line.
10    """
11    return (f(x + delta_x) - f(x)) / delta_x
12
13def numerical_derivative(f, x, h=1e-8):
14    """
15    Approximate the derivative using the limit definition.
16
17    As h approaches 0, this approaches f'(x).
18    We use a very small h for numerical approximation.
19    """
20    return (f(x + h) - f(x)) / h
21
22# Define our function: f(x) = x^2
23def f(x):
24    return x ** 2
25
26# True derivative: f'(x) = 2x (from power rule)
27def f_prime_exact(x):
28    return 2 * x
29
30# Let's see how the difference quotient approaches the derivative
31x = 3  # Point of interest
32
33print(f"Finding the derivative of f(x) = x^2 at x = {x}")
34print(f"True derivative f'({x}) = {f_prime_exact(x)}")
35print()
36
37# Try different values of h (delta_x)
38h_values = [1, 0.5, 0.1, 0.01, 0.001, 0.0001, 1e-6, 1e-8]
39
40print("h (Δx)        Average Rate    Error")
41print("-" * 45)
42
43for h in h_values:
44    avg_rate = average_rate_of_change(f, x, h)
45    error = abs(avg_rate - f_prime_exact(x))
46    print(f"{h:<14.8f} {avg_rate:<15.8f} {error:.2e}")
47
48# Visualize the convergence
49plt.figure(figsize=(10, 4))
50h_range = np.logspace(-8, 0, 100)
51errors = [abs(average_rate_of_change(f, x, h) - f_prime_exact(x))
52          for h in h_range]
53
54plt.loglog(h_range, errors, 'b-', linewidth=2)
55plt.xlabel('h (step size)')
56plt.ylabel('Error')
57plt.title('Error in Derivative Approximation vs Step Size')
58plt.grid(True, alpha=0.3)
59plt.show()

Gradient Descent Implementation

Here's how derivatives power optimization in machine learning:

Gradient Descent Using Derivatives

🐍gradient_descent.py

Explanation(5)

Code(48)

3Gradient Descent Algorithm

This is the core optimization algorithm used in machine learning. It uses derivatives to find the minimum of a function by repeatedly moving in the direction of steepest descent.

21The Update Rule

x_new = x_old - α × f'(x). We subtract the gradient (times learning rate) because the gradient points uphill. To minimize, we go downhill — the opposite direction.

EXAMPLE

If f'(x) = 4, we decrease x. If f'(x) = -4, we increase x.

24Convergence Check

When the derivative is nearly zero, we've reached a critical point (minimum, maximum, or saddle). For convex functions, this is the global minimum.

31Loss Function Example

In ML, f(x) would be the loss function and x would be the model parameters. We find the parameters that minimize prediction errors.

35Derivative for Optimization

The derivative tells us: if x increases, does f increase (positive derivative) or decrease (negative derivative)? This guides our search for the minimum.

43 lines without explanation

1import numpy as np
2
3def gradient_descent(f, f_prime, x0, learning_rate=0.1, n_iterations=100):
4    """
5    Find the minimum of f using gradient descent.
6
7    The derivative tells us the direction of steepest ascent,
8    so we move in the NEGATIVE gradient direction to minimize.
9
10    Parameters:
11    - f: The function to minimize
12    - f_prime: The derivative of f
13    - x0: Starting point
14    - learning_rate: Step size (alpha)
15    - n_iterations: Number of steps
16
17    Returns: Path of x values during optimization
18    """
19    x = x0
20    path = [x]
21
22    for i in range(n_iterations):
23        gradient = f_prime(x)  # Compute derivative at current x
24        x = x - learning_rate * gradient  # Move opposite to gradient
25        path.append(x)
26
27        if abs(gradient) < 1e-10:  # Convergence check
28            break
29
30    return np.array(path)
31
32# Example: Minimize f(x) = (x - 2)^2 + 1
33# This parabola has minimum at x = 2
34
35def f(x):
36    return (x - 2) ** 2 + 1
37
38def f_prime(x):
39    return 2 * (x - 2)  # Derivative: d/dx[(x-2)^2 + 1] = 2(x-2)
40
41# Run gradient descent starting from x = -3
42path = gradient_descent(f, f_prime, x0=-3, learning_rate=0.1)
43
44print("Gradient Descent Path:")
45print(f"Start: x = {path[0]:.4f}, f(x) = {f(path[0]):.4f}")
46print(f"End:   x = {path[-1]:.4f}, f(x) = {f(path[-1]):.4f}")
47print(f"True minimum: x = 2, f(x) = 1")
48print(f"\nConverged in {len(path)-1} iterations")

Common Pitfalls

Pitfall 1: Confusing Average and Instantaneous Rates

The difference quotient $\frac{f(x+h) - f(x)}{h}$ is the average rate over an interval. Only when we take the limit as $h \to 0$ do we get the instantaneous rate (the derivative).

Pitfall 2: Division by Zero

We can't simply plug in $h = 0$ because that gives $\frac{0}{0}$ , which is undefined. The limit process carefully avoids this by considering values approaching zero, not equal to zero.

Pitfall 3: Not All Functions Are Differentiable

The limit defining the derivative must exist. Functions with corners (like $|x|$ at $x = 0$ ), jumps, or vertical tangents are not differentiable at those points. We'll explore this in the section on differentiability.

Numerical Precision Limits

When computing derivatives numerically, making $h$ too small causes roundoff errors. Typically, $h \approx 10^{-8}$ works well for 64-bit floating point. For complex functions, use automatic differentiation instead.

Test Your Understanding

Question 1 of 7

What does the derivative f'(a) represent geometrically?

Summary

The derivative is the fundamental concept that distinguishes calculus from algebra. It captures the idea of instantaneous rate of change.

Key Concepts

Concept	Description
Average Rate of Change	Δy/Δx = slope of secant line
Instantaneous Rate of Change	Limit of average rate as interval → 0
Derivative f'(a)	lim_{h→0} [f(a+h) - f(a)] / h
Geometric Meaning	Slope of tangent line at the point
Physical Meaning	Velocity = derivative of position
ML Application	Gradient descent uses derivatives to minimize loss

Key Takeaways

The derivative solves the ancient problem of finding instantaneous rates of change
Secant lines (connecting two points) approach the tangent line (touching one point) as the interval shrinks
The derivative is defined as a limit of difference quotients
Geometrically, $f'(a)$ is the slope of the tangent line at $x = a$
In physics, velocity is the derivative of position; acceleration is the derivative of velocity
Machine learning depends on derivatives for gradient descent optimization
Not all functions have derivatives everywhere — corners, jumps, and vertical tangents cause problems

The Essence of the Derivative:

"The derivative captures how a function changes at a single instant — transforming the impossible question of change without duration into the precise limit of change over shrinking intervals."

Coming Next: In the next section, we'll explore The Derivative as a Function. Instead of computing $f'(a)$ at a single point, we'll define $f'(x)$ as a new function that gives the derivative at every point — and visualize how the derivative function relates to the original.