Boo-AI — Master Artificial Intelligence by Building from Scratch

Learning Objectives

By the end of this section, you will be able to:

Define the derivative of a vector-valued function using limits
Compute derivatives by differentiating each component separately
Apply differentiation rules: sum, product, chain rule for vector functions
Interpret the derivative geometrically as the tangent vector to a curve
Calculate velocity, speed, and acceleration for motion problems
Find unit tangent vectors and understand their significance
Connect vector derivatives to gradients in machine learning

The Big Picture: Why Differentiate Vectors?

"The derivative tells us how things change — and in the vector world, change means not just 'how fast' but also 'in which direction.'"

In single-variable calculus, the derivative $f'(x)$ tells us the instantaneous rate of change of a function — how fast and in what sense (increasing or decreasing) the output changes as we nudge the input. With vector-valued functions, we face a richer question: when a point moves along a curve in space, how does its position vector change?

The answer is the derivative of a vector function, which gives us a new vector — the tangent vector to the curve. This tangent vector captures:

Direction: Which way is the point moving at this instant?
Speed: How fast is it moving (the magnitude of the tangent)?
Velocity: The complete picture — direction and speed together

The Central Idea

For a vector function $\mathbf{r}(t) = \langle f(t), g(t), h(t) \rangle$ , the derivative is computed by differentiating each component:

\mathbf{r}'(t) = \langle f'(t), g'(t), h'(t) \rangle

This elegant result lets us apply all our single-variable differentiation techniques component by component!

Where Vector Derivatives Appear

Physics

Velocity = derivative of position
Acceleration = derivative of velocity
Jerk, snap, and higher derivatives
Electric and magnetic field variations

Engineering

Robot arm kinematics
Flight path analysis
Structural deformation rates
Control system dynamics

Computer Graphics

Curve tangents for shading
Motion interpolation
Camera path smoothing
Particle system dynamics

Machine Learning

Gradients for optimization
Backpropagation (chain rule)
Neural network training
Optimization trajectories

Historical Origins

The calculus of vector functions developed alongside classical mechanics in the 17th-19th centuries, as mathematicians sought precise ways to describe motion in space.

Newton and Leibniz: The Foundations

Isaac Newton (1643–1727) essentially invented vector calculus to solve physics problems. His "method of fluxions" treated velocity as the rate of change of position — exactly our modern concept of the derivative of a position vector. His laws of motion require computing derivatives of vector quantities.

Gottfried Leibniz (1646–1716) developed the notation we still use today. The symbols $\frac{d\mathbf{r}}{dt}$ and $\mathbf{r}'(t)$ both trace back to his systematic approach to infinitesimal calculus.

The 19th Century Formalization

The formal treatment of vector derivatives emerged with the work of William Rowan Hamilton, Josiah Willard Gibbs, and Oliver Heaviside in the 1800s. They established that:

Vector functions can be differentiated component-wise
The derivative rules (product, chain) extend naturally to vectors
The geometric meaning is the tangent vector to the curve

From Physics to Machine Learning

The same mathematical framework Newton used to describe planetary motion now powers machine learning. When we compute gradients in neural networks, we're using vector calculus — differentiating a scalar loss function with respect to a vector of weights produces a gradient vector, just as differentiating position with respect to time produces a velocity vector.

The Definition: Derivative of a Vector Function

Definition: Derivative of a Vector Function

Let $\mathbf{r}(t)$ be a vector-valued function. The derivative of $\mathbf{r}$ at $t$ is:

\mathbf{r}'(t) = \lim_{\Delta t \to 0} \frac{\mathbf{r}(t + \Delta t) - \mathbf{r}(t)}{\Delta t}

provided this limit exists. The derivative is also denoted $\frac{d\mathbf{r}}{dt}$ .

This definition is identical in form to the scalar derivative — we're taking the limit of a difference quotient. The key insight is that subtracting vectors and dividing by a scalar still yields a vector.

Component-Wise Differentiation

The remarkable practical consequence is that we can differentiate component by component:

Theorem: Component-Wise Differentiation

If $\mathbf{r}(t) = \langle f(t), g(t), h(t) \rangle$ , then:

\mathbf{r}'(t) = \langle f'(t), g'(t), h'(t) \rangle

provided $f'(t)$ , $g'(t)$ , and $h'(t)$ all exist.

This follows directly from limit laws: the limit of a sum is the sum of limits, and limits can be taken component by component.

Example: Circular Motion

Consider a particle moving on a unit circle: $\mathbf{r}(t) = \langle \cos(t), \sin(t) \rangle$

Derivative: $\mathbf{r}'(t) = \langle -\sin(t), \cos(t) \rangle$

At $t = 0$ : position is $(1, 0)$ and velocity is $(0, 1)$ — pointing straight up!

The velocity vector is tangent to the circle and perpendicular to the position vector.

Why Perpendicular?

For any curve on a sphere (including a circle), the velocity is perpendicular to the position vector. This is because $|\mathbf{r}(t)|^2 = \text{constant}$ , so differentiating both sides gives $2\mathbf{r} \cdot \mathbf{r}' = 0$ , which means $\mathbf{r} \perp \mathbf{r}'$ .

Visualizing the Limit: Secant to Tangent

Just as the derivative in single-variable calculus arises from the limit of secant lines approaching a tangent line, the vector derivative arises from secant vectors approaching the tangent vector.

The secant vector from $\mathbf{r}(t_0)$ to $\mathbf{r}(t_0 + \Delta t)$ is:

\frac{\mathbf{r}(t_0 + \Delta t) - \mathbf{r}(t_0)}{\Delta t}

As $\Delta t \to 0$ , this secant vector rotates and stretches/shrinks to become the tangent vector $\mathbf{r}'(t_0)$ .

Use the interactive visualization below to watch the limit process in action:

The Limit Definition: Secant → Tangent

Base Point t₀t₀ = 0.300

Step Size ΔtΔt = 0.3000

The Limit Definition

r'(t₀) = lim_Δt→0 [r(t₀+Δt) - r(t₀)]/ Δt

Vector Comparison

Secant (approx):(-1.667, -5.129)

True r'(t₀):(-5.976, -1.942)

Error magnitude:5.3600

⟳ Decrease Δt to see the secant approach the tangent vector

Understanding the Limit

The secant vector connects two points on the curve and approximates the direction of motion. As we take the limit Δt → 0, this secant rotates and shrinks, approaching the true tangent vector — the instantaneous rate of change of the position vector. This is exactly how we defined the scalar derivative, but now applied to vectors!

Differentiation Rules for Vector Functions

All the familiar differentiation rules extend to vector functions. Here are the key rules:

Basic Rules

Rule	Formula	Notes
Sum Rule	d/dt[u + v] = u' + v'	Add component derivatives
Constant Multiple	d/dt[c · u] = c · u'	c is a scalar constant
Scalar Function Product	d/dt[f(t)u] = f'(t)u + f(t)u'	Product rule with scalar

Product Rules

There are three important product rules for vectors:

Dot Product Rule

\frac{d}{dt}[\mathbf{u} \cdot \mathbf{v}] = \mathbf{u}' \cdot \mathbf{v} + \mathbf{u} \cdot \mathbf{v}'

Note: The result is a scalar (derivative of a scalar is a scalar).

Cross Product Rule

\frac{d}{dt}[\mathbf{u} \times \mathbf{v}] = \mathbf{u}' \times \mathbf{v} + \mathbf{u} \times \mathbf{v}'

Note: Order matters! The cross product is not commutative.

Scalar-Vector Product Rule

\frac{d}{dt}[f(t)\mathbf{u}(t)] = f'(t)\mathbf{u}(t) + f(t)\mathbf{u}'(t)

This is the standard product rule with a scalar function.

Chain Rule

If $\mathbf{r}(t)$ is a vector function and $t = g(s)$ is a scalar function, then:

\frac{d\mathbf{r}}{ds} = \frac{d\mathbf{r}}{dt} \cdot \frac{dt}{ds} = \mathbf{r}'(t) \cdot g'(s)

The Chain Rule is Fundamental to ML

The chain rule for vectors is exactly what powers backpropagation in neural networks. When computing gradients, we chain together derivatives through multiple layers — each application of the chain rule propagates the gradient backward through the network.

Geometric Interpretation: The Tangent Vector

The derivative $\mathbf{r}'(t)$ has a beautiful geometric meaning: it is the tangent vector to the curve at the point $\mathbf{r}(t)$ .

Geometric Meaning of r'(t)

Direction: $\mathbf{r}'(t)$ points in the direction of motion along the curve at time $t$
Magnitude: $|\mathbf{r}'(t)|$ equals the speed — how fast the point moves along the curve
Tangent Line: The line through $\mathbf{r}(t_0)$ with direction $\mathbf{r}'(t_0)$ is the tangent line to the curve

The Tangent Line

The parametric equation of the tangent line at $t = t_0$ is:

\mathbf{L}(s) = \mathbf{r}(t_0) + s \cdot \mathbf{r}'(t_0)

Here, $s$ is a parameter ranging over all real numbers. When $s = 0$ , we're at the point of tangency.

Interactive Exploration

Explore how the tangent vector changes as you move along different curves. Notice how the tangent always points in the direction of motion.

Vector Function Derivative Visualizer

Curve Type

r(t) = ⟨cos(t), sin(t)⟩

Parameter tt = 3.142

Vector Scale0.50×

Show Position Vector r(t)

Show Tangent Vector r'(t)

Show Unit Tangent T(t)

Computed Values

Position r(t):(-1.000, 0.000)

Derivative r'(t):(-0.000, -1.000)

Speed |r'(t)|:1.000

Unit T(t):(-0.000, -1.000)

Key Insight

The tangent vector r'(t) always points in the direction of motion along the curve. Its magnitude represents the speed — how fast the point moves. The unit tangent T(t) has magnitude 1, capturing only the direction without speed information.

Velocity and Speed: The Physical Interpretation

When $\mathbf{r}(t)$ represents the position of a particle at time $t$ , the derivative has a direct physical meaning:

Quantity	Definition	Type
Velocity	v(t) = r'(t)	Vector (direction + magnitude)
Speed	\|v(t)\| = \|r'(t)\|	Scalar (magnitude only)
Acceleration	a(t) = v'(t) = r''(t)	Vector

Key Distinction: Velocity vs. Speed

Velocity v(t)

A vector
Has direction and magnitude
Can be negative (reverses)
$\mathbf{v}(t) = \mathbf{r}'(t)$

Speed |v(t)|

A scalar
Magnitude only (no direction)
Always non-negative
$|\mathbf{v}(t)| = |\mathbf{r}'(t)|$

This distinction is crucial: velocity tells you how fast and in what direction something is moving, while speed tells you only how fast.

Explore this distinction interactively below. Watch how the velocity vector changes direction around the ellipse while the speed (its magnitude) varies:

Velocity (Vector) vs. Speed (Scalar)

Time tt = 0.000 rad

Speed Over Time

Speed varies as the object moves along the ellipse

Current Values

Position r(t):(3.000, 0.000)

Velocity v(t):(0.000, 1.000)

Speed |v(t)|:1.000

Key Distinction

Velocity v(t)

Vector — has direction

= r'(t)

Speed |v(t)|

Scalar — just magnitude

= |r'(t)|

Physical Interpretation

Watch how the velocity vector changes as the particle moves along the ellipse. At the ends of the major axis (left and right), the speed is slowest (the object "turns around"). At the top and bottom, the speed is fastest. The velocity vector always points tangent to the path — the direction of instantaneous motion.

Unit Tangent Vector

Sometimes we want just the direction of motion, without the speed information. This is captured by the unit tangent vector:

Definition: Unit Tangent Vector

The unit tangent vector at $t$ is:

\mathbf{T}(t) = \frac{\mathbf{r}'(t)}{|\mathbf{r}'(t)|}

provided $\mathbf{r}'(t) \neq \mathbf{0}$ . By construction, $|\mathbf{T}(t)| = 1$ .

The unit tangent vector $\mathbf{T}(t)$ points in the direction of motion with a standardized length of 1. This is useful for:

Describing the direction of a curve independently of parameterization
Computing curvature (how fast the direction changes)
Building the Frenet-Serret frame (T, N, B) for curve analysis
Normalizing directions in computer graphics

Computing T(t)

For $\mathbf{r}(t) = \langle t^2, t^3 \rangle$ :

1. Find $\mathbf{r}'(t) = \langle 2t, 3t^2 \rangle$

2. Find magnitude: $|\mathbf{r}'(t)| = |t|\sqrt{4 + 9t^2}$

3. Divide: $\mathbf{T}(t) = \frac{1}{|t|\sqrt{4 + 9t^2}} \langle 2t, 3t^2 \rangle$

Higher-Order Derivatives

Just as with scalar functions, we can take multiple derivatives of vector functions:

Derivative	Physical Meaning	Formula
r(t)	Position	Where the particle is
r'(t) = v(t)	Velocity	How position changes
r''(t) = a(t)	Acceleration	How velocity changes
r'''(t) = j(t)	Jerk	How acceleration changes

Each level of derivative tells us about the rate of change of the previous quantity.

Example (Helix): For $\mathbf{r}(t) = \langle \cos(t), \sin(t), t \rangle$

• Velocity: $\mathbf{r}'(t) = \langle -\sin(t), \cos(t), 1 \rangle$

• Acceleration: $\mathbf{r}''(t) = \langle -\cos(t), -\sin(t), 0 \rangle$

• Jerk: $\mathbf{r}'''(t) = \langle \sin(t), -\cos(t), 0 \rangle$

Acceleration Points Inward

For the helix (and any circular motion), the acceleration vector $\mathbf{r}''(t)$ points toward the center of the circle! This is the centripetal acceleration that keeps the particle curving instead of flying off in a straight line.

Applications in Science and Engineering

1. Projectile Motion

A projectile launched with initial velocity $\mathbf{v}_0$ from position $\mathbf{r}_0$ under gravity follows:

\mathbf{r}(t) = \mathbf{r}_0 + \mathbf{v}_0 t + \frac{1}{2}\mathbf{g}t^2

Taking derivatives:

Velocity: $\mathbf{v}(t) = \mathbf{v}_0 + \mathbf{g}t$
Acceleration: $\mathbf{a}(t) = \mathbf{g}$ (constant)

2. Circular Motion

For uniform circular motion with radius $R$ and angular velocity $\omega$ :

Position: $\mathbf{r}(t) = R\langle \cos(\omega t), \sin(\omega t) \rangle$

Velocity: $\mathbf{v}(t) = R\omega\langle -\sin(\omega t), \cos(\omega t) \rangle$

Speed: $|\mathbf{v}| = R\omega$ (constant)

Acceleration: $\mathbf{a}(t) = -R\omega^2\langle \cos(\omega t), \sin(\omega t) \rangle = -\omega^2 \mathbf{r}$

The acceleration points toward the center (centripetal) with magnitude $R\omega^2$ .

3. Robotics: End Effector Velocity

In robotics, the Jacobian relates joint velocities to end effector velocity. If joint angles are $\mathbf{q}(t)$ and the end effector position is $\mathbf{r}(\mathbf{q})$ , then:

\mathbf{v}_{end} = \frac{d\mathbf{r}}{dt} = \frac{\partial \mathbf{r}}{\partial \mathbf{q}} \cdot \frac{d\mathbf{q}}{dt} = \mathbf{J}(\mathbf{q}) \cdot \dot{\mathbf{q}}

This is the chain rule for vectors, relating joint velocities $\dot{\mathbf{q}}$ to workspace velocity $\mathbf{v}_{end}$ .

Machine Learning Applications

Vector derivatives are the heart of machine learning optimization. Every time you train a neural network, you're computing vector derivatives.

The Gradient: Derivative of a Scalar Function

Given a loss function $L(\mathbf{w})$ that depends on a weight vector $\mathbf{w} = (w_1, w_2, ..., w_n)$ , the gradient is:

\nabla L = \frac{\partial L}{\partial \mathbf{w}} = \left\langle \frac{\partial L}{\partial w_1}, \frac{\partial L}{\partial w_2}, ..., \frac{\partial L}{\partial w_n} \right\rangle

This gradient vector points in the direction of steepest increase of $L$ . To minimize the loss, we move in the opposite direction:

\mathbf{w}_{new} = \mathbf{w}_{old} - \eta \nabla L

where $\eta$ is the learning rate.

Backpropagation: Chain Rule in Action

Neural networks are compositions of functions: $L = L_n \circ L_{n-1} \circ ... \circ L_1$ . The chain rule gives us:

\frac{\partial L}{\partial \mathbf{w}_1} = \frac{\partial L}{\partial \mathbf{z}_n} \cdot \frac{\partial \mathbf{z}_n}{\partial \mathbf{z}_{n-1}} \cdot ... \cdot \frac{\partial \mathbf{z}_2}{\partial \mathbf{w}_1}

This is exactly the vector chain rule applied repeatedly! Each term is a Jacobian matrix, and backpropagation efficiently computes this product.

The Deep Connection

When you trace the optimization path of gradient descent in weight space, you get a curve — just like the space curves we've been studying! The "velocity" along this path is $-\eta \nabla L$ , and optimization is the process of following this curve downhill toward a minimum.

Python Implementation

Vector Derivatives in NumPy

Here's how to work with vector function derivatives in Python:

Vector Function Derivatives

🐍vector_derivatives.py

Explanation(6)

Code(117)

8Position Vector Function

We define a helix: r(t) = ⟨cos(t), sin(t), t/2⟩. The x and y components trace a circle while z increases linearly, creating a spiral staircase shape.

12Velocity (First Derivative)

The derivative r'(t) = ⟨-sin(t), cos(t), 0.5⟩ is computed by differentiating each component separately. This is the velocity vector.

16Acceleration (Second Derivative)

The second derivative r''(t) = ⟨-cos(t), -sin(t), 0⟩ gives acceleration. Notice the z-component is 0 because velocity in z is constant.

28Speed Calculation

Speed is the magnitude of velocity: |r'(t)| = √(sin²t + cos²t + 0.25) = √1.25. The speed is constant for this helix!

32Unit Tangent Vector

T(t) = r'(t)/|r'(t)| gives the unit tangent — the direction of motion with magnitude 1. Essential for studying curve geometry.

40Numerical Differentiation

Central differences approximate the derivative: (r(t+h) - r(t-h))/(2h). This is how autodiff systems compute gradients numerically.

111 lines without explanation

1import numpy as np
2import matplotlib.pyplot as plt
3from mpl_toolkits.mplot3d import Axes3D
4
5# ============================================
6# DERIVATIVES OF VECTOR FUNCTIONS
7# ============================================
8
9def r(t):
10    """Position vector: r(t) = ⟨cos(t), sin(t), t/2⟩ (helix)"""
11    return np.array([np.cos(t), np.sin(t), t/2])
12
13def r_prime(t):
14    """Velocity vector (derivative): r'(t) = ⟨-sin(t), cos(t), 0.5⟩"""
15    return np.array([-np.sin(t), np.cos(t), 0.5])
16
17def r_double_prime(t):
18    """Acceleration vector (second derivative): r''(t) = ⟨-cos(t), -sin(t), 0⟩"""
19    return np.array([-np.cos(t), -np.sin(t), 0])
20
21# Evaluate at a specific time
22t0 = np.pi / 4
23position = r(t0)
24velocity = r_prime(t0)
25acceleration = r_double_prime(t0)
26
27print(f"At t = π/4:")
28print(f"  Position r(t):     {position}")
29print(f"  Velocity r'(t):    {velocity}")
30print(f"  Acceleration r''(t): {acceleration}")
31
32# Speed (magnitude of velocity)
33speed = np.linalg.norm(velocity)
34print(f"  Speed |r'(t)|:     {speed:.4f}")
35
36# Unit tangent vector
37T = velocity / speed
38print(f"  Unit tangent T(t): {T}")
39
40# ============================================
41# NUMERICAL DIFFERENTIATION
42# ============================================
43
44def numerical_derivative(r_func, t, h=1e-5):
45    """Approximate derivative using central differences."""
46    return (r_func(t + h) - r_func(t - h)) / (2 * h)
47
48# Compare numerical vs. analytical
49numerical_vel = numerical_derivative(r, t0)
50analytical_vel = r_prime(t0)
51
52print(f"\n--- Numerical vs Analytical ---")
53print(f"Numerical r'(t):  {numerical_vel}")
54print(f"Analytical r'(t): {analytical_vel}")
55print(f"Error: {np.linalg.norm(numerical_vel - analytical_vel):.2e}")
56
57# ============================================
58# VISUALIZATION
59# ============================================
60
61fig = plt.figure(figsize=(14, 5))
62
63# 3D helix with tangent and acceleration vectors
64ax1 = fig.add_subplot(131, projection='3d')
65
66# Draw the helix
67t_vals = np.linspace(0, 4*np.pi, 200)
68x = np.cos(t_vals)
69y = np.sin(t_vals)
70z = t_vals / 2
71
72ax1.plot(x, y, z, 'b-', linewidth=2, label='r(t) helix')
73
74# Draw vectors at t0
75pos = r(t0)
76vel = r_prime(t0) * 0.5  # Scale for visibility
77acc = r_double_prime(t0) * 0.5
78
79ax1.quiver(*pos, *vel, color='orange', linewidth=2,
80           label="r'(t) velocity", arrow_length_ratio=0.2)
81ax1.quiver(*pos, *acc, color='red', linewidth=2,
82           label="r''(t) acceleration", arrow_length_ratio=0.2)
83ax1.scatter(*pos, color='green', s=100, zorder=5)
84
85ax1.set_xlabel('X')
86ax1.set_ylabel('Y')
87ax1.set_zlabel('Z')
88ax1.legend(loc='upper left')
89ax1.set_title('Helix with Velocity and Acceleration')
90
91# Speed over time
92ax2 = fig.add_subplot(132)
93speeds = [np.linalg.norm(r_prime(t)) for t in t_vals]
94ax2.plot(t_vals, speeds, 'g-', linewidth=2)
95ax2.axhline(y=np.sqrt(1.25), color='r', linestyle='--',
96            label=f'Constant speed = √1.25')
97ax2.set_xlabel('Time t')
98ax2.set_ylabel('Speed |r\'(t)|')
99ax2.set_title('Speed vs Time')
100ax2.legend()
101
102# Tangent vector components
103ax3 = fig.add_subplot(133)
104T_x = [-np.sin(t)/np.sqrt(1.25) for t in t_vals]
105T_y = [np.cos(t)/np.sqrt(1.25) for t in t_vals]
106T_z = [0.5/np.sqrt(1.25) for t in t_vals]
107
108ax3.plot(t_vals, T_x, 'r-', label='T_x')
109ax3.plot(t_vals, T_y, 'g-', label='T_y')
110ax3.plot(t_vals, T_z, 'b-', label='T_z')
111ax3.set_xlabel('Time t')
112ax3.set_ylabel('Component')
113ax3.set_title('Unit Tangent Vector Components')
114ax3.legend()
115
116plt.tight_layout()
117plt.show()

Gradients in Machine Learning

Here's how vector derivatives appear in ML optimization:

Gradients and Optimization

🐍gradient_descent.py

Explanation(5)

Code(79)

8Loss Function

The loss L(w) measures prediction error. It's a scalar function of the weight vector w. We want to find w that minimizes L.

13The Gradient Vector

∂L/∂w = ⟨∂L/∂w₁, ∂L/∂w₂, ∂L/∂w₃⟩ is a vector pointing in the direction of steepest increase. Each component is a partial derivative.

35Gradient Descent

We move in the NEGATIVE gradient direction to decrease loss. This is exactly like following the downhill slope of a surface.

EXAMPLE

w_new = w_old - learning_rate × ∂L/∂w

44The Update Step

w = w - η∇L is the core update. The gradient tells us which direction increases L; we go opposite to decrease it.

60Optimization as a Curve

The sequence of weight vectors forms a path through weight space — a discrete version of a vector-valued function w(t)!

74 lines without explanation

1import numpy as np
2
3# ============================================
4# VECTOR DERIVATIVES IN MACHINE LEARNING
5# ============================================
6
7# In ML, we differentiate loss functions with respect to weight vectors
8# The gradient is a vector of partial derivatives
9
10def loss_function(w, X, y):
11    """Mean squared error loss: L = (1/n) * ||Xw - y||²"""
12    predictions = X @ w
13    residuals = predictions - y
14    return np.mean(residuals ** 2)
15
16def gradient(w, X, y):
17    """Gradient: ∂L/∂w = (2/n) * X.T @ (Xw - y)"""
18    n = len(y)
19    predictions = X @ w
20    residuals = predictions - y
21    return (2/n) * X.T @ residuals
22
23# Example data: 3 features, 5 samples
24np.random.seed(42)
25X = np.random.randn(5, 3)  # 5 samples, 3 features
26y = np.random.randn(5)      # 5 target values
27w = np.array([1.0, -0.5, 0.2])  # Initial weights
28
29print("--- Gradient Computation ---")
30print(f"Weight vector w: {w}")
31print(f"Loss L(w): {loss_function(w, X, y):.4f}")
32print(f"Gradient ∂L/∂w: {gradient(w, X, y)}")
33
34# ============================================
35# GRADIENT DESCENT: Following the Negative Gradient
36# ============================================
37
38def gradient_descent(X, y, learning_rate=0.1, iterations=100):
39    """Minimize loss by moving opposite to the gradient."""
40    w = np.zeros(X.shape[1])  # Initialize at origin
41    history = []
42
43    for i in range(iterations):
44        loss = loss_function(w, X, y)
45        grad = gradient(w, X, y)
46        history.append({'iter': i, 'loss': loss, 'w': w.copy(), 'grad': grad.copy()})
47
48        # Key step: update in NEGATIVE gradient direction
49        w = w - learning_rate * grad
50
51        if i < 5:
52            print(f"Iter {i}: loss = {loss:.4f}, |grad| = {np.linalg.norm(grad):.4f}")
53
54    return w, history
55
56print("\n--- Gradient Descent ---")
57w_optimal, history = gradient_descent(X, y)
58print(f"\nOptimal weights: {w_optimal}")
59print(f"Final loss: {loss_function(w_optimal, X, y):.6f}")
60
61# ============================================
62# PATH AS A VECTOR-VALUED FUNCTION
63# ============================================
64
65# The optimization path w(t) is like a parametric curve!
66# Each iteration t gives a new weight vector w(t)
67
68print("\n--- Optimization Path as Vector Function ---")
69print("Think of w(iteration) as a vector-valued function:")
70print("  w(0) → w(1) → w(2) → ... → w_optimal")
71print("\nThe 'velocity' is approximately -grad (direction of update)")
72print("The path curves through weight space toward the minimum!")
73
74# Path length (total distance traveled)
75total_distance = 0
76for i in range(1, len(history)):
77    step = np.linalg.norm(history[i]['w'] - history[i-1]['w'])
78    total_distance += step
79print(f"\nTotal path length: {total_distance:.4f}")

Common Pitfalls

Pitfall 1: Confusing Speed and Velocity

Speed $|\mathbf{r}'(t)|$ is a scalar (always ≥ 0). Velocity $\mathbf{r}'(t)$ is a vector (can point in any direction). They're related but not the same!

Pitfall 2: Forgetting Order in Cross Products

When differentiating $\mathbf{u} \times \mathbf{v}$ , the order matters: $\frac{d}{dt}[\mathbf{u} \times \mathbf{v}] = \mathbf{u}' \times \mathbf{v} + \mathbf{u} \times \mathbf{v}'$ . Swapping the order changes the sign!

Pitfall 3: Division by Zero in Unit Tangent

The unit tangent $\mathbf{T}(t) = \mathbf{r}'(t)/|\mathbf{r}'(t)|$ is undefined when $\mathbf{r}'(t) = \mathbf{0}$ . This happens at cusps or stationary points where the particle momentarily stops.

Pitfall 4: Assuming Constant Speed

Just because an object moves along a curve doesn't mean its speed is constant. For $\mathbf{r}(t) = \langle t^2, t^3 \rangle$ , the speed $|\mathbf{r}'(t)| = |t|\sqrt{4 + 9t^2}$ varies with $t$ .

Pitfall 5: Reparameterization Changes r'(t)

The same curve with different parameterizations has different velocity vectors. If $\mathbf{r}_1(t)$ and $\mathbf{r}_2(s)$ trace the same curve, $\mathbf{r}_1'(t) \neq \mathbf{r}_2'(s)$ in general. The unit tangent $\mathbf{T}$ , however, is the same!

Test Your Understanding

Question 1 of 8

If r(t) = ⟨t², 3t, cos(t)⟩, what is r'(t)?

Summary

The derivative of a vector function extends the fundamental concept of instantaneous rate of change to curves in space. By differentiating component-wise, we obtain the tangent vector — a powerful tool for analyzing motion, geometry, and optimization.

Key Concepts

Concept	Description
Definition	r'(t) = lim[r(t+Δt) - r(t)]/Δt
Component form	r'(t) = ⟨f'(t), g'(t), h'(t)⟩
Velocity	v(t) = r'(t) — vector describing motion
Speed	\|v(t)\| = \|r'(t)\| — scalar magnitude of velocity
Unit tangent	T(t) = r'(t)/\|r'(t)\| — direction only, \|T\| = 1
Acceleration	a(t) = r''(t) = v'(t)
Gradient	∇L = ⟨∂L/∂w₁, ..., ∂L/∂wₙ⟩ — for ML optimization

Key Takeaways

The derivative of a vector function is computed component by component
$\mathbf{r}'(t)$ is the tangent vector to the curve — it points in the direction of motion
Velocity is a vector (direction + speed); speed is a scalar (just magnitude)
All differentiation rules (sum, product, chain) extend naturally to vectors
The unit tangent vector $\mathbf{T}(t)$ captures direction without speed
In machine learning, gradients are vector derivatives used for optimization
Backpropagation is the chain rule for vectors applied through neural network layers

The Essence of Vector Derivatives:

"The derivative of position is velocity. The derivative of velocity is acceleration. The derivative of a loss function is the gradient. Each tells us how to move forward — literally or figuratively — toward our goal."

Coming Next: In the next section, we'll explore Arc Length and Curvature — using the tangent vector to measure how long a curve is and how sharply it bends. These concepts complete our toolkit for analyzing space curves.