Why Derivatives Matter for Neural Networks
In the previous section, we learned how vectors and matrices organize and transform data. But there is a deeper question: how does a neural network learn? The answer is derivatives. Every time a neural network adjusts its weights during training, it is using derivatives to figure out which direction to change each weight to make the network's predictions better.
Here is the central idea. A neural network makes a prediction, compares it to the correct answer, and computes a loss — a single number measuring how wrong the prediction was. The question then becomes: how should each weight change to make the loss smaller? The derivative of the loss with respect to each weight answers exactly this question. It tells us the direction and rate at which the loss changes when we nudge each weight by a tiny amount.
The Central Insight: A derivative measures how a function's output changes when its input changes by a tiny amount. Neural networks use derivatives to find which direction to adjust each weight to reduce the prediction error. This process is called gradient descent.
This section builds your understanding from the ground up: first the derivative of a single variable, then partial derivatives for multiple variables, then the gradient vector that combines them, and finally the gradient descent algorithm that drives all of deep learning. We will implement everything in Python first, then see how PyTorch automates the entire process.
The Derivative: Measuring Instantaneous Change
Imagine you are driving a car and your GPS shows you are 100 km from your destination at noon, and 60 km at 1 PM. Your average speed was km/h. But that average hides the details — you may have been going 80 km/h on the highway and 10 km/h in traffic. The derivative gives you the instantaneous speed at any single moment, not the average over a period.
Mathematically, the derivative of a function at a point is defined as the limit:
This formula says: take two points on the function that are apart, compute the slope of the straight line connecting them (the secant line), then shrink to zero. As shrinks, the secant line rotates and becomes the tangent line — the line that just touches the curve at that single point. The slope of that tangent line is the derivative.
Geometric Intuition: Secant to Tangent
The interactive visualization below shows this process in action. The red dashed line is the secant line connecting two points on the curve. The green line is the true tangent line. As you decrease (or click "Animate"), watch the secant line rotate to match the tangent line. The secant slope converges to the derivative.
Try This: Select different functions (x², x³, sin(x), e⊃x) and move the point along the curve. Notice how the derivative (tangent slope) changes at different locations. For x², the slope is always 2x — positive when x > 0, negative when x < 0, and zero at x = 0 (the bottom of the bowl).
Computing Derivatives: The Limit Process
Let us make this concrete. For at the point , we know . We want to find the derivative . The formula says to compute for smaller and smaller :
| h | f(2 + h) | f(2 + h) - f(2) | Slope = Δy/h | Error from 4.0 |
|---|---|---|---|---|
| 1.0 | 9.0 | 5.0 | 5.000000 | 1.000000 |
| 0.1 | 4.41 | 0.41 | 4.100000 | 0.100000 |
| 0.01 | 4.0401 | 0.0401 | 4.010000 | 0.010000 |
| 0.001 | 4.004001 | 0.004001 | 4.001000 | 0.001000 |
| 0.0001 | 4.00040001 | 0.00040001 | 4.000100 | 0.000100 |
The pattern is clear: as shrinks by 10×, the error also shrinks by 10×. In the limit , the slope reaches exactly 4.0. This is the derivative: .
We can verify this with calculus. The power rule says that the derivative of is . So for : . At : .
Python Implementation: Derivative from First Principles
Let us code this limit process directly. We compute the finite difference for shrinking values of and watch it converge:
Essential Derivative Rules
Computing derivatives from the limit definition every time would be tedious. Instead, calculus gives us a set of rules that make finding derivatives mechanical. Here are the rules you will use most in deep learning:
The Power Rule
If , then . This is the most frequently used rule. The exponent comes down as a coefficient, and the exponent decreases by one.
| Function | Derivative | At x=2 |
|---|---|---|
| x² | 2x | 4.0 |
| x³ | 3x² | 12.0 |
| x⁴ | 4x³ | 32.0 |
| x¹ = x | 1 | 1.0 |
| x⁰ = 1 (constant) | 0 | 0.0 |
The Sum Rule
The derivative of a sum is the sum of the derivatives. If , then . This lets you differentiate term by term. For example: .
The Constant Multiple Rule
Constants factor out of derivatives: if , then . Example: .
Special Functions
| Function | Derivative | Why It Matters |
|---|---|---|
| eˣ | eˣ | Appears in softmax, sigmoid. Its own derivative! |
| ln(x) | 1/x | Appears in cross-entropy loss |
| sin(x) | cos(x) | Used in positional encodings |
| 1/x | -1/x² | Appears in normalization layers |
Why These Rules Matter: Every neural network layer is built from these basic operations — multiplication, addition, powers, and exponentials. The chain rule (covered in Section 4 of this chapter) combines these to differentiate complex compositions. Together, they let us compute the derivative of any neural network output with respect to any weight.
Partial Derivatives: Multiple Variables
A real neural network has thousands or millions of weights, not just one. Its loss function takes all weights as input: . To update each weight, we need to know how the loss changes when that specific weight changes, while all other weights stay fixed. This is exactly what a partial derivative computes.
Consider a function of two variables: . This defines a 3D surface — an elliptical bowl. The partial derivative asks: if I nudge slightly while holding fixed, how much does change? To compute it, simply treat as a constant and differentiate with respect to :
At the point : and . The surface is steeper in the -direction (slope 6 vs 4) because the term has a larger coefficient.
Code: Computing Partial Derivatives
The Gradient Vector
The gradient is the vector that collects all partial derivatives into one object. For a function , the gradient is:
The gradient has two crucial geometric properties:
- Direction: The gradient points in the direction of steepest ascent — the direction where increases fastest. To decrease (minimize the loss), move in the opposite direction:.
- Magnitude: tells you how steep the surface is at that point. Near a minimum, the gradient magnitude approaches zero because the surface flattens out.
Interactive: Gradient and Partial Derivatives on a Contour Plot
The visualization below shows a contour plot (top-down view of the surface) with the partial derivatives and gradient vector drawn at a movable point. Click or drag to move the point. Notice how the gradient (purple arrow) always points perpendicular to the contour lines — toward higher values.
Key Observation: The gradient is always perpendicular to the contour lines (level curves). Contour lines connect points of equal value, so moving along a contour means no change in . The gradient points in the direction of maximum change, which must be perpendicular to the "no change" direction.
For a neural network with weights, the gradient is an -dimensional vector:
Each component tells you how to adjust one specific weight to reduce the loss. The gradient gives you the complete recipe for improving all weights simultaneously.
Gradient Descent: Walking Downhill
Now we have all the pieces. Gradient descent is the algorithm that uses the gradient to iteratively minimize a function. The idea is beautifully simple: if the gradient tells you which way is uphill, go the other way.
The update rule is:
where (eta) is the learning rate — a positive number that controls the step size. The minus sign ensures we move opposite to the gradient (downhill). Let us start with a single-variable example to build intuition.
Example: Minimizing a 1D Loss Function
Consider . This is a parabola with minimum at where . The derivative is . Starting at :
- The gradient is (negative, meaning the minimum is to the right)
- The update: (moved right!)
- Each subsequent step moves us closer to , with smaller steps as we approach
Gradient Descent in Multiple Dimensions
The real power of gradient descent emerges with multiple variables. Instead of updating a single weight, we update an entire weight vector simultaneously. The update rule is the same, just with vectors:
Let us apply this to our two-variable loss . The gradient is . Starting at with learning rate 0.05:
Interactive 3D Visualization: Watch Gradient Descent Navigate a Loss Surface
The 3D visualization below shows gradient descent in action on a real loss surface. The orange ball follows the negative gradient downhill, leaving a yellow trail. The green arrow shows the descent direction. Try different surfaces, learning rates, and starting positions to build intuition for how gradient descent behaves.
Experiment: Try the Rosenbrock surface — it has a narrow curved valley where gradient descent oscillates side-to-side while slowly progressing along the valley floor. This demonstrates a real challenge in optimization: the gradient may not point directly toward the minimum. Advanced optimizers like Adam and RMSProp (Chapter 9) address this.
The Learning Rate: Step Size Matters
The learning rate is the most important hyperparameter in neural network training. It controls the size of each update step:
- Too small (): each step is tiny, convergence is painfully slow, and training may get stuck.
- Just right: smooth convergence to the minimum in a reasonable number of steps.
- Too large (): steps overshoot the minimum, causing oscillation or even divergence where the loss explodes to infinity.
For a simple quadratic loss , the update becomes . This converges when , which requires . If , the weight oscillates with growing amplitude and the loss diverges.
Interactive: The Effect of Learning Rate
Use the slider below to experiment with different learning rates. Watch how the optimization trajectory and convergence behavior change dramatically with this single parameter.
PyTorch Autograd: Automatic Derivatives
Everything we have computed by hand — derivatives, partial derivatives, gradients — PyTorch does automatically. This is called automatic differentiation (autograd). Here is how it works:
- Create tensors with . This tells PyTorch to track all operations on these tensors.
- Compute a function of those tensors (the forward pass). PyTorch builds a computational graph recording every operation.
- Call on the output. PyTorch traverses the graph in reverse, computing all partial derivatives using the chain rule (the backward pass).
- Read the gradients from . Each input tensor's attribute contains the derivative of the output with respect to that tensor.
No derivative formulas needed! For a neural network with millions of parameters, this is the difference between impossible and practical.
Gradient Descent in PyTorch
Now let us combine autograd with gradient descent to replicate our NumPy example in PyTorch. The code follows the standard PyTorch training loop that you will see in every deep learning project:
- Forward pass: compute the loss from current weights
- Backward pass: call to compute gradients
- Update weights: subtract inside
- Zero gradients: call for the next iteration
Important Detail — Zeroing Gradients: PyTorch accumulates gradients by default (each adds to the existing ). You must call before each new backward pass. Forgetting this is one of the most common PyTorch bugs — the gradients silently accumulate, and training diverges for no apparent reason.
In practice, PyTorch provides the module with optimizers like SGD, Adam, and AdaGrad that handle the update step and gradient zeroing for you. We will explore these in Chapter 9 (Optimizers). But it is critical to understand what they are doing under the hood — which is exactly the gradient descent loop we just implemented.
Summary and Key Takeaways
This section covered the mathematical foundation of how neural networks learn. Here are the key concepts:
| Concept | Definition | Role in Neural Networks |
|---|---|---|
| Derivative f′(x) | Rate of change of f with respect to x | Tells how the loss changes when one weight changes |
| Partial derivative ∂f/∂x | Derivative with respect to one variable, others held constant | Isolates the effect of one specific weight on the loss |
| Gradient ∇f | Vector of all partial derivatives | Points in direction of steepest ascent; −∇f points downhill |
| Gradient descent | w_new = w_old − η⋅∇L | The algorithm that updates weights to minimize loss |
| Learning rate η | Step size multiplier | Controls convergence speed vs. stability |
| PyTorch autograd | Automatic differentiation via .backward() | Computes all gradients automatically, no formulas needed |
The Big Picture: Training a neural network is just gradient descent on a high-dimensional loss surface. The loss function measures prediction error, the gradient tells us how to adjust each weight to reduce that error, and the learning rate controls how aggressively we make those adjustments. In the next section, we will cover the probability basics that define the loss functions we optimize. In Section 4, we will see how the chain rule lets us compute gradients through the many layers of a deep network.