Why Derivatives? The Slope That Guides Learning
Imagine you are blindfolded, standing on a hilly landscape. Your goal is to reach the lowest point. You cannot see, but you can feel the ground beneath your feet. If the ground slopes downward to your right, you step right. If it slopes downward to your left, you step left. The steepness of the slope tells you how big a step to take.
This is exactly what a neural network does during training. The “landscape” is the loss function — a mathematical surface where the height represents how wrong the network's predictions are. The “slope” is the derivative, and the process of stepping downhill is called gradient descent.
The Central Idea: A derivative tells you how a function's output changes when you nudge its input. For a loss function, it answers: “If I slightly increase this weight, does the error go up or down, and by how much?” This is ALL a neural network needs to learn.
Formally, the derivative of a function at a point is defined as:
This is the slope of the tangent line at point . When the derivative is positive, the function is increasing (going uphill). When negative, it is decreasing (going downhill). When zero, the function is flat — a potential minimum or maximum.
Explore this interactively below. Drag the slider to move along the curve and watch how the tangent line (and its slope) changes at each point.
Computing Derivatives: Three Approaches
There are three ways to compute derivatives in practice, each with different trade-offs:
| Approach | How It Works | Pros | Cons |
|---|---|---|---|
| Symbolic | Apply calculus rules (power rule, chain rule) by hand or with CAS | Exact answer, closed-form expression | Doesn’t scale to complex functions with millions of operations |
| Numerical | Approximate using f’(x) ≈ [f(x+h) - f(x)] / h | Simple to implement, works for any function | Slow (two function evaluations per parameter), rounding errors |
| Automatic (autograd) | Record operations during forward pass, replay chain rule backward | Exact, fast, scales to millions of parameters | Requires a framework (PyTorch, JAX, etc.) |
Neural networks use automatic differentiation (the third approach). PyTorch's autograd system implements this, giving you exact derivatives with a single function call. But let's first see the numerical approach in Python to build intuition, then see how autograd replaces it.
Numerical Derivative in Python
The simplest way to compute a derivative: evaluate the function at two very close points and compute the slope between them. For at , the analytical derivative is . Let's verify numerically.
The Same Derivative with PyTorch Autograd
Now watch autograd do the same thing in three lines. No finite differences, no choosing , no floating-point errors. Just: create a tensor with , compute the function, call , and read the gradient from .
Key Insight: The numerical approach computes an approximation that can have floating-point errors. Autograd computes the exact derivative by applying calculus rules (chain rule) automatically during the backward pass. For a network with millions of parameters, autograd gives all gradients in roughly the same time as one forward pass.
The Chain Rule: Engine of Deep Learning
Neural networks are built from chains of functions: input passes through layer 1, the output passes through layer 2, then layer 3, and so on until the final prediction. To train the network, we need the derivative of the loss with respect to parameters in the very first layer — but the loss is many function compositions away.
The chain rule solves this elegantly. If , then:
In words: multiply the local derivatives at each link in the chain. Each function only needs to know its own derivative; the chain rule connects them. For a network with layers, the total derivative is the product of local derivatives:
Chain Rule by Hand in Python
Let's compute the chain rule manually for where and . So .
Chain Rule with PyTorch Autograd
Now let autograd do the same chain rule calculation. We write the same computation, but instead of computing local derivatives manually, we just call and PyTorch handles the chain rule internally.
Explore different chain compositions interactively. Watch how local derivatives at each function multiply to produce the total derivative, and see how changing affects the gradient flow.
Computational Graphs: PyTorch's Bookkeeping
How does PyTorch actually compute derivatives through chains of operations? The answer is the computational graph — a directed acyclic graph (DAG) that records every operation performed on tensors with .
During the forward pass, as you execute operations like or , PyTorch silently builds this graph. Each operation creates a node (identified by ) that records: what operation was performed, and what its inputs were. When you call , PyTorch walks this graph in reverse, applying the chain rule at each node to compute all gradients.
Let's see this in action with , where variable feeds into the graph through two different paths.
The visualization below lets you see this graph being built during the forward pass and then watch gradients flow backward during the backward pass. Try all three examples to see how different graph topologies work.
| Concept | What It Is | Example |
|---|---|---|
| Leaf tensor | Tensor created directly by user (not by an operation) | x = torch.tensor(2.0, requires_grad=True) |
| grad_fn | The graph node that created a tensor. None for leaves. | a.grad_fn = MulBackward0 |
| Forward pass | Computing the output. Builds the graph. | z = x * y + x ** 2 |
| Backward pass | Walking the graph in reverse to compute gradients. | z.backward() |
| Multi-path gradient | When a variable feeds into multiple paths, gradients are summed. | dz/dx = (through a) + (through b) = 3 + 4 = 7 |
The Autograd API: requires_grad, backward, grad
The autograd system has three essential pieces. Think of them as: the switch (requires_grad), the trigger (backward), and the result (grad).
- : Tells PyTorch to track operations on this tensor. Set this on parameters (weights and biases), not on input data.
- : Triggers the backward pass. Walks the computational graph in reverse, computing all gradients using the chain rule. Can only be called on a scalar (single-number) tensor.
- : After backward(), this attribute holds the gradient of the loss with respect to this parameter. It tells you: “which direction and how much should I change this parameter to reduce the loss?”
Let's see all three working together in a mini neural network: a single linear transformation with a squared error loss.
The Gradient Tells You How to Improve: After backward(), w.grad = 0.4 means “increasing w by 1 increases the loss by 0.4.” So to decrease the loss, we should decrease w. The gradient descent update is: where is the learning rate.
Partial Derivatives and the Gradient Vector
Real neural networks have thousands or millions of parameters, not just one. When a function depends on multiple variables, like , we need a separate derivative with respect to each variable. These are called partial derivatives.
The partial derivative treats all other variables as constants and differentiates only with respect to . The collection of all partial derivatives forms the gradient vector:
The gradient vector points in the direction of steepest ascent. To minimize a function (reduce loss), we move in the opposite direction — the negative gradient. This is exactly what gradient descent does.
| Notation | Meaning | Example |
|---|---|---|
| ∂f/∂x | Partial derivative: differentiate f with respect to x, treating y as constant | f = x² + 2xy → ∂f/∂x = 2x + 2y |
| ∂f/∂y | Partial derivative: differentiate f with respect to y, treating x as constant | f = x² + 2xy → ∂f/∂y = 2x |
| ∇f (nabla f) | Gradient vector: the vector of all partial derivatives | ∇f = (2x + 2y, 2x) |
| -∇f | Negative gradient: direction of steepest DESCENT | The direction to move to reduce f fastest |
PyTorch computes all partial derivatives simultaneously. When you call , every leaf tensor with gets its gradient populated. You saw this in the computational graph example: gave us and in a single backward() call.
The Gradient Accumulation Trap
PyTorch has a behavior that surprises every beginner: calling does not replace the gradient in — it adds to it. This means if you call backward() twice without zeroing the gradient between calls, the second gradient gets added to the first.
This is by design (it enables advanced techniques like gradient accumulation over mini-batches), but it's the most common bug in PyTorch training loops. If you forget to zero gradients, your model will train with increasingly wrong gradient values.
The interactive demo below lets you experience this bug firsthand. Click “loss.backward()” repeatedly and watch the gradient grow without bound in the “Without zero_grad()” mode. Then switch to the correct mode and see how resetting keeps the gradient clean.
- — reset all parameter gradients to 0
- — forward pass
- — compute gradients
- — update parameters using gradients
Controlling the Graph: no_grad() and detach()
Building a computational graph takes memory and computation. During inference (making predictions with a trained model), you don't need gradients — you're not training, just computing outputs. PyTorch provides two mechanisms to disable gradient tracking when you don't need it:
| Mechanism | What It Does | When to Use |
|---|---|---|
| torch.no_grad() | Context manager: disables gradient tracking for all operations inside the block | Inference, evaluation, parameter updates in training loops |
| tensor.detach() | Creates a new tensor with same data but disconnected from the graph | Using a tensor’s value without gradient flow, stopping gradient through part of network |
Memory Impact: For a model like GPT with 175 billion parameters, the computational graph for a single forward pass can use 10-100 GB of memory. Wrapping inference in can reduce memory usage by 50-75%, since no graph nodes need to be stored.
Gradient Descent with Autograd
Now we bring everything together. We'll build a complete gradient descent loop that uses autograd to minimize a function. The pattern is the same one used to train every neural network: forward pass, backward pass, update parameters, repeat.
Our goal: find the value of that minimizes . The minimum is at (where ). We start at and let gradient descent find the minimum by following the negative gradient step by step.
Notice the pattern: the loss decreases at each step (9.0 → 5.76 → 3.69 → ...), and approaches 3 (0 → 0.6 → 1.08 → ...). The gradient gets smaller as we approach the minimum, so the steps naturally shrink — this is why gradient descent converges smoothly for convex functions.
3D Visualization: Gradient Descent on a Loss Surface
In real neural networks, the loss is a function of many parameters. With two parameters, we can visualize it as a 3D surface where height represents the loss. Watch the red ball follow the gradient downhill to find the minimum. The yellow arrow shows the direction of steepest descent at each point — this is exactly what autograd computes.
Summary and Bridge
In this section, we learned how PyTorch automatically computes derivatives — the gradients that drive neural network training. Here are the key concepts:
| Concept | What It Means | PyTorch API |
|---|---|---|
| Derivative | Rate of change — the slope of the function at a point | Computed automatically by autograd |
| Chain rule | Derivative through composed functions = product of local derivatives | Applied internally by .backward() |
| Computational graph | Record of operations that enables automatic differentiation | Built automatically during forward pass |
| requires_grad | Enable gradient tracking on a tensor (parameters only) | torch.tensor(..., requires_grad=True) |
| .backward() | Compute all gradients via reverse-mode autodiff | loss.backward() |
| .grad | Access the computed gradient of a leaf tensor | weight.grad |
| zero_grad() | Reset gradients to zero (MUST do before each backward) | optimizer.zero_grad() or tensor.grad.zero_() |
| no_grad() | Disable gradient tracking to save memory | with torch.no_grad(): |
| Gradient descent | Update parameters in negative gradient direction | w -= lr * w.grad |
The Training Loop Pattern: Every neural network training loop follows the same four steps: (1) forward pass to compute loss, (2) backward pass to compute gradients, (3) parameter update using gradients, (4) zero gradients for next iteration. Everything else — data loading, model architecture, learning rate schedules — is built around this core loop.
In the next chapter, we'll use these tools to build actual neural networks. You now have all the PyTorch essentials: tensors (Section 2-3), operations (Section 3), and automatic differentiation (this section). Chapter 3 will combine them into complete forward passes, loss computation, and training loops for real networks.