Chapter 2
25 min read
Section 7 of 65

Autograd Basics

Python and PyTorch Essentials

Why Derivatives? The Slope That Guides Learning

Imagine you are blindfolded, standing on a hilly landscape. Your goal is to reach the lowest point. You cannot see, but you can feel the ground beneath your feet. If the ground slopes downward to your right, you step right. If it slopes downward to your left, you step left. The steepness of the slope tells you how big a step to take.

This is exactly what a neural network does during training. The “landscape” is the loss function — a mathematical surface where the height represents how wrong the network's predictions are. The “slope” is the derivative, and the process of stepping downhill is called gradient descent.

The Central Idea: A derivative tells you how a function's output changes when you nudge its input. For a loss function, it answers: “If I slightly increase this weight, does the error go up or down, and by how much?” This is ALL a neural network needs to learn.

Formally, the derivative of a function f(x)f(x) at a point xx is defined as:

f(x)=limh0f(x+h)f(x)hf'(x) = \lim_{h \to 0} \frac{f(x + h) - f(x)}{h}

This is the slope of the tangent line at point xx. When the derivative is positive, the function is increasing (going uphill). When negative, it is decreasing (going downhill). When zero, the function is flat — a potential minimum or maximum.

Explore this interactively below. Drag the slider to move along the curve and watch how the tangent line (and its slope) changes at each point.

Loading derivative visualization...
Try the f(x) = ½(x-3)² function. Notice that the derivative is zero at x = 3 — that is the minimum. The derivative is negative to the left of 3 (pointing right toward the minimum) and positive to the right (pointing left toward the minimum). This is the signal that gradient descent follows.

Computing Derivatives: Three Approaches

There are three ways to compute derivatives in practice, each with different trade-offs:

ApproachHow It WorksProsCons
SymbolicApply calculus rules (power rule, chain rule) by hand or with CASExact answer, closed-form expressionDoesn’t scale to complex functions with millions of operations
NumericalApproximate using f’(x) ≈ [f(x+h) - f(x)] / hSimple to implement, works for any functionSlow (two function evaluations per parameter), rounding errors
Automatic (autograd)Record operations during forward pass, replay chain rule backwardExact, fast, scales to millions of parametersRequires a framework (PyTorch, JAX, etc.)

Neural networks use automatic differentiation (the third approach). PyTorch's autograd system implements this, giving you exact derivatives with a single function call. But let's first see the numerical approach in Python to build intuition, then see how autograd replaces it.

Numerical Derivative in Python

The simplest way to compute a derivative: evaluate the function at two very close points and compute the slope between them. For f(x)=x2+3x+2f(x) = x^2 + 3x + 2 at x=2x = 2, the analytical derivative is f(x)=2x+3=7f'(x) = 2x + 3 = 7. Let's verify numerically.

Numerical Derivative \u2014 The Limit Definition in Code
🐍numerical_derivative.py
1import numpy as np

We import NumPy for numerical computing. In this example we use plain Python math, but NumPy will be essential in later code blocks for array operations.

EXECUTION STATE
📚 numpy = Numerical computing library. Provides ndarray, mathematical functions, and linear algebra operations. We alias it as np by convention.
3def f(x) — the function we want to differentiate

We define a simple quadratic function f(x) = x² + 3x + 2. This is a parabola that opens upward. We chose this because its derivative is easy to verify by hand: f’(x) = 2x + 3.

EXECUTION STATE
⬇ input: x = A single real number — the point at which we evaluate the function. For our example, x = 2.0.
⬆ returns = f(x) = x² + 3x + 2. At x=2: 4 + 6 + 2 = 12.0
4Docstring: our function definition

The docstring documents what this function computes. f(x) = x² + 3x + 2 is a degree-2 polynomial (quadratic). Its graph is a parabola with minimum at x = -3/2.

5return x**2 + 3*x + 2

Computes and returns the value of f(x) = x² + 3x + 2. The ** operator is Python’s exponentiation. For x=2.0: 2² + 3×2 + 2 = 4 + 6 + 2 = 12.0.

EXECUTION STATE
x**2 = 2.0² = 4.0
3*x = 3 × 2.0 = 6.0
⬆ return: x**2 + 3*x + 2 = 4.0 + 6.0 + 2 = 12.0
7def numerical_derivative(f, x, h=1e-7) — computing the slope

This function implements the limit definition of the derivative: f’(x) ≈ [f(x+h) - f(x)] / h. It draws a tiny secant line from x to x+h and computes its slope. As h → 0, this approaches the true derivative.

EXECUTION STATE
⬇ input: f = The function to differentiate. We pass in our f(x) = x² + 3x + 2. This is a Python function object.
⬇ input: x = The point at which to compute the derivative. For our example, x = 2.0.
⬇ input: h = 1e-7 = The tiny step size. h = 0.0000001. Smaller h = more accurate, but too small causes floating-point errors. 1e-7 is a good default balance.
⬆ returns = The approximate derivative f’(x) as a float. For x=2: returns 7.000000 (matching the analytical answer).
8Docstring: the limit definition

The derivative is defined as the limit: f’(x) = lim(h→0) [f(x+h) - f(x)] / h. We can’t actually take h=0 (division by zero), so we use a very small h instead.

9return (f(x + h) - f(x)) / h

The core computation: evaluate f at two very close points (x and x+h), take their difference, and divide by h. This gives the slope of the secant line, which approximates the slope of the tangent line (the derivative).

EXECUTION STATE
f(x + h) = f(2.0 + 1e-7) = f(2.0000001) = (2.0000001)² + 3×(2.0000001) + 2 = 12.0000007...
f(x) = f(2.0) = 12.0
f(x+h) - f(x) = 12.0000007 - 12.0 = 0.0000007 (the tiny rise)
h = 1e-7 = 0.0000001 (the tiny run)
⬆ return: (f(x+h) - f(x)) / h = 0.0000007 / 0.0000001 = 7.000000 (rise/run = slope)
11x = 2.0

We choose x = 2.0 as our evaluation point. The derivative of f(x) = x² + 3x + 2 is f’(x) = 2x + 3. At x=2: f’(2) = 2(2) + 3 = 7. We’ll verify this numerically.

EXECUTION STATE
x = 2.0 — the point where we want to know the slope
13print(f"f({x}) = {f(x)}")

Prints the function value at x=2. This is the height of the curve at our point.

EXECUTION STATE
output = f(2.0) = 12.0
14print numerical derivative

Prints the numerically computed derivative. The secant-line approximation gives us 7.000000, which matches the exact answer perfectly (within floating-point precision).

EXECUTION STATE
numerical_derivative(f, 2.0) = 7.000000 — computed by the limit formula
15print analytical derivative

Prints the exact analytical derivative for comparison. We know f’(x) = 2x + 3 from calculus (power rule + constant rule). At x=2: 2(2) + 3 = 7.

EXECUTION STATE
2*x + 3 = 2(2.0) + 3 = 7.000000 — exact from calculus
→ match! = Numerical = 7.000000, Analytical = 7.000000. They agree! The limit definition works.
4 lines without explanation
1import numpy as np
2
3def f(x):
4    """Our function: f(x) = x² + 3x + 2"""
5    return x**2 + 3*x + 2
6
7def numerical_derivative(f, x, h=1e-7):
8    """Compute derivative using the limit definition."""
9    return (f(x + h) - f(x)) / h
10
11x = 2.0
12
13print(f"f({x}) = {f(x)}")
14print(f"f'({x}) numerical  = {numerical_derivative(f, x):.6f}")
15print(f"f'({x}) analytical = {2*x + 3:.6f}")

The Same Derivative with PyTorch Autograd

Now watch autograd do the same thing in three lines. No finite differences, no choosing hh, no floating-point errors. Just: create a tensor with requires_grad=True\texttt{requires\_grad=True}, compute the function, call .backward()\texttt{.backward()}, and read the gradient from .grad\texttt{.grad}.

PyTorch Autograd \u2014 Exact Derivative in Three Lines
🐍autograd_first.py
1import torch

Import PyTorch. This gives us torch.tensor (like np.array but with gradient tracking), torch.autograd (automatic differentiation engine), and the entire deep learning toolkit.

EXECUTION STATE
📚 torch = PyTorch’s core module. Key for this section: torch.tensor() creates tensors, requires_grad=True enables gradient tracking, .backward() computes gradients.
3x = torch.tensor(2.0, requires_grad=True)

Creates a scalar tensor with value 2.0 and tells PyTorch: “track every operation on this tensor so you can compute derivatives later.” This is the fundamental switch that turns on autograd. Without requires_grad=True, PyTorch treats x as a plain number with no gradient tracking.

EXECUTION STATE
📚 torch.tensor() = Creates a new tensor. First arg is the data (a Python float here). Unlike np.array, can optionally track gradients for automatic differentiation.
⬇ arg 1: 2.0 = The scalar value. Creates a 0-dimensional tensor (a single number). Shape: torch.Size([]).
⬇ arg 2: requires_grad=True = The ON switch for autograd. When True, PyTorch records every operation on x into a computational graph. When False (default), no tracking happens and .grad stays None.
⬆ result: x = tensor(2.0, requires_grad=True) — a leaf tensor in the computational graph. ‘Leaf’ means it was created by the user (not by an operation on other tensors).
5y = x**2 + 3*x + 2

Computes y = f(x) = x² + 3x + 2. Because x has requires_grad=True, PyTorch records each operation: (1) x**2 creates a node with PowBackward, (2) 3*x creates MulBackward, (3) the additions create AddBackward nodes. The computational graph is built automatically during this line.

EXECUTION STATE
x**2 = 2.0² = 4.0 — recorded as PowBackward0
3*x = 3 × 2.0 = 6.0 — recorded as MulBackward0
x**2 + 3*x = 4.0 + 6.0 = 10.0 — recorded as AddBackward0
y = tensor(12.0, grad_fn=<AddBackward0>) — 10.0 + 2 = 12.0
y.grad_fn = AddBackward0 — this is the ‘breadcrumb’ that lets PyTorch trace backward through the computation
7y.backward()

THE key autograd call. This tells PyTorch: “Start from y, walk backward through the computational graph, and compute dy/dx using the chain rule.” PyTorch visits each node in reverse order, multiplying local derivatives together (chain rule), and stores the final result in x.grad. The entire backward pass happens in this single line — no manual calculus needed.

EXECUTION STATE
📚 .backward() = Tensor method: computes gradients of this tensor with respect to all leaf tensors that have requires_grad=True. For a scalar output (like loss or y here), no arguments needed. Internally calls torch.autograd.backward().
→ what happens internally = 1. Start at y with gradient 1.0 (dy/dy = 1) 2. Walk backward through AddBackward, PowBackward, MulBackward 3. Apply chain rule at each node 4. Accumulate result into x.grad
→ after this call = x.grad = 7.0 (because dy/dx = 2x + 3 = 2(2) + 3 = 7)
9print x value

Prints the value of x. The .item() method extracts the Python float from a scalar tensor.

EXECUTION STATE
📚 .item() = Tensor method: returns the scalar value as a standard Python number (int or float). Only works on tensors with exactly one element.
output = x = 2.0
10print y value

Prints y = 12.0. This is f(2.0) = 4 + 6 + 2 = 12.0.

EXECUTION STATE
output = y = x² + 3x + 2 = 12.0
11print dy/dx

This is the magic of autograd. x.grad contains the derivative dy/dx, computed automatically by backward(). The value 7.0 matches our analytical answer: dy/dx = 2x + 3 = 2(2) + 3 = 7. No manual differentiation needed!

EXECUTION STATE
x.grad = tensor(7.0) — this is dy/dx at x=2.0
📚 .grad = Attribute on leaf tensors: stores the gradient computed by .backward(). Is None before backward() is called. Accumulates across multiple backward() calls (a common source of bugs!).
output = dy/dx = 7.0
→ verification = Manual calculus: d/dx(x² + 3x + 2) = 2x + 3 = 2(2) + 3 = 7 ✔
4 lines without explanation
1import torch
2
3x = torch.tensor(2.0, requires_grad=True)
4
5y = x**2 + 3*x + 2
6
7y.backward()
8
9print(f"x = {x.item()}")
10print(f"y = x² + 3x + 2 = {y.item()}")
11print(f"dy/dx = {x.grad.item()}")
Key Insight: The numerical approach computes an approximation that can have floating-point errors. Autograd computes the exact derivative by applying calculus rules (chain rule) automatically during the backward pass. For a network with millions of parameters, autograd gives all gradients in roughly the same time as one forward pass.

The Chain Rule: Engine of Deep Learning

Neural networks are built from chains of functions: input passes through layer 1, the output passes through layer 2, then layer 3, and so on until the final prediction. To train the network, we need the derivative of the loss with respect to parameters in the very first layer — but the loss is many function compositions away.

The chain rule solves this elegantly. If y=f(g(x))y = f(g(x)), then:

dydx=dydududxwhere u=g(x)\frac{dy}{dx} = \frac{dy}{du} \cdot \frac{du}{dx} \quad \text{where } u = g(x)

In words: multiply the local derivatives at each link in the chain. Each function only needs to know its own derivative; the chain rule connects them. For a network with nn layers, the total derivative is the product of nn local derivatives:

dydx=dydundundun1du2du1du1dx\frac{dy}{dx} = \frac{dy}{du_n} \cdot \frac{du_n}{du_{n-1}} \cdots \frac{du_2}{du_1} \cdot \frac{du_1}{dx}

Chain Rule by Hand in Python

Let's compute the chain rule manually for y=f(g(x))y = f(g(x)) where g(x)=3x+1g(x) = 3x + 1 and f(u)=u2f(u) = u^2. So y=(3x+1)2y = (3x + 1)^2.

Chain Rule by Hand \u2014 Multiply Local Derivatives
🐍chain_rule_manual.py
1import numpy as np

Import NumPy. We use plain Python arithmetic here, but import numpy as a convention since it’s always available in scientific Python code.

EXECUTION STATE
📚 numpy = Numerical computing library. Not strictly needed for this scalar example, but included to establish the pattern.
3Comment: y = f(g(x)) — function composition

We’re computing a composition of two functions: first apply g to x, then apply f to the result. This is exactly what happens in neural networks: layer 1 transforms the input, layer 2 transforms that output, and so on. The chain rule tells us how to differentiate through this composition.

5def g(x) — the inner function

g(x) = 3x + 1 is the inner function. In a neural network, this would be like a linear layer: z = wx + b where w=3, b=1. Its derivative is simple: dg/dx = 3 (the slope of the line).

EXECUTION STATE
⬇ input: x = A scalar value. For our example: x = 2.0
⬆ returns = g(x) = 3x + 1. At x=2: 3(2) + 1 = 7.0
6return 3 * x + 1

Computes the linear function 3x + 1. This is the simplest possible neural network layer: multiply by a weight (3) and add a bias (1).

EXECUTION STATE
⬆ return: 3 * x + 1 = 3 × 2.0 + 1 = 7.0
8def f(u) — the outer function

f(u) = u² is the outer function. In a neural network, this could be part of a loss function (e.g., MSE loss squares the error). Its derivative is df/du = 2u.

EXECUTION STATE
⬇ input: u = The output of g(x). For our example: u = g(2.0) = 7.0
⬆ returns = f(u) = u². At u=7: 7² = 49.0
9return u ** 2

Computes u squared. Python’s ** is the exponentiation operator.

EXECUTION STATE
⬆ return: u ** 2 = 7.0² = 49.0
11x = 2.0

Our input value. We’ll trace the computation forward (x → u → y) then trace the derivatives backward (dy/du × du/dx = dy/dx).

EXECUTION STATE
x = 2.0 — the starting point of our computation chain
12u = g(x)

Forward step 1: apply the inner function g. The value flows from x to u through the first link of the chain.

EXECUTION STATE
u = g(2.0) = 3 × 2.0 + 1 = 7.0
13y = f(u)

Forward step 2: apply the outer function f. Now we have the final output. The forward pass is complete: x=2 → u=7 → y=49.

EXECUTION STATE
y = f(7.0) = 7.0² = 49.0
15Comment: Local derivatives

Now we compute the derivative of each function at the value it actually received. These are ‘local’ derivatives — each function only knows about its own input and output. The chain rule multiplies them together to get the global derivative.

16du_dx = 3

The derivative of g(x) = 3x + 1 with respect to x is simply 3 (the coefficient of x). This is the local derivative at the first link of the chain. It tells us: if x changes by ε, then u changes by 3ε.

EXECUTION STATE
du_dx = 3 — d/dx(3x + 1) = 3. The slope of the line y = 3x + 1.
17dy_du = 2 * u

The derivative of f(u) = u² with respect to u is 2u (power rule). At u=7: dy/du = 2(7) = 14. This is the local derivative at the second link. It tells us: if u changes by ε, then y changes by 14ε.

EXECUTION STATE
dy_du = 2 * u = 2 × 7.0 = 14.0 — d/du(u²) = 2u evaluated at u=7
19Comment: Chain rule formula

The chain rule says: to get the derivative through a chain of functions, multiply all the local derivatives together. dy/dx = dy/du × du/dx. This is the fundamental algorithm behind backpropagation in neural networks.

20dy_dx = dy_du * du_dx

The chain rule in action: multiply the local derivatives of each link in the chain. 14 × 3 = 42. This tells us: if x changes by ε, then y changes by 42ε. A tiny change in x gets amplified by 3 (through g) and then by 14 (through f), for a total amplification of 42.

EXECUTION STATE
dy_du = 14.0 — how much f amplifies changes
du_dx = 3 — how much g amplifies changes
dy_dx = 14 × 3 = 42.0 — the total derivative dy/dx
22print x value

Displays the input value x = 2.0.

EXECUTION STATE
output = x = 2.0
23print u = g(x)

Displays the intermediate value u = 7.0 after the first function.

EXECUTION STATE
output = u = g(x) = 3x + 1 = 7.0
24print y = f(u)

Displays the final output y = 49.0.

EXECUTION STATE
output = y = f(u) = u² = 49.0
25print du/dx

The local derivative of the inner function: 3.

EXECUTION STATE
output = du/dx = 3
26print dy/du

The local derivative of the outer function: 14.0.

EXECUTION STATE
output = dy/du = 2u = 14.0
27print chain rule result

The complete chain rule result: dy/dx = 14 × 3 = 42.

EXECUTION STATE
output = dy/dx = dy/du × du/dx = 14.0 × 3 = 42.0
7 lines without explanation
1import numpy as np
2
3# y = f(g(x)) where g(x) = 3x + 1, f(u) = u²
4
5def g(x):
6    return 3 * x + 1
7
8def f(u):
9    return u ** 2
10
11x = 2.0
12u = g(x)
13y = f(u)
14
15# Local derivatives
16du_dx = 3
17dy_du = 2 * u
18
19# Chain rule: dy/dx = dy/du × du/dx
20dy_dx = dy_du * du_dx
21
22print(f"x = {x}")
23print(f"u = g(x) = 3x + 1 = {u}")
24print(f"y = f(u) = u² = {y}")
25print(f"du/dx = {du_dx}")
26print(f"dy/du = 2u = {dy_du}")
27print(f"dy/dx = dy/du × du/dx = {dy_du} × {du_dx} = {dy_dx}")

Chain Rule with PyTorch Autograd

Now let autograd do the same chain rule calculation. We write the same computation, but instead of computing local derivatives manually, we just call .backward()\texttt{.backward()} and PyTorch handles the chain rule internally.

PyTorch Autograd Applies the Chain Rule Automatically
🐍chain_rule_autograd.py
1import torch

Import PyTorch. We’ll now let autograd compute the chain rule automatically — no manual derivative calculation needed.

EXECUTION STATE
📚 torch = PyTorch’s core module with automatic differentiation via torch.autograd.
3x = torch.tensor(2.0, requires_grad=True)

Create x = 2.0 with gradient tracking enabled. PyTorch will now record every operation that involves x.

EXECUTION STATE
x = tensor(2.0, requires_grad=True)
5u = 3 * x + 1 — PyTorch records this operation

Computes g(x) = 3x + 1 = 7.0. PyTorch creates two graph nodes: MulBackward0 (for 3*x) and AddBackward0 (for +1). Each node stores the information needed to compute its local derivative during the backward pass.

EXECUTION STATE
3 * x = 3 × 2.0 = 6.0 — creates MulBackward0 node
u = 6.0 + 1 = 7.0 — creates AddBackward0 node
u.grad_fn = AddBackward0 — the graph node tracking this operation
6y = u ** 2 — PyTorch records this too

Computes f(u) = u² = 49.0. Creates a PowBackward0 node. Now the entire computational graph is built: x → MulBackward → AddBackward → PowBackward. This graph is what backward() will traverse.

EXECUTION STATE
y = u ** 2 = 7.0² = 49.0
y.grad_fn = PowBackward0 — knows that y came from squaring u
8y.backward() — autograd computes the chain rule

PyTorch walks backward through the graph: 1. Start: dy/dy = 1 2. PowBackward: dy/du = 2u = 2(7) = 14 3. AddBackward: du’/d(3x) = 1 → running grad = 14 4. MulBackward: d(3x)/dx = 3 → running grad = 14 × 3 = 42 Result: x.grad = 42.0

EXECUTION STATE
📚 .backward() = Traverses the computational graph in reverse, applying the chain rule at each node. Stores the result in x.grad.
→ backward trace = PowBackward: 2u = 14 → AddBackward: ×1 = 14 → MulBackward: ×3 = 42
10print x value

x = 2.0, unchanged by backward().

EXECUTION STATE
output = x = 2.0
11print u value

u = 7.0, the intermediate computation.

EXECUTION STATE
output = u = 3x + 1 = 7.0
12print y value

y = 49.0, the final output.

EXECUTION STATE
output = y = u² = 49.0
13print autograd result

x.grad = 42.0 — computed automatically by one call to y.backward(). This is exactly the same answer we got manually (14 × 3 = 42), but PyTorch computed it for us.

EXECUTION STATE
x.grad.item() = 42.0 — dy/dx computed by autograd
14print manual verification

We verify: dy/dx = 2u × 3 = 2(7)(3) = 42. Autograd and manual calculation match perfectly.

EXECUTION STATE
2 * u.item() * 3 = 2 × 7.0 × 3 = 42.0 — matches autograd!
4 lines without explanation
1import torch
2
3x = torch.tensor(2.0, requires_grad=True)
4
5u = 3 * x + 1      # g(x) = 3x + 1 → u = 7.0
6y = u ** 2          # f(u) = u²    → y = 49.0
7
8y.backward()
9
10print(f"x = {x.item()}")
11print(f"u = 3x + 1 = {u.item()}")
12print(f"y = u² = {y.item()}")
13print(f"dy/dx (autograd) = {x.grad.item()}")
14print(f"dy/dx (manual)   = 2 * u * 3 = {2 * u.item() * 3}")

Explore different chain compositions interactively. Watch how local derivatives at each function multiply to produce the total derivative, and see how changing xx affects the gradient flow.

Loading chain rule visualization...
Why the chain rule matters for deep learning: A neural network with 100 layers means the chain rule multiplies 100 local derivatives together. If local derivatives are all slightly greater than 1, the product explodes (exploding gradients). If they're all slightly less than 1, the product vanishes (vanishing gradients). Solving these problems (with careful initialization, batch normalization, residual connections, and other techniques) is one of the central challenges of deep learning. We will explore these solutions in later chapters.

Computational Graphs: PyTorch's Bookkeeping

How does PyTorch actually compute derivatives through chains of operations? The answer is the computational graph — a directed acyclic graph (DAG) that records every operation performed on tensors with requires_grad=True\texttt{requires\_grad=True}.

During the forward pass, as you execute operations like a=xya = x \cdot y or b=x2b = x^2, PyTorch silently builds this graph. Each operation creates a node (identified by grad_fn\texttt{grad\_fn}) that records: what operation was performed, and what its inputs were. When you call .backward()\texttt{.backward()}, PyTorch walks this graph in reverse, applying the chain rule at each node to compute all gradients.

Let's see this in action with z=xy+x2z = x \cdot y + x^2, where variable xx feeds into the graph through two different paths.

Exploring the Computational Graph
🐍computational_graph.py
1import torch

Import PyTorch for tensor creation and automatic differentiation.

EXECUTION STATE
📚 torch = PyTorch core module
3x = torch.tensor(2.0, requires_grad=True)

Create the first input tensor x = 2.0 with gradient tracking. This is a leaf tensor — it was directly created by the user, not by an operation on other tensors.

EXECUTION STATE
x = tensor(2.0, requires_grad=True)
x.is_leaf = True — leaf tensors are inputs to the graph
x.grad_fn = None — leaf tensors have no grad_fn (they weren’t created by an operation)
4y = torch.tensor(3.0, requires_grad=True)

Create the second input tensor y = 3.0. Now we have two leaf tensors, so z.backward() will compute dz/dx AND dz/dy (partial derivatives with respect to each input).

EXECUTION STATE
y = tensor(3.0, requires_grad=True)
6a = x * y — multiplication node

Computes a = 2.0 × 3.0 = 6.0. PyTorch creates a MulBackward0 node in the graph. This node remembers: its inputs were x and y, and the local derivatives are da/dx = y = 3 and da/dy = x = 2.

EXECUTION STATE
a = x * y = 2.0 × 3.0 = 6.0
a.grad_fn = MulBackward0 — knows a = x*y, so da/dx = y, da/dy = x
a.is_leaf = False — a was created by an operation, not by the user
7b = x ** 2 — power node

Computes b = 2.0² = 4.0. Creates a PowBackward0 node. Note: x feeds into BOTH a and b. This means x has two paths through the graph, and its total gradient is the sum of gradients from both paths.

EXECUTION STATE
b = x ** 2 = 2.0² = 4.0
b.grad_fn = PowBackward0 — knows b = x², so db/dx = 2x = 4.0
8z = a + b — addition node (output)

Computes z = 6.0 + 4.0 = 10.0. Creates an AddBackward0 node. This is our final output. The complete graph: x and y feed into a=x*y; x also feeds into b=x²; a and b feed into z=a+b.

EXECUTION STATE
z = a + b = 6.0 + 4.0 = 10.0
z.grad_fn = AddBackward0 — knows z = a + b, so dz/da = 1, dz/db = 1
10print x properties

Shows that x is a leaf tensor with requires_grad=True.

EXECUTION STATE
output = x = 2.0, requires_grad = True
11print y properties

Shows that y is also a leaf tensor with requires_grad=True.

EXECUTION STATE
output = y = 3.0, requires_grad = True
12print a properties

Shows a = 6.0 with grad_fn = MulBackward0. The grad_fn is how PyTorch knows how to backpropagate through this node.

EXECUTION STATE
output = a = x*y = 6.0, grad_fn = <MulBackward0 object>
13print b properties

Shows b = 4.0 with grad_fn = PowBackward0.

EXECUTION STATE
output = b = x² = 4.0, grad_fn = <PowBackward0 object>
14print z properties

Shows z = 10.0 with grad_fn = AddBackward0 — the root of our computational graph.

EXECUTION STATE
output = z = a+b = 10.0, grad_fn = <AddBackward0 object>
16z.backward() — compute all gradients

Backward pass through the graph: 1. Start: dz/dz = 1 2. AddBackward: dz/da = 1, dz/db = 1 3. MulBackward (for a=x*y): da/dx = y = 3, da/dy = x = 2 4. PowBackward (for b=x²): db/dx = 2x = 4 5. x has TWO paths: through a and through b dz/dx = dz/da × da/dx + dz/db × db/dx = 1×3 + 1×4 = 7 dz/dy = dz/da × da/dy = 1×2 = 2

EXECUTION STATE
→ x gradient (multi-path) = Path 1 (through a): dz/da × da/dx = 1 × 3 = 3 Path 2 (through b): dz/db × db/dx = 1 × 4 = 4 Total: 3 + 4 = 7.0
→ y gradient (single path) = Path (through a): dz/da × da/dy = 1 × 2 = 2.0
18print dz/dx

dz/dx = 7.0. Manual verification: z = xy + x², so dz/dx = y + 2x = 3 + 4 = 7. Autograd handles the multi-path gradient accumulation automatically.

EXECUTION STATE
x.grad = 7.0 — dz/dx = y + 2x = 3 + 4 = 7 ✔
19print dz/dy

dz/dy = 2.0. Manual verification: z = xy + x², so dz/dy = x = 2. Variable y only appears in one term (xy), so there’s only one path.

EXECUTION STATE
y.grad = 2.0 — dz/dy = x = 2 ✔
5 lines without explanation
1import torch
2
3x = torch.tensor(2.0, requires_grad=True)
4y = torch.tensor(3.0, requires_grad=True)
5
6a = x * y          # a = 6.0
7b = x ** 2         # b = 4.0
8z = a + b          # z = 10.0
9
10print(f"x = {x.item()}, requires_grad = {x.requires_grad}")
11print(f"y = {y.item()}, requires_grad = {y.requires_grad}")
12print(f"a = x*y = {a.item()}, grad_fn = {a.grad_fn}")
13print(f"b = x² = {b.item()}, grad_fn = {b.grad_fn}")
14print(f"z = a+b = {z.item()}, grad_fn = {z.grad_fn}")
15
16z.backward()
17
18print(f"dz/dx = {x.grad.item()}")
19print(f"dz/dy = {y.grad.item()}")

The visualization below lets you see this graph being built during the forward pass and then watch gradients flow backward during the backward pass. Try all three examples to see how different graph topologies work.

Loading computational graph visualization...
ConceptWhat It IsExample
Leaf tensorTensor created directly by user (not by an operation)x = torch.tensor(2.0, requires_grad=True)
grad_fnThe graph node that created a tensor. None for leaves.a.grad_fn = MulBackward0
Forward passComputing the output. Builds the graph.z = x * y + x ** 2
Backward passWalking the graph in reverse to compute gradients.z.backward()
Multi-path gradientWhen a variable feeds into multiple paths, gradients are summed.dz/dx = (through a) + (through b) = 3 + 4 = 7

The Autograd API: requires_grad, backward, grad

The autograd system has three essential pieces. Think of them as: the switch (requires_grad), the trigger (backward), and the result (grad).

  1. requires_grad=True\texttt{requires\_grad=True}: Tells PyTorch to track operations on this tensor. Set this on parameters (weights and biases), not on input data.
  2. loss.backward()\texttt{loss.backward()}: Triggers the backward pass. Walks the computational graph in reverse, computing all gradients using the chain rule. Can only be called on a scalar (single-number) tensor.
  3. parameter.grad\texttt{parameter.grad}: After backward(), this attribute holds the gradient of the loss with respect to this parameter. It tells you: “which direction and how much should I change this parameter to reduce the loss?”

Let's see all three working together in a mini neural network: a single linear transformation z=wx+bz = wx + b with a squared error loss.

The Three Pillars of Autograd: requires_grad, backward(), .grad
🐍autograd_api.py
1import torch

Import PyTorch for the autograd API demonstration.

EXECUTION STATE
📚 torch = PyTorch core module
3Comment: Step 1 — learnable parameters

In neural networks, weights and biases are the parameters we want to learn. They need requires_grad=True so we can compute how the loss changes with respect to each parameter. Input data does NOT need gradients.

4w = torch.tensor(0.5, requires_grad=True) — weight

The weight parameter, initialized to 0.5. In a real network, this would be initialized randomly. We set requires_grad=True because we want to know: “how should I change w to reduce the loss?”

EXECUTION STATE
w = tensor(0.5, requires_grad=True) — the learnable weight
→ why requires_grad=True? = We need dloss/dw to update w during training. Without it, PyTorch won’t track how w affects the loss.
5b = torch.tensor(0.1, requires_grad=True) — bias

The bias parameter, initialized to 0.1. Also needs gradient tracking for the same reason as w.

EXECUTION STATE
b = tensor(0.1, requires_grad=True) — the learnable bias
6x = torch.tensor(2.0) — input data, no grad

The input data. Notice: NO requires_grad. Inputs are fixed observations — we don’t want to change the input data, we want to change the weights. By default, requires_grad=False.

EXECUTION STATE
x = tensor(2.0) — requires_grad=False (default)
→ why no grad? = x is data, not a parameter. We observe x; we learn w and b. Gradients flow backward to parameters, not to inputs.
8Comment: Step 2 — Forward pass

The forward pass computes the prediction and the loss. PyTorch builds the computational graph as we go.

9z = w * x + b

The linear transformation: z = 0.5 × 2.0 + 0.1 = 1.1. This is the prediction of our tiny ‘network’. The target is 1.0, so we’re slightly off.

EXECUTION STATE
w * x = 0.5 × 2.0 = 1.0
z = w*x + b = 1.0 + 0.1 = 1.1
z.grad_fn = AddBackward0
10loss = (z - 1.0) ** 2

Mean squared error: how far is our prediction (1.1) from the target (1.0)? loss = (1.1 - 1.0)² = 0.01. A small loss means we’re close to the target.

EXECUTION STATE
z - 1.0 = 1.1 - 1.0 = 0.1 — the prediction error
loss = 0.1² = 0.0100 — squared error
12print z value

Displays the prediction z = 1.1.

EXECUTION STATE
output = z = w*x + b = 1.1000
13print loss value

Displays the loss = 0.0100.

EXECUTION STATE
output = loss = (z - 1.0)² = 0.0100
14print w.grad before backward

Before calling backward(), no gradients have been computed yet. w.grad is None.

EXECUTION STATE
w.grad = None — gradients are only computed when you call .backward()
15print b.grad before backward

Similarly, b.grad is None before backward().

EXECUTION STATE
b.grad = None — not yet computed
17Comment: Step 3 — Backward pass

Now we call backward() to compute how each parameter should change to reduce the loss.

18loss.backward() — compute gradients

Traverses the graph backward from loss: 1. dloss/dz = 2(z - 1.0) = 2(0.1) = 0.2 2. dz/dw = x = 2.0 → dloss/dw = 0.2 × 2.0 = 0.4 3. dz/db = 1 → dloss/db = 0.2 × 1 = 0.2 Now w.grad = 0.4 and b.grad = 0.2.

EXECUTION STATE
→ dloss/dz = 2(z - 1.0) = 2(0.1) = 0.2
→ dz/dw = x = 2.0
→ dz/db = 1.0
→ dloss/dw = dloss/dz × dz/dw = 0.2 × 2.0 = 0.4000
→ dloss/db = dloss/dz × dz/db = 0.2 × 1.0 = 0.2000
20print w.grad

w.grad = 0.4. This means: increasing w by 1 would increase loss by approximately 0.4. So to DECREASE the loss, we should decrease w. The gradient descent update would be: w_new = w - lr × 0.4.

EXECUTION STATE
w.grad = 0.4000 — dloss/dw. Positive gradient → decrease w to reduce loss.
21print b.grad

b.grad = 0.2. Increasing b by 1 would increase loss by approximately 0.2. Again, to decrease loss, we should decrease b.

EXECUTION STATE
b.grad = 0.2000 — dloss/db. Positive gradient → decrease b to reduce loss.
5 lines without explanation
1import torch
2
3# Step 1: Create learnable parameters
4w = torch.tensor(0.5, requires_grad=True)
5b = torch.tensor(0.1, requires_grad=True)
6x = torch.tensor(2.0)   # Input — no gradient needed
7
8# Step 2: Forward pass
9z = w * x + b
10loss = (z - 1.0) ** 2
11
12print(f"z = w*x + b = {z.item():.4f}")
13print(f"loss = (z - 1.0)² = {loss.item():.4f}")
14print(f"w.grad before backward: {w.grad}")
15print(f"b.grad before backward: {b.grad}")
16
17# Step 3: Backward pass
18loss.backward()
19
20print(f"w.grad = {w.grad.item():.4f}")
21print(f"b.grad = {b.grad.item():.4f}")
The Gradient Tells You How to Improve: After backward(), w.grad = 0.4 means “increasing w by 1 increases the loss by 0.4.” So to decrease the loss, we should decrease w. The gradient descent update is: wnew=wηLww_{\text{new}} = w - \eta \cdot \frac{\partial L}{\partial w} where η\eta is the learning rate.

Partial Derivatives and the Gradient Vector

Real neural networks have thousands or millions of parameters, not just one. When a function depends on multiple variables, like f(x,y)=x2+2xy+y2f(x, y) = x^2 + 2xy + y^2, we need a separate derivative with respect to each variable. These are called partial derivatives.

The partial derivative fx\frac{\partial f}{\partial x} treats all other variables as constants and differentiates only with respect to xx. The collection of all partial derivatives forms the gradient vector:

f=(fx,fy)\nabla f = \left( \frac{\partial f}{\partial x}, \frac{\partial f}{\partial y} \right)

The gradient vector points in the direction of steepest ascent. To minimize a function (reduce loss), we move in the opposite direction — the negative gradient. This is exactly what gradient descent does.

NotationMeaningExample
∂f/∂xPartial derivative: differentiate f with respect to x, treating y as constantf = x² + 2xy → ∂f/∂x = 2x + 2y
∂f/∂yPartial derivative: differentiate f with respect to y, treating x as constantf = x² + 2xy → ∂f/∂y = 2x
∇f (nabla f)Gradient vector: the vector of all partial derivatives∇f = (2x + 2y, 2x)
-∇fNegative gradient: direction of steepest DESCENTThe direction to move to reduce f fastest

PyTorch computes all partial derivatives simultaneously. When you call loss.backward()\texttt{loss.backward()}, every leaf tensor with requires_grad=True\texttt{requires\_grad=True} gets its gradient populated. You saw this in the computational graph example: z=xy+x2z = xy + x^2 gave us zx=7\frac{\partial z}{\partial x} = 7 and zy=2\frac{\partial z}{\partial y} = 2 in a single backward() call.


The Gradient Accumulation Trap

PyTorch has a behavior that surprises every beginner: calling .backward()\texttt{.backward()} does not replace the gradient in .grad\texttt{.grad} — it adds to it. This means if you call backward() twice without zeroing the gradient between calls, the second gradient gets added to the first.

This is by design (it enables advanced techniques like gradient accumulation over mini-batches), but it's the most common bug in PyTorch training loops. If you forget to zero gradients, your model will train with increasingly wrong gradient values.

Gradient Accumulation Bug and Fix
🐍gradient_accumulation.py
1import torch

Import PyTorch.

EXECUTION STATE
📚 torch = PyTorch core module
3x = torch.tensor(3.0, requires_grad=True)

Create x = 3.0 with gradient tracking. x.grad starts as None.

EXECUTION STATE
x = tensor(3.0, requires_grad=True)
x.grad = None (no backward yet)
5Comment: First backward

We’ll compute d(x²)/dx = 2x = 6.0.

6y1 = x ** 2

Computes x² = 9.0.

EXECUTION STATE
y1 = 3.0² = 9.0
7y1.backward()

Computes dy1/dx = 2x = 2(3) = 6.0. Since x.grad was None, it gets initialized to 6.0.

EXECUTION STATE
dy1/dx = 2x = 2 × 3.0 = 6.0
x.grad after = 6.0 (was None, now 6.0)
8print x.grad after 1st backward

x.grad = 6.0. This is correct: d(x²)/dx = 2(3) = 6.

EXECUTION STATE
output = After 1st backward: x.grad = 6.0 ✔
10Comment: Second backward WITHOUT zeroing — BUG!

HERE is the trap. We’re about to compute d(x³)/dx = 3x² = 27.0. But x.grad still holds 6.0 from the first backward. PyTorch will ADD the new gradient to the existing one: 6 + 27 = 33. This is almost certainly not what you want!

11y2 = x ** 3

Computes x³ = 27.0.

EXECUTION STATE
y2 = 3.0³ = 27.0
12y2.backward() — gradients ACCUMULATE!

Computes dy2/dx = 3x² = 3(9) = 27.0. But instead of replacing x.grad, PyTorch ADDS to it: x.grad = 6.0 + 27.0 = 33.0. This is the accumulation behavior — by design in PyTorch, but a common bug source.

EXECUTION STATE
dy2/dx = 3x² = 3 × 9 = 27.0 (the new gradient)
x.grad before = 6.0 (leftover from first backward)
x.grad after = 6.0 + 27.0 = 33.0 — WRONG if we wanted just dy2/dx!
13print x.grad after 2nd backward — 33.0!

x.grad = 33.0. We wanted 27.0 (the gradient of x³) but got 33.0 (the sum of both gradients). In a training loop, this bug makes your gradients grow larger every iteration, causing training to diverge.

EXECUTION STATE
output = After 2nd backward: x.grad = 33.0 ❌ (should be 27.0)
15Comment: The fix — zero the gradient

The fix is simple: call .zero_() on the gradient tensor before each backward pass. This resets it to 0, so the new gradient doesn’t add to stale values.

16x.grad.zero_()

Resets x.grad to 0.0. The underscore suffix in PyTorch means ‘in-place operation’ — it modifies the tensor directly without creating a new one. In a training loop, you’d use optimizer.zero_grad() which does this for all parameters at once.

EXECUTION STATE
📚 .zero_() = In-place operation: sets all elements to 0. The underscore convention in PyTorch means ‘modify in place’ (like sort vs sorted in Python).
x.grad before = 33.0 (accumulated)
x.grad after = 0.0 (reset)
17y3 = x ** 3

Computes x³ = 27.0 again.

EXECUTION STATE
y3 = 3.0³ = 27.0
18y3.backward() — now on a clean slate

Computes dy3/dx = 3x² = 27.0. Since x.grad was zeroed, the result is just 0 + 27 = 27.0. Correct!

EXECUTION STATE
x.grad = 0 + 27.0 = 27.0 — correct gradient!
19print x.grad after zero + backward

x.grad = 27.0. This is the correct gradient of x³ at x=3: d/dx(x³) = 3x² = 3(9) = 27.

EXECUTION STATE
output = After zero + backward: x.grad = 27.0 ✔
4 lines without explanation
1import torch
2
3x = torch.tensor(3.0, requires_grad=True)
4
5# First backward: y1 = x²
6y1 = x ** 2
7y1.backward()
8print(f"After 1st backward: x.grad = {x.grad.item()}")
9
10# Second backward WITHOUT zeroing — BUG!
11y2 = x ** 3
12y2.backward()
13print(f"After 2nd backward: x.grad = {x.grad.item()}")
14
15# The fix: zero the gradient first
16x.grad.zero_()
17y3 = x ** 3
18y3.backward()
19print(f"After zero + backward: x.grad = {x.grad.item()}")

The interactive demo below lets you experience this bug firsthand. Click “loss.backward()” repeatedly and watch the gradient grow without bound in the “Without zero_grad()” mode. Then switch to the correct mode and see how resetting keeps the gradient clean.

Loading gradient accumulation demo...
The Rule: In every training loop iteration, you must zero gradients before calling backward(). The standard pattern is:
  • optimizer.zero_grad()\texttt{optimizer.zero\_grad()} — reset all parameter gradients to 0
  • loss = model(input)\texttt{loss = model(input)} — forward pass
  • loss.backward()\texttt{loss.backward()} — compute gradients
  • optimizer.step()\texttt{optimizer.step()} — update parameters using gradients

Controlling the Graph: no_grad() and detach()

Building a computational graph takes memory and computation. During inference (making predictions with a trained model), you don't need gradients — you're not training, just computing outputs. PyTorch provides two mechanisms to disable gradient tracking when you don't need it:

MechanismWhat It DoesWhen to Use
torch.no_grad()Context manager: disables gradient tracking for all operations inside the blockInference, evaluation, parameter updates in training loops
tensor.detach()Creates a new tensor with same data but disconnected from the graphUsing a tensor’s value without gradient flow, stopping gradient through part of network
Disabling Gradient Tracking: no_grad() and detach()
🐍no_grad_detach.py
1import torch

Import PyTorch.

EXECUTION STATE
📚 torch = PyTorch core module
3x = torch.tensor(2.0, requires_grad=True)

Create x with gradient tracking enabled.

EXECUTION STATE
x = tensor(2.0, requires_grad=True)
5Comment: Normal operation — tracked

First, we show what happens normally: operations on a tensor with requires_grad=True create a computational graph.

6y = x ** 2

Normal operation: PyTorch tracks this and creates a PowBackward0 node. Memory is used to store the graph.

EXECUTION STATE
y = tensor(4.0, grad_fn=<PowBackward0>)
7print y.requires_grad

True — because y was created from an operation on x (which has requires_grad=True), y also has requires_grad=True. This propagates automatically.

EXECUTION STATE
output = y.requires_grad = True
8print y.grad_fn

PowBackward0 — the graph node that knows ‘y came from squaring x’.

EXECUTION STATE
output = y.grad_fn = <PowBackward0 object>
10Comment: torch.no_grad() — disables tracking

torch.no_grad() is a context manager that temporarily disables gradient tracking for ALL operations inside its block. This is used during inference (making predictions) when you don’t need gradients, saving memory and computation.

11with torch.no_grad():

Enters the no-gradient context. All tensor operations inside this block will NOT build a computational graph, regardless of requires_grad settings. This is essential during inference — a model with millions of parameters would waste enormous memory tracking gradients it will never use.

EXECUTION STATE
📚 torch.no_grad() = Context manager: disables gradient computation inside the block. Reduces memory usage (no graph stored) and speeds up computation. Use for: inference, evaluation, parameter updates.
12y_no_grad = x ** 2 (inside no_grad)

Same operation as before (x² = 4.0), but now NO graph is built. The result tensor has requires_grad=False and grad_fn=None.

EXECUTION STATE
y_no_grad = tensor(4.0) — no grad_fn, no graph. Pure number.
13print y_no_grad.requires_grad

False — inside no_grad(), the result doesn’t track gradients even though x does.

EXECUTION STATE
output = y_no_grad.requires_grad = False
14print y_no_grad.grad_fn

None — no computational graph node was created. You CANNOT call y_no_grad.backward().

EXECUTION STATE
output = y_no_grad.grad_fn = None
16Comment: .detach() — breaks the graph connection

.detach() creates a new tensor that shares the same data but is disconnected from the computational graph. Useful when you need to use a value as a ‘constant’ without gradients flowing through it.

17x_detached = x.detach()

Creates a new tensor with the same value as x (2.0) but with requires_grad=False. The original x is unaffected. This is useful when you want to use x’s value as a fixed constant in another computation.

EXECUTION STATE
📚 .detach() = Returns a new tensor that shares the same underlying data but is detached from the computational graph. The original tensor is unchanged. Common use: getting a tensor’s value without gradient tracking.
x_detached = tensor(2.0) — same value, but requires_grad=False
18y_detached = x_detached ** 2

Since x_detached has requires_grad=False, this operation is not tracked. No graph node is created.

EXECUTION STATE
y_detached = tensor(4.0) — no graph, no grad_fn
19print x_detached.requires_grad

False — detached tensors don’t track gradients.

EXECUTION STATE
output = x_detached.requires_grad = False
20print y_detached.requires_grad

False — operations on non-tracking tensors produce non-tracking results.

EXECUTION STATE
output = y_detached.requires_grad = False
4 lines without explanation
1import torch
2
3x = torch.tensor(2.0, requires_grad=True)
4
5# Normal operation — tracked
6y = x ** 2
7print(f"y.requires_grad = {y.requires_grad}")
8print(f"y.grad_fn = {y.grad_fn}")
9
10# torch.no_grad() — disables tracking
11with torch.no_grad():
12    y_no_grad = x ** 2
13    print(f"y_no_grad.requires_grad = {y_no_grad.requires_grad}")
14    print(f"y_no_grad.grad_fn = {y_no_grad.grad_fn}")
15
16# .detach() — breaks the graph connection
17x_detached = x.detach()
18y_detached = x_detached ** 2
19print(f"x_detached.requires_grad = {x_detached.requires_grad}")
20print(f"y_detached.requires_grad = {y_detached.requires_grad}")
Memory Impact: For a model like GPT with 175 billion parameters, the computational graph for a single forward pass can use 10-100 GB of memory. Wrapping inference in torch.no_grad()\texttt{torch.no\_grad()} can reduce memory usage by 50-75%, since no graph nodes need to be stored.

Gradient Descent with Autograd

Now we bring everything together. We'll build a complete gradient descent loop that uses autograd to minimize a function. The pattern is the same one used to train every neural network: forward pass, backward pass, update parameters, repeat.

Our goal: find the value of xx that minimizes f(x)=(x3)2f(x) = (x - 3)^2. The minimum is at x=3x = 3 (where f(3)=0f(3) = 0). We start at x=0x = 0 and let gradient descent find the minimum by following the negative gradient step by step.

Complete Gradient Descent Loop with Autograd
🐍gradient_descent.py
1import torch

Import PyTorch for tensors and automatic differentiation.

EXECUTION STATE
📚 torch = PyTorch core module
3Comment: Goal — minimize f(x) = (x - 3)²

We want to find the value of x that makes (x - 3)² as small as possible. The answer is obviously x = 3 (where the squared error is 0), but we’ll let gradient descent find it automatically. This is the core of how neural networks learn — they minimize a loss function by following gradients.

4Comment: Minimum at x = 3

f’(x) = 2(x - 3). Setting f’(x) = 0: 2(x - 3) = 0 → x = 3. The derivative is negative when x < 3 (go right) and positive when x > 3 (go left). Gradient descent follows this signal to the minimum.

6x = torch.tensor(0.0, requires_grad=True)

Start at x = 0. This is far from the minimum at x = 3. We’ll watch gradient descent walk from 0 toward 3 step by step.

EXECUTION STATE
x = tensor(0.0, requires_grad=True) — our starting point, far from the minimum
7learning_rate = 0.1

The step size for gradient descent. Controls how big each update is. Too large: overshoots the minimum. Too small: takes forever. 0.1 is a reasonable starting point for this simple problem.

EXECUTION STATE
learning_rate = 0.1 — each step moves 10% of the gradient magnitude
9print header

Prints the title of our optimization run.

EXECUTION STATE
output = Gradient Descent: minimize f(x) = (x - 3)²
10print column headers

Column headers for the step-by-step output: step number, current x, function value, gradient.

EXECUTION STATE
output = Step x f(x) grad
11print separator

A line of dashes to separate the header from the data.

13for step in range(10): — the training loop

Run 10 steps of gradient descent. Each iteration: (1) compute loss, (2) compute gradient via backward(), (3) update x, (4) zero gradient. This is the exact same structure as training a neural network — the only difference is that real networks have thousands of parameters instead of one.

LOOP TRACE · 10 iterations
step=0
x = 0.0000
loss = (0 - 3)² = 9.0000
grad = 2(0 - 3) = -6.0000
x_new = 0 - 0.1×(-6) = 0.6000
step=1
x = 0.6000
loss = (0.6 - 3)² = 5.7600
grad = 2(0.6 - 3) = -4.8000
x_new = 0.6 - 0.1×(-4.8) = 1.0800
step=2
x = 1.0800
loss = 3.6864
grad = -3.8400
x_new = 1.4640
step=3
x = 1.4640
loss = 2.3593
grad = -3.0720
x_new = 1.7712
step=4
x = 1.7712
loss = 1.5099
grad = -2.4576
x_new = 2.0170
step=5
x = 2.0170
loss = 0.9664
grad = -1.9661
x_new = 2.2136
step=6
x = 2.2136
loss = 0.6185
grad = -1.5729
x_new = 2.3709
step=7
x = 2.3709
loss = 0.3958
grad = -1.2583
x_new = 2.4967
step=8
x = 2.4967
loss = 0.2533
grad = -1.0066
x_new = 2.5973
step=9
x = 2.5973
loss = 0.1621
grad = -0.8053
x_new = 2.6779
14Comment: Forward — compute loss

Step 1 of the training loop: compute the function value (the loss) using the current x.

15loss = (x - 3.0) ** 2

The loss function f(x) = (x - 3)². At step 0: loss = (0 - 3)² = 9.0. The graph is built: x → SubBackward → PowBackward.

EXECUTION STATE
loss at step 0 = (0.0 - 3.0)² = 9.0
17Comment: Backward — compute gradient

Step 2: use autograd to compute dloss/dx = 2(x - 3).

18loss.backward()

Computes the gradient of loss with respect to x. At step 0: dloss/dx = 2(0 - 3) = -6. The negative gradient tells us: x should INCREASE to reduce the loss (move toward 3).

EXECUTION STATE
dloss/dx at step 0 = 2(0 - 3) = -6.0 — negative means increase x
20grad = x.grad.item()

Extract the gradient value as a Python float for printing.

EXECUTION STATE
grad at step 0 = -6.0
21print step information

Prints the current state: step number, x position, loss value, and gradient.

EXECUTION STATE
output at step 0 = 0 0.0000 9.0000 -6.0000
23Comment: Update — step in negative gradient direction

Step 3: move x in the direction that decreases the loss. The update rule is x_new = x - lr × gradient. Subtracting the gradient (negative of the gradient direction) moves us downhill.

24with torch.no_grad():

We must wrap the parameter update in no_grad() because we don’t want this arithmetic operation to be tracked in the computational graph. This is purely a parameter update, not a forward pass.

EXECUTION STATE
📚 torch.no_grad() = Disables gradient tracking. The update x -= lr * grad is not a computation we want to differentiate through — it’s just moving a number.
25x -= learning_rate * x.grad

The gradient descent update rule. At step 0: x = 0.0 - 0.1 × (-6.0) = 0.0 + 0.6 = 0.6. The negative gradient (-6) means “go right”, so x increases from 0 toward 3.

EXECUTION STATE
learning_rate * x.grad = 0.1 × (-6.0) = -0.6
x = x - (-0.6) = 0.0 + 0.6 = 0.6
27Comment: Zero gradient for next iteration

Step 4: reset the gradient to zero before the next backward() call. Without this, gradients would accumulate across iterations (the bug we saw earlier).

28x.grad.zero_()

Resets x.grad to 0. This is critical — without it, each backward() would add to the existing gradient, making the updates larger and larger until training explodes.

EXECUTION STATE
📚 .zero_() = Sets the gradient to 0 in-place. In real training, you’d call optimizer.zero_grad() which does this for all parameters.
30print final result

After 10 steps, x = 2.6779. Not quite at 3.0 yet, but steadily approaching it. With more steps or a larger learning rate, x would converge closer to 3.0. The exponential convergence is visible: each step reduces the error by 80% (the factor 1 - 2×lr = 0.8).

EXECUTION STATE
output = Final x = 2.6779 (target: 3.0)
→ convergence = After 10 steps: 89.3% of the way to 3.0. After 20 steps: x ≈ 2.9885. After 50 steps: x ≈ 2.9999.
9 lines without explanation
1import torch
2
3# Goal: find x that minimizes f(x) = (x - 3)²
4# Minimum is at x = 3 (where derivative = 0)
5
6x = torch.tensor(0.0, requires_grad=True)
7learning_rate = 0.1
8
9print("Gradient Descent: minimize f(x) = (x - 3)²")
10print(f"{'Step':>4} {'x':>8} {'f(x)':>8} {'grad':>8}")
11print("-" * 36)
12
13for step in range(10):
14    # Forward: compute loss
15    loss = (x - 3.0) ** 2
16
17    # Backward: compute gradient
18    loss.backward()
19
20    grad = x.grad.item()
21    print(f"{step:4d} {x.item():8.4f} {loss.item():8.4f} {grad:8.4f}")
22
23    # Update: step in negative gradient direction
24    with torch.no_grad():
25        x -= learning_rate * x.grad
26
27    # Zero gradient for next iteration
28    x.grad.zero_()
29
30print(f"\nFinal x = {x.item():.4f} (target: 3.0)")

Notice the pattern: the loss decreases at each step (9.0 → 5.76 → 3.69 → ...), and xx approaches 3 (0 → 0.6 → 1.08 → ...). The gradient gets smaller as we approach the minimum, so the steps naturally shrink — this is why gradient descent converges smoothly for convex functions.

3D Visualization: Gradient Descent on a Loss Surface

In real neural networks, the loss is a function of many parameters. With two parameters, we can visualize it as a 3D surface where height represents the loss. Watch the red ball follow the gradient downhill to find the minimum. The yellow arrow shows the direction of steepest descent at each point — this is exactly what autograd computes.

Loading 3D gradient descent visualization...
Try the “Elongated Valley” surface. Notice how the ball oscillates back and forth across the narrow valley. This is a fundamental problem with basic gradient descent on elongated loss surfaces. Advanced optimizers like Adam and RMSProp solve this — we'll cover them in a later chapter.

Summary and Bridge

In this section, we learned how PyTorch automatically computes derivatives — the gradients that drive neural network training. Here are the key concepts:

ConceptWhat It MeansPyTorch API
DerivativeRate of change — the slope of the function at a pointComputed automatically by autograd
Chain ruleDerivative through composed functions = product of local derivativesApplied internally by .backward()
Computational graphRecord of operations that enables automatic differentiationBuilt automatically during forward pass
requires_gradEnable gradient tracking on a tensor (parameters only)torch.tensor(..., requires_grad=True)
.backward()Compute all gradients via reverse-mode autodiffloss.backward()
.gradAccess the computed gradient of a leaf tensorweight.grad
zero_grad()Reset gradients to zero (MUST do before each backward)optimizer.zero_grad() or tensor.grad.zero_()
no_grad()Disable gradient tracking to save memorywith torch.no_grad():
Gradient descentUpdate parameters in negative gradient directionw -= lr * w.grad
The Training Loop Pattern: Every neural network training loop follows the same four steps: (1) forward pass to compute loss, (2) backward pass to compute gradients, (3) parameter update using gradients, (4) zero gradients for next iteration. Everything else — data loading, model architecture, learning rate schedules — is built around this core loop.

In the next chapter, we'll use these tools to build actual neural networks. You now have all the PyTorch essentials: tensors (Section 2-3), operations (Section 3), and automatic differentiation (this section). Chapter 3 will combine them into complete forward passes, loss computation, and training loops for real networks.

Loading comments...