Boo-AI — Master Artificial Intelligence by Building from Scratch

Why Derivatives? The Slope That Guides Learning

Imagine you are blindfolded, standing on a hilly landscape. Your goal is to reach the lowest point. You cannot see, but you can feel the ground beneath your feet. If the ground slopes downward to your right, you step right. If it slopes downward to your left, you step left. The steepness of the slope tells you how big a step to take.

This is exactly what a neural network does during training. The “landscape” is the loss function — a mathematical surface where the height represents how wrong the network's predictions are. The “slope” is the derivative, and the process of stepping downhill is called gradient descent.

The Central Idea: A derivative tells you how a function's output changes when you nudge its input. For a loss function, it answers: “If I slightly increase this weight, does the error go up or down, and by how much?” This is ALL a neural network needs to learn.

Formally, the derivative of a function $f(x)$ at a point $x$ is defined as:

$f'(x) = \lim_{h \to 0} \frac{f(x + h) - f(x)}{h}$

This is the slope of the tangent line at point $x$ . When the derivative is positive, the function is increasing (going uphill). When negative, it is decreasing (going downhill). When zero, the function is flat — a potential minimum or maximum.

Explore this interactively below. Drag the slider to move along the curve and watch how the tangent line (and its slope) changes at each point.

Loading derivative visualization...

Try the f(x) = ½(x-3)² function. Notice that the derivative is zero at x = 3 — that is the minimum. The derivative is negative to the left of 3 (pointing right toward the minimum) and positive to the right (pointing left toward the minimum). This is the signal that gradient descent follows.

Computing Derivatives: Three Approaches

There are three ways to compute derivatives in practice, each with different trade-offs:

Approach	How It Works	Pros	Cons
Symbolic	Apply calculus rules (power rule, chain rule) by hand or with CAS	Exact answer, closed-form expression	Doesn’t scale to complex functions with millions of operations
Numerical	Approximate using f’(x) ≈ [f(x+h) - f(x)] / h	Simple to implement, works for any function	Slow (two function evaluations per parameter), rounding errors
Automatic (autograd)	Record operations during forward pass, replay chain rule backward	Exact, fast, scales to millions of parameters	Requires a framework (PyTorch, JAX, etc.)

Neural networks use automatic differentiation (the third approach). PyTorch's autograd system implements this, giving you exact derivatives with a single function call. But let's first see the numerical approach in Python to build intuition, then see how autograd replaces it.

Numerical Derivative in Python

The simplest way to compute a derivative: evaluate the function at two very close points and compute the slope between them. For $f(x) = x^2 + 3x + 2$ at $x = 2$ , the analytical derivative is $f'(x) = 2x + 3 = 7$ . Let's verify numerically.

Numerical Derivative \u2014 The Limit Definition in Code

🐍numerical_derivative.py

Explanation(11)

Code(15)

1import numpy as np

We import NumPy for numerical computing. In this example we use plain Python math, but NumPy will be essential in later code blocks for array operations.

EXECUTION STATE

📚 numpy = Numerical computing library. Provides ndarray, mathematical functions, and linear algebra operations. We alias it as np by convention.

3def f(x) — the function we want to differentiate

We define a simple quadratic function f(x) = x² + 3x + 2. This is a parabola that opens upward. We chose this because its derivative is easy to verify by hand: f’(x) = 2x + 3.

EXECUTION STATE

⬇ input: x = A single real number — the point at which we evaluate the function. For our example, x = 2.0.

⬆ returns = f(x) = x² + 3x + 2. At x=2: 4 + 6 + 2 = 12.0

4Docstring: our function definition

The docstring documents what this function computes. f(x) = x² + 3x + 2 is a degree-2 polynomial (quadratic). Its graph is a parabola with minimum at x = -3/2.

5return x**2 + 3*x + 2

Computes and returns the value of f(x) = x² + 3x + 2. The ** operator is Python’s exponentiation. For x=2.0: 2² + 3×2 + 2 = 4 + 6 + 2 = 12.0.

EXECUTION STATE

x**2 = 2.0² = 4.0

3*x = 3 × 2.0 = 6.0

⬆ return: x**2 + 3*x + 2 = 4.0 + 6.0 + 2 = 12.0

7def numerical_derivative(f, x, h=1e-7) — computing the slope

This function implements the limit definition of the derivative: f’(x) ≈ [f(x+h) - f(x)] / h. It draws a tiny secant line from x to x+h and computes its slope. As h → 0, this approaches the true derivative.

EXECUTION STATE

⬇ input: f = The function to differentiate. We pass in our f(x) = x² + 3x + 2. This is a Python function object.

⬇ input: x = The point at which to compute the derivative. For our example, x = 2.0.

⬇ input: h = 1e-7 = The tiny step size. h = 0.0000001. Smaller h = more accurate, but too small causes floating-point errors. 1e-7 is a good default balance.

⬆ returns = The approximate derivative f’(x) as a float. For x=2: returns 7.000000 (matching the analytical answer).

8Docstring: the limit definition

The derivative is defined as the limit: f’(x) = lim(h→0) [f(x+h) - f(x)] / h. We can’t actually take h=0 (division by zero), so we use a very small h instead.

9return (f(x + h) - f(x)) / h

The core computation: evaluate f at two very close points (x and x+h), take their difference, and divide by h. This gives the slope of the secant line, which approximates the slope of the tangent line (the derivative).

EXECUTION STATE

f(x + h) = f(2.0 + 1e-7) = f(2.0000001) = (2.0000001)² + 3×(2.0000001) + 2 = 12.0000007...

f(x) = f(2.0) = 12.0

f(x+h) - f(x) = 12.0000007 - 12.0 = 0.0000007 (the tiny rise)

h = 1e-7 = 0.0000001 (the tiny run)

⬆ return: (f(x+h) - f(x)) / h = 0.0000007 / 0.0000001 = 7.000000 (rise/run = slope)

11x = 2.0

We choose x = 2.0 as our evaluation point. The derivative of f(x) = x² + 3x + 2 is f’(x) = 2x + 3. At x=2: f’(2) = 2(2) + 3 = 7. We’ll verify this numerically.

EXECUTION STATE

x = 2.0 — the point where we want to know the slope

13print(f"f({x}) = {f(x)}")

Prints the function value at x=2. This is the height of the curve at our point.

EXECUTION STATE

output = f(2.0) = 12.0

14print numerical derivative

Prints the numerically computed derivative. The secant-line approximation gives us 7.000000, which matches the exact answer perfectly (within floating-point precision).

EXECUTION STATE

numerical_derivative(f, 2.0) = 7.000000 — computed by the limit formula

15print analytical derivative

Prints the exact analytical derivative for comparison. We know f’(x) = 2x + 3 from calculus (power rule + constant rule). At x=2: 2(2) + 3 = 7.

EXECUTION STATE

2*x + 3 = 2(2.0) + 3 = 7.000000 — exact from calculus

→ match! = Numerical = 7.000000, Analytical = 7.000000. They agree! The limit definition works.

4 lines without explanation

1import numpy as np
2
3def f(x):
4    """Our function: f(x) = x² + 3x + 2"""
5    return x**2 + 3*x + 2
6
7def numerical_derivative(f, x, h=1e-7):
8    """Compute derivative using the limit definition."""
9    return (f(x + h) - f(x)) / h
10
11x = 2.0
12
13print(f"f({x}) = {f(x)}")
14print(f"f'({x}) numerical  = {numerical_derivative(f, x):.6f}")
15print(f"f'({x}) analytical = {2*x + 3:.6f}")

The Same Derivative with PyTorch Autograd

Now watch autograd do the same thing in three lines. No finite differences, no choosing $h$ , no floating-point errors. Just: create a tensor with $\texttt{requires\_grad=True}$ , compute the function, call $\texttt{.backward()}$ , and read the gradient from $\texttt{.grad}$ .

PyTorch Autograd \u2014 Exact Derivative in Three Lines

🐍autograd_first.py

Explanation(7)

Code(11)

1import torch

Import PyTorch. This gives us torch.tensor (like np.array but with gradient tracking), torch.autograd (automatic differentiation engine), and the entire deep learning toolkit.

EXECUTION STATE

📚 torch = PyTorch’s core module. Key for this section: torch.tensor() creates tensors, requires_grad=True enables gradient tracking, .backward() computes gradients.

3x = torch.tensor(2.0, requires_grad=True)

Creates a scalar tensor with value 2.0 and tells PyTorch: “track every operation on this tensor so you can compute derivatives later.” This is the fundamental switch that turns on autograd. Without requires_grad=True, PyTorch treats x as a plain number with no gradient tracking.

EXECUTION STATE

📚 torch.tensor() = Creates a new tensor. First arg is the data (a Python float here). Unlike np.array, can optionally track gradients for automatic differentiation.

⬇ arg 1: 2.0 = The scalar value. Creates a 0-dimensional tensor (a single number). Shape: torch.Size([]).

⬇ arg 2: requires_grad=True = The ON switch for autograd. When True, PyTorch records every operation on x into a computational graph. When False (default), no tracking happens and .grad stays None.

⬆ result: x = tensor(2.0, requires_grad=True) — a leaf tensor in the computational graph. ‘Leaf’ means it was created by the user (not by an operation on other tensors).

5y = x**2 + 3*x + 2

Computes y = f(x) = x² + 3x + 2. Because x has requires_grad=True, PyTorch records each operation: (1) x**2 creates a node with PowBackward, (2) 3*x creates MulBackward, (3) the additions create AddBackward nodes. The computational graph is built automatically during this line.

EXECUTION STATE

x**2 = 2.0² = 4.0 — recorded as PowBackward0

3*x = 3 × 2.0 = 6.0 — recorded as MulBackward0

x**2 + 3*x = 4.0 + 6.0 = 10.0 — recorded as AddBackward0

y = tensor(12.0, grad_fn=<AddBackward0>) — 10.0 + 2 = 12.0

y.grad_fn = AddBackward0 — this is the ‘breadcrumb’ that lets PyTorch trace backward through the computation

7y.backward()

THE key autograd call. This tells PyTorch: “Start from y, walk backward through the computational graph, and compute dy/dx using the chain rule.” PyTorch visits each node in reverse order, multiplying local derivatives together (chain rule), and stores the final result in x.grad. The entire backward pass happens in this single line — no manual calculus needed.

EXECUTION STATE

📚 .backward() = Tensor method: computes gradients of this tensor with respect to all leaf tensors that have requires_grad=True. For a scalar output (like loss or y here), no arguments needed. Internally calls torch.autograd.backward().

→ what happens internally = 1. Start at y with gradient 1.0 (dy/dy = 1) 2. Walk backward through AddBackward, PowBackward, MulBackward 3. Apply chain rule at each node 4. Accumulate result into x.grad

→ after this call = x.grad = 7.0 (because dy/dx = 2x + 3 = 2(2) + 3 = 7)

9print x value

Prints the value of x. The .item() method extracts the Python float from a scalar tensor.

EXECUTION STATE

📚 .item() = Tensor method: returns the scalar value as a standard Python number (int or float). Only works on tensors with exactly one element.

output = x = 2.0

10print y value

Prints y = 12.0. This is f(2.0) = 4 + 6 + 2 = 12.0.

EXECUTION STATE

output = y = x² + 3x + 2 = 12.0

11print dy/dx

This is the magic of autograd. x.grad contains the derivative dy/dx, computed automatically by backward(). The value 7.0 matches our analytical answer: dy/dx = 2x + 3 = 2(2) + 3 = 7. No manual differentiation needed!

EXECUTION STATE

x.grad = tensor(7.0) — this is dy/dx at x=2.0

📚 .grad = Attribute on leaf tensors: stores the gradient computed by .backward(). Is None before backward() is called. Accumulates across multiple backward() calls (a common source of bugs!).

output = dy/dx = 7.0

→ verification = Manual calculus: d/dx(x² + 3x + 2) = 2x + 3 = 2(2) + 3 = 7 ✔

4 lines without explanation

1import torch
2
3x = torch.tensor(2.0, requires_grad=True)
4
5y = x**2 + 3*x + 2
6
7y.backward()
8
9print(f"x = {x.item()}")
10print(f"y = x² + 3x + 2 = {y.item()}")
11print(f"dy/dx = {x.grad.item()}")

Key Insight: The numerical approach computes an approximation that can have floating-point errors. Autograd computes the exact derivative by applying calculus rules (chain rule) automatically during the backward pass. For a network with millions of parameters, autograd gives all gradients in roughly the same time as one forward pass.

The Chain Rule: Engine of Deep Learning

Neural networks are built from chains of functions: input passes through layer 1, the output passes through layer 2, then layer 3, and so on until the final prediction. To train the network, we need the derivative of the loss with respect to parameters in the very first layer — but the loss is many function compositions away.

The chain rule solves this elegantly. If $y = f(g(x))$ , then:

$\frac{dy}{dx} = \frac{dy}{du} \cdot \frac{du}{dx} \quad \text{where } u = g(x)$

In words: multiply the local derivatives at each link in the chain. Each function only needs to know its own derivative; the chain rule connects them. For a network with $n$ layers, the total derivative is the product of $n$ local derivatives:

$\frac{dy}{dx} = \frac{dy}{du_n} \cdot \frac{du_n}{du_{n-1}} \cdots \frac{du_2}{du_1} \cdot \frac{du_1}{dx}$

Chain Rule by Hand in Python

Let's compute the chain rule manually for $y = f(g(x))$ where $g(x) = 3x + 1$ and $f(u) = u^2$ . So $y = (3x + 1)^2$ .

Chain Rule by Hand \u2014 Multiply Local Derivatives

🐍chain_rule_manual.py

Explanation(20)

Code(27)

1import numpy as np

Import NumPy. We use plain Python arithmetic here, but import numpy as a convention since it’s always available in scientific Python code.

EXECUTION STATE

📚 numpy = Numerical computing library. Not strictly needed for this scalar example, but included to establish the pattern.

3Comment: y = f(g(x)) — function composition

We’re computing a composition of two functions: first apply g to x, then apply f to the result. This is exactly what happens in neural networks: layer 1 transforms the input, layer 2 transforms that output, and so on. The chain rule tells us how to differentiate through this composition.

5def g(x) — the inner function

g(x) = 3x + 1 is the inner function. In a neural network, this would be like a linear layer: z = wx + b where w=3, b=1. Its derivative is simple: dg/dx = 3 (the slope of the line).

EXECUTION STATE

⬇ input: x = A scalar value. For our example: x = 2.0

⬆ returns = g(x) = 3x + 1. At x=2: 3(2) + 1 = 7.0

6return 3 * x + 1

Computes the linear function 3x + 1. This is the simplest possible neural network layer: multiply by a weight (3) and add a bias (1).

EXECUTION STATE

⬆ return: 3 * x + 1 = 3 × 2.0 + 1 = 7.0

8def f(u) — the outer function

f(u) = u² is the outer function. In a neural network, this could be part of a loss function (e.g., MSE loss squares the error). Its derivative is df/du = 2u.

EXECUTION STATE

⬇ input: u = The output of g(x). For our example: u = g(2.0) = 7.0

⬆ returns = f(u) = u². At u=7: 7² = 49.0

9return u ** 2

Computes u squared. Python’s ** is the exponentiation operator.

EXECUTION STATE

⬆ return: u ** 2 = 7.0² = 49.0

11x = 2.0

Our input value. We’ll trace the computation forward (x → u → y) then trace the derivatives backward (dy/du × du/dx = dy/dx).

EXECUTION STATE

x = 2.0 — the starting point of our computation chain

12u = g(x)

Forward step 1: apply the inner function g. The value flows from x to u through the first link of the chain.

EXECUTION STATE

u = g(2.0) = 3 × 2.0 + 1 = 7.0

13y = f(u)

Forward step 2: apply the outer function f. Now we have the final output. The forward pass is complete: x=2 → u=7 → y=49.

EXECUTION STATE

y = f(7.0) = 7.0² = 49.0

15Comment: Local derivatives

Now we compute the derivative of each function at the value it actually received. These are ‘local’ derivatives — each function only knows about its own input and output. The chain rule multiplies them together to get the global derivative.

16du_dx = 3

The derivative of g(x) = 3x + 1 with respect to x is simply 3 (the coefficient of x). This is the local derivative at the first link of the chain. It tells us: if x changes by ε, then u changes by 3ε.

EXECUTION STATE

du_dx = 3 — d/dx(3x + 1) = 3. The slope of the line y = 3x + 1.

17dy_du = 2 * u

The derivative of f(u) = u² with respect to u is 2u (power rule). At u=7: dy/du = 2(7) = 14. This is the local derivative at the second link. It tells us: if u changes by ε, then y changes by 14ε.

EXECUTION STATE

dy_du = 2 * u = 2 × 7.0 = 14.0 — d/du(u²) = 2u evaluated at u=7

19Comment: Chain rule formula

The chain rule says: to get the derivative through a chain of functions, multiply all the local derivatives together. dy/dx = dy/du × du/dx. This is the fundamental algorithm behind backpropagation in neural networks.

20dy_dx = dy_du * du_dx

The chain rule in action: multiply the local derivatives of each link in the chain. 14 × 3 = 42. This tells us: if x changes by ε, then y changes by 42ε. A tiny change in x gets amplified by 3 (through g) and then by 14 (through f), for a total amplification of 42.

EXECUTION STATE

dy_du = 14.0 — how much f amplifies changes

du_dx = 3 — how much g amplifies changes

dy_dx = 14 × 3 = 42.0 — the total derivative dy/dx

22print x value

Displays the input value x = 2.0.

EXECUTION STATE

output = x = 2.0

23print u = g(x)

Displays the intermediate value u = 7.0 after the first function.

EXECUTION STATE

output = u = g(x) = 3x + 1 = 7.0

24print y = f(u)

Displays the final output y = 49.0.

EXECUTION STATE

output = y = f(u) = u² = 49.0

25print du/dx

The local derivative of the inner function: 3.

EXECUTION STATE

output = du/dx = 3

26print dy/du

The local derivative of the outer function: 14.0.

EXECUTION STATE

output = dy/du = 2u = 14.0

27print chain rule result

The complete chain rule result: dy/dx = 14 × 3 = 42.

EXECUTION STATE

output = dy/dx = dy/du × du/dx = 14.0 × 3 = 42.0

7 lines without explanation

1import numpy as np
2
3# y = f(g(x)) where g(x) = 3x + 1, f(u) = u²
4
5def g(x):
6    return 3 * x + 1
7
8def f(u):
9    return u ** 2
10
11x = 2.0
12u = g(x)
13y = f(u)
14
15# Local derivatives
16du_dx = 3
17dy_du = 2 * u
18
19# Chain rule: dy/dx = dy/du × du/dx
20dy_dx = dy_du * du_dx
21
22print(f"x = {x}")
23print(f"u = g(x) = 3x + 1 = {u}")
24print(f"y = f(u) = u² = {y}")
25print(f"du/dx = {du_dx}")
26print(f"dy/du = 2u = {dy_du}")
27print(f"dy/dx = dy/du × du/dx = {dy_du} × {du_dx} = {dy_dx}")

Chain Rule with PyTorch Autograd

Now let autograd do the same chain rule calculation. We write the same computation, but instead of computing local derivatives manually, we just call $\texttt{.backward()}$ and PyTorch handles the chain rule internally.

PyTorch Autograd Applies the Chain Rule Automatically

🐍chain_rule_autograd.py

Explanation(10)

Code(14)

1import torch

Import PyTorch. We’ll now let autograd compute the chain rule automatically — no manual derivative calculation needed.

EXECUTION STATE

📚 torch = PyTorch’s core module with automatic differentiation via torch.autograd.

3x = torch.tensor(2.0, requires_grad=True)

Create x = 2.0 with gradient tracking enabled. PyTorch will now record every operation that involves x.

EXECUTION STATE

x = tensor(2.0, requires_grad=True)

5u = 3 * x + 1 — PyTorch records this operation

Computes g(x) = 3x + 1 = 7.0. PyTorch creates two graph nodes: MulBackward0 (for 3*x) and AddBackward0 (for +1). Each node stores the information needed to compute its local derivative during the backward pass.

EXECUTION STATE

3 * x = 3 × 2.0 = 6.0 — creates MulBackward0 node

u = 6.0 + 1 = 7.0 — creates AddBackward0 node

u.grad_fn = AddBackward0 — the graph node tracking this operation

6y = u ** 2 — PyTorch records this too

Computes f(u) = u² = 49.0. Creates a PowBackward0 node. Now the entire computational graph is built: x → MulBackward → AddBackward → PowBackward. This graph is what backward() will traverse.

EXECUTION STATE

y = u ** 2 = 7.0² = 49.0

y.grad_fn = PowBackward0 — knows that y came from squaring u

8y.backward() — autograd computes the chain rule

PyTorch walks backward through the graph: 1. Start: dy/dy = 1 2. PowBackward: dy/du = 2u = 2(7) = 14 3. AddBackward: du’/d(3x) = 1 → running grad = 14 4. MulBackward: d(3x)/dx = 3 → running grad = 14 × 3 = 42 Result: x.grad = 42.0

EXECUTION STATE

📚 .backward() = Traverses the computational graph in reverse, applying the chain rule at each node. Stores the result in x.grad.

→ backward trace = PowBackward: 2u = 14 → AddBackward: ×1 = 14 → MulBackward: ×3 = 42

10print x value

x = 2.0, unchanged by backward().

EXECUTION STATE

output = x = 2.0

11print u value

u = 7.0, the intermediate computation.

EXECUTION STATE

output = u = 3x + 1 = 7.0

12print y value

y = 49.0, the final output.

EXECUTION STATE

output = y = u² = 49.0

13print autograd result

x.grad = 42.0 — computed automatically by one call to y.backward(). This is exactly the same answer we got manually (14 × 3 = 42), but PyTorch computed it for us.

EXECUTION STATE

x.grad.item() = 42.0 — dy/dx computed by autograd

14print manual verification

We verify: dy/dx = 2u × 3 = 2(7)(3) = 42. Autograd and manual calculation match perfectly.

EXECUTION STATE

2 * u.item() * 3 = 2 × 7.0 × 3 = 42.0 — matches autograd!

4 lines without explanation

1import torch
2
3x = torch.tensor(2.0, requires_grad=True)
4
5u = 3 * x + 1      # g(x) = 3x + 1 → u = 7.0
6y = u ** 2          # f(u) = u²    → y = 49.0
7
8y.backward()
9
10print(f"x = {x.item()}")
11print(f"u = 3x + 1 = {u.item()}")
12print(f"y = u² = {y.item()}")
13print(f"dy/dx (autograd) = {x.grad.item()}")
14print(f"dy/dx (manual)   = 2 * u * 3 = {2 * u.item() * 3}")

Explore different chain compositions interactively. Watch how local derivatives at each function multiply to produce the total derivative, and see how changing $x$ affects the gradient flow.

Loading chain rule visualization...

Why the chain rule matters for deep learning: A neural network with 100 layers means the chain rule multiplies 100 local derivatives together. If local derivatives are all slightly greater than 1, the product explodes (exploding gradients). If they're all slightly less than 1, the product vanishes (vanishing gradients). Solving these problems (with careful initialization, batch normalization, residual connections, and other techniques) is one of the central challenges of deep learning. We will explore these solutions in later chapters.

Computational Graphs: PyTorch's Bookkeeping

How does PyTorch actually compute derivatives through chains of operations? The answer is the computational graph — a directed acyclic graph (DAG) that records every operation performed on tensors with $\texttt{requires\_grad=True}$ .

During the forward pass, as you execute operations like $a = x \cdot y$ or $b = x^2$ , PyTorch silently builds this graph. Each operation creates a node (identified by $\texttt{grad\_fn}$ ) that records: what operation was performed, and what its inputs were. When you call $\texttt{.backward()}$ , PyTorch walks this graph in reverse, applying the chain rule at each node to compute all gradients.

Let's see this in action with $z = x \cdot y + x^2$ , where variable $x$ feeds into the graph through two different paths.

Exploring the Computational Graph

🐍computational_graph.py

Explanation(14)

Code(19)

1import torch

Import PyTorch for tensor creation and automatic differentiation.

EXECUTION STATE

📚 torch = PyTorch core module

3x = torch.tensor(2.0, requires_grad=True)

Create the first input tensor x = 2.0 with gradient tracking. This is a leaf tensor — it was directly created by the user, not by an operation on other tensors.

EXECUTION STATE

x = tensor(2.0, requires_grad=True)

x.is_leaf = True — leaf tensors are inputs to the graph

x.grad_fn = None — leaf tensors have no grad_fn (they weren’t created by an operation)

4y = torch.tensor(3.0, requires_grad=True)

Create the second input tensor y = 3.0. Now we have two leaf tensors, so z.backward() will compute dz/dx AND dz/dy (partial derivatives with respect to each input).

EXECUTION STATE

y = tensor(3.0, requires_grad=True)

6a = x * y — multiplication node

Computes a = 2.0 × 3.0 = 6.0. PyTorch creates a MulBackward0 node in the graph. This node remembers: its inputs were x and y, and the local derivatives are da/dx = y = 3 and da/dy = x = 2.

EXECUTION STATE

a = x * y = 2.0 × 3.0 = 6.0

a.grad_fn = MulBackward0 — knows a = x*y, so da/dx = y, da/dy = x

a.is_leaf = False — a was created by an operation, not by the user

7b = x ** 2 — power node

Computes b = 2.0² = 4.0. Creates a PowBackward0 node. Note: x feeds into BOTH a and b. This means x has two paths through the graph, and its total gradient is the sum of gradients from both paths.

EXECUTION STATE

b = x ** 2 = 2.0² = 4.0

b.grad_fn = PowBackward0 — knows b = x², so db/dx = 2x = 4.0

8z = a + b — addition node (output)

Computes z = 6.0 + 4.0 = 10.0. Creates an AddBackward0 node. This is our final output. The complete graph: x and y feed into a=x*y; x also feeds into b=x²; a and b feed into z=a+b.

EXECUTION STATE

z = a + b = 6.0 + 4.0 = 10.0

z.grad_fn = AddBackward0 — knows z = a + b, so dz/da = 1, dz/db = 1

10print x properties

Shows that x is a leaf tensor with requires_grad=True.

EXECUTION STATE

output = x = 2.0, requires_grad = True

11print y properties

Shows that y is also a leaf tensor with requires_grad=True.

EXECUTION STATE

output = y = 3.0, requires_grad = True

12print a properties

Shows a = 6.0 with grad_fn = MulBackward0. The grad_fn is how PyTorch knows how to backpropagate through this node.

EXECUTION STATE

output = a = x*y = 6.0, grad_fn = <MulBackward0 object>

13print b properties

Shows b = 4.0 with grad_fn = PowBackward0.

EXECUTION STATE

output = b = x² = 4.0, grad_fn = <PowBackward0 object>

14print z properties

Shows z = 10.0 with grad_fn = AddBackward0 — the root of our computational graph.

EXECUTION STATE

output = z = a+b = 10.0, grad_fn = <AddBackward0 object>

16z.backward() — compute all gradients

Backward pass through the graph: 1. Start: dz/dz = 1 2. AddBackward: dz/da = 1, dz/db = 1 3. MulBackward (for a=x*y): da/dx = y = 3, da/dy = x = 2 4. PowBackward (for b=x²): db/dx = 2x = 4 5. x has TWO paths: through a and through b dz/dx = dz/da × da/dx + dz/db × db/dx = 1×3 + 1×4 = 7 dz/dy = dz/da × da/dy = 1×2 = 2

EXECUTION STATE

→ x gradient (multi-path) = Path 1 (through a): dz/da × da/dx = 1 × 3 = 3 Path 2 (through b): dz/db × db/dx = 1 × 4 = 4 Total: 3 + 4 = 7.0

→ y gradient (single path) = Path (through a): dz/da × da/dy = 1 × 2 = 2.0

18print dz/dx

dz/dx = 7.0. Manual verification: z = xy + x², so dz/dx = y + 2x = 3 + 4 = 7. Autograd handles the multi-path gradient accumulation automatically.

EXECUTION STATE

x.grad = 7.0 — dz/dx = y + 2x = 3 + 4 = 7 ✔

19print dz/dy

dz/dy = 2.0. Manual verification: z = xy + x², so dz/dy = x = 2. Variable y only appears in one term (xy), so there’s only one path.

EXECUTION STATE

y.grad = 2.0 — dz/dy = x = 2 ✔

5 lines without explanation

1import torch
2
3x = torch.tensor(2.0, requires_grad=True)
4y = torch.tensor(3.0, requires_grad=True)
5
6a = x * y          # a = 6.0
7b = x ** 2         # b = 4.0
8z = a + b          # z = 10.0
9
10print(f"x = {x.item()}, requires_grad = {x.requires_grad}")
11print(f"y = {y.item()}, requires_grad = {y.requires_grad}")
12print(f"a = x*y = {a.item()}, grad_fn = {a.grad_fn}")
13print(f"b = x² = {b.item()}, grad_fn = {b.grad_fn}")
14print(f"z = a+b = {z.item()}, grad_fn = {z.grad_fn}")
15
16z.backward()
17
18print(f"dz/dx = {x.grad.item()}")
19print(f"dz/dy = {y.grad.item()}")

The visualization below lets you see this graph being built during the forward pass and then watch gradients flow backward during the backward pass. Try all three examples to see how different graph topologies work.

Loading computational graph visualization...

Concept	What It Is	Example
Leaf tensor	Tensor created directly by user (not by an operation)	x = torch.tensor(2.0, requires_grad=True)
grad_fn	The graph node that created a tensor. None for leaves.	a.grad_fn = MulBackward0
Forward pass	Computing the output. Builds the graph.	z = x * y + x ** 2
Backward pass	Walking the graph in reverse to compute gradients.	z.backward()
Multi-path gradient	When a variable feeds into multiple paths, gradients are summed.	dz/dx = (through a) + (through b) = 3 + 4 = 7

The Autograd API: requires_grad, backward, grad

The autograd system has three essential pieces. Think of them as: the switch (requires_grad), the trigger (backward), and the result (grad).

$\texttt{requires\_grad=True}$ : Tells PyTorch to track operations on this tensor. Set this on parameters (weights and biases), not on input data.
$\texttt{loss.backward()}$ : Triggers the backward pass. Walks the computational graph in reverse, computing all gradients using the chain rule. Can only be called on a scalar (single-number) tensor.
$\texttt{parameter.grad}$ : After backward(), this attribute holds the gradient of the loss with respect to this parameter. It tells you: “which direction and how much should I change this parameter to reduce the loss?”

Let's see all three working together in a mini neural network: a single linear transformation $z = wx + b$ with a squared error loss.

The Three Pillars of Autograd: requires_grad, backward(), .grad

🐍autograd_api.py

Explanation(16)

Code(21)

1import torch

Import PyTorch for the autograd API demonstration.

EXECUTION STATE

📚 torch = PyTorch core module

3Comment: Step 1 — learnable parameters

In neural networks, weights and biases are the parameters we want to learn. They need requires_grad=True so we can compute how the loss changes with respect to each parameter. Input data does NOT need gradients.

4w = torch.tensor(0.5, requires_grad=True) — weight

The weight parameter, initialized to 0.5. In a real network, this would be initialized randomly. We set requires_grad=True because we want to know: “how should I change w to reduce the loss?”

EXECUTION STATE

w = tensor(0.5, requires_grad=True) — the learnable weight

→ why requires_grad=True? = We need dloss/dw to update w during training. Without it, PyTorch won’t track how w affects the loss.

5b = torch.tensor(0.1, requires_grad=True) — bias

The bias parameter, initialized to 0.1. Also needs gradient tracking for the same reason as w.

EXECUTION STATE

b = tensor(0.1, requires_grad=True) — the learnable bias

6x = torch.tensor(2.0) — input data, no grad

The input data. Notice: NO requires_grad. Inputs are fixed observations — we don’t want to change the input data, we want to change the weights. By default, requires_grad=False.

EXECUTION STATE

x = tensor(2.0) — requires_grad=False (default)

→ why no grad? = x is data, not a parameter. We observe x; we learn w and b. Gradients flow backward to parameters, not to inputs.

8Comment: Step 2 — Forward pass

The forward pass computes the prediction and the loss. PyTorch builds the computational graph as we go.

9z = w * x + b

The linear transformation: z = 0.5 × 2.0 + 0.1 = 1.1. This is the prediction of our tiny ‘network’. The target is 1.0, so we’re slightly off.

EXECUTION STATE

w * x = 0.5 × 2.0 = 1.0

z = w*x + b = 1.0 + 0.1 = 1.1

z.grad_fn = AddBackward0

10loss = (z - 1.0) ** 2

Mean squared error: how far is our prediction (1.1) from the target (1.0)? loss = (1.1 - 1.0)² = 0.01. A small loss means we’re close to the target.

EXECUTION STATE

z - 1.0 = 1.1 - 1.0 = 0.1 — the prediction error

loss = 0.1² = 0.0100 — squared error

12print z value

Displays the prediction z = 1.1.

EXECUTION STATE

output = z = w*x + b = 1.1000

13print loss value

Displays the loss = 0.0100.

EXECUTION STATE

output = loss = (z - 1.0)² = 0.0100

14print w.grad before backward

Before calling backward(), no gradients have been computed yet. w.grad is None.

EXECUTION STATE

w.grad = None — gradients are only computed when you call .backward()

15print b.grad before backward

Similarly, b.grad is None before backward().

EXECUTION STATE

b.grad = None — not yet computed

17Comment: Step 3 — Backward pass

Now we call backward() to compute how each parameter should change to reduce the loss.

18loss.backward() — compute gradients

Traverses the graph backward from loss: 1. dloss/dz = 2(z - 1.0) = 2(0.1) = 0.2 2. dz/dw = x = 2.0 → dloss/dw = 0.2 × 2.0 = 0.4 3. dz/db = 1 → dloss/db = 0.2 × 1 = 0.2 Now w.grad = 0.4 and b.grad = 0.2.

EXECUTION STATE

→ dloss/dz = 2(z - 1.0) = 2(0.1) = 0.2

→ dz/dw = x = 2.0

→ dz/db = 1.0

→ dloss/dw = dloss/dz × dz/dw = 0.2 × 2.0 = 0.4000

→ dloss/db = dloss/dz × dz/db = 0.2 × 1.0 = 0.2000

20print w.grad

w.grad = 0.4. This means: increasing w by 1 would increase loss by approximately 0.4. So to DECREASE the loss, we should decrease w. The gradient descent update would be: w_new = w - lr × 0.4.

EXECUTION STATE

w.grad = 0.4000 — dloss/dw. Positive gradient → decrease w to reduce loss.

21print b.grad

b.grad = 0.2. Increasing b by 1 would increase loss by approximately 0.2. Again, to decrease loss, we should decrease b.

EXECUTION STATE

b.grad = 0.2000 — dloss/db. Positive gradient → decrease b to reduce loss.

5 lines without explanation

1import torch
2
3# Step 1: Create learnable parameters
4w = torch.tensor(0.5, requires_grad=True)
5b = torch.tensor(0.1, requires_grad=True)
6x = torch.tensor(2.0)   # Input — no gradient needed
7
8# Step 2: Forward pass
9z = w * x + b
10loss = (z - 1.0) ** 2
11
12print(f"z = w*x + b = {z.item():.4f}")
13print(f"loss = (z - 1.0)² = {loss.item():.4f}")
14print(f"w.grad before backward: {w.grad}")
15print(f"b.grad before backward: {b.grad}")
16
17# Step 3: Backward pass
18loss.backward()
19
20print(f"w.grad = {w.grad.item():.4f}")
21print(f"b.grad = {b.grad.item():.4f}")

The Gradient Tells You How to Improve: After backward(), w.grad = 0.4 means “increasing w by 1 increases the loss by 0.4.” So to decrease the loss, we should decrease w. The gradient descent update is: $w_{\text{new}} = w - \eta \cdot \frac{\partial L}{\partial w}$ where $\eta$ is the learning rate.

Partial Derivatives and the Gradient Vector

Real neural networks have thousands or millions of parameters, not just one. When a function depends on multiple variables, like $f(x, y) = x^2 + 2xy + y^2$ , we need a separate derivative with respect to each variable. These are called partial derivatives.

The partial derivative $\frac{\partial f}{\partial x}$ treats all other variables as constants and differentiates only with respect to $x$ . The collection of all partial derivatives forms the gradient vector:

$\nabla f = \left( \frac{\partial f}{\partial x}, \frac{\partial f}{\partial y} \right)$

The gradient vector points in the direction of steepest ascent. To minimize a function (reduce loss), we move in the opposite direction — the negative gradient. This is exactly what gradient descent does.

Notation	Meaning	Example
∂f/∂x	Partial derivative: differentiate f with respect to x, treating y as constant	f = x² + 2xy → ∂f/∂x = 2x + 2y
∂f/∂y	Partial derivative: differentiate f with respect to y, treating x as constant	f = x² + 2xy → ∂f/∂y = 2x
∇f (nabla f)	Gradient vector: the vector of all partial derivatives	∇f = (2x + 2y, 2x)
-∇f	Negative gradient: direction of steepest DESCENT	The direction to move to reduce f fastest

PyTorch computes all partial derivatives simultaneously. When you call $\texttt{loss.backward()}$ , every leaf tensor with $\texttt{requires\_grad=True}$ gets its gradient populated. You saw this in the computational graph example: $z = xy + x^2$ gave us $\frac{\partial z}{\partial x} = 7$ and $\frac{\partial z}{\partial y} = 2$ in a single backward() call.

The Gradient Accumulation Trap

PyTorch has a behavior that surprises every beginner: calling $\texttt{.backward()}$ does not replace the gradient in $\texttt{.grad}$ — it adds to it. This means if you call backward() twice without zeroing the gradient between calls, the second gradient gets added to the first.

This is by design (it enables advanced techniques like gradient accumulation over mini-batches), but it's the most common bug in PyTorch training loops. If you forget to zero gradients, your model will train with increasingly wrong gradient values.

Gradient Accumulation Bug and Fix

🐍gradient_accumulation.py

Explanation(15)

Code(19)

1import torch

Import PyTorch.

EXECUTION STATE

📚 torch = PyTorch core module

3x = torch.tensor(3.0, requires_grad=True)

Create x = 3.0 with gradient tracking. x.grad starts as None.

EXECUTION STATE

x = tensor(3.0, requires_grad=True)

x.grad = None (no backward yet)

5Comment: First backward

We’ll compute d(x²)/dx = 2x = 6.0.

6y1 = x ** 2

Computes x² = 9.0.

EXECUTION STATE

y1 = 3.0² = 9.0

7y1.backward()

Computes dy1/dx = 2x = 2(3) = 6.0. Since x.grad was None, it gets initialized to 6.0.

EXECUTION STATE

dy1/dx = 2x = 2 × 3.0 = 6.0

x.grad after = 6.0 (was None, now 6.0)

8print x.grad after 1st backward

x.grad = 6.0. This is correct: d(x²)/dx = 2(3) = 6.

EXECUTION STATE

output = After 1st backward: x.grad = 6.0 ✔

10Comment: Second backward WITHOUT zeroing — BUG!

HERE is the trap. We’re about to compute d(x³)/dx = 3x² = 27.0. But x.grad still holds 6.0 from the first backward. PyTorch will ADD the new gradient to the existing one: 6 + 27 = 33. This is almost certainly not what you want!

11y2 = x ** 3

Computes x³ = 27.0.

EXECUTION STATE

y2 = 3.0³ = 27.0

12y2.backward() — gradients ACCUMULATE!

Computes dy2/dx = 3x² = 3(9) = 27.0. But instead of replacing x.grad, PyTorch ADDS to it: x.grad = 6.0 + 27.0 = 33.0. This is the accumulation behavior — by design in PyTorch, but a common bug source.

EXECUTION STATE

dy2/dx = 3x² = 3 × 9 = 27.0 (the new gradient)

x.grad before = 6.0 (leftover from first backward)

x.grad after = 6.0 + 27.0 = 33.0 — WRONG if we wanted just dy2/dx!

13print x.grad after 2nd backward — 33.0!

x.grad = 33.0. We wanted 27.0 (the gradient of x³) but got 33.0 (the sum of both gradients). In a training loop, this bug makes your gradients grow larger every iteration, causing training to diverge.

EXECUTION STATE

output = After 2nd backward: x.grad = 33.0 ❌ (should be 27.0)

15Comment: The fix — zero the gradient

The fix is simple: call .zero_() on the gradient tensor before each backward pass. This resets it to 0, so the new gradient doesn’t add to stale values.

16x.grad.zero_()

Resets x.grad to 0.0. The underscore suffix in PyTorch means ‘in-place operation’ — it modifies the tensor directly without creating a new one. In a training loop, you’d use optimizer.zero_grad() which does this for all parameters at once.

EXECUTION STATE

📚 .zero_() = In-place operation: sets all elements to 0. The underscore convention in PyTorch means ‘modify in place’ (like sort vs sorted in Python).

x.grad before = 33.0 (accumulated)

x.grad after = 0.0 (reset)

17y3 = x ** 3

Computes x³ = 27.0 again.

EXECUTION STATE

y3 = 3.0³ = 27.0

18y3.backward() — now on a clean slate

Computes dy3/dx = 3x² = 27.0. Since x.grad was zeroed, the result is just 0 + 27 = 27.0. Correct!

EXECUTION STATE

x.grad = 0 + 27.0 = 27.0 — correct gradient!

19print x.grad after zero + backward

x.grad = 27.0. This is the correct gradient of x³ at x=3: d/dx(x³) = 3x² = 3(9) = 27.

EXECUTION STATE

output = After zero + backward: x.grad = 27.0 ✔

4 lines without explanation

1import torch
2
3x = torch.tensor(3.0, requires_grad=True)
4
5# First backward: y1 = x²
6y1 = x ** 2
7y1.backward()
8print(f"After 1st backward: x.grad = {x.grad.item()}")
9
10# Second backward WITHOUT zeroing — BUG!
11y2 = x ** 3
12y2.backward()
13print(f"After 2nd backward: x.grad = {x.grad.item()}")
14
15# The fix: zero the gradient first
16x.grad.zero_()
17y3 = x ** 3
18y3.backward()
19print(f"After zero + backward: x.grad = {x.grad.item()}")

The interactive demo below lets you experience this bug firsthand. Click “loss.backward()” repeatedly and watch the gradient grow without bound in the “Without zero_grad()” mode. Then switch to the correct mode and see how resetting keeps the gradient clean.

Loading gradient accumulation demo...

The Rule: In every training loop iteration, you must zero gradients before calling backward(). The standard pattern is:

$\texttt{optimizer.zero\_grad()}$ — reset all parameter gradients to 0
$\texttt{loss = model(input)}$ — forward pass
$\texttt{loss.backward()}$ — compute gradients
$\texttt{optimizer.step()}$ — update parameters using gradients

Controlling the Graph: no_grad() and detach()

Building a computational graph takes memory and computation. During inference (making predictions with a trained model), you don't need gradients — you're not training, just computing outputs. PyTorch provides two mechanisms to disable gradient tracking when you don't need it:

Mechanism	What It Does	When to Use
torch.no_grad()	Context manager: disables gradient tracking for all operations inside the block	Inference, evaluation, parameter updates in training loops
tensor.detach()	Creates a new tensor with same data but disconnected from the graph	Using a tensor’s value without gradient flow, stopping gradient through part of network

Disabling Gradient Tracking: no_grad() and detach()

🐍no_grad_detach.py

Explanation(16)

Code(20)

1import torch

Import PyTorch.

EXECUTION STATE

📚 torch = PyTorch core module

3x = torch.tensor(2.0, requires_grad=True)

Create x with gradient tracking enabled.

EXECUTION STATE

x = tensor(2.0, requires_grad=True)

5Comment: Normal operation — tracked

First, we show what happens normally: operations on a tensor with requires_grad=True create a computational graph.

6y = x ** 2

Normal operation: PyTorch tracks this and creates a PowBackward0 node. Memory is used to store the graph.

EXECUTION STATE

y = tensor(4.0, grad_fn=<PowBackward0>)

7print y.requires_grad

True — because y was created from an operation on x (which has requires_grad=True), y also has requires_grad=True. This propagates automatically.

EXECUTION STATE

output = y.requires_grad = True

8print y.grad_fn

PowBackward0 — the graph node that knows ‘y came from squaring x’.

EXECUTION STATE

output = y.grad_fn = <PowBackward0 object>

10Comment: torch.no_grad() — disables tracking

torch.no_grad() is a context manager that temporarily disables gradient tracking for ALL operations inside its block. This is used during inference (making predictions) when you don’t need gradients, saving memory and computation.

11with torch.no_grad():

Enters the no-gradient context. All tensor operations inside this block will NOT build a computational graph, regardless of requires_grad settings. This is essential during inference — a model with millions of parameters would waste enormous memory tracking gradients it will never use.

EXECUTION STATE

📚 torch.no_grad() = Context manager: disables gradient computation inside the block. Reduces memory usage (no graph stored) and speeds up computation. Use for: inference, evaluation, parameter updates.

12y_no_grad = x ** 2 (inside no_grad)

Same operation as before (x² = 4.0), but now NO graph is built. The result tensor has requires_grad=False and grad_fn=None.

EXECUTION STATE

y_no_grad = tensor(4.0) — no grad_fn, no graph. Pure number.

13print y_no_grad.requires_grad

False — inside no_grad(), the result doesn’t track gradients even though x does.

EXECUTION STATE

output = y_no_grad.requires_grad = False

14print y_no_grad.grad_fn

None — no computational graph node was created. You CANNOT call y_no_grad.backward().

EXECUTION STATE

output = y_no_grad.grad_fn = None

16Comment: .detach() — breaks the graph connection

.detach() creates a new tensor that shares the same data but is disconnected from the computational graph. Useful when you need to use a value as a ‘constant’ without gradients flowing through it.

17x_detached = x.detach()

Creates a new tensor with the same value as x (2.0) but with requires_grad=False. The original x is unaffected. This is useful when you want to use x’s value as a fixed constant in another computation.

EXECUTION STATE

📚 .detach() = Returns a new tensor that shares the same underlying data but is detached from the computational graph. The original tensor is unchanged. Common use: getting a tensor’s value without gradient tracking.

x_detached = tensor(2.0) — same value, but requires_grad=False

18y_detached = x_detached ** 2

Since x_detached has requires_grad=False, this operation is not tracked. No graph node is created.

EXECUTION STATE

y_detached = tensor(4.0) — no graph, no grad_fn

19print x_detached.requires_grad

False — detached tensors don’t track gradients.

EXECUTION STATE

output = x_detached.requires_grad = False

20print y_detached.requires_grad

False — operations on non-tracking tensors produce non-tracking results.

EXECUTION STATE

output = y_detached.requires_grad = False

4 lines without explanation

1import torch
2
3x = torch.tensor(2.0, requires_grad=True)
4
5# Normal operation — tracked
6y = x ** 2
7print(f"y.requires_grad = {y.requires_grad}")
8print(f"y.grad_fn = {y.grad_fn}")
9
10# torch.no_grad() — disables tracking
11with torch.no_grad():
12    y_no_grad = x ** 2
13    print(f"y_no_grad.requires_grad = {y_no_grad.requires_grad}")
14    print(f"y_no_grad.grad_fn = {y_no_grad.grad_fn}")
15
16# .detach() — breaks the graph connection
17x_detached = x.detach()
18y_detached = x_detached ** 2
19print(f"x_detached.requires_grad = {x_detached.requires_grad}")
20print(f"y_detached.requires_grad = {y_detached.requires_grad}")

Memory Impact: For a model like GPT with 175 billion parameters, the computational graph for a single forward pass can use 10-100 GB of memory. Wrapping inference in $\texttt{torch.no\_grad()}$ can reduce memory usage by 50-75%, since no graph nodes need to be stored.

Gradient Descent with Autograd

Now we bring everything together. We'll build a complete gradient descent loop that uses autograd to minimize a function. The pattern is the same one used to train every neural network: forward pass, backward pass, update parameters, repeat.

Our goal: find the value of $x$ that minimizes $f(x) = (x - 3)^2$ . The minimum is at $x = 3$ (where $f(3) = 0$ ). We start at $x = 0$ and let gradient descent find the minimum by following the negative gradient step by step.

Complete Gradient Descent Loop with Autograd

🐍gradient_descent.py

Explanation(21)

Code(30)

1import torch

Import PyTorch for tensors and automatic differentiation.

EXECUTION STATE

📚 torch = PyTorch core module

3Comment: Goal — minimize f(x) = (x - 3)²

We want to find the value of x that makes (x - 3)² as small as possible. The answer is obviously x = 3 (where the squared error is 0), but we’ll let gradient descent find it automatically. This is the core of how neural networks learn — they minimize a loss function by following gradients.

4Comment: Minimum at x = 3

f’(x) = 2(x - 3). Setting f’(x) = 0: 2(x - 3) = 0 → x = 3. The derivative is negative when x < 3 (go right) and positive when x > 3 (go left). Gradient descent follows this signal to the minimum.

6x = torch.tensor(0.0, requires_grad=True)

Start at x = 0. This is far from the minimum at x = 3. We’ll watch gradient descent walk from 0 toward 3 step by step.

EXECUTION STATE

x = tensor(0.0, requires_grad=True) — our starting point, far from the minimum

7learning_rate = 0.1

The step size for gradient descent. Controls how big each update is. Too large: overshoots the minimum. Too small: takes forever. 0.1 is a reasonable starting point for this simple problem.

EXECUTION STATE

learning_rate = 0.1 — each step moves 10% of the gradient magnitude

9print header

Prints the title of our optimization run.

EXECUTION STATE

output = Gradient Descent: minimize f(x) = (x - 3)²

10print column headers

Column headers for the step-by-step output: step number, current x, function value, gradient.

EXECUTION STATE

output = Step x f(x) grad

11print separator

A line of dashes to separate the header from the data.

13for step in range(10): — the training loop

Run 10 steps of gradient descent. Each iteration: (1) compute loss, (2) compute gradient via backward(), (3) update x, (4) zero gradient. This is the exact same structure as training a neural network — the only difference is that real networks have thousands of parameters instead of one.

LOOP TRACE · 10 iterations

step=0

x = 0.0000

loss = (0 - 3)² = 9.0000

grad = 2(0 - 3) = -6.0000

x_new = 0 - 0.1×(-6) = 0.6000

step=1

x = 0.6000

loss = (0.6 - 3)² = 5.7600

grad = 2(0.6 - 3) = -4.8000

x_new = 0.6 - 0.1×(-4.8) = 1.0800

step=2

x = 1.0800

loss = 3.6864

grad = -3.8400

x_new = 1.4640

step=3

x = 1.4640

loss = 2.3593

grad = -3.0720

x_new = 1.7712

step=4

x = 1.7712

loss = 1.5099

grad = -2.4576

x_new = 2.0170

step=5

x = 2.0170

loss = 0.9664

grad = -1.9661

x_new = 2.2136

step=6

x = 2.2136

loss = 0.6185

grad = -1.5729

x_new = 2.3709

step=7

x = 2.3709

loss = 0.3958

grad = -1.2583

x_new = 2.4967

step=8

x = 2.4967

loss = 0.2533

grad = -1.0066

x_new = 2.5973

step=9

x = 2.5973

loss = 0.1621

grad = -0.8053

x_new = 2.6779

14Comment: Forward — compute loss

Step 1 of the training loop: compute the function value (the loss) using the current x.

15loss = (x - 3.0) ** 2

The loss function f(x) = (x - 3)². At step 0: loss = (0 - 3)² = 9.0. The graph is built: x → SubBackward → PowBackward.

EXECUTION STATE

loss at step 0 = (0.0 - 3.0)² = 9.0

17Comment: Backward — compute gradient

Step 2: use autograd to compute dloss/dx = 2(x - 3).

18loss.backward()

Computes the gradient of loss with respect to x. At step 0: dloss/dx = 2(0 - 3) = -6. The negative gradient tells us: x should INCREASE to reduce the loss (move toward 3).

EXECUTION STATE

dloss/dx at step 0 = 2(0 - 3) = -6.0 — negative means increase x

20grad = x.grad.item()

Extract the gradient value as a Python float for printing.

EXECUTION STATE

grad at step 0 = -6.0

21print step information

Prints the current state: step number, x position, loss value, and gradient.

EXECUTION STATE

output at step 0 = 0 0.0000 9.0000 -6.0000

23Comment: Update — step in negative gradient direction

Step 3: move x in the direction that decreases the loss. The update rule is x_new = x - lr × gradient. Subtracting the gradient (negative of the gradient direction) moves us downhill.

24with torch.no_grad():

We must wrap the parameter update in no_grad() because we don’t want this arithmetic operation to be tracked in the computational graph. This is purely a parameter update, not a forward pass.

EXECUTION STATE

📚 torch.no_grad() = Disables gradient tracking. The update x -= lr * grad is not a computation we want to differentiate through — it’s just moving a number.

25x -= learning_rate * x.grad

The gradient descent update rule. At step 0: x = 0.0 - 0.1 × (-6.0) = 0.0 + 0.6 = 0.6. The negative gradient (-6) means “go right”, so x increases from 0 toward 3.

EXECUTION STATE

learning_rate * x.grad = 0.1 × (-6.0) = -0.6

x = x - (-0.6) = 0.0 + 0.6 = 0.6

27Comment: Zero gradient for next iteration

Step 4: reset the gradient to zero before the next backward() call. Without this, gradients would accumulate across iterations (the bug we saw earlier).

28x.grad.zero_()

Resets x.grad to 0. This is critical — without it, each backward() would add to the existing gradient, making the updates larger and larger until training explodes.

EXECUTION STATE

📚 .zero_() = Sets the gradient to 0 in-place. In real training, you’d call optimizer.zero_grad() which does this for all parameters.

30print final result

After 10 steps, x = 2.6779. Not quite at 3.0 yet, but steadily approaching it. With more steps or a larger learning rate, x would converge closer to 3.0. The exponential convergence is visible: each step reduces the error by 80% (the factor 1 - 2×lr = 0.8).

EXECUTION STATE

output = Final x = 2.6779 (target: 3.0)

→ convergence = After 10 steps: 89.3% of the way to 3.0. After 20 steps: x ≈ 2.9885. After 50 steps: x ≈ 2.9999.

9 lines without explanation

1import torch
2
3# Goal: find x that minimizes f(x) = (x - 3)²
4# Minimum is at x = 3 (where derivative = 0)
5
6x = torch.tensor(0.0, requires_grad=True)
7learning_rate = 0.1
8
9print("Gradient Descent: minimize f(x) = (x - 3)²")
10print(f"{'Step':>4} {'x':>8} {'f(x)':>8} {'grad':>8}")
11print("-" * 36)
12
13for step in range(10):
14    # Forward: compute loss
15    loss = (x - 3.0) ** 2
16
17    # Backward: compute gradient
18    loss.backward()
19
20    grad = x.grad.item()
21    print(f"{step:4d} {x.item():8.4f} {loss.item():8.4f} {grad:8.4f}")
22
23    # Update: step in negative gradient direction
24    with torch.no_grad():
25        x -= learning_rate * x.grad
26
27    # Zero gradient for next iteration
28    x.grad.zero_()
29
30print(f"\nFinal x = {x.item():.4f} (target: 3.0)")

Notice the pattern: the loss decreases at each step (9.0 → 5.76 → 3.69 → ...), and $x$ approaches 3 (0 → 0.6 → 1.08 → ...). The gradient gets smaller as we approach the minimum, so the steps naturally shrink — this is why gradient descent converges smoothly for convex functions.

3D Visualization: Gradient Descent on a Loss Surface

In real neural networks, the loss is a function of many parameters. With two parameters, we can visualize it as a 3D surface where height represents the loss. Watch the red ball follow the gradient downhill to find the minimum. The yellow arrow shows the direction of steepest descent at each point — this is exactly what autograd computes.

Loading 3D gradient descent visualization...

Try the “Elongated Valley” surface. Notice how the ball oscillates back and forth across the narrow valley. This is a fundamental problem with basic gradient descent on elongated loss surfaces. Advanced optimizers like Adam and RMSProp solve this — we'll cover them in a later chapter.

Summary and Bridge

In this section, we learned how PyTorch automatically computes derivatives — the gradients that drive neural network training. Here are the key concepts:

Concept	What It Means	PyTorch API
Derivative	Rate of change — the slope of the function at a point	Computed automatically by autograd
Chain rule	Derivative through composed functions = product of local derivatives	Applied internally by .backward()
Computational graph	Record of operations that enables automatic differentiation	Built automatically during forward pass
requires_grad	Enable gradient tracking on a tensor (parameters only)	torch.tensor(..., requires_grad=True)
.backward()	Compute all gradients via reverse-mode autodiff	loss.backward()
.grad	Access the computed gradient of a leaf tensor	weight.grad
zero_grad()	Reset gradients to zero (MUST do before each backward)	optimizer.zero_grad() or tensor.grad.zero_()
no_grad()	Disable gradient tracking to save memory	with torch.no_grad():
Gradient descent	Update parameters in negative gradient direction	w -= lr * w.grad

The Training Loop Pattern: Every neural network training loop follows the same four steps: (1) forward pass to compute loss, (2) backward pass to compute gradients, (3) parameter update using gradients, (4) zero gradients for next iteration. Everything else — data loading, model architecture, learning rate schedules — is built around this core loop.

In the next chapter, we'll use these tools to build actual neural networks. You now have all the PyTorch essentials: tensors (Section 2-3), operations (Section 3), and automatic differentiation (this section). Chapter 3 will combine them into complete forward passes, loss computation, and training loops for real networks.