Chapter 3
20 min read
Section 9 of 65

Derivatives and Gradients

Mathematics for Neural Networks

Why Derivatives Matter for Neural Networks

In the previous section, we learned how vectors and matrices organize and transform data. But there is a deeper question: how does a neural network learn? The answer is derivatives. Every time a neural network adjusts its weights during training, it is using derivatives to figure out which direction to change each weight to make the network's predictions better.

Here is the central idea. A neural network makes a prediction, compares it to the correct answer, and computes a loss — a single number measuring how wrong the prediction was. The question then becomes: how should each weight change to make the loss smaller? The derivative of the loss with respect to each weight answers exactly this question. It tells us the direction and rate at which the loss changes when we nudge each weight by a tiny amount.

The Central Insight: A derivative measures how a function's output changes when its input changes by a tiny amount. Neural networks use derivatives to find which direction to adjust each weight to reduce the prediction error. This process is called gradient descent.

This section builds your understanding from the ground up: first the derivative of a single variable, then partial derivatives for multiple variables, then the gradient vector that combines them, and finally the gradient descent algorithm that drives all of deep learning. We will implement everything in Python first, then see how PyTorch automates the entire process.


The Derivative: Measuring Instantaneous Change

Imagine you are driving a car and your GPS shows you are 100 km from your destination at noon, and 60 km at 1 PM. Your average speed was 100601=40\frac{100 - 60}{1} = 40 km/h. But that average hides the details — you may have been going 80 km/h on the highway and 10 km/h in traffic. The derivative gives you the instantaneous speed at any single moment, not the average over a period.

Mathematically, the derivative of a function f(x)f(x) at a point xx is defined as the limit:

f(x)=limh0f(x+h)f(x)hf'(x) = \lim_{h \to 0} \frac{f(x + h) - f(x)}{h}

This formula says: take two points on the function that are hh apart, compute the slope of the straight line connecting them (the secant line), then shrink hh to zero. As hh shrinks, the secant line rotates and becomes the tangent line — the line that just touches the curve at that single point. The slope of that tangent line is the derivative.

Geometric Intuition: Secant to Tangent

The interactive visualization below shows this process in action. The red dashed line is the secant line connecting two points on the curve. The green line is the true tangent line. As you decrease Δx\Delta x (or click "Animate"), watch the secant line rotate to match the tangent line. The secant slope converges to the derivative.

Loading derivative visualization...
Try This: Select different functions (x², x³, sin(x), e⊃x) and move the point along the curve. Notice how the derivative (tangent slope) changes at different locations. For x², the slope is always 2x — positive when x > 0, negative when x < 0, and zero at x = 0 (the bottom of the bowl).

Computing Derivatives: The Limit Process

Let us make this concrete. For f(x)=x2f(x) = x^2 at the point x=2x = 2, we know f(2)=4f(2) = 4. We want to find the derivative f(2)f'(2). The formula says to compute f(2+h)f(2)h\frac{f(2+h) - f(2)}{h} for smaller and smaller hh:

hf(2 + h)f(2 + h) - f(2)Slope = Δy/hError from 4.0
1.09.05.05.0000001.000000
0.14.410.414.1000000.100000
0.014.04010.04014.0100000.010000
0.0014.0040010.0040014.0010000.001000
0.00014.000400010.000400014.0001000.000100

The pattern is clear: as hh shrinks by 10×, the error also shrinks by 10×. In the limit h0h \to 0, the slope reaches exactly 4.0. This is the derivative: f(2)=4f'(2) = 4.

We can verify this with calculus. The power rule says that the derivative of xnx^n is nxn1nx^{n-1}. So for f(x)=x2f(x) = x^2: f(x)=2xf'(x) = 2x. At x=2x = 2: f(2)=2×2=4f'(2) = 2 \times 2 = 4.

Python Implementation: Derivative from First Principles

Let us code this limit process directly. We compute the finite difference f(x+h)f(x)h\frac{f(x+h) - f(x)}{h} for shrinking values of hh and watch it converge:

Derivative from First Principles
🐍derivative_first_principles.py
1import numpy as np

NumPy provides fast numerical arrays and math operations. We import it as np by convention. We will use it here for numerical derivative computation — though this first example uses plain Python to make the math transparent.

EXECUTION STATE
📚 numpy = Numerical computing library. Provides ndarray, vectorized math, linear algebra. We use it later for array-based derivative computation.
3# The function: f(x) = x²

We choose a simple quadratic function as our test case because its derivative has a known closed-form answer: f′(x) = 2x. This lets us verify our numerical approximation against the true value.

4def f(x): — Define the function f(x) = x²

Defines a Python function that computes x squared. In neural network terms, this is like a simple loss function — we want to find its minimum, and the derivative tells us which way is downhill.

EXECUTION STATE
⬇ input: x = A single number (float). Examples: x=2.0 gives f(2)=4.0, x=3.0 gives f(3)=9.0
⬆ returns = x² — a single float. The function curves upward like a bowl, with minimum at x=0.
5return x ** 2

Python’s ** operator computes exponentiation. x**2 = x×x. For x=2.0: 2.0**2 = 4.0.

EXECUTION STATE
** (exponent operator) = Python’s power operator. a**b = a raised to the power b. Example: 3**2 = 9, 2**3 = 8
⬆ return: x ** 2 = When x=2.0: 2.0 ** 2 = 4.0
7# Pick a point

We will compute the derivative at x=2.0. The derivative at this point tells us: if x increases by a tiny amount, how much does f(x) change? The answer should be f′(2) = 2×2 = 4.

8x = 2.0 — The point where we evaluate the derivative

We fix x=2.0. At this point, f(2.0) = 4.0. The function is climbing steeply upward here — the slope should be positive and equal to 2×2 = 4.0.

EXECUTION STATE
x = 2.0 = The point on the x-axis where we want to know the instantaneous rate of change. f(2.0) = 2.0² = 4.0
9print(f"f({x}) = {f(x)}")

Prints the function value at our chosen point.

EXECUTION STATE
output = f(2.0) = 4.0
11# Approximate the derivative with shrinking h

The derivative is defined as the limit: f′(x) = lim(h→0) [f(x+h) - f(x)] / h. We cannot compute h=0 directly (division by zero), so instead we try progressively smaller values of h and watch the slope converge to the true derivative.

12for h in [1.0, 0.1, 0.01, 0.001, 0.0001]:

We try five values of h, each 10× smaller than the previous. This demonstrates how the finite difference approximation converges to the true derivative as h shrinks toward zero.

LOOP TRACE · 5 iterations
h = 1.0000
f(x+h) = f(2.0 + 1.0) = f(3.0) = 9.0
slope = (9.0 - 4.0) / 1.0 = 5.000000 — far from true value 4.0
h = 0.1000
f(x+h) = f(2.0 + 0.1) = f(2.1) = 4.41
slope = (4.41 - 4.0) / 0.1 = 4.100000 — getting closer
h = 0.0100
f(x+h) = f(2.0 + 0.01) = f(2.01) = 4.0401
slope = (4.0401 - 4.0) / 0.01 = 4.010000
h = 0.0010
f(x+h) = f(2.0 + 0.001) = f(2.001) = 4.004001
slope = (4.004001 - 4.0) / 0.001 = 4.001000
h = 0.0001
f(x+h) = f(2.0 + 0.0001) = f(2.0001) = 4.00040001
slope = (4.00040001 - 4.0) / 0.0001 = 4.000100 — nearly exact!
13slope = (f(x + h) - f(x)) / h

This is the difference quotient formula — the slope of the secant line through two points on the curve: (x, f(x)) and (x+h, f(x+h)). As h→0, this secant line becomes the tangent line, and its slope becomes the derivative.

EXECUTION STATE
f(x + h) - f(x) = The change in the function output (Δy). How much does f change when x increases by h?
/ h = Divide by the change in input (Δx = h). This gives the average rate of change over the interval [x, x+h].
⬆ slope = The finite difference approximation to f′(x). Approaches 4.0 as h→0.
14print(f"h = {h:.4f}, slope = {slope:.6f}")

Prints each approximation. The pattern is clear: as h shrinks by 10×, the error also shrinks by 10×.

EXECUTION STATE
output (all iterations) =
h = 1.0000, slope = 5.000000
h = 0.1000, slope = 4.100000
h = 0.0100, slope = 4.010000
h = 0.0010, slope = 4.001000
h = 0.0001, slope = 4.000100
16# True derivative: f′(x) = 2x

For f(x) = x², calculus gives us the exact derivative: f′(x) = 2x. At x=2, f′(2) = 4.0. Our numerical approximation converged to this value as h shrank.

17true_derivative = 2 * x

Computes the exact derivative using the power rule: d/dx[x²] = 2x. At x=2.0: 2 × 2.0 = 4.0.

EXECUTION STATE
power rule = d/dx[xⁿ] = n·xⁿ⁻¹. For x²: n=2, so derivative = 2x¹ = 2x
true_derivative = 2 × 2.0 = 4.0 — the exact instantaneous rate of change at x=2
18print(f"True derivative f′({x}) = {true_derivative}")

Prints the analytical derivative value for comparison with our numerical approximations.

EXECUTION STATE
output = True derivative f′(2.0) = 4.0
4 lines without explanation
1import numpy as np
2
3# The function: f(x) = x²
4def f(x):
5    return x ** 2
6
7# Pick a point
8x = 2.0
9print(f"f({x}) = {f(x)}")
10
11# Approximate the derivative with shrinking h
12for h in [1.0, 0.1, 0.01, 0.001, 0.0001]:
13    slope = (f(x + h) - f(x)) / h
14    print(f"h = {h:.4f}, slope = {slope:.6f}")
15
16# True derivative: f'(x) = 2x
17true_derivative = 2 * x
18print(f"True derivative f'({x}) = {true_derivative}")

Essential Derivative Rules

Computing derivatives from the limit definition every time would be tedious. Instead, calculus gives us a set of rules that make finding derivatives mechanical. Here are the rules you will use most in deep learning:

The Power Rule

If f(x)=xnf(x) = x^n, then f(x)=nxn1f'(x) = nx^{n-1}. This is the most frequently used rule. The exponent comes down as a coefficient, and the exponent decreases by one.

FunctionDerivativeAt x=2
2x4.0
3x²12.0
x⁴4x³32.0
x¹ = x11.0
x⁰ = 1 (constant)00.0

The Sum Rule

The derivative of a sum is the sum of the derivatives. If f(x)=g(x)+h(x)f(x) = g(x) + h(x), then f(x)=g(x)+h(x)f'(x) = g'(x) + h'(x). This lets you differentiate term by term. For example: ddx(x2+3x+7)=2x+3+0=2x+3\frac{d}{dx}(x^2 + 3x + 7) = 2x + 3 + 0 = 2x + 3.

The Constant Multiple Rule

Constants factor out of derivatives: if f(x)=cg(x)f(x) = c \cdot g(x), then f(x)=cg(x)f'(x) = c \cdot g'(x). Example: ddx(3x2)=32x=6x\frac{d}{dx}(3x^2) = 3 \cdot 2x = 6x.

Special Functions

FunctionDerivativeWhy It Matters
Appears in softmax, sigmoid. Its own derivative!
ln(x)1/xAppears in cross-entropy loss
sin(x)cos(x)Used in positional encodings
1/x-1/x²Appears in normalization layers
Why These Rules Matter: Every neural network layer is built from these basic operations — multiplication, addition, powers, and exponentials. The chain rule (covered in Section 4 of this chapter) combines these to differentiate complex compositions. Together, they let us compute the derivative of any neural network output with respect to any weight.

Partial Derivatives: Multiple Variables

A real neural network has thousands or millions of weights, not just one. Its loss function takes all weights as input: L(w1,w2,,wn)L(w_1, w_2, \ldots, w_n). To update each weight, we need to know how the loss changes when that specific weight changes, while all other weights stay fixed. This is exactly what a partial derivative computes.

Consider a function of two variables: f(x,y)=x2+3y2f(x, y) = x^2 + 3y^2. This defines a 3D surface — an elliptical bowl. The partial derivative fx\frac{\partial f}{\partial x} asks: if I nudge xx slightly while holding yy fixed, how much does ff change? To compute it, simply treat yy as a constant and differentiate with respect to xx:

fx=2x(treat y as constant, 3y2 disappears)\frac{\partial f}{\partial x} = 2x \quad \text{(treat } y \text{ as constant, } 3y^2 \text{ disappears)}

fy=6y(treat x as constant, x2 disappears)\frac{\partial f}{\partial y} = 6y \quad \text{(treat } x \text{ as constant, } x^2 \text{ disappears)}

At the point (2,1)(2, 1): fx=4\frac{\partial f}{\partial x} = 4 and fy=6\frac{\partial f}{\partial y} = 6. The surface is steeper in the yy-direction (slope 6 vs 4) because the 3y23y^2 term has a larger coefficient.

Code: Computing Partial Derivatives

Partial Derivatives and the Gradient Vector
🐍partial_derivatives.py
1import numpy as np

NumPy provides np.array for the gradient vector and np.linalg.norm for computing its magnitude.

EXECUTION STATE
📚 numpy = We use np.array() to build the gradient vector and np.linalg.norm() to compute its length.
3# A function of TWO variables

Neural network loss functions depend on many variables (all the weights). Here we start with just two variables to build intuition. The function f(x, y) = x² + 3y² creates an elliptical bowl shape — steeper in the y-direction because of the 3× coefficient.

4def f(x, y): — A two-variable function

This function takes two inputs and returns one output. It defines a surface in 3D space: for every (x, y) point on the ground plane, f gives the height. The minimum is at (0, 0) where f = 0.

EXECUTION STATE
⬇ input: x = First variable. Partial derivative df/dx tells how f changes when x changes (holding y fixed).
⬇ input: y = Second variable. Partial derivative df/dy tells how f changes when y changes (holding x fixed).
⬆ returns = x² + 3y² — always ≥ 0, equals 0 only at origin (0,0). The 3× makes the surface steeper along y.
5return x**2 + 3*y**2

Two terms added together. The x² term creates curvature in the x-direction (coefficient 1), and 3y² creates steeper curvature in the y-direction (coefficient 3). This asymmetry means the gradient will be larger in the y-direction.

EXECUTION STATE
x**2 = Contribution from x. At x=2: 2² = 4.0
3*y**2 = Contribution from y, 3× steeper. At y=1: 3×1² = 3.0
⬆ return = 4.0 + 3.0 = 7.0
7# Evaluate at a specific point

We pick the point (2.0, 1.0) to evaluate both the function value and its partial derivatives. This point is in the upper-right quadrant of the surface, away from the minimum at origin.

8x0, y0 = 2.0, 1.0 — The evaluation point

Python tuple unpacking: assigns x0=2.0 and y0=1.0 simultaneously. This is our starting position on the loss surface.

EXECUTION STATE
x0 = 2.0 = The x-coordinate where we evaluate the gradient.
y0 = 1.0 = The y-coordinate where we evaluate the gradient.
9print(f"f({x0}, {y0}) = {f(x0, y0)}")

Computes and prints f(2.0, 1.0) = 2² + 3×1² = 4 + 3 = 7.0. This is the height of the surface at our point.

EXECUTION STATE
output = f(2.0, 1.0) = 7.0
11# Partial derivative with respect to x (treat y as constant)

A partial derivative asks: if I change only x (keeping y fixed), how fast does f change? We literally treat y as a constant number and differentiate only with respect to x. For f = x² + 3y², the 3y² term is constant, so df/dx = 2x.

12# df/dx = 2x

Applying the power rule to x² gives 2x. The 3y² term has no x in it, so its derivative with respect to x is 0. Therefore df/dx = 2x + 0 = 2x.

13df_dx = 2 * x0

Evaluates df/dx = 2x at x=2.0. The result 4.0 means: at point (2, 1), if we nudge x by a tiny amount ε, the function changes by approximately 4ε. Moving right increases f; moving left decreases f.

EXECUTION STATE
df/dx = 2x = At x=2.0: df/dx = 2 × 2.0 = 4.0
df_dx = 4.0 = Positive → f increases as x increases. Moving in the +x direction goes uphill.
14print(f"df/dx = 2x = 2*{x0} = {df_dx}")

Displays the partial derivative with respect to x.

EXECUTION STATE
output = df/dx = 2x = 2*2.0 = 4.0
16# Partial derivative with respect to y (treat x as constant)

Now we ask: if I change only y (keeping x fixed), how fast does f change? We treat x as constant and differentiate with respect to y. For f = x² + 3y²: the x² term vanishes, and d/dy[3y²] = 6y.

17# df/dy = 6y

The power rule on 3y² gives 3×2y = 6y. The x² term is constant with respect to y, contributing 0.

18df_dy = 6 * y0

Evaluates df/dy = 6y at y=1.0. The result 6.0 means: the surface is steeper in the y-direction than in x (6.0 vs 4.0). This is because of the 3× coefficient on y² — the surface curves more sharply along y.

EXECUTION STATE
df/dy = 6y = At y=1.0: df/dy = 6 × 1.0 = 6.0
df_dy = 6.0 = Larger than df/dx = 4.0 → the surface is steeper in the y-direction at this point.
19print(f"df/dy = 6y = 6*{y0} = {df_dy}")

Displays the partial derivative with respect to y.

EXECUTION STATE
output = df/dy = 6y = 6*1.0 = 6.0
21# The gradient: vector of all partial derivatives

The gradient is simply the vector that collects all partial derivatives into one object: ∇f = [df/dx, df/dy]. It points in the direction of steepest ascent — the direction where f increases fastest. For gradient descent, we go in the opposite direction.

22gradient = np.array([df_dx, df_dy])

Packs the two partial derivatives into a NumPy array. This IS the gradient vector ∇f(2, 1) = [4.0, 6.0]. It points toward the upper-right, meaning f increases fastest in the direction that is 4 parts x and 6 parts y.

EXECUTION STATE
📚 np.array([...]) = Creates a 1-D NumPy array from a Python list. The resulting ndarray supports vectorized math.
gradient = [4.0, 6.0] — the gradient vector ∇f at (2, 1)
→ direction interpretation = The gradient points more strongly in the y-direction (6.0) than x (4.0), because the 3y² term makes the surface steeper along y.
23print(f"Gradient: {gradient}")

Displays the gradient vector.

EXECUTION STATE
output = Gradient: [4. 6.]
25# Gradient magnitude: how steep is the surface here?

The magnitude of the gradient tells us the steepness of the surface. A large magnitude means the surface is very steep; near a minimum, the magnitude approaches zero because the surface flattens out.

26grad_magnitude = np.linalg.norm(gradient)

Computes |∇f| = √(4² + 6²) = √(16 + 36) = √52 ≈ 7.2111. This is the maximum rate of change in any direction at this point.

EXECUTION STATE
📚 np.linalg.norm() = Computes the Euclidean (L2) norm: √(sum of squares). For the vector [4, 6]: √(16+36) = √52.
grad_magnitude = 7.2111 — the maximum directional derivative. The surface is quite steep here.
27print(f"|gradient| = {grad_magnitude:.4f}")

Displays the gradient magnitude formatted to 4 decimal places.

EXECUTION STATE
output = |gradient| = 7.2111
29# Numerical verification using central differences

We verify our analytical derivatives by computing them numerically. Central differences give a more accurate approximation than forward differences: instead of [f(x+h) - f(x)]/h, we use [f(x+h) - f(x-h)]/(2h), which cancels the leading error term.

30h = 1e-7 — Tiny step for numerical differentiation

We use h = 0.0000001. This is small enough for accurate approximation but large enough to avoid floating-point precision issues. The central difference formula has error proportional to h², so h=10⁻⁷ gives ~10⁻¹⁴ accuracy.

EXECUTION STATE
h = 1e-7 = 0.0000001 — scientific notation. Balances accuracy (smaller h = less truncation error) against numerical stability (too-small h causes cancellation errors).
31num_df_dx = (f(x0+h, y0) - f(x0-h, y0)) / (2*h)

Central difference for df/dx: evaluate f at x+h and x-h (both with the same y), then divide by 2h. This is like measuring slope by looking equally far in both directions from the point.

EXECUTION STATE
📚 central difference = [f(x+h) - f(x-h)] / (2h) — symmetric around x. More accurate than forward difference because odd-order error terms cancel.
f(x0+h, y0) = f(2.0000001, 1.0) = 2.0000001² + 3×1² = 7.00000040...
f(x0-h, y0) = f(1.9999999, 1.0) = 1.9999999² + 3×1² = 6.99999960...
num_df_dx = (7.0000004 - 6.9999996) / 0.0000002 = 4.000000 — matches analytical result exactly
32num_df_dy = (f(x0, y0+h) - f(x0, y0-h)) / (2*h)

Central difference for df/dy: evaluate f at y+h and y-h (both with the same x). This time x is held constant while y varies.

EXECUTION STATE
f(x0, y0+h) = f(2.0, 1.0000001) = 4 + 3×(1.0000001)² = 7.00000060...
f(x0, y0-h) = f(2.0, 0.9999999) = 4 + 3×(0.9999999)² = 6.99999940...
num_df_dy = (7.0000006 - 6.9999994) / 0.0000002 = 6.000000 — matches analytical result exactly
33print(f"Numerical df/dx = {num_df_dx:.6f}")

Confirms numerical and analytical derivatives match.

EXECUTION STATE
output = Numerical df/dx = 4.000000
34print(f"Numerical df/dy = {num_df_dy:.6f}")

Confirms numerical and analytical derivatives match for y as well.

EXECUTION STATE
output = Numerical df/dy = 6.000000
7 lines without explanation
1import numpy as np
2
3# A function of TWO variables
4def f(x, y):
5    return x**2 + 3*y**2
6
7# Evaluate at a specific point
8x0, y0 = 2.0, 1.0
9print(f"f({x0}, {y0}) = {f(x0, y0)}")
10
11# Partial derivative with respect to x (treat y as constant)
12# df/dx = 2x
13df_dx = 2 * x0
14print(f"df/dx = 2x = 2*{x0} = {df_dx}")
15
16# Partial derivative with respect to y (treat x as constant)
17# df/dy = 6y
18df_dy = 6 * y0
19print(f"df/dy = 6y = 6*{y0} = {df_dy}")
20
21# The gradient: vector of all partial derivatives
22gradient = np.array([df_dx, df_dy])
23print(f"Gradient: {gradient}")
24
25# Gradient magnitude: how steep is the surface here?
26grad_magnitude = np.linalg.norm(gradient)
27print(f"|gradient| = {grad_magnitude:.4f}")
28
29# Numerical verification using central differences
30h = 1e-7
31num_df_dx = (f(x0 + h, y0) - f(x0 - h, y0)) / (2 * h)
32num_df_dy = (f(x0, y0 + h) - f(x0, y0 - h)) / (2 * h)
33print(f"Numerical df/dx = {num_df_dx:.6f}")
34print(f"Numerical df/dy = {num_df_dy:.6f}")

The Gradient Vector

The gradient is the vector that collects all partial derivatives into one object. For a function f(x,y)f(x, y), the gradient is:

f=[fxfy]\nabla f = \begin{bmatrix} \frac{\partial f}{\partial x} \\ \frac{\partial f}{\partial y} \end{bmatrix}

The gradient has two crucial geometric properties:

  1. Direction: The gradient points in the direction of steepest ascent — the direction where ff increases fastest. To decrease ff (minimize the loss), move in the opposite direction:f-\nabla f.
  2. Magnitude: f|\nabla f| tells you how steep the surface is at that point. Near a minimum, the gradient magnitude approaches zero because the surface flattens out.

Interactive: Gradient and Partial Derivatives on a Contour Plot

The visualization below shows a contour plot (top-down view of the surface) with the partial derivatives and gradient vector drawn at a movable point. Click or drag to move the point. Notice how the gradient (purple arrow) always points perpendicular to the contour lines — toward higher values.

Loading contour gradient visualization...
Key Observation: The gradient is always perpendicular to the contour lines (level curves). Contour lines connect points of equal value, so moving along a contour means no change in ff. The gradient points in the direction of maximum change, which must be perpendicular to the "no change" direction.

For a neural network with nn weights, the gradient is an nn-dimensional vector:

L=[Lw1Lw2Lwn]\nabla L = \begin{bmatrix} \frac{\partial L}{\partial w_1} \\ \frac{\partial L}{\partial w_2} \\ \vdots \\ \frac{\partial L}{\partial w_n} \end{bmatrix}

Each component tells you how to adjust one specific weight to reduce the loss. The gradient gives you the complete recipe for improving all weights simultaneously.


Gradient Descent: Walking Downhill

Now we have all the pieces. Gradient descent is the algorithm that uses the gradient to iteratively minimize a function. The idea is beautifully simple: if the gradient tells you which way is uphill, go the other way.

The update rule is:

wnew=woldηL(wold)w_{\text{new}} = w_{\text{old}} - \eta \cdot \nabla L(w_{\text{old}})

where η\eta (eta) is the learning rate — a positive number that controls the step size. The minus sign ensures we move opposite to the gradient (downhill). Let us start with a single-variable example to build intuition.

Example: Minimizing a 1D Loss Function

Consider L(w)=(w3)2+1L(w) = (w - 3)^2 + 1. This is a parabola with minimum at w=3w = 3 where L=1L = 1. The derivative is dLdw=2(w3)\frac{dL}{dw} = 2(w - 3). Starting at w=0w = 0:

  • The gradient is 2(03)=62(0 - 3) = -6 (negative, meaning the minimum is to the right)
  • The update: wnew=00.1×(6)=0.6w_{\text{new}} = 0 - 0.1 \times (-6) = 0.6 (moved right!)
  • Each subsequent step moves us closer to w=3w = 3, with smaller steps as we approach
Gradient Descent from Scratch (1D)
🐍gradient_descent_1d.py
1import numpy as np

NumPy imported for any array operations. In this 1-D example we use plain Python, but the same pattern scales to vectors and matrices with NumPy arrays.

EXECUTION STATE
📚 numpy = Imported by convention. Not strictly needed for 1-D but establishes the pattern used in higher dimensions.
3# Loss function: L(w) = (w - 3)² + 1

This is a simple loss function with a known minimum at w=3, where L=1. It represents the typical scenario in neural network training: we have some parameter w and want to find the value that minimizes the loss.

4# Minimum at w = 3, L = 1

The +1 shifts the minimum up from 0 to 1 (irreducible loss). The (w-3)² term means the minimum is at w=3 where (3-3)²=0.

5def loss(w): — Quadratic loss function

Defines a loss function of a single parameter w. The loss forms a U-shaped parabola centered at w=3. In a real neural network, w would be thousands of weights and the loss landscape would be much more complex.

EXECUTION STATE
⬇ input: w = A single weight parameter (float). Example: w=0.0 gives L=(0-3)²+1=10.0, w=3.0 gives L=(3-3)²+1=1.0
⬆ returns = (w-3)² + 1 — always ≥1.0, minimum at w=3
6return (w - 3)**2 + 1

Computes the squared distance from w to the optimal value 3, plus a constant offset of 1. At w=0: (0-3)²+1 = 9+1 = 10.0.

EXECUTION STATE
(w - 3) = Distance from w to the optimum. At w=0: 0-3 = -3
(w - 3)**2 = Squared distance, always ≥0. At w=0: (-3)² = 9
⬆ return = At w=0: 9 + 1 = 10.0
8def loss_gradient(w): — Derivative of the loss

The gradient (derivative for a single variable) tells us which direction to move w to reduce the loss. Negative gradient = move right (increase w). Positive gradient = move left (decrease w).

EXECUTION STATE
⬇ input: w = Current weight value
⬆ returns = 2(w-3) — the derivative dL/dw. Negative when w<3, zero at w=3, positive when w>3.
9return 2 * (w - 3)

dL/dw = 2(w-3). Using the chain rule: d/dw[(w-3)²] = 2(w-3)×1 = 2(w-3). The +1 constant disappears (derivative of a constant is 0). At w=0: 2(0-3) = -6, meaning the loss is decreasing steeply to the right.

EXECUTION STATE
chain rule = d/dw[(w-3)²] = 2(w-3) × d/dw[w-3] = 2(w-3) × 1
⬆ return: 2*(w-3) = At w=0: 2×(0-3) = -6.0. Negative gradient means the minimum is to the RIGHT.
11# Initialize

We start with an initial guess for w and a learning rate that controls step size. In real neural networks, weights are typically initialized randomly and the learning rate is a critical hyperparameter.

12w = 0.0 — Start far from the minimum

We deliberately start at w=0, which is far from the minimum at w=3. The loss here is L(0)=10.0 and the gradient is dL/dw=-6.0. Gradient descent will move w to the right toward 3.

EXECUTION STATE
w = 0.0 = Initial weight. Distance to optimum: |0-3| = 3.0. Loss: (0-3)²+1 = 10.0
13learning_rate = 0.1 — Step size multiplier

The learning rate (η) controls how big each step is. Too large = overshoot and oscillate. Too small = converge slowly. 0.1 is a moderate value: each step moves w by 0.1 × |gradient|.

EXECUTION STATE
learning_rate = 0.1 = Step size hyperparameter. At w=0: step = 0.1 × 6.0 = 0.6 (w moves from 0 to 0.6 in one step).
15# Run gradient descent for 10 steps

We run a fixed number of iterations. In practice, you stop when the loss converges (stops changing) or the gradient magnitude drops below a threshold.

16for step in range(10): — 10 optimization steps

Each iteration: (1) compute the loss, (2) compute the gradient, (3) update w by moving opposite to the gradient. Watch how w approaches 3.0 and loss approaches 1.0 with each step.

LOOP TRACE · 10 iterations
step=0
w = 0.0000
L = 10.0000
grad = -6.0000 (pointing left → minimum is right)
w_new = 0.0 - 0.1×(-6.0) = 0.0 + 0.6 = 0.6000
step=1
w = 0.6000
L = 6.7600 (down from 10.0!)
grad = -4.8000
w_new = 0.6 - 0.1×(-4.8) = 1.0800
step=2
w = 1.0800
L = 4.6864
grad = -3.8400 (gradient shrinking as we near minimum)
w_new = 1.0800 - 0.1×(-3.84) = 1.4640
step=3
w → L = 1.4640 → 3.3593
grad → w_new = -3.0720 → 1.7712
step=4
w → L = 1.7712 → 2.5099
grad → w_new = -2.4576 → 2.0170
step=5
w → L = 2.0170 → 1.9664
grad → w_new = -1.9661 → 2.2136
step=6
w → L = 2.2136 → 1.6185
grad → w_new = -1.5729 → 2.3709
step=7
w → L = 2.3709 → 1.3958
grad → w_new = -1.2583 → 2.4967
step=8
w → L = 2.4967 → 1.2533
grad → w_new = -1.0066 → 2.5973
step=9
w → L = 2.5973 → 1.1621
grad → w_new = -0.8053 → 2.6779
17L = loss(w) — Compute current loss

Evaluates the loss at the current weight. We track this to verify the loss is decreasing with each step — that is the whole point of gradient descent.

EXECUTION STATE
L = loss(w) = Step 0: loss(0.0) = (0-3)²+1 = 10.0. Step 9: loss(2.5973) = (2.5973-3)²+1 = 1.1621
18grad = loss_gradient(w) — Compute gradient

The gradient tells us which way is uphill. A negative gradient means the loss decreases to the right, so we should increase w. A positive gradient means decrease w.

EXECUTION STATE
grad = loss_gradient(w) = Step 0: 2(0-3) = -6.0. Step 9: 2(2.5973-3) = -0.8053. Notice: gradient magnitude shrinks as we approach the minimum.
19w_new = w - learning_rate * grad — THE update rule

This is the core of gradient descent: subtract the gradient (scaled by the learning rate) from the current weight. The minus sign means we move OPPOSITE to the gradient — downhill. This single line is what makes neural networks learn.

EXECUTION STATE
w = Current position on the loss curve
learning_rate * grad = Step size and direction. Step 0: 0.1 × (-6.0) = -0.6
w - learning_rate * grad = Step 0: 0.0 - (-0.6) = 0.0 + 0.6 = 0.6 (moves right toward minimum!)
⬆ w_new = The updated weight. Each step gets us closer to w=3.
20print(f"Step {step}: w={w:.4f}, L={L:.4f}, grad={grad:.4f}")

Logs the state at each step. Key pattern: w increases toward 3, L decreases toward 1, gradient magnitude shrinks.

EXECUTION STATE
output (all steps) =
Step 0: w=0.0000, L=10.0000, grad=-6.0000
Step 1: w=0.6000, L=6.7600, grad=-4.8000
Step 2: w=1.0800, L=4.6864, grad=-3.8400
Step 3: w=1.4640, L=3.3593, grad=-3.0720
Step 4: w=1.7712, L=2.5099, grad=-2.4576
Step 5: w=2.0170, L=1.9664, grad=-1.9661
Step 6: w=2.2136, L=1.6185, grad=-1.5729
Step 7: w=2.3709, L=1.3958, grad=-1.2583
Step 8: w=2.4967, L=1.2533, grad=-1.0066
Step 9: w=2.5973, L=1.1621, grad=-0.8053
21w = w_new — Update the weight

Replace the old weight with the new one. This completes one iteration of gradient descent.

EXECUTION STATE
w = w_new = After step 0: w changes from 0.0 to 0.6. After step 9: w = 2.6779
23print(f"Final: w = {w:.4f}, L = {loss(w):.4f}")

After 10 steps, w has moved from 0.0 to 2.6779 (close to optimum 3.0) and loss has dropped from 10.0 to 1.1038 (close to minimum 1.0). More steps would bring us even closer.

EXECUTION STATE
output = Final: w = 2.6779, L = 1.1038
→ progress = Traveled 2.6779 out of 3.0 distance to optimum (89.3%). Loss reduced by 89% (10.0 → 1.1038).
5 lines without explanation
1import numpy as np
2
3# Loss function: L(w) = (w - 3)² + 1
4# Minimum at w = 3, L = 1
5def loss(w):
6    return (w - 3)**2 + 1
7
8def loss_gradient(w):
9    return 2 * (w - 3)
10
11# Initialize
12w = 0.0
13learning_rate = 0.1
14
15# Run gradient descent for 10 steps
16for step in range(10):
17    L = loss(w)
18    grad = loss_gradient(w)
19    w_new = w - learning_rate * grad
20    print(f"Step {step}: w={w:.4f}, L={L:.4f}, grad={grad:.4f}")
21    w = w_new
22
23print(f"Final: w = {w:.4f}, L = {loss(w):.4f}")

Gradient Descent in Multiple Dimensions

The real power of gradient descent emerges with multiple variables. Instead of updating a single weight, we update an entire weight vector simultaneously. The update rule is the same, just with vectors:

wnew=woldηL(wold)\mathbf{w}_{\text{new}} = \mathbf{w}_{\text{old}} - \eta \cdot \nabla L(\mathbf{w}_{\text{old}})

Let us apply this to our two-variable loss L(x,y)=x2+3y2L(x, y) = x^2 + 3y^2. The gradient is L=[2x,6y]\nabla L = [2x, 6y]. Starting at (4,3)(4, 3) with learning rate 0.05:

2D Gradient Descent from Scratch
🐍gradient_descent_2d.py
1import numpy as np

NumPy is essential here — we use np.array for position and gradient vectors, and np.linalg.norm for gradient magnitude.

EXECUTION STATE
📚 numpy = Used for: np.array (vectors), np.linalg.norm (magnitude), and vectorized arithmetic (pos - lr*grad).
3# Loss function of TWO variables: L(x, y) = x² + 3y²

Now we have two parameters to optimize simultaneously. This is closer to a real neural network where you update many weights at once. The gradient tells us which direction in 2D space leads downhill fastest.

4def loss_2d(pos): — Loss as a function of position vector

Takes a 2-element NumPy array [x, y] and returns the loss. Using a vector input (instead of separate x, y) mirrors how real optimization works — all parameters packed into one vector.

EXECUTION STATE
⬇ input: pos = np.array([x, y]) — a 2-D position vector on the loss surface
⬆ returns = x² + 3y² — the loss value (height of the surface)
5x, y = pos — Unpack the position vector

Python unpacking: extracts x and y from the array. pos=[4.0, 3.0] becomes x=4.0, y=3.0.

EXECUTION STATE
x, y = pos = Unpacks [4.0, 3.0] into x=4.0 and y=3.0
6return x**2 + 3*y**2

At the starting point [4, 3]: 4² + 3×3² = 16 + 27 = 43.0. The 3× on y² creates an elliptical bowl, steeper in y than x.

EXECUTION STATE
⬆ return = At pos=[4, 3]: 16 + 27 = 43.0
8def gradient_2d(pos): — Compute the gradient vector

Returns ∇L = [dL/dx, dL/dy] = [2x, 6y]. This vector points in the direction of steepest ascent. We move opposite to it for descent.

EXECUTION STATE
⬇ input: pos = np.array([x, y]) — the current position
⬆ returns = np.array([2x, 6y]) — the gradient vector pointing uphill
9x, y = pos

Unpacks the position into x and y components.

EXECUTION STATE
x, y = At step 0: x=4.0, y=3.0
10return np.array([2*x, 6*y]) — The gradient vector

Packs the partial derivatives into a vector. At [4, 3]: gradient = [2×4, 6×3] = [8.0, 18.0]. The y-component (18.0) is much larger than x (8.0), so the gradient points mostly in the y-direction — the surface is steepest there.

EXECUTION STATE
📚 np.array([2*x, 6*y]) = Creates a 1-D array with both partial derivatives. This IS the gradient.
⬆ return = [8.0, 18.0] at pos=[4, 3]. The y-component is 2.25× larger.
12# Starting point: far from the minimum at origin

We start at (4, 3), far from the minimum at (0, 0). The loss here is 43.0 — quite high.

13pos = np.array([4.0, 3.0]) — Initial position

The starting point as a NumPy array. In a real network, this would be the initial weights (often random).

EXECUTION STATE
pos = [4.0, 3.0]
→ initial loss = L(4, 3) = 16 + 27 = 43.0
14learning_rate = 0.05 — Smaller for 2D stability

We use a smaller learning rate than the 1-D case because the gradient magnitude is larger (|[8, 18]| = 19.7). If we used lr=0.1, the y-component might overshoot. The learning rate must be tuned to the problem.

EXECUTION STATE
learning_rate = 0.05 = Each step moves by 0.05 × gradient. Step 0: 0.05 × [8, 18] = [0.4, 0.9]
16# Run gradient descent

We run 8 steps. Key insight: the y-coordinate converges faster than x because the gradient is steeper in y (coefficient 3× in the loss). This asymmetry is common in real networks.

17for step in range(8): — 8 optimization steps

Each step updates both x and y simultaneously using the gradient vector. Watch how y converges faster than x.

LOOP TRACE · 8 iterations
step=0
pos = [4.0000, 3.0000]
L = 43.0000
grad = [8.0000, 18.0000]
new_pos = [4.0-0.4, 3.0-0.9] = [3.6000, 2.1000]
step=1
pos = [3.6000, 2.1000]
L = 26.1900 (down 39%)
grad = [7.2000, 12.6000]
new_pos = [3.2400, 1.4700]
step=2
pos → L = [3.2400, 1.4700] → 16.9803
grad → new_pos = [6.4800, 8.8200] → [2.9160, 1.0290]
step=3
pos → L = [2.9160, 1.0290] → 11.6796
grad → new_pos = [5.8320, 6.1740] → [2.6244, 0.7203]
step=4
pos → L = [2.6244, 0.7203] → 8.4440
grad → new_pos = [5.2488, 4.3218] → [2.3620, 0.5042]
step=5
pos → L = [2.3620, 0.5042] → 6.3415
grad → new_pos = [4.7239, 3.0253] → [2.1258, 0.3529]
step=6
pos → L = [2.1258, 0.3529] → 4.8926
grad → new_pos = [4.2515, 2.1177] → [1.9132, 0.2471]
step=7
pos → L = [1.9132, 0.2471] → 3.8434
grad → new_pos = [3.8264, 1.4824] → [1.7219, 0.1729]
18L = loss_2d(pos) — Current loss value

Evaluates the loss at the current position. Over 8 steps, loss drops from 43.0 to 3.05 — a 93% reduction.

EXECUTION STATE
L = Step 0: 43.0, Step 4: 8.44, Step 7: 3.84
19grad = gradient_2d(pos) — Gradient at current position

Computes ∇L = [2x, 6y]. As we approach the origin, both components shrink, and the gradient magnitude decreases — steps become smaller automatically.

EXECUTION STATE
grad = Step 0: [8.0, 18.0], Step 7: [3.83, 1.48]. Note how y-gradient shrinks faster.
20new_pos = pos - learning_rate * grad — 2D update

Vector subtraction: pos - 0.05 * grad. NumPy handles this element-wise: x_new = x - 0.05*2x = x*(1-0.1) = 0.9x, and y_new = y - 0.05*6y = y*(1-0.3) = 0.7y. The y-coordinate shrinks by 30% per step while x shrinks by only 10%!

EXECUTION STATE
pos - learning_rate * grad = Element-wise vector operation. Each component updated independently.
x update factor = x_new = x - 0.05*(2x) = 0.9x — x shrinks by 10% per step
y update factor = y_new = y - 0.05*(6y) = 0.7y — y shrinks by 30% per step
→ asymmetry = y converges 3× faster than x because its curvature (coefficient 3) is steeper. This is typical of real loss landscapes!
21print(f"Step {step}: pos={pos}, L={L:.4f}, |grad|={np.linalg.norm(grad):.4f}")

Logs the position, loss, and gradient magnitude at each step.

EXECUTION STATE
output (first 3) =
Step 0: pos=[4. 3.], L=43.0000, |grad|=19.6977
Step 1: pos=[3.6 2.1], L=26.1900, |grad|=14.5117
Step 2: pos=[3.24 1.47], L=16.9803, |grad|=10.9477
22pos = new_pos — Update the position

Replace old position with the new one. Both x and y move simultaneously.

EXECUTION STATE
pos = new_pos = After step 7: pos = [1.7219, 0.1729]
24print(f"Final: pos={pos}, L={loss_2d(pos):.4f}")

After 8 steps: position [1.7219, 0.1729], loss 3.0546. The y-coordinate (0.1729) is much closer to 0 than x (1.7219) because it converged faster. More steps would bring both to zero.

EXECUTION STATE
output = Final: pos=[1.72186880 0.17294395], L=3.0546
→ y converged faster = y: 3.0 → 0.17 (94% reduction). x: 4.0 → 1.72 (57% reduction). The steeper dimension converges first.
5 lines without explanation
1import numpy as np
2
3# Loss function of TWO variables: L(x, y) = x² + 3y²
4def loss_2d(pos):
5    x, y = pos
6    return x**2 + 3*y**2
7
8def gradient_2d(pos):
9    x, y = pos
10    return np.array([2*x, 6*y])
11
12# Starting point: far from the minimum at origin
13pos = np.array([4.0, 3.0])
14learning_rate = 0.05
15
16# Run gradient descent
17for step in range(8):
18    L = loss_2d(pos)
19    grad = gradient_2d(pos)
20    new_pos = pos - learning_rate * grad
21    print(f"Step {step}: pos={pos}, L={L:.4f}, |grad|={np.linalg.norm(grad):.4f}")
22    pos = new_pos
23
24print(f"Final: pos={pos}, L={loss_2d(pos):.4f}")

Interactive 3D Visualization: Watch Gradient Descent Navigate a Loss Surface

The 3D visualization below shows gradient descent in action on a real loss surface. The orange ball follows the negative gradient downhill, leaving a yellow trail. The green arrow shows the descent direction. Try different surfaces, learning rates, and starting positions to build intuition for how gradient descent behaves.

Loading 3D gradient descent...
Experiment: Try the Rosenbrock surface — it has a narrow curved valley where gradient descent oscillates side-to-side while slowly progressing along the valley floor. This demonstrates a real challenge in optimization: the gradient may not point directly toward the minimum. Advanced optimizers like Adam and RMSProp (Chapter 9) address this.

The Learning Rate: Step Size Matters

The learning rate η\eta is the most important hyperparameter in neural network training. It controls the size of each update step:

  • Too small (η1\eta \ll 1): each step is tiny, convergence is painfully slow, and training may get stuck.
  • Just right: smooth convergence to the minimum in a reasonable number of steps.
  • Too large (η1\eta \gg 1): steps overshoot the minimum, causing oscillation or even divergence where the loss explodes to infinity.

For a simple quadratic loss L(w)=(ww)2L(w) = (w - w^*)^2, the update becomes wnew=w(12η)+2ηww_{\text{new}} = w(1 - 2\eta) + 2\eta w^*. This converges when 12η<1|1 - 2\eta| < 1, which requires 0<η<10 < \eta < 1. If η1\eta \geq 1, the weight oscillates with growing amplitude and the loss diverges.

Interactive: The Effect of Learning Rate

Use the slider below to experiment with different learning rates. Watch how the optimization trajectory and convergence behavior change dramatically with this single parameter.

Loading learning rate visualization...

PyTorch Autograd: Automatic Derivatives

Everything we have computed by hand — derivatives, partial derivatives, gradients — PyTorch does automatically. This is called automatic differentiation (autograd). Here is how it works:

  1. Create tensors with requires_grad=True\texttt{requires\_grad=True}. This tells PyTorch to track all operations on these tensors.
  2. Compute a function of those tensors (the forward pass). PyTorch builds a computational graph recording every operation.
  3. Call .backward()\texttt{.backward()} on the output. PyTorch traverses the graph in reverse, computing all partial derivatives using the chain rule (the backward pass).
  4. Read the gradients from tensor.grad\texttt{tensor.grad}. Each input tensor's .grad\texttt{.grad} attribute contains the derivative of the output with respect to that tensor.

No derivative formulas needed! For a neural network with millions of parameters, this is the difference between impossible and practical.

PyTorch Autograd: Automatic Differentiation
🐍pytorch_autograd.py
1import torch

PyTorch is a deep learning framework with automatic differentiation (autograd). Instead of computing derivatives by hand or numerically, PyTorch tracks all operations and computes exact gradients automatically using the chain rule.

EXECUTION STATE
📚 torch = PyTorch library. Key features: tensors (like NumPy arrays but GPU-capable), autograd (automatic differentiation), nn (neural network layers). Import as torch by convention.
3# === Automatic differentiation in PyTorch ===

Autograd is the engine that powers neural network training. Instead of deriving gradient formulas by hand (tedious, error-prone for complex networks), PyTorch records every operation you perform on tensors and automatically computes all derivatives when you call .backward().

5# Create a tensor WITH gradient tracking

The requires_grad=True flag tells PyTorch to record all operations on this tensor so it can compute derivatives later. Without this flag, PyTorch treats the tensor as a constant (no gradient computation).

6x = torch.tensor(2.0, requires_grad=True)

Creates a scalar tensor with value 2.0 and enables gradient tracking. PyTorch will build a computational graph of every operation involving x.

EXECUTION STATE
📚 torch.tensor(value, requires_grad) = Creates a PyTorch tensor from a Python number or list. requires_grad=True tells autograd to track operations for automatic differentiation.
⬇ value: 2.0 = The tensor’s value. A 0-dimensional (scalar) tensor.
⬇ requires_grad: True = Enables gradient computation. PyTorch records every operation (x**2, x+y, etc.) in a computational graph. When you call .backward(), it traverses this graph to compute derivatives.
x = tensor(2.0, requires_grad=True)
8# Compute a function of x

When we write y = x**2, PyTorch doesn’t just compute 4.0 — it also records that y was created by squaring x. This recorded graph is used later to compute dy/dx.

9y = x ** 2

Computes y = x² = 2.0² = 4.0. Behind the scenes, PyTorch creates a computational graph node: y = PowBackward(x, 2). This node knows that dy/dx = 2x.

EXECUTION STATE
** (power operator) = Same as Python’s ** but on a tracked tensor. PyTorch overloads this operator to record the computation.
y = tensor(4.0, grad_fn=<PowBackward0>) — the value 4.0 plus a record of how it was computed
→ grad_fn = PowBackward0 — tells autograd that y = x**2 and the local derivative is 2x
10print(f"x = {x.item()}, y = x² = {y.item()}")

.item() extracts the Python number from a scalar tensor. Useful for printing — avoids the tensor(...) wrapper in output.

EXECUTION STATE
📚 .item() = Converts a scalar tensor to a plain Python float. Only works on tensors with exactly one element.
output = x = 2.0, y = x² = 4.0
12# Compute the derivative: dy/dx

Now we call .backward() which traverses the computational graph in reverse and computes dy/dx using the chain rule. The result is stored in x.grad.

13y.backward() — Compute all gradients automatically

This single call computes dy/dx for every tensor that has requires_grad=True. PyTorch walks the graph backwards: y = x² → dy/dx = 2x = 2×2 = 4.0. The gradient is stored in x.grad.

EXECUTION STATE
📚 .backward() = Triggers reverse-mode automatic differentiation (backpropagation). Traverses the computational graph from output to inputs, computing gradients using the chain rule at each step.
→ graph traversal = y = x² → dy/dx = 2x. At x=2: dy/dx = 4.0. This is stored in x.grad.
→ after .backward() = x.grad = tensor(4.0) — the derivative dy/dx evaluated at x=2
14print(f"dy/dx = {x.grad.item()}")

Accesses the computed gradient. x.grad contains dy/dx = 4.0, matching our analytical result f′(2) = 2×2 = 4.

EXECUTION STATE
📚 x.grad = Attribute that stores the gradient of the output with respect to x after .backward() is called. None before .backward().
output = dy/dx = 4.0 — matches the hand-computed derivative 2x = 2×2 = 4
16# === Multivariate example ===

Now we compute gradients of a function with two inputs — exactly like computing the gradient vector ∇f = [df/dx, df/dy]. PyTorch handles this automatically: one .backward() call computes all partial derivatives.

17x = torch.tensor(2.0, requires_grad=True)

Creates a new tracked tensor for x. We need fresh tensors because .backward() can only be called once on a graph (by default).

EXECUTION STATE
x = tensor(2.0, requires_grad=True) — fresh tensor, x.grad is None until .backward()
18y = torch.tensor(1.0, requires_grad=True)

Creates a tracked tensor for y. Both x and y will have their .grad populated after calling .backward() on the output.

EXECUTION STATE
y = tensor(1.0, requires_grad=True)
20# Same loss function: z = x² + 3y²

The same elliptical bowl function we used in NumPy. Now PyTorch will compute both partial derivatives automatically.

21z = x**2 + 3*y**2

Builds a computational graph: z = Add(Pow(x, 2), Mul(3, Pow(y, 2))). Each operation is recorded. z = 2² + 3×1² = 4 + 3 = 7.0.

EXECUTION STATE
x**2 = 2.0² = 4.0
3*y**2 = 3 × 1.0² = 3.0
z = tensor(7.0, grad_fn=<AddBackward0>) — value 7.0 with recorded computation graph
22print(f"z = x² + 3y² = {z.item()}")

Prints the loss value.

EXECUTION STATE
output = z = x² + 3y² = 7.0
24# backward() computes ALL partial derivatives at once

One call to .backward() computes dz/dx AND dz/dy simultaneously. This is the power of reverse-mode automatic differentiation — no matter how many parameters, one backward pass computes all their gradients.

25z.backward() — Compute the full gradient

Traverses the graph backwards from z to x and y. Computes: dz/dx = 2x = 2×2 = 4.0, dz/dy = 6y = 6×1 = 6.0. Stores results in x.grad and y.grad.

EXECUTION STATE
📚 z.backward() = Backpropagation: follows the graph z → x**2 + 3*y**2 → x, y. Applies the chain rule at each node.
→ dz/dx = d/dx[x² + 3y²] = 2x = 2×2 = 4.0 → stored in x.grad
→ dz/dy = d/dy[x² + 3y²] = 6y = 6×1 = 6.0 → stored in y.grad
26print(f"dz/dx = {x.grad.item()}")

Displays the partial derivative with respect to x.

EXECUTION STATE
output = dz/dx = 4.0 — matches our hand-computed result: 2×2 = 4
27print(f"dz/dy = {y.grad.item()}")

Displays the partial derivative with respect to y.

EXECUTION STATE
output = dz/dy = 6.0 — matches our hand-computed result: 6×1 = 6
28print(f"Gradient: [{x.grad.item()}, {y.grad.item()}]")

The complete gradient vector ∇z = [4.0, 6.0]. Identical to what we computed by hand with NumPy, but PyTorch did all the calculus automatically. For a neural network with millions of parameters, this saves enormous effort.

EXECUTION STATE
output = Gradient: [4.0, 6.0]
→ key insight = Same result as manual computation, but no derivative formulas needed! For neural networks with millions of weights, this is essential — computing gradients by hand would be impossible.
7 lines without explanation
1import torch
2
3# === Automatic differentiation in PyTorch ===
4
5# Create a tensor WITH gradient tracking
6x = torch.tensor(2.0, requires_grad=True)
7
8# Compute a function of x
9y = x ** 2
10print(f"x = {x.item()}, y = x² = {y.item()}")
11
12# Compute the derivative: dy/dx
13y.backward()
14print(f"dy/dx = {x.grad.item()}")
15
16# === Multivariate example ===
17x = torch.tensor(2.0, requires_grad=True)
18y = torch.tensor(1.0, requires_grad=True)
19
20# Same loss function: z = x² + 3y²
21z = x**2 + 3*y**2
22print(f"z = x² + 3y² = {z.item()}")
23
24# backward() computes ALL partial derivatives at once
25z.backward()
26print(f"dz/dx = {x.grad.item()}")
27print(f"dz/dy = {y.grad.item()}")
28print(f"Gradient: [{x.grad.item()}, {y.grad.item()}]")

Gradient Descent in PyTorch

Now let us combine autograd with gradient descent to replicate our NumPy example in PyTorch. The code follows the standard PyTorch training loop that you will see in every deep learning project:

  1. Forward pass: compute the loss from current weights
  2. Backward pass: call loss.backward()\texttt{loss.backward()} to compute gradients
  3. Update weights: subtract ηgrad\eta \cdot \text{grad} inside torch.no_grad()\texttt{torch.no\_grad()}
  4. Zero gradients: call w.grad.zero_()\texttt{w.grad.zero\_()} for the next iteration
Gradient Descent in PyTorch
🐍pytorch_gradient_descent.py
1import torch

PyTorch provides everything we need: tensors with gradient tracking, automatic differentiation, and the building blocks for gradient descent.

EXECUTION STATE
📚 torch = We use torch.tensor (tracked arrays), .backward() (auto-differentiation), torch.no_grad() (disable tracking), and .grad.zero_() (reset gradients).
3# Same loss: L(x, y) = x² + 3y²

We replicate the same 2-D gradient descent example but now using PyTorch’s autograd instead of hand-computed gradients. The results will be identical.

4w = torch.tensor([4.0, 3.0], requires_grad=True)

Creates a 1-D tensor with two elements representing [x, y]. requires_grad=True enables automatic gradient computation for both elements.

EXECUTION STATE
📚 torch.tensor([...], requires_grad=True) = Creates a tensor from a list with gradient tracking. The tensor is 1-D with shape (2,).
w = tensor([4.0, 3.0], requires_grad=True) — same starting point as NumPy version
5learning_rate = 0.05

Same learning rate as our NumPy example. The results should be identical.

EXECUTION STATE
learning_rate = 0.05 = Step size hyperparameter. Same as NumPy version.
7for step in range(8): — 8 training steps

Each iteration follows the standard PyTorch training loop: forward pass → backward pass → update → zero gradients. This pattern is universal across all PyTorch training code.

LOOP TRACE · 3 iterations
step=0
w = [4.0000, 3.0000]
loss = 43.0000
w.grad = [8.0000, 18.0000]
new w = [3.6000, 2.1000]
step=1
w = [3.6000, 2.1000]
loss = 26.1900
w.grad = [7.2000, 12.6000]
new w = [3.2400, 1.4700]
step=2–7
trajectory = [3.24,1.47] → [2.92,1.03] → [2.62,0.72] → [2.36,0.50] → [2.13,0.35] → [1.91,0.25] → [1.72,0.17]
loss trajectory = 16.98 → 11.68 → 8.44 → 6.34 → 4.89 → 3.84 → 3.05
8# Forward pass: compute loss

In neural network terminology, computing the output from the inputs is called the forward pass. Here, our “network” is just L = x² + 3y².

9loss = w[0]**2 + 3*w[1]**2

Computes the loss and records the computation graph. w[0] is x, w[1] is y. At step 0: 4² + 3×3² = 16 + 27 = 43.0.

EXECUTION STATE
w[0] = The x-component. Step 0: 4.0
w[1] = The y-component. Step 0: 3.0
loss = tensor(43.0, grad_fn=<AddBackward0>). Step 7: 3.8434
11# Backward pass: compute gradients

The backward pass traverses the computation graph in reverse, applying the chain rule to compute dL/dw[0] and dL/dw[1]. This is backpropagation.

12loss.backward() — Backpropagation

Computes gradients and stores them in w.grad. At step 0: w.grad = [8.0, 18.0] = [2×4, 6×3]. This is the same gradient we computed by hand in the NumPy version.

EXECUTION STATE
📚 .backward() = Runs backpropagation through the computation graph. Computes dL/dw for every tensor with requires_grad=True.
w.grad after backward() = Step 0: tensor([8.0, 18.0]). Same as our hand-computed [2x, 6y] = [2×4, 6×3].
14# Update weights (no gradient tracking here!)

The weight update is NOT a differentiable operation — we don’t want PyTorch to compute gradients of the gradient. torch.no_grad() tells autograd to skip tracking.

15with torch.no_grad():

A context manager that disables gradient computation. Any tensor operations inside this block are not recorded in the computation graph. This is critical for the weight update step.

EXECUTION STATE
📚 torch.no_grad() = Context manager: temporarily disables gradient tracking. Without this, w -= lr*grad would be recorded and corrupt the next backward() pass.
→ why needed = The update w -= lr*grad should NOT be part of the computation graph. It’s a parameter update, not a forward pass operation.
16w -= learning_rate * w.grad — Gradient descent step

The weight update: w = w - 0.05 × w.grad. This is element-wise: w[0] -= 0.05×8.0 = 0.4, w[1] -= 0.05×18.0 = 0.9. The -= operator modifies the tensor in-place.

EXECUTION STATE
learning_rate * w.grad = Step 0: 0.05 × [8.0, 18.0] = [0.4, 0.9]
w -= [0.4, 0.9] = [4.0-0.4, 3.0-0.9] = [3.6, 2.1]
18print(f"Step {step}: w=[{w[0].item():.4f}, {w[1].item():.4f}], L={loss.item():.4f}")

Prints the updated weights and the loss from this step.

EXECUTION STATE
output (first 2) =
Step 0: w=[3.6000, 2.1000], L=43.0000
Step 1: w=[3.2400, 1.4700], L=26.1900
20# CRITICAL: zero the gradients for next iteration

PyTorch ACCUMULATES gradients by default — each .backward() call ADDS to .grad instead of replacing it. This is by design (useful for gradient accumulation across mini-batches), but if we forget to zero, the gradients will be wrong.

21w.grad.zero_() — Reset gradients to zero

Sets w.grad to [0, 0]. The underscore suffix (zero_) means in-place operation. Without this line, step 1’s gradient would be step 0’s gradient + step 1’s gradient = incorrect!

EXECUTION STATE
📚 .zero_() = In-place operation (the _ suffix is PyTorch convention for in-place). Sets all elements to 0. Called .grad.zero_() to reset accumulated gradients.
→ what if we forget? = Without zeroing: step 1 would compute grad=[7.2, 12.6] but .grad would contain [8+7.2, 18+12.6]=[15.2, 30.6] — double the correct value! Training would diverge.
23print(f"Final: L = {(w[0]**2 + 3*w[1]**2).item():.4f}")

Final loss after 8 steps: 3.0546. Identical to our NumPy result — proving that PyTorch’s automatic gradients give the same result as hand-computed ones, just with much less code.

EXECUTION STATE
output = Final: L = 3.0546
→ comparison = PyTorch: 3.0546. NumPy: 3.0546. Identical! Same math, different tools.
7 lines without explanation
1import torch
2
3# Same loss: L(x, y) = x² + 3y²
4w = torch.tensor([4.0, 3.0], requires_grad=True)
5learning_rate = 0.05
6
7for step in range(8):
8    # Forward pass: compute loss
9    loss = w[0]**2 + 3*w[1]**2
10
11    # Backward pass: compute gradients
12    loss.backward()
13
14    # Update weights (no gradient tracking here!)
15    with torch.no_grad():
16        w -= learning_rate * w.grad
17
18    print(f"Step {step}: w=[{w[0].item():.4f}, {w[1].item():.4f}], L={loss.item():.4f}")
19
20    # CRITICAL: zero the gradients for next iteration
21    w.grad.zero_()
22
23print(f"Final: L = {(w[0]**2 + 3*w[1]**2).item():.4f}")
Important Detail — Zeroing Gradients: PyTorch accumulates gradients by default (each .backward()\texttt{.backward()} adds to the existing .grad\texttt{.grad}). You must call .grad.zero_()\texttt{.grad.zero\_()} before each new backward pass. Forgetting this is one of the most common PyTorch bugs — the gradients silently accumulate, and training diverges for no apparent reason.

In practice, PyTorch provides the torch.optim\texttt{torch.optim} module with optimizers like SGD, Adam, and AdaGrad that handle the update step and gradient zeroing for you. We will explore these in Chapter 9 (Optimizers). But it is critical to understand what they are doing under the hood — which is exactly the gradient descent loop we just implemented.


Summary and Key Takeaways

This section covered the mathematical foundation of how neural networks learn. Here are the key concepts:

ConceptDefinitionRole in Neural Networks
Derivative f′(x)Rate of change of f with respect to xTells how the loss changes when one weight changes
Partial derivative ∂f/∂xDerivative with respect to one variable, others held constantIsolates the effect of one specific weight on the loss
Gradient ∇fVector of all partial derivativesPoints in direction of steepest ascent; −∇f points downhill
Gradient descentw_new = w_old − η⋅∇LThe algorithm that updates weights to minimize loss
Learning rate ηStep size multiplierControls convergence speed vs. stability
PyTorch autogradAutomatic differentiation via .backward()Computes all gradients automatically, no formulas needed
The Big Picture: Training a neural network is just gradient descent on a high-dimensional loss surface. The loss function measures prediction error, the gradient tells us how to adjust each weight to reduce that error, and the learning rate controls how aggressively we make those adjustments. In the next section, we will cover the probability basics that define the loss functions we optimize. In Section 4, we will see how the chain rule lets us compute gradients through the many layers of a deep network.
Loading comments...