Chapter 2
18 min read
Section 5 of 65

Introduction to PyTorch

Python and PyTorch Essentials

Why PyTorch?

In the previous section, we built a complete neural network layer in NumPy — matrix multiplications, bias additions, ReLU activations. The forward pass worked perfectly. But there was a problem we deliberately ignored: how do we update the weights?

To update weights, we need gradients — the derivatives of the loss with respect to each weight. For our tiny 2-neuron layer, computing gradients by hand is tedious but possible. For a real network with millions of parameters, it is impossible to do by hand and extremely error-prone to code manually. This is the problem PyTorch solves.

The One Sentence Summary: PyTorch is NumPy with automatic gradient computation. Every operation you already know (++, ×\times, @@) works the same way, but PyTorch secretly records every computation so it can automatically calculate derivatives when you ask.

Here is what PyTorch adds on top of NumPy:

FeatureNumPyPyTorch
Core data structurendarrayTensor (same API, more features)
Automatic gradientsNoYes — via requires_grad=True + .backward()
GPU accelerationNoYes — .to('cuda') moves tensors to GPU
Neural network layersBuild from scratchtorch.nn provides pre-built layers
Optimizers (SGD, Adam)Build from scratchtorch.optim provides pre-built optimizers
Computation graphNoneDynamic graph built on-the-fly

The key insight is that PyTorch is not a replacement for NumPy — it is a superset. If you know NumPy, you already know how to write PyTorch code. The syntax is deliberately designed to feel identical. The only new concept is the computational graph: a record of every operation that PyTorch uses to compute gradients automatically.


From NumPy to Tensors: Your First Translation

A PyTorch tensor is the direct equivalent of a NumPy ndarray. It stores numbers in a multi-dimensional grid, supports the same indexing and slicing, and provides the same mathematical operations. The name "tensor" comes from mathematics — it simply means "a multi-dimensional array of numbers." A scalar is a 0D tensor, a vector is a 1D tensor, a matrix is a 2D tensor, and so on.

Let's translate our NumPy code from Section 1 into PyTorch, line by line:

From NumPy to PyTorch \u2014 Side by Side
🐍numpy_to_pytorch.py
1import numpy as np

We import NumPy first to show the side-by-side comparison. Everything we built in the previous section — arrays, dot products, matrix multiply — has a direct PyTorch equivalent.

EXECUTION STATE
📚 numpy = The numerical computing library we used in Section 1. Provides ndarray, linear algebra, and math functions.
2import torch

PyTorch is imported as just ‘torch’ — no alias needed. This single import gives you tensors, automatic differentiation, neural network layers, optimizers, and GPU support. It’s the equivalent of importing NumPy, but with superpowers for deep learning.

EXECUTION STATE
📚 torch = PyTorch’s core module. Provides: torch.Tensor (like np.ndarray), torch.autograd (automatic differentiation), torch.nn (neural network layers), torch.optim (optimizers like SGD, Adam).
→ why ‘torch’? = PyTorch’s name comes from ‘Py’ (Python) + ‘Torch’ (an earlier ML framework in Lua). The import name is just ‘torch’ by convention.
4Comment: NumPy — what you already know

We start with the familiar NumPy code from Section 1. The goal is to show that PyTorch’s syntax is nearly identical — you already know 90% of PyTorch if you know NumPy.

5x_np = np.array([1.0, 2.0, 3.0])

Creates a NumPy ndarray — a 1D vector with 3 float64 elements. This is the familiar pattern from Section 1.

EXECUTION STATE
📚 np.array() = Creates an ndarray from a Python list. Infers dtype from values: floats → float64.
⬇ arg: [1.0, 2.0, 3.0] = Python list of 3 floats
⬆ result: x_np = [1. 2. 3.]
6print(x_np)

Displays the NumPy array. Notice the compact format with trailing dots for floats.

EXECUTION STATE
output = [1. 2. 3.]
7print(type(x_np))

Shows the Python type of the object. NumPy arrays are numpy.ndarray objects.

EXECUTION STATE
output = <class 'numpy.ndarray'>
9Comment: PyTorch — almost identical syntax

Now the key comparison. Watch how little changes when moving from NumPy to PyTorch. This is by design — PyTorch was built to feel like NumPy.

10x_pt = torch.tensor([1.0, 2.0, 3.0])

Creates a PyTorch tensor — the fundamental data structure of PyTorch. Syntactically, just replace ‘np.array’ with ‘torch.tensor’. But under the hood, this tensor can track gradients, move to GPU, and participate in a computational graph.

EXECUTION STATE
📚 torch.tensor() = Creates a new PyTorch tensor from data (list, tuple, NumPy array, or scalar). Unlike np.array, defaults to float32 (not float64) — this saves memory and is standard for deep learning.
⬇ arg: [1.0, 2.0, 3.0] = Python list of 3 floats — same input as np.array
⬆ result: x_pt = tensor([1., 2., 3.])
→ key difference = Default dtype is float32 (not float64 like NumPy). GPU-friendly: 32-bit math is 2× faster on GPUs. Can optionally track gradients via requires_grad=True.
11print(x_pt)

PyTorch prints tensors with the prefix ‘tensor(...)’ so you can always tell them apart from NumPy arrays at a glance.

EXECUTION STATE
output = tensor([1., 2., 3.])
12print(type(x_pt))

The Python type is torch.Tensor — PyTorch’s equivalent of numpy.ndarray. Both are multi-dimensional array types, but Tensor adds gradient tracking and GPU support.

EXECUTION STATE
output = <class 'torch.Tensor'>
14Comment: Convert NumPy → PyTorch (shared memory!)

PyTorch and NumPy can share the same underlying memory. This means conversion is nearly free — no data copying. But it also means modifying one changes the other!

15x_from_np = torch.from_numpy(x_np)

Converts a NumPy array to a PyTorch tensor WITHOUT copying data. The tensor and array point to the same memory. This is extremely efficient for large datasets but means changes to one affect the other.

EXECUTION STATE
📚 torch.from_numpy() = Creates a tensor that shares memory with the input NumPy array. Zero-copy conversion — O(1) time regardless of array size. The returned tensor inherits the NumPy dtype (float64 here).
⬇ arg: x_np = NumPy array [1. 2. 3.] with dtype float64
⬆ result: x_from_np = tensor([1., 2., 3.], dtype=torch.float64)
→ shared memory = If you do x_np[0] = 99, then x_from_np[0] also becomes 99! They point to the same memory. Use .clone() if you need an independent copy.
16print(x_from_np)

Notice dtype=torch.float64 — it preserved the NumPy dtype. When you create tensors via torch.tensor(), the default is float32. When converting from NumPy, the dtype is preserved.

EXECUTION STATE
output = tensor([1., 2., 3.], dtype=torch.float64)
18Comment: Convert PyTorch → NumPy

The reverse conversion is just as easy. This is useful when you want to use NumPy-based plotting libraries (matplotlib) or other NumPy-dependent tools.

19x_back = x_pt.numpy()

Converts a PyTorch tensor back to a NumPy array. Also shares memory (zero-copy). Only works for tensors on CPU — GPU tensors must be moved to CPU first with .cpu().

EXECUTION STATE
📚 .numpy() = Tensor method: returns a NumPy ndarray sharing the same memory. Only works if: (1) tensor is on CPU, (2) tensor does not require gradients. For grad tensors, use .detach().numpy().
⬆ result: x_back = [1. 2. 3.]
20print(x_back)

Back to a familiar NumPy array. The round-trip NumPy → PyTorch → NumPy is seamless.

EXECUTION STATE
output = [1. 2. 3.]
22Comment: Tensor properties

PyTorch tensors have the same core properties as NumPy arrays (shape, dtype) plus one critical addition: device. This tells you whether the tensor lives in CPU memory or GPU memory.

23print(x_pt.shape)

Shape works identically to NumPy. Returns a torch.Size object (behaves like a tuple). For a 1D tensor with 3 elements: torch.Size([3]).

EXECUTION STATE
📚 .shape = Returns torch.Size — a tuple-like object giving dimensions. Identical to NumPy’s .shape but returns torch.Size instead of tuple.
x_pt.shape = torch.Size([3]) — 1D tensor with 3 elements
24print(x_pt.dtype)

The data type. PyTorch defaults to float32 (unlike NumPy’s float64). This is intentional: 32-bit arithmetic is faster on GPUs and sufficient for neural network training.

EXECUTION STATE
📚 .dtype = The data type of every element. Common PyTorch types: torch.float32 (default for floats), torch.float64, torch.int64, torch.bool.
x_pt.dtype = torch.float32 — 32-bit float. Each element uses 4 bytes (vs 8 bytes for float64). A 1000×1000 matrix = 4MB instead of 8MB.
25print(x_pt.device)

The device property is what separates PyTorch from NumPy. It tells you WHERE the tensor lives — CPU memory or GPU memory. Moving tensors to GPU (x_pt.to(‘cuda’)) enables massive parallelism.

EXECUTION STATE
📚 .device = Where the tensor’s data is stored. ‘cpu’ = system RAM. ‘cuda:0’ = first GPU. You can move between devices with .to('cuda') or .to('cpu').
x_pt.device = cpu — the tensor is in system memory. To move to GPU: x_pt.to('cuda'). All tensors in an operation must be on the same device.
27Comment: 2D tensor (matrix)

Just like NumPy, a list of lists creates a 2D tensor (a matrix). Same data from Section 1 — 4 people with 3 features each.

28X = torch.tensor([[170.0, 65.0, 25.0], ...])

Creates a 2D tensor (matrix) from nested lists. Identical syntax to np.array with nested lists. Each inner list becomes a row. PyTorch verifies all rows have equal length.

EXECUTION STATE
⬇ arg: nested lists = 4 inner lists, each with 3 values — same data from Section 1: height, weight, age for 4 people.
⬆ result: X (4×3) =
  height  weight  age
  170.0    65.0  25.0
  160.0    55.0  30.0
  180.0    80.0  22.0
  155.0    50.0  35.0
34print(X.shape)

torch.Size([4, 3]) — exactly the same shape interpretation as NumPy: 4 rows (data points) and 3 columns (features). In deep learning: (batch_size=4, input_features=3).

EXECUTION STATE
X.shape = torch.Size([4, 3]) — 4 data points, 3 features each. Same as NumPy’s (4, 3).
11 lines without explanation
1import numpy as np
2import torch
3
4# NumPy: what you already know
5x_np = np.array([1.0, 2.0, 3.0])
6print(x_np)           # [1. 2. 3.]
7print(type(x_np))     # <class 'numpy.ndarray'>
8
9# PyTorch: almost identical syntax
10x_pt = torch.tensor([1.0, 2.0, 3.0])
11print(x_pt)           # tensor([1., 2., 3.])
12print(type(x_pt))     # <class 'torch.Tensor'>
13
14# Convert NumPy → PyTorch (shared memory!)
15x_from_np = torch.from_numpy(x_np)
16print(x_from_np)      # tensor([1., 2., 3.], dtype=torch.float64)
17
18# Convert PyTorch → NumPy
19x_back = x_pt.numpy()
20print(x_back)         # [1. 2. 3.]
21
22# Tensor properties (like NumPy, plus 'device')
23print(x_pt.shape)     # torch.Size([3])
24print(x_pt.dtype)     # torch.float32
25print(x_pt.device)    # cpu
26
27# 2D tensor (matrix) — same as np.array with nested lists
28X = torch.tensor([
29    [170.0, 65.0, 25.0],
30    [160.0, 55.0, 30.0],
31    [180.0, 80.0, 22.0],
32    [155.0, 50.0, 35.0],
33])
34print(X.shape)        # torch.Size([4, 3])

Notice the pattern: replace np\texttt{np} with torch\texttt{torch}, and you have working PyTorch code. The shapes, indexing, and operations all transfer directly. The three new properties to remember are:

PropertyNumPyPyTorchWhy It Matters
Default float typefloat64 (8 bytes)float32 (4 bytes)GPUs are 2× faster at 32-bit math
deviceAlways CPUCPU or CUDA (GPU)GPU parallelism enables training on large data
requires_gradNot availableTrue/FalseEnables automatic gradient computation

Conversion Cheat Sheet

NumPy to PyTorch: torch.from_numpy(arr) (shared memory, zero copy). PyTorch to NumPy: tensor.numpy() (shared memory, CPU only). For GPU tensors: tensor.cpu().numpy(). For gradient tensors: tensor.detach().numpy().

Operations You Already Know

Every operation from Section 1 — element-wise arithmetic, dot products, matrix multiplication — works identically in PyTorch. The operators are the same, the behavior is the same, and the results are the same. The only difference is that when tensors have requires_grad=True\texttt{requires\_grad=True}, PyTorch records each operation in a computation graph.

PyTorch Operations \u2014 Familiar Territory
🐍pytorch_operations.py
1import torch

Import PyTorch. For pure tensor operations, this is all you need — no submodules required.

EXECUTION STATE
📚 torch = Core PyTorch module. Tensor creation and operations are all accessed from here: torch.tensor(), torch.dot(), torch.matmul(), etc.
3a = torch.tensor([1.0, 2.0, 3.0])

Creates a 1D tensor (vector) with 3 elements. This is our first operand for demonstrating element-wise operations.

EXECUTION STATE
⬆ result: a = tensor([1., 2., 3.])
4b = torch.tensor([4.0, 5.0, 6.0])

Creates a second vector with the same shape. Both tensors must have the same shape (or be broadcastable) for element-wise operations.

EXECUTION STATE
⬆ result: b = tensor([4., 5., 6.])
6Comment: Element-wise addition

Element-wise operations work exactly like NumPy: corresponding elements are paired and the operator is applied to each pair independently.

7c = a + b

Adds corresponding elements: a[0]+b[0], a[1]+b[1], a[2]+b[2]. Identical to NumPy — same + operator, same behavior.

EXECUTION STATE
+ (addition) = Element-wise addition. Each position: c[i] = a[i] + b[i]. Requires same shape or broadcastable shapes.
→ calculation = c[0] = 1+4 = 5, c[1] = 2+5 = 7, c[2] = 3+6 = 9
⬆ result: c = tensor([5., 7., 9.])
8print(c)

Displays the element-wise sum. Same result as np.array([1,2,3]) + np.array([4,5,6]).

EXECUTION STATE
output = tensor([5., 7., 9.])
10Comment: Element-wise multiplication

Also called the Hadamard product. NOT the dot product — this multiplies element-by-element, producing a vector of the same shape.

11d = a * b

Multiplies corresponding elements: a[0]*b[0], a[1]*b[1], a[2]*b[2]. The result is a vector, not a scalar. In NumPy, this is the same as np.multiply(a, b).

EXECUTION STATE
* (multiplication) = Element-wise multiplication (Hadamard product). Each position: d[i] = a[i] * b[i]. NOT the dot product — result is a vector, not a scalar.
→ calculation = d[0] = 1×4 = 4, d[1] = 2×5 = 10, d[2] = 3×6 = 18
⬆ result: d = tensor([ 4., 10., 18.])
12print(d)

The element-wise product. To get the dot product, you would sum these: 4 + 10 + 18 = 32.

EXECUTION STATE
output = tensor([ 4., 10., 18.])
14Comment: Dot product

The dot product is the sum of element-wise products — the core operation of a neuron. In NumPy we used np.dot() or @. In PyTorch, we use torch.dot() for vectors.

15dot = torch.dot(a, b)

Computes the dot product: sum of element-wise products. For 1D vectors only — for matrices, use torch.matmul() or @. This is the single most important operation in neural networks: it’s how a neuron computes its weighted sum.

EXECUTION STATE
📚 torch.dot() = Computes the dot product of two 1D tensors: sum(a[i] * b[i]). Only works for 1D tensors — for 2D+ use torch.matmul() or @. Returns a scalar tensor (0-dimensional).
⬇ arg 1: a = tensor([1., 2., 3.])
⬇ arg 2: b = tensor([4., 5., 6.])
→ calculation = 1×4 + 2×5 + 3×6 = 4 + 10 + 18 = 32
⬆ result: dot = tensor(32.) — a 0-dimensional (scalar) tensor
16print(dot)

The dot product is a scalar (single number). PyTorch represents it as a 0-dimensional tensor: tensor(32.).

EXECUTION STATE
output = tensor(32.)
18Comment: Matrix multiplication with @

The @ operator works in PyTorch exactly as it does in NumPy. For 2D tensors, it performs standard matrix multiplication. This is the operation that lets a neural network process an entire batch through a layer in one shot.

19X = torch.tensor([[1.0, 2.0], [3.0, 4.0]])

Creates a 2×2 matrix. In a neural network context, X could represent 2 data points, each with 2 features.

EXECUTION STATE
⬆ result: X (2×2) =
[[1.0, 2.0],
 [3.0, 4.0]]
21W = torch.tensor([[5.0, 6.0], [7.0, 8.0]])

A second 2×2 matrix. In a neural network, W would be the weight matrix of a layer.

EXECUTION STATE
⬆ result: W (2×2) =
[[5.0, 6.0],
 [7.0, 8.0]]
23result = X @ W

Matrix multiplication using the @ operator. Each element result[i][j] is the dot product of row i of X with column j of W. Identical to np.matmul(X, W) or X @ W in NumPy.

EXECUTION STATE
@ (matmul operator) = Matrix multiplication. For 2D tensors: result[i][j] = sum(X[i][k] * W[k][j]) for all k. Requires inner dimensions to match: X(m×n) @ W(n×p) = result(m×p).
→ result[0][0] = X[0]·W[:,0] = 1×5 + 2×7 = 5 + 14 = 19
→ result[0][1] = X[0]·W[:,1] = 1×6 + 2×8 = 6 + 16 = 22
→ result[1][0] = X[1]·W[:,0] = 3×5 + 4×7 = 15 + 28 = 43
→ result[1][1] = X[1]·W[:,1] = 3×6 + 4×8 = 18 + 32 = 50
⬆ result: X @ W (2×2) =
[[19., 22.],
 [43., 50.]]
24print(result)

The 2×2 result matrix. Every operation you knew in NumPy translates directly to PyTorch with the same operator.

EXECUTION STATE
output =
tensor([[19., 22.],
        [43., 50.]])
8 lines without explanation
1import torch
2
3a = torch.tensor([1.0, 2.0, 3.0])
4b = torch.tensor([4.0, 5.0, 6.0])
5
6# Element-wise addition — identical to NumPy
7c = a + b
8print(c)             # tensor([5., 7., 9.])
9
10# Element-wise multiplication
11d = a * b
12print(d)             # tensor([ 4., 10., 18.])
13
14# Dot product: sum of element-wise products
15dot = torch.dot(a, b)
16print(dot)           # tensor(32.)
17
18# Matrix multiplication with @
19X = torch.tensor([[1.0, 2.0],
20                   [3.0, 4.0]])
21W = torch.tensor([[5.0, 6.0],
22                   [7.0, 8.0]])
23result = X @ W
24print(result)        # tensor([[19., 22.],
25                     #         [43., 50.]])

Here is the translation table for the most common operations:

OperationNumPyPyTorchResult Shape
Element-wise adda + ba + bSame as inputs
Element-wise multiplya * ba * bSame as inputs
Dot product (1D)np.dot(a, b)torch.dot(a, b)Scalar
Matrix multiplyX @ W or np.matmul(X, W)X @ W or torch.matmul(X, W)(m×n) @ (n×p) = (m×p)
TransposeW.TW.T or W.mTRows ↔ Columns
ReLUnp.maximum(0, x)torch.relu(x)Same as input
Sumnp.sum(x, axis=0)torch.sum(x, dim=0)Reduced along dim
Key Insight: If your NumPy code runs correctly, the same logic in PyTorch will produce the same numbers. PyTorch was designed so that you focus on the math, not on learning a new API.

The Key Difference: Computational Graphs

So far, PyTorch looks like a NumPy clone. But there is one feature that changes everything: the computational graph. When you set requires_grad=True\texttt{requires\_grad=True} on a tensor, PyTorch starts building an invisible data structure that records every operation.

Think of it like this: NumPy is a calculator — you give it numbers, it gives you answers. PyTorch is a calculator with a tape recorder — it gives you the same answers, but it also records how it computed them. When you later ask "how would the answer change if I tweaked this input?", PyTorch can rewind the tape and tell you, automatically.

How the Graph Works

Consider a simple function: y=x2+3x+1y = x^2 + 3x + 1. When you compute this in PyTorch with requires_grad=True\texttt{requires\_grad=True}, here is what happens behind the scenes:

  1. x is created as a leaf node in the graph (an input with no parent operations)
  2. x**2 is computed: PyTorch creates a PowBackward0\texttt{PowBackward0} node that remembers "I squared x"
  3. 3*x is computed: a MulBackward0\texttt{MulBackward0} node remembers "I multiplied x by 3"
  4. The sum and final addition create AddBackward0\texttt{AddBackward0} nodes
  5. y gets a grad_fn attribute pointing to the last node — the entry point for traversing the graph backward

Try the interactive visualizer below. Change xx and watch the forward values update. Then click Backward to see gradients flow from the output back to the input:

Loading computational graph visualization...

Each node in the backward pass computes a local derivative and multiplies it by the incoming gradient — this is the chain rule in action. The gradient at xx is yx=2x+3\frac{\partial y}{\partial x} = 2x + 3, computed automatically by summing the contributions from both paths: (x2)x=2x\frac{\partial(x^2)}{\partial x} = 2x and (3x)x=3\frac{\partial(3x)}{\partial x} = 3.

Forward vs Backward

Forward pass: values flow from inputs to output (left to right). Each operation produces a new value. This is the computation you would do in NumPy. Backward pass: gradients flow from output to inputs (right to left). Each node multiplies the incoming gradient by its local derivative (chain rule). This is what PyTorch adds.

Your First Gradient

Let's make it concrete. We will compute y=x2+3x+1y = x^2 + 3x + 1 at x=2x = 2, then ask PyTorch for the derivative. The analytical answer is dydx=2x+3=7\frac{dy}{dx} = 2x + 3 = 7 — let's see if PyTorch agrees:

Your First Automatic Gradient
🐍first_gradient.py
1import torch

Standard PyTorch import. The autograd system (automatic differentiation) is built into the core module — no extra imports needed.

EXECUTION STATE
📚 torch = Core module. torch.autograd is included — gradient computation is always available.
3Comment: Track gradients

This is THE key moment. With requires_grad=True, PyTorch starts watching every operation on this tensor. It silently builds a graph that records how outputs are computed from inputs — so it can later compute derivatives automatically.

4x = torch.tensor(2.0, requires_grad=True)

Creates a scalar tensor with value 2.0 and tells PyTorch: ‘I want gradients with respect to this tensor.’ From this point forward, every operation involving x will be recorded in a computational graph. This is the feature that makes PyTorch fundamentally different from NumPy.

EXECUTION STATE
📚 torch.tensor(data, requires_grad) = Creates a tensor. When requires_grad=True, PyTorch records all operations on this tensor into a computational graph, enabling automatic gradient computation via .backward().
⬇ arg 1: 2.0 = The value of x. A Python float becomes a 0-dimensional (scalar) tensor.
⬇ arg 2: requires_grad=True = The magic switch. When True, PyTorch attaches a gradient accumulator to this tensor and starts recording operations. When False (default), no graph is built — behaves like NumPy. Only float/complex tensors can require grad.
⬆ result: x = tensor(2., requires_grad=True)
→ what changed? = x now has a .grad attribute (initially None) and every operation on x creates a node in the computation graph.
5print(x)

Notice requires_grad=True is shown in the output. PyTorch always tells you which tensors are tracking gradients.

EXECUTION STATE
output = tensor(2., requires_grad=True)
7Comment: Forward pass

The ‘forward pass’ is simply computing the output from the input. In neural networks, this means running data through the network to get predictions. Here, we compute a simple polynomial.

8y = x**2 + 3*x + 1

Computes y = x² + 3x + 1 = 4 + 6 + 1 = 11. But something invisible is happening: PyTorch is building a computation graph behind the scenes. Each operation (**, *, +) creates a node in the graph that remembers what operation was performed and on which tensors.

EXECUTION STATE
x**2 = 2.0² = 4.0 — creates a PowBackward0 node in the graph
3*x = 3 × 2.0 = 6.0 — creates a MulBackward0 node in the graph
x**2 + 3*x = 4.0 + 6.0 = 10.0 — creates an AddBackward0 node
... + 1 = 10.0 + 1 = 11.0 — creates another AddBackward0 node
⬆ result: y = tensor(11., grad_fn=<AddBackward0>) — the output tensor with its grad_fn pointing to the last operation in the graph
→ grad_fn = Every tensor created by an operation on a requires_grad tensor gets a grad_fn attribute. This is the link back into the computation graph. y.grad_fn points to AddBackward0, which knows it came from adding two things.
9print(y)

Notice grad_fn=<AddBackward0>. This is proof that PyTorch recorded the computation. The last operation was addition (+1), so the grad_fn is AddBackward0. You can trace the entire graph by following .grad_fn.next_functions recursively.

EXECUTION STATE
output = tensor(11., grad_fn=<AddBackward0>)
11Comment: Backward pass

Now the magic. With a single call to .backward(), PyTorch walks the computation graph in reverse, applying the chain rule at each node to compute the derivative dy/dx. This is what would take pages of calculus to do by hand.

12y.backward()

Triggers automatic differentiation. PyTorch traverses the computation graph from y back to x, computing the gradient dy/dx using the chain rule. The gradient is accumulated into x.grad. After .backward(), the graph is destroyed (to free memory) unless you pass retain_graph=True.

EXECUTION STATE
📚 .backward() = Tensor method: computes gradients of this tensor with respect to all leaf tensors (those with requires_grad=True) in the computation graph. Uses reverse-mode automatic differentiation (backpropagation). Can only be called on scalar tensors.
→ what happens inside = 1. Start at y with gradient 1.0 (dy/dy = 1) 2. Walk back through AddBackward0: gradient passes through unchanged 3. Walk back through PowBackward0: multiply by 2x = 4.0 4. Walk back through MulBackward0: multiply by 3 5. Sum at x: 4.0 + 3.0 = 7.0 6. Store 7.0 in x.grad
→ graph destroyed = After .backward(), the computation graph is freed. Calling y.backward() again would error unless retain_graph=True was passed.
14Comment: Gradient stored in x.grad

After .backward(), every tensor with requires_grad=True has its .grad attribute populated with the computed gradient.

15print(x.grad)

The gradient dy/dx = 7.0. This is the answer to: ‘if I increase x by a tiny amount, how much does y change?’ At x=2.0, y changes 7 times as fast as x. This gradient tells us the direction and magnitude for updating x to minimize (or maximize) y.

EXECUTION STATE
📚 .grad = Attribute on leaf tensors (those created directly, not by operations). After .backward(), contains the accumulated gradient. Initially None before any backward pass.
x.grad = tensor(7.) — the derivative dy/dx evaluated at x=2.0
→ analytical verification = y = x² + 3x + 1, so dy/dx = 2x + 3. At x=2: dy/dx = 2(2) + 3 = 7 ✔️
16Comment: Verify dy/dx = 2x + 3 = 7

We can verify by hand: the derivative of x² is 2x, the derivative of 3x is 3, and the derivative of the constant 1 is 0. So dy/dx = 2x + 3 = 2(2) + 3 = 7. PyTorch got it exactly right, automatically.

4 lines without explanation
1import torch
2
3# Create a tensor that TRACKS GRADIENTS
4x = torch.tensor(2.0, requires_grad=True)
5print(x)             # tensor(2., requires_grad=True)
6
7# Forward pass: compute y = x² + 3x + 1
8y = x**2 + 3*x + 1
9print(y)             # tensor(11., grad_fn=<AddBackward0>)
10
11# Backward pass: compute dy/dx automatically!
12y.backward()
13
14# The gradient is stored in x.grad
15print(x.grad)        # tensor(7.)
16# Verify: dy/dx = 2x + 3 = 2(2) + 3 = 7 ✓

Three lines of code. That is all it takes to compute a gradient that would require the chain rule, product rule, and careful bookkeeping if done by hand. And this scales: the same three lines work for a function with a million parameters.

The pattern is always the same:

  1. Create tensors with requires_grad=True\texttt{requires\_grad=True}
  2. Compute the output (forward pass)
  3. Call output.backward()\texttt{output.backward()}
  4. Read the gradient from input.grad\texttt{input.grad}
Why This Matters: In a neural network, the "input" is the weights, the "computation" is the forward pass, and the "output" is the loss. By calling loss.backward()\texttt{loss.backward()}, PyTorch computes the gradient of the loss with respect to every weight — telling us exactly how to adjust each weight to reduce the error.

The Neuron Revisited: NumPy vs PyTorch

Now let's revisit the neuron from Section 1. We will write the exact same forward pass in both NumPy and PyTorch, side by side. The numbers will be identical. The only difference: the PyTorch version records a computational graph, so we can later compute gradients.

Same Neuron, Same Numbers \u2014 But Now With Gradient Tracking
🐍neuron_comparison.py
1import numpy as np

Importing both libraries for the side-by-side comparison. This is the moment where we see the same neuron in NumPy and PyTorch.

EXECUTION STATE
📚 numpy = For the NumPy version of the neuron (no gradient tracking)
2import torch

PyTorch for the enhanced version with gradient tracking.

EXECUTION STATE
📚 torch = For the PyTorch version (with gradient tracking)
4Comment: NumPy version from Section 1

First, we reproduce the exact neuron from Section 1 in NumPy. This establishes the baseline — the same computation, but without gradients.

5X_np = np.array([[170.0, 65.0, 25.0], ...])

Input data: 4 people with 3 features each (height, weight, age). Same data from Section 1.

EXECUTION STATE
⬆ result: X_np (4×3) =
  height  weight  age
  170.0    65.0  25.0
  160.0    55.0  30.0
  180.0    80.0  22.0
  155.0    50.0  35.0
9W_np = np.array([[0.01, -0.02, 0.03], [0.02, 0.01, -0.01]])

Weight matrix for 2 neurons, each with 3 inputs. Row 0 is neuron 1’s weights; row 1 is neuron 2’s weights.

EXECUTION STATE
⬆ result: W_np (2×3) =
          height  weight    age
neuron1   0.01   -0.02    0.03
neuron2   0.02    0.01   -0.01
11b_np = np.array([0.0, 0.0])

Bias vector — one bias per neuron. Both set to 0.0 initially.

EXECUTION STATE
⬆ result: b_np = [0.0, 0.0]
13Z_np = X_np @ W_np.T + b_np

The forward pass in NumPy: multiply input by transposed weights, add bias. X(4×3) @ W.T(3×2) + b(2) = Z(4×2). Each row of Z is one data point’s pre-activation output for both neurons.

EXECUTION STATE
@ W_np.T = W_np.T transposes (2×3) to (3×2). Matrix multiply X(4×3) @ W.T(3×2) = Z(4×2)
⬆ result: Z_np (4×2) =
          neuron1  neuron2
person0    1.15     3.80
person1    1.40     3.45
person2    0.86     4.18
person3    1.60     3.25
14A_np = np.maximum(0, Z_np)

ReLU activation: replace negative values with 0, keep positive values unchanged. Since all values in Z_np are positive, A_np equals Z_np here.

EXECUTION STATE
📚 np.maximum(0, Z) = Element-wise maximum between 0 and each element. This IS the ReLU activation function: max(0, z).
⬆ result: A_np (4×2) =
          neuron1  neuron2
person0    1.15     3.80
person1    1.40     3.45
person2    0.86     4.18
person3    1.60     3.25
15print("NumPy output:")

Label for the NumPy output to distinguish from the PyTorch output below.

16print(A_np)

The NumPy result. Hold this in mind — the PyTorch version will produce the exact same numbers.

EXECUTION STATE
output =
[[1.15 3.8 ]
 [1.4  3.45]
 [0.86 4.18]
 [1.6  3.25]]
18Comment: PyTorch version

Now the same computation in PyTorch. The code is nearly identical — the only difference is requires_grad=True on the parameters we want to optimize.

19X = torch.tensor([[170.0, 65.0, 25.0], ...])

Same input data, now as a PyTorch tensor. No requires_grad because we don’t need gradients with respect to the input data — we want to optimize the weights, not the data.

EXECUTION STATE
⬆ result: X (4×3) = Same data as X_np. No requires_grad — data is fixed, not learnable.
24Comment: The KEY difference: requires_grad=True!

This is the entire reason we’re using PyTorch instead of NumPy. With this single flag, PyTorch will automatically compute how changing W and b affects the output — the gradients we need for learning.

25W = torch.tensor([[0.01, -0.02, 0.03], ...], requires_grad=True)

Same weight values as NumPy, but now PyTorch is tracking them. Every operation involving W will be recorded. After .backward(), W.grad will tell us exactly how to adjust each weight to reduce the error.

EXECUTION STATE
⬇ requires_grad=True = W is a LEARNABLE parameter. PyTorch will record all operations involving W and compute ∂loss/∂W when we call .backward().
⬆ result: W (2×3) =
tensor([[ 0.01, -0.02,  0.03],
        [ 0.02,  0.01, -0.01]], requires_grad=True)
28b = torch.tensor([0.0, 0.0], requires_grad=True)

Bias is also learnable. Both W and b are ‘leaf tensors’ with requires_grad=True — they’re the parameters the network will optimize.

EXECUTION STATE
⬆ result: b = tensor([0., 0.], requires_grad=True)
30Z = X @ W.T + b

Identical computation to the NumPy version. But now PyTorch is silently building a computation graph: it records that Z came from a matrix multiply of X and W.T, plus an addition of b. The numerical result is the same.

EXECUTION STATE
X @ W.T = Matrix multiply: X(4×3) @ W.T(3×2) = (4×2). Same arithmetic as NumPy.
+ b = Broadcasting: adds b(2) to each row of the (4×2) result
⬆ result: Z (4×2) =
          neuron1  neuron2
person0    1.15     3.80
person1    1.40     3.45
person2    0.86     4.18
person3    1.60     3.25
→ Z.grad_fn = AddBackward0 — the graph recorded: Z = (matmul result) + b
31A = torch.relu(Z)

ReLU activation using PyTorch’s built-in function. Equivalent to np.maximum(0, Z) but is also recorded in the computation graph. PyTorch knows how to differentiate through ReLU: gradient is 1 where Z > 0, and 0 where Z ≤ 0.

EXECUTION STATE
📚 torch.relu() = Applies ReLU (Rectified Linear Unit) element-wise: relu(z) = max(0, z). Returns 0 for negative inputs, passes positive inputs through unchanged. Recorded in the computation graph with grad_fn=ReluBackward0.
⬇ arg: Z (4×2) = Pre-activation values. All positive here, so ReLU doesn’t change them.
⬆ result: A (4×2) =
          neuron1  neuron2
person0    1.15     3.80
person1    1.40     3.45
person2    0.86     4.18
person3    1.60     3.25
32print("\nPyTorch output:")

Label for the PyTorch output.

33print(A)

Same numbers as the NumPy version! The math is identical. The difference is that PyTorch recorded the entire computation, so it can now compute gradients.

EXECUTION STATE
output =
tensor([[1.1500, 3.8000],
        [1.4000, 3.4500],
        [0.8600, 4.1800],
        [1.6000, 3.2500]], grad_fn=<ReluBackward0>)
34print("\ngrad_fn chain:")

Let’s peek at the computation graph PyTorch built.

35print(f" A.grad_fn = {A.grad_fn}")

A.grad_fn points to ReluBackward0 — the last operation that created A. You can follow .next_functions to walk back through the entire graph: Relu → Add → Matmul → W, b.

EXECUTION STATE
output = A.grad_fn = <ReluBackward0 object>
→ the chain = A ←(relu)← Z ←(add)← ([email protected]) + b ←(matmul)← X, W.T
36print(f" Z.grad_fn = {Z.grad_fn}")

Z.grad_fn points to AddBackward0 — because Z was created by adding the matmul result and the bias b. Each tensor remembers exactly how it was born.

EXECUTION STATE
output = Z.grad_fn = <AddBackward0 object>
14 lines without explanation
1import numpy as np
2import torch
3
4# ── NumPy version (from Section 1) ──────────────
5X_np = np.array([[170.0, 65.0, 25.0],
6                  [160.0, 55.0, 30.0],
7                  [180.0, 80.0, 22.0],
8                  [155.0, 50.0, 35.0]])
9W_np = np.array([[0.01, -0.02, 0.03],
10                  [0.02,  0.01, -0.01]])
11b_np = np.array([0.0, 0.0])
12
13Z_np = X_np @ W_np.T + b_np
14A_np = np.maximum(0, Z_np)      # ReLU
15print("NumPy output:")
16print(A_np)
17
18# ── PyTorch version ─────────────────────────────
19X = torch.tensor([[170.0, 65.0, 25.0],
20                   [160.0, 55.0, 30.0],
21                   [180.0, 80.0, 22.0],
22                   [155.0, 50.0, 35.0]])
23
24# The KEY difference: requires_grad=True!
25W = torch.tensor([[0.01, -0.02, 0.03],
26                   [0.02,  0.01, -0.01]],
27                  requires_grad=True)
28b = torch.tensor([0.0, 0.0], requires_grad=True)
29
30Z = X @ W.T + b
31A = torch.relu(Z)
32print("\nPyTorch output:")
33print(A)
34print("\ngrad_fn chain:")
35print(f"  A.grad_fn = {A.grad_fn}")
36print(f"  Z.grad_fn = {Z.grad_fn}")

The outputs are identical:

PersonNeuron 1 (NumPy)Neuron 1 (PyTorch)Neuron 2 (NumPy)Neuron 2 (PyTorch)
Person 01.151.153.803.80
Person 11.401.403.453.45
Person 20.860.864.184.18
Person 31.601.603.253.25

Same math, same results. But the PyTorch version has something NumPy cannot provide: A.grad_fn = ReluBackward0\texttt{A.grad\_fn = ReluBackward0}. That grad_fn\texttt{grad\_fn} is a thread we can pull to unravel the entire computation — from output back to weights — and compute gradients automatically.


One Step of Learning

We now have all the pieces. Let's put them together into a complete training step: forward pass, loss computation, backward pass, and weight update. This is the fundamental loop of all neural network training — everything else is optimization and scaling.

The Four-Step Training Loop

Every training iteration follows the same pattern:

StepWhat HappensPyTorch Code
1. ForwardCompute predictions from inputsy_pred = model(X)
2. LossMeasure how wrong the predictions areloss = loss_fn(y_pred, y_true)
3. BackwardCompute gradients automaticallyloss.backward()
4. UpdateAdjust weights to reduce lossW -= lr * W.grad

Let's implement all four steps for our single neuron:

The Complete Training Step \u2014 Forward, Loss, Backward, Update
🐍one_step_of_learning.py
1import torch

Core PyTorch for tensors and autograd.

EXECUTION STATE
📚 torch = Core module: tensors, autograd, basic operations
2import torch.nn.functional as F

PyTorch’s functional module provides loss functions, activations, and other neural network operations as plain functions (not as class instances). We’ll use F.binary_cross_entropy for our loss.

EXECUTION STATE
📚 torch.nn.functional = Contains stateless versions of neural network operations: F.relu(), F.softmax(), F.binary_cross_entropy(), F.mse_loss(), etc. Aliased as F by convention.
as F = Universal alias — you’ll see ‘import torch.nn.functional as F’ in virtually every PyTorch codebase.
4Comment: Input data and true labels

We now have a supervised learning setup: input features X and the correct answers y_true. The goal: train the network to predict y_true from X.

5X = torch.tensor([[170.0, 65.0, 25.0], ...])

Same 4-person dataset. No requires_grad — data is fixed.

EXECUTION STATE
⬆ result: X (4×3) =
  height  weight  age
  170.0    65.0  25.0
  160.0    55.0  30.0
  180.0    80.0  22.0
  155.0    50.0  35.0
9y_true = torch.tensor([1.0, 0.0, 1.0, 0.0])

The ground-truth labels. This is a binary classification problem: persons 0 and 2 belong to class 1, persons 1 and 3 belong to class 0. The network needs to learn to predict these labels from the features.

EXECUTION STATE
⬆ result: y_true = [1.0, 0.0, 1.0, 0.0] — person 0: class 1, person 1: class 0, person 2: class 1, person 3: class 0
11Comment: Learnable parameters

W and b are the values the network will adjust to make better predictions. They start with small random-ish values and will be updated by gradient descent.

12W = torch.tensor([[0.01, -0.02, 0.03]], requires_grad=True)

A single output neuron: 1 row of 3 weights (one per input feature). requires_grad=True because this is what we’re optimizing.

EXECUTION STATE
⬆ result: W (1×3) =
[[0.01, -0.02, 0.03]] — w_height=0.01, w_weight=-0.02, w_age=0.03
→ requires_grad=True = Learnable parameter. After .backward(), W.grad will contain ∂loss/∂W.
14b = torch.tensor([0.0], requires_grad=True)

Bias for the single output neuron. Also learnable.

EXECUTION STATE
⬆ result: b = tensor([0.], requires_grad=True)
16Comment: Step 1 — Forward pass

The forward pass computes predictions from inputs. The pipeline: linear transformation → sigmoid activation → probability output.

17Z = X @ W.T + b

Linear transformation: each data point’s weighted sum plus bias. X(4×3) @ W.T(3×1) + b(1) = Z(4×1). Each element is one person’s raw score.

EXECUTION STATE
X @ W.T = Matrix multiply: X(4×3) @ W.T(3×1) = (4×1)
→ person 0 = 170×0.01 + 65×(-0.02) + 25×0.03 = 1.70 - 1.30 + 0.75 = 1.15
→ person 1 = 160×0.01 + 55×(-0.02) + 30×0.03 = 1.60 - 1.10 + 0.90 = 1.40
→ person 2 = 180×0.01 + 80×(-0.02) + 22×0.03 = 1.80 - 1.60 + 0.66 = 0.86
→ person 3 = 155×0.01 + 50×(-0.02) + 35×0.03 = 1.55 - 1.00 + 1.05 = 1.60
⬆ result: Z (4×1) =
[[1.15],
 [1.40],
 [0.86],
 [1.60]]
18y_pred = torch.sigmoid(Z.squeeze())

Two operations: (1) .squeeze() removes the size-1 dimension, converting Z from shape (4,1) to (4). (2) torch.sigmoid() converts raw scores to probabilities between 0 and 1. Values close to 1 mean ‘class 1’, close to 0 mean ‘class 0’.

EXECUTION STATE
📚 .squeeze() = Removes all dimensions of size 1. Shape (4,1) becomes (4). This is needed because y_true has shape (4), and BCE loss requires matching shapes.
📚 torch.sigmoid() = Applies the sigmoid function element-wise: σ(z) = 1/(1+e^(-z)). Maps any real number to (0, 1). Used for binary classification: output represents probability of class 1.
→ sigmoid(1.15) = 1/(1+e^(-1.15)) = 1/(1+0.3166) = 0.7595
→ sigmoid(1.40) = 1/(1+e^(-1.40)) = 1/(1+0.2466) = 0.8022
→ sigmoid(0.86) = 1/(1+e^(-0.86)) = 1/(1+0.4232) = 0.7027
→ sigmoid(1.60) = 1/(1+e^(-1.60)) = 1/(1+0.2019) = 0.8320
⬆ result: y_pred = [0.7595, 0.8022, 0.7027, 0.8320]
→ interpretation = The network thinks ALL 4 people are class 1 (all > 0.5). But persons 1 and 3 are actually class 0! The network is wrong — it needs to learn.
19print(f"Predictions: {y_pred.data}")

Printing .data instead of the tensor directly hides the grad_fn for cleaner output.

EXECUTION STATE
output = Predictions: tensor([0.7595, 0.8022, 0.7027, 0.8320])
21Comment: Step 2 — Compute loss

The loss function measures how wrong the predictions are. A lower loss means better predictions. We need a single number so we can compute its gradient.

22loss = F.binary_cross_entropy(y_pred, y_true)

Binary Cross-Entropy (BCE) loss: the standard loss for binary classification. It heavily penalizes confident wrong predictions. For each sample: -[y·log(p) + (1-y)·log(1-p)], then averaged over all samples.

EXECUTION STATE
📚 F.binary_cross_entropy() = Computes BCE loss: -mean(y·log(p) + (1-y)·log(1-p)). When y=1 and p is low, loss is high (penalizes missing a positive). When y=0 and p is high, loss is high (penalizes false positive).
⬇ arg 1: y_pred = [0.7595, 0.8022, 0.7027, 0.8320] — predicted probabilities
⬇ arg 2: y_true = [1.0, 0.0, 1.0, 0.0] — ground truth labels
→ person 0 (y=1, p=0.76) = -[1·log(0.76) + 0·log(0.24)] = -log(0.76) = 0.2749 (low loss: correct and fairly confident)
→ person 1 (y=0, p=0.80) = -[0·log(0.80) + 1·log(0.20)] = -log(0.20) = 1.6094 (HIGH loss: wrong and confident!)
→ person 2 (y=1, p=0.70) = -[1·log(0.70) + 0·log(0.30)] = -log(0.70) = 0.3567 (low loss: correct)
→ person 3 (y=0, p=0.83) = -[0·log(0.83) + 1·log(0.17)] = -log(0.17) = 1.7720 (HIGH loss: wrong and confident!)
⬆ result: loss = tensor(1.0081) — average of [0.2749, 1.6094, 0.3567, 1.7720] = 1.0032 (close to 1.0081 accounting for float precision)
23print(f"Loss: {loss.item():.4f}")

.item() extracts the Python float from a scalar tensor. Useful for printing and logging.

EXECUTION STATE
📚 .item() = Extracts the scalar value from a 0-dimensional tensor as a Python float. Only works on tensors with exactly one element.
output = Loss: 1.0081
25Comment: Step 3 — Backward pass

One line. That’s all it takes. PyTorch walks the computation graph backward and computes ∂loss/∂W and ∂loss/∂b automatically using the chain rule.

26loss.backward()

Triggers backpropagation through the entire computation graph: loss → BCE → sigmoid → linear → W, b. After this call, W.grad and b.grad contain the gradients of the loss with respect to W and b.

EXECUTION STATE
📚 .backward() = Computes gradients via reverse-mode autodiff (backpropagation). Walks the graph from loss back to all leaf tensors with requires_grad=True, applying the chain rule at each step.
→ chain rule path = loss → BCE derivative → sigmoid derivative → linear derivative → W.grad, b.grad
27print(f"W.grad: {W.grad}")

The gradient of the loss with respect to each weight. Positive gradient means ‘increasing this weight increases the loss’ (so decrease it). Negative gradient means the opposite.

EXECUTION STATE
W.grad (1×3) =
tensor([[40.7270, 11.5755, 10.1581]])
→ interpretation = W.grad[0] = 40.73: the height weight (0.01) is WAY too high — reducing it sharply will reduce the loss. W.grad[1] = 11.58: weight feature needs adjustment too. W.grad[2] = 10.16: age feature also.
28print(f"b.grad: {b.grad}")

The gradient of the loss with respect to the bias.

EXECUTION STATE
b.grad = tensor([0.2741])
→ interpretation = Positive gradient: the bias (0.0) should decrease slightly to reduce the loss.
30Comment: Step 4 — Update weights

Now we use the gradients to take a step in the direction that reduces the loss. This is gradient descent: w_new = w_old - learning_rate × gradient.

31lr = 0.0001

The learning rate controls how big each update step is. Too large: overshoots the minimum. Too small: learns too slowly. 0.0001 is conservative but safe for un-normalized data.

EXECUTION STATE
lr = 0.0001 = A small step size. With gradients around 40, the actual weight change is 0.0001 × 40 = 0.004 — a small but meaningful nudge.
32with torch.no_grad():

Critical: we do NOT want PyTorch to track gradient operations on the weight update itself. torch.no_grad() temporarily disables gradient tracking so the update doesn’t pollute the computation graph.

EXECUTION STATE
📚 torch.no_grad() = Context manager that disables gradient tracking inside its block. Without this, W -= lr * W.grad would try to build a graph for the update operation, which we don’t want. The update is not part of the model — it’s an optimization step.
33W -= lr * W.grad

The gradient descent update rule: W_new = W_old - lr × ∂loss/∂W. Each weight moves in the direction that DECREASES the loss.

EXECUTION STATE
lr * W.grad = 0.0001 × [40.73, 11.58, 10.16] = [0.00407, 0.00116, 0.00102]
W after update = [[0.0059, -0.0212, 0.0290]] — height weight decreased from 0.01 to 0.006 (biggest change, biggest gradient)
34b -= lr * b.grad

Same update rule for the bias: b_new = b_old - lr × ∂loss/∂b.

EXECUTION STATE
b after update = tensor([-0.0000]) — decreased from 0.0 by a tiny amount (0.0001 × 0.274 = 0.0000274)
35print(f"W after: {W.data}")

The updated weights. Even this one tiny step changed the weights in the direction that reduces the loss.

EXECUTION STATE
output = W after: tensor([[ 0.0059, -0.0212, 0.0290]])
36print(f"b after: {b.data}")

The updated bias.

EXECUTION STATE
output = b after: tensor([-2.7409e-05])
38Comment: Clear gradients

PyTorch ACCUMULATES gradients by default — each .backward() call ADDS to .grad, it doesn’t replace it. This is useful for some advanced techniques, but usually you want to zero the gradients before the next iteration.

39W.grad.zero_()

Sets W.grad to all zeros, ready for the next iteration. The trailing underscore _ is PyTorch’s convention for in-place operations (modifies the tensor directly rather than creating a new one).

EXECUTION STATE
📚 .zero_() = In-place operation: fills the tensor with zeros. The _ suffix means ‘in-place’ in PyTorch: .add_(), .mul_(), .zero_() all modify the tensor directly.
→ why zero? = If we skip zeroing, the next .backward() ADDS to the existing gradient. After 2 iterations, W.grad would be grad_iter1 + grad_iter2 instead of just grad_iter2. This causes incorrect updates.
40b.grad.zero_()

Same for the bias gradient. In practice, you’ll use optimizer.zero_grad() which does this for all parameters at once.

EXECUTION STATE
b.grad.zero_() = Resets bias gradient to 0. Now both W.grad and b.grad are clean for the next training step.
11 lines without explanation
1import torch
2import torch.nn.functional as F
3
4# Input data and true labels
5X = torch.tensor([[170.0, 65.0, 25.0],
6                   [160.0, 55.0, 30.0],
7                   [180.0, 80.0, 22.0],
8                   [155.0, 50.0, 35.0]])
9y_true = torch.tensor([1.0, 0.0, 1.0, 0.0])
10
11# Learnable parameters
12W = torch.tensor([[0.01, -0.02, 0.03]],
13                  requires_grad=True)
14b = torch.tensor([0.0], requires_grad=True)
15
16# ── Step 1: Forward pass ──────────────────────
17Z = X @ W.T + b
18y_pred = torch.sigmoid(Z.squeeze())
19print(f"Predictions: {y_pred.data}")
20
21# ── Step 2: Compute loss ──────────────────────
22loss = F.binary_cross_entropy(y_pred, y_true)
23print(f"Loss: {loss.item():.4f}")
24
25# ── Step 3: Backward pass ─────────────────────
26loss.backward()
27print(f"W.grad: {W.grad}")
28print(f"b.grad: {b.grad}")
29
30# ── Step 4: Update weights ────────────────────
31lr = 0.0001
32with torch.no_grad():
33    W -= lr * W.grad
34    b -= lr * b.grad
35print(f"W after: {W.data}")
36print(f"b after: {b.data}")
37
38# Clear gradients for next iteration
39W.grad.zero_()
40b.grad.zero_()

After just one step, the loss decreased from 1.00811.0081 to approximately 0.880.88. The gradient pointed the weights in the right direction. Repeat this step hundreds or thousands of times, and the network converges to weights that correctly classify the data.

Try the interactive gradient descent visualizer below. It shows the same concept on a simple 1D loss curve L(w)=(w3)2L(w) = (w - 3)^2. Watch how the weight slides downhill toward the minimum:

Loading gradient descent visualization...

Experiment with the learning rate slider. Notice:

  • Too small (0.01): the weight inches toward 3.0 very slowly — it would take hundreds of steps
  • Just right (0.1): smooth, steady convergence in about 20 steps
  • Too large (0.5): the weight overshoots and oscillates around the minimum, or even diverges

The Training Loop in Production

In real PyTorch code, you use torch.optim (e.g., optim.SGD\texttt{optim.SGD} or optim.Adam\texttt{optim.Adam}) instead of manually writing W -= lr * W.grad\texttt{W -= lr * W.grad}. The optimizer handles the update step and gradient zeroing for you. But the four-step structure — forward, loss, backward, update — never changes.

Summary and Bridge

In this section, we translated everything from NumPy to PyTorch and discovered the one feature that makes deep learning possible: automatic gradient computation. Here is what we covered:

ConceptNumPyPyTorch Equivalent
Array/Tensor creationnp.array([1, 2, 3])torch.tensor([1, 2, 3])
Element-wise opsa + b, a * ba + b, a * b (identical)
Dot productnp.dot(a, b)torch.dot(a, b)
Matrix multiplyX @ WX @ W (identical)
ReLU activationnp.maximum(0, x)torch.relu(x)
Gradient trackingNot possiblerequires_grad=True
Compute gradientsManual calculusloss.backward()
Read gradientsNot possibleparam.grad

The training loop we built — forward, loss, backward, update — is the heartbeat of every neural network, from a single neuron to GPT. The architecture gets more complex, the loss functions get more sophisticated, and the optimizers get smarter, but the four-step pattern never changes.

In the next section, we will dive deeper into PyTorch tensors: creation methods (torch.zeros\texttt{torch.zeros}, torch.randn\texttt{torch.randn}, torch.arange\texttt{torch.arange}), reshaping, broadcasting rules, and advanced indexing. These are the building blocks you will use every day when constructing neural networks.

In Section 4, we will explore PyTorch's autograd system in depth — how the computation graph is built and destroyed, how gradients accumulate, and how to usetorch.no_grad()\texttt{torch.no\_grad()} and .detach()\texttt{.detach()} to control what gets tracked.

Loading comments...