Why PyTorch?
In the previous section, we built a complete neural network layer in NumPy — matrix multiplications, bias additions, ReLU activations. The forward pass worked perfectly. But there was a problem we deliberately ignored: how do we update the weights?
To update weights, we need gradients — the derivatives of the loss with respect to each weight. For our tiny 2-neuron layer, computing gradients by hand is tedious but possible. For a real network with millions of parameters, it is impossible to do by hand and extremely error-prone to code manually. This is the problem PyTorch solves.
The One Sentence Summary: PyTorch is NumPy with automatic gradient computation. Every operation you already know (, , ) works the same way, but PyTorch secretly records every computation so it can automatically calculate derivatives when you ask.
Here is what PyTorch adds on top of NumPy:
| Feature | NumPy | PyTorch |
|---|---|---|
| Core data structure | ndarray | Tensor (same API, more features) |
| Automatic gradients | No | Yes — via requires_grad=True + .backward() |
| GPU acceleration | No | Yes — .to('cuda') moves tensors to GPU |
| Neural network layers | Build from scratch | torch.nn provides pre-built layers |
| Optimizers (SGD, Adam) | Build from scratch | torch.optim provides pre-built optimizers |
| Computation graph | None | Dynamic graph built on-the-fly |
The key insight is that PyTorch is not a replacement for NumPy — it is a superset. If you know NumPy, you already know how to write PyTorch code. The syntax is deliberately designed to feel identical. The only new concept is the computational graph: a record of every operation that PyTorch uses to compute gradients automatically.
From NumPy to Tensors: Your First Translation
A PyTorch tensor is the direct equivalent of a NumPy ndarray. It stores numbers in a multi-dimensional grid, supports the same indexing and slicing, and provides the same mathematical operations. The name "tensor" comes from mathematics — it simply means "a multi-dimensional array of numbers." A scalar is a 0D tensor, a vector is a 1D tensor, a matrix is a 2D tensor, and so on.
Let's translate our NumPy code from Section 1 into PyTorch, line by line:
Notice the pattern: replace with , and you have working PyTorch code. The shapes, indexing, and operations all transfer directly. The three new properties to remember are:
| Property | NumPy | PyTorch | Why It Matters |
|---|---|---|---|
| Default float type | float64 (8 bytes) | float32 (4 bytes) | GPUs are 2× faster at 32-bit math |
| device | Always CPU | CPU or CUDA (GPU) | GPU parallelism enables training on large data |
| requires_grad | Not available | True/False | Enables automatic gradient computation |
Conversion Cheat Sheet
Operations You Already Know
Every operation from Section 1 — element-wise arithmetic, dot products, matrix multiplication — works identically in PyTorch. The operators are the same, the behavior is the same, and the results are the same. The only difference is that when tensors have , PyTorch records each operation in a computation graph.
Here is the translation table for the most common operations:
| Operation | NumPy | PyTorch | Result Shape |
|---|---|---|---|
| Element-wise add | a + b | a + b | Same as inputs |
| Element-wise multiply | a * b | a * b | Same as inputs |
| Dot product (1D) | np.dot(a, b) | torch.dot(a, b) | Scalar |
| Matrix multiply | X @ W or np.matmul(X, W) | X @ W or torch.matmul(X, W) | (m×n) @ (n×p) = (m×p) |
| Transpose | W.T | W.T or W.mT | Rows ↔ Columns |
| ReLU | np.maximum(0, x) | torch.relu(x) | Same as input |
| Sum | np.sum(x, axis=0) | torch.sum(x, dim=0) | Reduced along dim |
Key Insight: If your NumPy code runs correctly, the same logic in PyTorch will produce the same numbers. PyTorch was designed so that you focus on the math, not on learning a new API.
The Key Difference: Computational Graphs
So far, PyTorch looks like a NumPy clone. But there is one feature that changes everything: the computational graph. When you set on a tensor, PyTorch starts building an invisible data structure that records every operation.
Think of it like this: NumPy is a calculator — you give it numbers, it gives you answers. PyTorch is a calculator with a tape recorder — it gives you the same answers, but it also records how it computed them. When you later ask "how would the answer change if I tweaked this input?", PyTorch can rewind the tape and tell you, automatically.
How the Graph Works
Consider a simple function: . When you compute this in PyTorch with , here is what happens behind the scenes:
- x is created as a leaf node in the graph (an input with no parent operations)
- x**2 is computed: PyTorch creates a node that remembers "I squared x"
- 3*x is computed: a node remembers "I multiplied x by 3"
- The sum and final addition create nodes
- y gets a grad_fn attribute pointing to the last node — the entry point for traversing the graph backward
Try the interactive visualizer below. Change and watch the forward values update. Then click Backward to see gradients flow from the output back to the input:
Each node in the backward pass computes a local derivative and multiplies it by the incoming gradient — this is the chain rule in action. The gradient at is , computed automatically by summing the contributions from both paths: and .
Forward vs Backward
Your First Gradient
Let's make it concrete. We will compute at , then ask PyTorch for the derivative. The analytical answer is — let's see if PyTorch agrees:
Three lines of code. That is all it takes to compute a gradient that would require the chain rule, product rule, and careful bookkeeping if done by hand. And this scales: the same three lines work for a function with a million parameters.
The pattern is always the same:
- Create tensors with
- Compute the output (forward pass)
- Call
- Read the gradient from
Why This Matters: In a neural network, the "input" is the weights, the "computation" is the forward pass, and the "output" is the loss. By calling , PyTorch computes the gradient of the loss with respect to every weight — telling us exactly how to adjust each weight to reduce the error.
The Neuron Revisited: NumPy vs PyTorch
Now let's revisit the neuron from Section 1. We will write the exact same forward pass in both NumPy and PyTorch, side by side. The numbers will be identical. The only difference: the PyTorch version records a computational graph, so we can later compute gradients.
The outputs are identical:
| Person | Neuron 1 (NumPy) | Neuron 1 (PyTorch) | Neuron 2 (NumPy) | Neuron 2 (PyTorch) |
|---|---|---|---|---|
| Person 0 | 1.15 | 1.15 | 3.80 | 3.80 |
| Person 1 | 1.40 | 1.40 | 3.45 | 3.45 |
| Person 2 | 0.86 | 0.86 | 4.18 | 4.18 |
| Person 3 | 1.60 | 1.60 | 3.25 | 3.25 |
Same math, same results. But the PyTorch version has something NumPy cannot provide: . That is a thread we can pull to unravel the entire computation — from output back to weights — and compute gradients automatically.
One Step of Learning
We now have all the pieces. Let's put them together into a complete training step: forward pass, loss computation, backward pass, and weight update. This is the fundamental loop of all neural network training — everything else is optimization and scaling.
The Four-Step Training Loop
Every training iteration follows the same pattern:
| Step | What Happens | PyTorch Code |
|---|---|---|
| 1. Forward | Compute predictions from inputs | y_pred = model(X) |
| 2. Loss | Measure how wrong the predictions are | loss = loss_fn(y_pred, y_true) |
| 3. Backward | Compute gradients automatically | loss.backward() |
| 4. Update | Adjust weights to reduce loss | W -= lr * W.grad |
Let's implement all four steps for our single neuron:
After just one step, the loss decreased from to approximately . The gradient pointed the weights in the right direction. Repeat this step hundreds or thousands of times, and the network converges to weights that correctly classify the data.
Try the interactive gradient descent visualizer below. It shows the same concept on a simple 1D loss curve . Watch how the weight slides downhill toward the minimum:
Experiment with the learning rate slider. Notice:
- Too small (0.01): the weight inches toward 3.0 very slowly — it would take hundreds of steps
- Just right (0.1): smooth, steady convergence in about 20 steps
- Too large (0.5): the weight overshoots and oscillates around the minimum, or even diverges
The Training Loop in Production
Summary and Bridge
In this section, we translated everything from NumPy to PyTorch and discovered the one feature that makes deep learning possible: automatic gradient computation. Here is what we covered:
| Concept | NumPy | PyTorch Equivalent |
|---|---|---|
| Array/Tensor creation | np.array([1, 2, 3]) | torch.tensor([1, 2, 3]) |
| Element-wise ops | a + b, a * b | a + b, a * b (identical) |
| Dot product | np.dot(a, b) | torch.dot(a, b) |
| Matrix multiply | X @ W | X @ W (identical) |
| ReLU activation | np.maximum(0, x) | torch.relu(x) |
| Gradient tracking | Not possible | requires_grad=True |
| Compute gradients | Manual calculus | loss.backward() |
| Read gradients | Not possible | param.grad |
The training loop we built — forward, loss, backward, update — is the heartbeat of every neural network, from a single neuron to GPT. The architecture gets more complex, the loss functions get more sophisticated, and the optimizers get smarter, but the four-step pattern never changes.
In the next section, we will dive deeper into PyTorch tensors: creation methods (, , ), reshaping, broadcasting rules, and advanced indexing. These are the building blocks you will use every day when constructing neural networks.
In Section 4, we will explore PyTorch's autograd system in depth — how the computation graph is built and destroyed, how gradients accumulate, and how to use and to control what gets tracked.