Learning Objectives
By the end of this section, you will be able to:
- Build a neural network with nn.Module — the standard way in PyTorch
- Understand nn.Linear's weight convention — why it stores (out, in) not (in, out)
- Create a complete training dataset of all 16 possible 2×2 binary images
- Run a batch forward pass on the entire dataset at once
The bridge: You did the math by hand. You saw it in NumPy. Now let's see how PyTorch does the same thing — and you'll see the code mirrors the math line for line. Once you trust the code, you'll never need to compute by hand again (but you'll always know what's happening underneath).
PyTorch Implementation
Click any line to see the exact values and explanations. Pay special attention to 's weight storage convention — it's the most common source of confusion.
The nn.Linear Weight Convention
This is the #1 gotcha for beginners. Our math convention and PyTorch's convention are transposed:
| Convention | shape | Computation | Used by |
|---|---|---|---|
| Our math | NumPy, textbooks | ||
| PyTorch |
Both produce identical results. PyTorch stores directly because its internal computation is . When you print and see , that's (out, in) — each row is one neuron's input weights.
Quick Check
PyTorch's nn.Linear(4, 3) stores its weight tensor with shape...
Creating the Training Dataset
Our 2×2 binary images have 4 pixels, each 0 or 1 — that's possible images. We can enumerate the entire input space:
Batch Forward Pass
Instead of feeding images one at a time, we stack all 16 into a matrix and process them simultaneously. The batch dimension (16) rides along through every matrix multiplication — the math is identical, just applied to all images in parallel:
Quick Check
If X has shape (16, 4) and the network is 4 → 3 → 4, what shape does model(X) produce?
What Changes During Training?
| Component | Fixed or Learned? | Details |
|---|---|---|
| Architecture | Fixed | You choose before training |
| Activation (ReLU) | Fixed | You choose before training |
| Weights | Learned | 24 numbers that change every step |
| Biases | Learned | 7 numbers that change every step |
| Training data | Fixed | The examples you provide |
| Forward pass | Fixed |
Learning = finding the right 31 numbers. The structure, activations, number of layers — all fixed. Training only adjusts the weights and biases.
Preview of Chapter 8: How do we find those 31 numbers? We compute how the loss changes when we nudge each weight (the gradient), then adjust each weight to reduce the loss. That's backpropagation + gradient descent. We'll trace every gradient by hand, just like we traced every forward pass here.
The Computation Graph
Here's the complete forward pass as a computation graph — a directed acyclic graph (DAG) where each node is an operation and edges show data flow. Hover over any node to see its value from our running example.
Every node in this graph stores its output during the forward pass. In Chapter 8, we'll flow gradients backward through these same edges to compute how each weight should change.
Exercises
- Different input. Pick and compute the forward pass by hand, then verify with the PyTorch code.
- Change hidden size. Modify to use 5 hidden neurons. How does the parameter count change?
- Remove activation. Comment out F.relu in . The output still exists but training will fail — why?
- Batch vs single. Verify that equals . Why is batching faster?
Summary
- PyTorch mirrors the math. calls which runs the same computation we did by hand.
- nn.Linear stores weights transposed — shape (out, in) not (in, out). Internally computes .
- All 31 parameters verified. Hand calculation, NumPy, and PyTorch produce identical results.
- 16 training examples cover every possible 2×2 binary image.
- The network currently produces garbage. Training (Chapter 8) will fix this.