Chapter 7
20 min read
Section 23 of 65

Building Forward Pass in PyTorch

Forward Propagation

Learning Objectives

By the end of this section, you will be able to:

  1. Build a neural network with nn.Module — the standard way in PyTorch
  2. Understand nn.Linear's weight convention — why it stores (out, in) not (in, out)
  3. Create a complete training dataset of all 16 possible 2×2 binary images
  4. Run a batch forward pass on the entire dataset at once
The bridge: You did the math by hand. You saw it in NumPy. Now let's see how PyTorch does the same thing — and you'll see the code mirrors the math line for line. Once you trust the code, you'll never need to compute by hand again (but you'll always know what's happening underneath).

PyTorch Implementation

Click any line to see the exact values and explanations. Pay special attention to nn.Linear\texttt{nn.Linear}'s weight storage convention — it's the most common source of confusion.

Forward Pass — PyTorch nn.Module
🐍forward_pytorch.py
1import torch

PyTorch is the deep learning framework. It provides tensors (like NumPy arrays but with GPU support and automatic differentiation), neural network modules, and optimizers.

EXECUTION STATE
torch = Core library: tensors, autograd, device management. torch.tensor() creates arrays, .backward() computes gradients
2import torch.nn as nn

Neural network building blocks: layers (Linear, Conv2d), containers (Module, Sequential), loss functions (MSELoss, CrossEntropyLoss).

EXECUTION STATE
nn.Module = Base class for all neural network models. Provides .parameters(), .forward(), .train()/.eval()
nn.Linear(in, out) = Fully connected layer. Stores weight (out×in) and bias (out,). Computes: output = x @ W.T + b
3import torch.nn.functional as F

Stateless functions (no learnable parameters): activations (F.relu, F.softmax), pooling, etc. Use when the operation has no weights to learn.

EXECUTION STATE
F.relu() = Functional ReLU: F.relu(x) = max(0, x) element-wise. Unlike nn.ReLU() (a module), this is a plain function call.
6class DiagonalFlipNet(nn.Module):

Our neural network class. Inheriting from nn.Module gives us: automatic parameter tracking, .forward() dispatch, GPU transfer, save/load, and gradient computation for free.

EXECUTION STATE
📚 nn.Module = PyTorch base class for all models. You define layers in __init__ and computation in forward(). PyTorch auto-discovers all nn.Linear/nn.Conv2d layers and tracks their weights.
Architecture = Input(4) → Linear(4→3) → ReLU → Linear(3→4) → Output(4) 31 parameters total: W1(4×3=12) + b1(3) + W2(3×4=12) + b2(4)
7def __init__(self):

Constructor. Define all layers here — PyTorch scans for nn.Module attributes to build the parameter list.

8super().__init__()

Call nn.Module's constructor. Required — it sets up internal bookkeeping for parameter registration, hooks, and device tracking.

9self.layer1 = nn.Linear(4, 3)

First fully-connected layer: 4 inputs → 3 hidden neurons. Creates a weight matrix and bias vector.

EXECUTION STATE
📚 nn.Linear(in_features, out_features) = Creates a layer that computes: output = input @ weight.T + bias
⬇ in_features = 4 = 4 input features (our flattened 2×2 image)
⬇ out_features = 3 = 3 output neurons (hidden layer size)
weight shape = (3, 4) — NOTE: PyTorch stores it transposed! (out, in) not (in, out)
bias shape = (3,) — one bias per output neuron
⚠️ Convention = PyTorch nn.Linear stores W as (out, in). Forward: x @ W.T + b. Our math convention has W as (in, out). Both produce identical results.
10self.layer2 = nn.Linear(3, 4)

Second layer: 3 hidden neurons → 4 outputs. Weight shape (4, 3), bias shape (4,).

EXECUTION STATE
weight shape = (4, 3) — 4 output neurons × 3 inputs from hidden layer = 12 weights
bias shape = (4,) — one per output
12def forward(self, x):

Defines the forward pass computation. Called automatically when you write model(x). PyTorch traces this to build the computation graph for backpropagation.

EXECUTION STATE
⬇ input: x = A tensor of shape (4,) for single input or (batch, 4) for batched. Example: [1.0, 0.0, 1.0, 1.0]
⬆ returns = Tensor of shape (4,) — the network's prediction for the input
13h = F.relu(self.layer1(x))

Two operations in one line: (1) self.layer1(x) computes x @ W1.T + b1, (2) F.relu() clamps negatives to zero. This is our hidden layer.

EXECUTION STATE
self.layer1(x) = Calls nn.Linear.forward(): x @ weight.T + bias = [0.0, -0.1, 0.5]
F.relu(…) = max(0, [0.0, -0.1, 0.5]) = [0.0, 0.0, 0.5]
⬆ result: h = [0.0, 0.0, 0.5] — matches our hand calculation!
14y_hat = self.layer2(h)

Output layer: h @ W2.T + b2. No activation on the output — raw values for regression.

EXECUTION STATE
self.layer2(h) = [0.0, 0.0, 0.5] @ W2.T + b2 = [0.1, -0.1, -0.05, -0.25]
⬆ result: y_hat = [0.1, -0.1, -0.05, -0.25] — matches!
15return y_hat

Return the prediction tensor. PyTorch keeps the computation graph attached so .backward() can compute gradients later.

EXECUTION STATE
⬆ return: y_hat = [0.1, -0.1, -0.05, -0.25]
18model = DiagonalFlipNet()

Create an instance of our network. PyTorch randomly initializes all weights. We'll override them next with our specific values.

EXECUTION STATE
model = DiagonalFlipNet( (layer1): Linear(4→3, 15 params) (layer2): Linear(3→4, 16 params) ) — 31 total parameters
20with torch.no_grad():

Context manager that disables gradient tracking. We're manually setting weights, not training — so we don't want PyTorch to record these operations in its computation graph.

EXECUTION STATE
📚 torch.no_grad() = Temporarily disables autograd. Operations inside this block run faster and use less memory. Use for inference or manual weight manipulation.
21model.layer1.weight.copy_(...) — Load W1

Load our hand-calculated weights. CRITICAL: nn.Linear stores weight as (out_features, in_features), so what we load is W1 transposed. Each row = one hidden neuron's weights.

EXECUTION STATE
⚠️ Shape convention = Our W1 is (4,3) — rows=inputs, cols=neurons. PyTorch layer1.weight is (3,4) — rows=neurons, cols=inputs. So we load the TRANSPOSED version.
layer1.weight (3×4) =
      x0    x1    x2    x3
h0   0.2  -0.3   0.1  -0.4
h1  -0.5   0.4   0.3   0.2
h2   0.1  -0.2   0.5  -0.1
26model.layer1.bias.copy_(...) — Load b1

Load hidden layer biases. Shape (3,) — one per hidden neuron.

EXECUTION STATE
layer1.bias = [0.1, -0.1, 0.0]
28model.layer2.weight.copy_(...) — Load W2

Load output layer weights. Again transposed: our W2 is (3,4), PyTorch stores (4,3). Each row = one output neuron.

EXECUTION STATE
layer2.weight (4×3) =
      h0    h1    h2
y0   0.3  -0.1   0.2
y1  -0.2   0.5  -0.4
y2   0.4  -0.3   0.1
y3   0.1   0.2  -0.5
33model.layer2.bias.copy_(...) — Load b2

Load output biases. Shape (4,) — one per output neuron.

EXECUTION STATE
layer2.bias = [0.0, 0.1, -0.1, 0.0]
37x = torch.tensor([1.0, 0.0, 1.0, 1.0])

Our input image [1,0,1,1] as a PyTorch tensor. Uses float32 by default.

EXECUTION STATE
x = tensor([1., 0., 1., 1.]) — shape (4,)
38y_hat = model(x)

THE FORWARD PASS. Calling model(x) invokes model.forward(x) which runs: layer1 → ReLU → layer2. PyTorch builds a computation graph for backprop.

EXECUTION STATE
📚 model(x) = Syntactic sugar for model.forward(x). But model(x) also runs hooks, sets training mode, etc. Never call model.forward(x) directly.
⬆ result: y_hat = tensor([0.1000, -0.1000, -0.0500, -0.2500]) — matches hand calculation!
39target = torch.tensor([1.0, 1.0, 0.0, 1.0])

The diagonal flip of [1,0,1,1]. This is what the network should output.

EXECUTION STATE
target = tensor([1., 1., 0., 1.])
40loss = torch.mean((y_hat - target) ** 2)

MSE loss computed manually. In production you'd use nn.MSELoss(), but this shows what it does under the hood.

EXECUTION STATE
y_hat - target = [-0.9, -1.1, -0.05, -1.25]
(…) ** 2 = [0.81, 1.21, 0.0025, 1.5625]
⬆ result: loss = tensor(0.8963) — same as NumPy!
42print(f"Prediction: {y_hat.tolist()}")

.tolist() converts tensor to Python list for cleaner printing.

EXECUTION STATE
output = Prediction: [0.1, -0.1, -0.05, -0.25]
43print(f"Target: {target.tolist()}")

Display target values.

EXECUTION STATE
output = Target: [1.0, 1.0, 0.0, 1.0]
44print(f"MSE Loss: {loss.item():.4f}")

.item() extracts a Python scalar from a 0-dimensional tensor. Use this for printing single values.

EXECUTION STATE
📚 .item() = Converts a single-element tensor to a Python float. Required because f-string formatting doesn't work directly on tensors.
output = MSE Loss: 0.8963
45print — total parameter count

Count all learnable parameters. model.parameters() yields each weight/bias tensor, .numel() counts elements.

EXECUTION STATE
📚 .parameters() = Generator that yields all learnable tensors: layer1.weight(3×4=12), layer1.bias(3), layer2.weight(4×3=12), layer2.bias(4)
📚 .numel() = Number of elements in a tensor. tensor([1,2,3]).numel() = 3
output = Parameters: 31 — matches our count!
20 lines without explanation
1import torch
2import torch.nn as nn
3import torch.nn.functional as F
4
5# ── Define the network ──
6class DiagonalFlipNet(nn.Module):
7    def __init__(self):
8        super().__init__()
9        self.layer1 = nn.Linear(4, 3)
10        self.layer2 = nn.Linear(3, 4)
11
12    def forward(self, x):
13        h = F.relu(self.layer1(x))
14        y_hat = self.layer2(h)
15        return y_hat
16
17# ── Create model & load our weights ──
18model = DiagonalFlipNet()
19
20with torch.no_grad():
21    model.layer1.weight.copy_(torch.tensor([
22        [ 0.2, -0.3,  0.1, -0.4],
23        [-0.5,  0.4,  0.3,  0.2],
24        [ 0.1, -0.2,  0.5, -0.1],
25    ]))
26    model.layer1.bias.copy_(
27        torch.tensor([0.1, -0.1, 0.0]))
28    model.layer2.weight.copy_(torch.tensor([
29        [ 0.3, -0.1,  0.2],
30        [-0.2,  0.5, -0.4],
31        [ 0.4, -0.3,  0.1],
32        [ 0.1,  0.2, -0.5],
33    ]))
34    model.layer2.bias.copy_(
35        torch.tensor([0.0, 0.1, -0.1, 0.0]))
36
37# ── Forward pass ──
38x = torch.tensor([1.0, 0.0, 1.0, 1.0])
39y_hat = model(x)
40target = torch.tensor([1.0, 1.0, 0.0, 1.0])
41loss = torch.mean((y_hat - target) ** 2)
42
43print(f"Prediction: {y_hat.tolist()}")
44print(f"Target:     {target.tolist()}")
45print(f"MSE Loss:   {loss.item():.4f}")
46print(f"Parameters: {sum(p.numel() for p in model.parameters())}")

The nn.Linear Weight Convention

This is the #1 gotcha for beginners. Our math convention and PyTorch's convention are transposed:

ConventionW1\mathbf{W}_1 shapeComputationUsed by
Our math(4,3)(4, 3)xW+b\mathbf{x} \cdot \mathbf{W} + \mathbf{b}NumPy, textbooks
PyTorch(3,4)(3, 4)xWT+b\mathbf{x} \cdot \mathbf{W}^T + \mathbf{b}nn.Linear\texttt{nn.Linear}

Both produce identical results. PyTorch stores WT\mathbf{W}^T directly because its internal computation is x @ weight.T + bias\texttt{x @ weight.T + bias}. When you print model.layer1.weight.shape\texttt{model.layer1.weight.shape} and see (3, 4)\texttt{(3, 4)}, that's (out, in) — each row is one neuron's input weights.

Quick Check

PyTorch's nn.Linear(4, 3) stores its weight tensor with shape...


Creating the Training Dataset

Our 2×2 binary images have 4 pixels, each 0 or 1 — that's 24=162^4 = 16 possible images. We can enumerate the entire input space:

Training Dataset — All 16 Images
🐍dataset.py
1import torch

PyTorch for tensor operations.

3def make_dataset():

Generates ALL possible 2×2 binary images (2⁴ = 16) and their diagonal flips. This is our complete training set — one of the few problems where we can enumerate the entire input space.

EXECUTION STATE
⬆ returns = (X, Y) — two tensors of shape (16, 4). X = inputs, Y = targets
Total examples = 16 — every possible combination of 4 binary pixels
7for i in range(16):

Loop through numbers 0-15. Each number encodes one of the 16 possible 2×2 binary images via its 4-bit binary representation.

LOOP TRACE · 4 iterations
i=0 → 0000 → [0,0,0,0]
image = [0,0] / [0,0] → all black
i=6 → 0110 → [0,1,1,0]
image = [0,1] / [1,0] → anti-diagonal
i=11 → 1011 → [1,0,1,1]
image = [1,0] / [1,1] → our example!
i=15 → 1111 → [1,1,1,1]
image = [1,1] / [1,1] → all white
8bits = [(i >> 3) & 1, ...] — Binary decomposition

Extract the 4 bits of i using right-shift and bitwise AND. i >> 3 gives the most significant bit, i & 1 gives the least significant bit.

EXECUTION STATE
📚 >> (right shift) = Shifts bits right. 11 >> 3 = 1 (binary 1011 → 0001). Extracts higher bits.
📚 & 1 (bitwise AND) = Keeps only the last bit. 6 & 1 = 0, 7 & 1 = 1. Tests if a number is odd.
Example: i=11 (binary 1011) = 11>>3 & 1 = 1, 11>>2 & 1 = 0, 11>>1 & 1 = 1, 11 & 1 = 1 → [1,0,1,1]
10img = torch.tensor(bits, dtype=torch.float32)

Convert the 4 bits to a float tensor — the network input.

EXECUTION STATE
img = tensor([1., 0., 1., 1.]) for i=11
11flipped = img.reshape(2,2).T.reshape(-1)

The diagonal flip in one line: reshape to 2×2, transpose (.T swaps rows/cols = diagonal flip), flatten back to 1D.

EXECUTION STATE
.reshape(2,2) = [1,0,1,1] → [[1,0],[1,1]]
.T (transpose) = [[1,0],[1,1]] → [[1,1],[0,1]]
.reshape(-1) = [[1,1],[0,1]] → [1,1,0,1]
⬆ result: flipped = tensor([1., 1., 0., 1.]) — our target!
14return torch.stack(inputs), torch.stack(targets)

Stack the 16 individual tensors into batch tensors.

EXECUTION STATE
📚 torch.stack() = Combines a list of tensors into one tensor with a new first dimension. 16 tensors of shape (4,) → one tensor of shape (16, 4).
⬆ return: X = Shape (16, 4) — 16 input images
⬆ return: Y = Shape (16, 4) — 16 target flips
16X, Y = make_dataset()

Generate the complete dataset. 16 input-output pairs covering every possible 2×2 binary image.

EXECUTION STATE
X = tensor of shape (16, 4) — all 16 possible inputs
Y = tensor of shape (16, 4) — corresponding diagonal flips
17print dataset info

Display the dataset shape.

EXECUTION STATE
output = Dataset: 16 samples, input shape torch.Size([16, 4]), target shape torch.Size([16, 4])
22for i in [0, 6, 11, 15]:

Show 4 representative examples: all-black, anti-diagonal, our running example, all-white.

LOOP TRACE · 4 iterations
i=0: all-black
output = [[0,0]] → [[0,0]] [[0,0]] [[0,0]]
i=6: anti-diagonal
output = [[0,1]] → [[0,1]] [[1,0]] [[1,0]]
i=11: our example
output = [[1,0]] → [[1,1]] [[1,1]] [[0,1]]
i=15: all-white
output = [[1,1]] → [[1,1]] [[1,1]] [[1,1]]
16 lines without explanation
1import torch
2
3def make_dataset():
4    """Generate all 16 possible 2x2 binary images
5    with their diagonal-flipped targets."""
6    inputs, targets = [], []
7    for i in range(16):
8        bits = [(i >> 3) & 1, (i >> 2) & 1,
9                (i >> 1) & 1, i & 1]
10        img = torch.tensor(bits, dtype=torch.float32)
11        flipped = img.reshape(2, 2).T.reshape(-1)
12        inputs.append(img)
13        targets.append(flipped)
14    return torch.stack(inputs), torch.stack(targets)
15
16X, Y = make_dataset()
17print(f"Dataset: {X.shape[0]} samples, "
18      f"input shape {X.shape}, "
19      f"target shape {Y.shape}")
20
21# Show a few examples
22for i in [0, 6, 11, 15]:
23    inp = X[i].tolist()
24    tgt = Y[i].tolist()
25    print(f"  [{inp[:2]}] → [{tgt[:2]}]")
26    print(f"  [{inp[2:]}]    [{tgt[2:]}]")
Only 16 samples! This is rare — in real ML, you can't enumerate every possible input (ImageNet has 14 million images). But for learning, a complete dataset lets us verify the network truly learns the pattern, not just memorizes training examples.

Batch Forward Pass

Instead of feeding images one at a time, we stack all 16 into a matrix and process them simultaneously. The batch dimension (16) rides along through every matrix multiplication — the math is identical, just applied to all images in parallel:

Batch Forward Pass — All 16 Images at Once
🐍batch_forward.py
1import torch

PyTorch for tensor operations.

4model = DiagonalFlipNet()

Same network architecture from the PyTorch section above.

EXECUTION STATE
model = Input(4) → Linear(4→3) → ReLU → Linear(3→4) → Output(4)
5X, Y = make_dataset() — All 16 images

X contains all 16 possible 2×2 binary images stacked as rows. Y contains their diagonal flips.

EXECUTION STATE
⬇ X shape = (16, 4) — 16 images, each flattened to 4 pixels
⬇ Y shape = (16, 4) — 16 target flips
8predictions = model(X) — THE BATCH FORWARD PASS

Feed ALL 16 images through the network in one call. The batch dimension (16) rides along through every operation: (16,4) @ (4,3) = (16,3) → ReLU → (16,3) @ (3,4) = (16,4).

EXECUTION STATE
⬇ X = Shape (16, 4) — 16 input images
Layer 1: X @ W1.T + b1 = (16, 4) @ (4, 3) = (16, 3) — 16 sets of 3 hidden values
ReLU = (16, 3) → (16, 3) — element-wise, shape unchanged
Layer 2: h @ W2.T + b2 = (16, 3) @ (3, 4) = (16, 4) — 16 output predictions
⬆ result: predictions = Shape (16, 4) — one prediction per image
9batch_loss = torch.mean((predictions - Y) ** 2)

MSE loss averaged over ALL 16 images. Each image contributes its error to the total.

EXECUTION STATE
predictions - Y = Shape (16, 4) — error for each pixel of each image
(…) ** 2 = Shape (16, 4) — squared errors
⬆ result: batch_loss = Single scalar — average of all 64 squared errors (16 images × 4 pixels)
12single = model(X[11]) — Single image forward pass

X[11] is our running example [1,0,1,1]. Feed just this one image through the network.

EXECUTION STATE
X[11] = tensor([1., 0., 1., 1.]) — shape (4,)
⬆ single = tensor([0.1, -0.1, -0.05, -0.25]) — shape (4,)
13batched = predictions[11] — Extract from batch

Row 11 of the batch output. Should be identical to the single-image result.

EXECUTION STATE
⬆ batched = tensor([0.1, -0.1, -0.05, -0.25]) — same as single!
14torch.allclose(single, batched) — Verify equality

Confirms both approaches produce identical results. Batching changes speed, not correctness.

EXECUTION STATE
📚 torch.allclose() = Returns True if all elements are equal within floating-point tolerance (default 1e-8)
output = True
16print predictions shape

Display the output shape.

EXECUTION STATE
output = Predictions shape: torch.Size([16, 4])
17print batch loss

The average loss across all 16 images with random weights.

EXECUTION STATE
output = Batch MSE loss: 0.6372
7 lines without explanation
1import torch
2
3# ── Setup (reusing model and dataset from above) ──
4model = DiagonalFlipNet()
5X, Y = make_dataset()  # X: (16, 4), Y: (16, 4)
6
7# ── Batch forward pass — all 16 images at once ──
8predictions = model(X)
9batch_loss = torch.mean((predictions - Y) ** 2)
10
11# ── Verify: batch == single ──
12single = model(X[11])
13batched = predictions[11]
14print(torch.allclose(single, batched))
15
16print(f"Predictions shape: {predictions.shape}")
17print(f"Batch MSE loss:    {batch_loss.item():.4f}")
Why batch? On a GPU, all 16 forward passes run as one parallel matrix multiply. For our tiny network the speedup is negligible, but for real networks (millions of parameters, thousands of inputs), batching is essential — it's the difference between minutes and hours.

Quick Check

If X has shape (16, 4) and the network is 4 → 3 → 4, what shape does model(X) produce?


What Changes During Training?

ComponentFixed or Learned?Details
Architecture 4344 \to 3 \to 4FixedYou choose before training
Activation (ReLU)FixedYou choose before training
Weights W1,W2\mathbf{W}_1, \mathbf{W}_2Learned24 numbers that change every step
Biases b1,b2\mathbf{b}_1, \mathbf{b}_2Learned7 numbers that change every step
Training data (X,Y)(\mathbf{X}, \mathbf{Y})FixedThe examples you provide
Forward passFixedLinearReLULinear\text{Linear} \to \text{ReLU} \to \text{Linear}

Learning = finding the right 31 numbers. The structure, activations, number of layers — all fixed. Training only adjusts the weights and biases.

Preview of Chapter 8: How do we find those 31 numbers? We compute how the loss changes when we nudge each weight (the gradient), then adjust each weight to reduce the loss. That's backpropagation + gradient descent. We'll trace every gradient by hand, just like we traced every forward pass here.

The Computation Graph

Here's the complete forward pass as a computation graph — a directed acyclic graph (DAG) where each node is an operation and edges show data flow. Hover over any node to see its value from our running example.

Loading computation graph...

Every node in this graph stores its output during the forward pass. In Chapter 8, we'll flow gradients backward through these same edges to compute how each weight should change.


Exercises

  1. Different input. Pick [0,1,1,0][0, 1, 1, 0] and compute the forward pass by hand, then verify with the PyTorch code.
  2. Change hidden size. Modify DiagonalFlipNet\texttt{DiagonalFlipNet} to use 5 hidden neurons. How does the parameter count change?
  3. Remove activation. Comment out F.relu in forward()\texttt{forward()}. The output still exists but training will fail — why?
  4. Batch vs single. Verify that model(X)[11]\texttt{model(X)[11]} equals model(X[11])\texttt{model(X[11])}. Why is batching faster?

Summary

  1. PyTorch mirrors the math. model(x)\texttt{model(x)} calls forward()\texttt{forward()} which runs the same computation we did by hand.
  2. nn.Linear stores weights transposed — shape (out, in) not (in, out). Internally computes x @ W.T + b\texttt{x @ W.T + b}.
  3. All 31 parameters verified. Hand calculation, NumPy, and PyTorch produce identical results.
  4. 16 training examples cover every possible 2×2 binary image.
  5. The network currently produces garbage. Training (Chapter 8) will fix this.
Loading comments...