Boo-AI — Master Artificial Intelligence by Building from Scratch

Learning Objectives

By the end of this section, you will be able to:

Build a neural network with nn.Module — the standard way in PyTorch
Understand nn.Linear's weight convention — why it stores (out, in) not (in, out)
Create a complete training dataset of all 16 possible 2×2 binary images
Run a batch forward pass on the entire dataset at once

The bridge: You did the math by hand. You saw it in NumPy. Now let's see how PyTorch does the same thing — and you'll see the code mirrors the math line for line. Once you trust the code, you'll never need to compute by hand again (but you'll always know what's happening underneath).

PyTorch Implementation

Click any line to see the exact values and explanations. Pay special attention to $\texttt{nn.Linear}$ 's weight storage convention — it's the most common source of confusion.

Forward Pass — PyTorch nn.Module

🐍forward_pytorch.py

Explanation(26)

Code(46)

1import torch

PyTorch is the deep learning framework. It provides tensors (like NumPy arrays but with GPU support and automatic differentiation), neural network modules, and optimizers.

EXECUTION STATE

torch = Core library: tensors, autograd, device management. torch.tensor() creates arrays, .backward() computes gradients

2import torch.nn as nn

Neural network building blocks: layers (Linear, Conv2d), containers (Module, Sequential), loss functions (MSELoss, CrossEntropyLoss).

EXECUTION STATE

nn.Module = Base class for all neural network models. Provides .parameters(), .forward(), .train()/.eval()

nn.Linear(in, out) = Fully connected layer. Stores weight (out×in) and bias (out,). Computes: output = x @ W.T + b

3import torch.nn.functional as F

Stateless functions (no learnable parameters): activations (F.relu, F.softmax), pooling, etc. Use when the operation has no weights to learn.

EXECUTION STATE

F.relu() = Functional ReLU: F.relu(x) = max(0, x) element-wise. Unlike nn.ReLU() (a module), this is a plain function call.

6class DiagonalFlipNet(nn.Module):

Our neural network class. Inheriting from nn.Module gives us: automatic parameter tracking, .forward() dispatch, GPU transfer, save/load, and gradient computation for free.

EXECUTION STATE

📚 nn.Module = PyTorch base class for all models. You define layers in __init__ and computation in forward(). PyTorch auto-discovers all nn.Linear/nn.Conv2d layers and tracks their weights.

Architecture = Input(4) → Linear(4→3) → ReLU → Linear(3→4) → Output(4) 31 parameters total: W1(4×3=12) + b1(3) + W2(3×4=12) + b2(4)

7def __init__(self):

Constructor. Define all layers here — PyTorch scans for nn.Module attributes to build the parameter list.

8super().__init__()

Call nn.Module's constructor. Required — it sets up internal bookkeeping for parameter registration, hooks, and device tracking.

9self.layer1 = nn.Linear(4, 3)

First fully-connected layer: 4 inputs → 3 hidden neurons. Creates a weight matrix and bias vector.

EXECUTION STATE

📚 nn.Linear(in_features, out_features) = Creates a layer that computes: output = input @ weight.T + bias

⬇ in_features = 4 = 4 input features (our flattened 2×2 image)

⬇ out_features = 3 = 3 output neurons (hidden layer size)

weight shape = (3, 4) — NOTE: PyTorch stores it transposed! (out, in) not (in, out)

bias shape = (3,) — one bias per output neuron

⚠️ Convention = PyTorch nn.Linear stores W as (out, in). Forward: x @ W.T + b. Our math convention has W as (in, out). Both produce identical results.

10self.layer2 = nn.Linear(3, 4)

Second layer: 3 hidden neurons → 4 outputs. Weight shape (4, 3), bias shape (4,).

EXECUTION STATE

weight shape = (4, 3) — 4 output neurons × 3 inputs from hidden layer = 12 weights

bias shape = (4,) — one per output

12def forward(self, x):

Defines the forward pass computation. Called automatically when you write model(x). PyTorch traces this to build the computation graph for backpropagation.

EXECUTION STATE

⬇ input: x = A tensor of shape (4,) for single input or (batch, 4) for batched. Example: [1.0, 0.0, 1.0, 1.0]

⬆ returns = Tensor of shape (4,) — the network's prediction for the input

13h = F.relu(self.layer1(x))

Two operations in one line: (1) self.layer1(x) computes x @ W1.T + b1, (2) F.relu() clamps negatives to zero. This is our hidden layer.

EXECUTION STATE

self.layer1(x) = Calls nn.Linear.forward(): x @ weight.T + bias = [0.0, -0.1, 0.5]

F.relu(…) = max(0, [0.0, -0.1, 0.5]) = [0.0, 0.0, 0.5]

⬆ result: h = [0.0, 0.0, 0.5] — matches our hand calculation!

14y_hat = self.layer2(h)

Output layer: h @ W2.T + b2. No activation on the output — raw values for regression.

EXECUTION STATE

self.layer2(h) = [0.0, 0.0, 0.5] @ W2.T + b2 = [0.1, -0.1, -0.05, -0.25]

⬆ result: y_hat = [0.1, -0.1, -0.05, -0.25] — matches!

15return y_hat

Return the prediction tensor. PyTorch keeps the computation graph attached so .backward() can compute gradients later.

EXECUTION STATE

⬆ return: y_hat = [0.1, -0.1, -0.05, -0.25]

18model = DiagonalFlipNet()

Create an instance of our network. PyTorch randomly initializes all weights. We'll override them next with our specific values.

EXECUTION STATE

model = DiagonalFlipNet( (layer1): Linear(4→3, 15 params) (layer2): Linear(3→4, 16 params) ) — 31 total parameters

20with torch.no_grad():

Context manager that disables gradient tracking. We're manually setting weights, not training — so we don't want PyTorch to record these operations in its computation graph.

EXECUTION STATE

📚 torch.no_grad() = Temporarily disables autograd. Operations inside this block run faster and use less memory. Use for inference or manual weight manipulation.

21model.layer1.weight.copy_(...) — Load W1

Load our hand-calculated weights. CRITICAL: nn.Linear stores weight as (out_features, in_features), so what we load is W1 transposed. Each row = one hidden neuron's weights.

EXECUTION STATE

⚠️ Shape convention = Our W1 is (4,3) — rows=inputs, cols=neurons. PyTorch layer1.weight is (3,4) — rows=neurons, cols=inputs. So we load the TRANSPOSED version.

layer1.weight (3×4) =

      x0    x1    x2    x3
h0   0.2  -0.3   0.1  -0.4
h1  -0.5   0.4   0.3   0.2
h2   0.1  -0.2   0.5  -0.1

26model.layer1.bias.copy_(...) — Load b1

Load hidden layer biases. Shape (3,) — one per hidden neuron.

EXECUTION STATE

layer1.bias = [0.1, -0.1, 0.0]

28model.layer2.weight.copy_(...) — Load W2

Load output layer weights. Again transposed: our W2 is (3,4), PyTorch stores (4,3). Each row = one output neuron.

EXECUTION STATE

layer2.weight (4×3) =

      h0    h1    h2
y0   0.3  -0.1   0.2
y1  -0.2   0.5  -0.4
y2   0.4  -0.3   0.1
y3   0.1   0.2  -0.5

33model.layer2.bias.copy_(...) — Load b2

Load output biases. Shape (4,) — one per output neuron.

EXECUTION STATE

layer2.bias = [0.0, 0.1, -0.1, 0.0]

37x = torch.tensor([1.0, 0.0, 1.0, 1.0])

Our input image [1,0,1,1] as a PyTorch tensor. Uses float32 by default.

EXECUTION STATE

x = tensor([1., 0., 1., 1.]) — shape (4,)

38y_hat = model(x)

THE FORWARD PASS. Calling model(x) invokes model.forward(x) which runs: layer1 → ReLU → layer2. PyTorch builds a computation graph for backprop.

EXECUTION STATE

📚 model(x) = Syntactic sugar for model.forward(x). But model(x) also runs hooks, sets training mode, etc. Never call model.forward(x) directly.

⬆ result: y_hat = tensor([0.1000, -0.1000, -0.0500, -0.2500]) — matches hand calculation!

39target = torch.tensor([1.0, 1.0, 0.0, 1.0])

The diagonal flip of [1,0,1,1]. This is what the network should output.

EXECUTION STATE

target = tensor([1., 1., 0., 1.])

40loss = torch.mean((y_hat - target) ** 2)

MSE loss computed manually. In production you'd use nn.MSELoss(), but this shows what it does under the hood.

EXECUTION STATE

y_hat - target = [-0.9, -1.1, -0.05, -1.25]

(…) ** 2 = [0.81, 1.21, 0.0025, 1.5625]

⬆ result: loss = tensor(0.8963) — same as NumPy!

42print(f"Prediction: {y_hat.tolist()}")

.tolist() converts tensor to Python list for cleaner printing.

EXECUTION STATE

output = Prediction: [0.1, -0.1, -0.05, -0.25]

43print(f"Target: {target.tolist()}")

Display target values.

EXECUTION STATE

output = Target: [1.0, 1.0, 0.0, 1.0]

44print(f"MSE Loss: {loss.item():.4f}")

.item() extracts a Python scalar from a 0-dimensional tensor. Use this for printing single values.

EXECUTION STATE

📚 .item() = Converts a single-element tensor to a Python float. Required because f-string formatting doesn't work directly on tensors.

output = MSE Loss: 0.8963

45print — total parameter count

Count all learnable parameters. model.parameters() yields each weight/bias tensor, .numel() counts elements.

EXECUTION STATE

📚 .parameters() = Generator that yields all learnable tensors: layer1.weight(3×4=12), layer1.bias(3), layer2.weight(4×3=12), layer2.bias(4)

📚 .numel() = Number of elements in a tensor. tensor([1,2,3]).numel() = 3

output = Parameters: 31 — matches our count!

20 lines without explanation

1import torch
2import torch.nn as nn
3import torch.nn.functional as F
4
5# ── Define the network ──
6class DiagonalFlipNet(nn.Module):
7    def __init__(self):
8        super().__init__()
9        self.layer1 = nn.Linear(4, 3)
10        self.layer2 = nn.Linear(3, 4)
11
12    def forward(self, x):
13        h = F.relu(self.layer1(x))
14        y_hat = self.layer2(h)
15        return y_hat
16
17# ── Create model & load our weights ──
18model = DiagonalFlipNet()
19
20with torch.no_grad():
21    model.layer1.weight.copy_(torch.tensor([
22        [ 0.2, -0.3,  0.1, -0.4],
23        [-0.5,  0.4,  0.3,  0.2],
24        [ 0.1, -0.2,  0.5, -0.1],
25    ]))
26    model.layer1.bias.copy_(
27        torch.tensor([0.1, -0.1, 0.0]))
28    model.layer2.weight.copy_(torch.tensor([
29        [ 0.3, -0.1,  0.2],
30        [-0.2,  0.5, -0.4],
31        [ 0.4, -0.3,  0.1],
32        [ 0.1,  0.2, -0.5],
33    ]))
34    model.layer2.bias.copy_(
35        torch.tensor([0.0, 0.1, -0.1, 0.0]))
36
37# ── Forward pass ──
38x = torch.tensor([1.0, 0.0, 1.0, 1.0])
39y_hat = model(x)
40target = torch.tensor([1.0, 1.0, 0.0, 1.0])
41loss = torch.mean((y_hat - target) ** 2)
42
43print(f"Prediction: {y_hat.tolist()}")
44print(f"Target:     {target.tolist()}")
45print(f"MSE Loss:   {loss.item():.4f}")
46print(f"Parameters: {sum(p.numel() for p in model.parameters())}")

The nn.Linear Weight Convention

This is the #1 gotcha for beginners. Our math convention and PyTorch's convention are transposed:

Convention	$\mathbf{W}_1$ shape	Computation	Used by
Our math	$(4, 3)$	$\mathbf{x} \cdot \mathbf{W} + \mathbf{b}$	NumPy, textbooks
PyTorch	$(3, 4)$	$\mathbf{x} \cdot \mathbf{W}^T + \mathbf{b}$	$\texttt{nn.Linear}$

Both produce identical results. PyTorch stores $\mathbf{W}^T$ directly because its internal computation is $\texttt{x @ weight.T + bias}$ . When you print $\texttt{model.layer1.weight.shape}$ and see $\texttt{(3, 4)}$ , that's (out, in) — each row is one neuron's input weights.

Quick Check

PyTorch's nn.Linear(4, 3) stores its weight tensor with shape...

Creating the Training Dataset

Our 2×2 binary images have 4 pixels, each 0 or 1 — that's $2^4 = 16$ possible images. We can enumerate the entire input space:

Training Dataset — All 16 Images

🐍dataset.py

Explanation(10)

Code(26)

1import torch

PyTorch for tensor operations.

3def make_dataset():

Generates ALL possible 2×2 binary images (2⁴ = 16) and their diagonal flips. This is our complete training set — one of the few problems where we can enumerate the entire input space.

EXECUTION STATE

⬆ returns = (X, Y) — two tensors of shape (16, 4). X = inputs, Y = targets

Total examples = 16 — every possible combination of 4 binary pixels

7for i in range(16):

Loop through numbers 0-15. Each number encodes one of the 16 possible 2×2 binary images via its 4-bit binary representation.

LOOP TRACE · 4 iterations

i=0 → 0000 → [0,0,0,0]

image = [0,0] / [0,0] → all black

i=6 → 0110 → [0,1,1,0]

image = [0,1] / [1,0] → anti-diagonal

i=11 → 1011 → [1,0,1,1]

image = [1,0] / [1,1] → our example!

i=15 → 1111 → [1,1,1,1]

image = [1,1] / [1,1] → all white

8bits = [(i >> 3) & 1, ...] — Binary decomposition

Extract the 4 bits of i using right-shift and bitwise AND. i >> 3 gives the most significant bit, i & 1 gives the least significant bit.

EXECUTION STATE

📚 >> (right shift) = Shifts bits right. 11 >> 3 = 1 (binary 1011 → 0001). Extracts higher bits.

📚 & 1 (bitwise AND) = Keeps only the last bit. 6 & 1 = 0, 7 & 1 = 1. Tests if a number is odd.

Example: i=11 (binary 1011) = 11>>3 & 1 = 1, 11>>2 & 1 = 0, 11>>1 & 1 = 1, 11 & 1 = 1 → [1,0,1,1]

10img = torch.tensor(bits, dtype=torch.float32)

Convert the 4 bits to a float tensor — the network input.

EXECUTION STATE

img = tensor([1., 0., 1., 1.]) for i=11

11flipped = img.reshape(2,2).T.reshape(-1)

The diagonal flip in one line: reshape to 2×2, transpose (.T swaps rows/cols = diagonal flip), flatten back to 1D.

EXECUTION STATE

.reshape(2,2) = [1,0,1,1] → [[1,0],[1,1]]

.T (transpose) = [[1,0],[1,1]] → [[1,1],[0,1]]

.reshape(-1) = [[1,1],[0,1]] → [1,1,0,1]

⬆ result: flipped = tensor([1., 1., 0., 1.]) — our target!

14return torch.stack(inputs), torch.stack(targets)

Stack the 16 individual tensors into batch tensors.

EXECUTION STATE

📚 torch.stack() = Combines a list of tensors into one tensor with a new first dimension. 16 tensors of shape (4,) → one tensor of shape (16, 4).

⬆ return: X = Shape (16, 4) — 16 input images

⬆ return: Y = Shape (16, 4) — 16 target flips

16X, Y = make_dataset()

Generate the complete dataset. 16 input-output pairs covering every possible 2×2 binary image.

EXECUTION STATE

X = tensor of shape (16, 4) — all 16 possible inputs

Y = tensor of shape (16, 4) — corresponding diagonal flips

17print dataset info

Display the dataset shape.

EXECUTION STATE

output = Dataset: 16 samples, input shape torch.Size([16, 4]), target shape torch.Size([16, 4])

22for i in [0, 6, 11, 15]:

Show 4 representative examples: all-black, anti-diagonal, our running example, all-white.

LOOP TRACE · 4 iterations

i=0: all-black

output = [[0,0]] → [[0,0]] [[0,0]] [[0,0]]

i=6: anti-diagonal

output = [[0,1]] → [[0,1]] [[1,0]] [[1,0]]

i=11: our example

output = [[1,0]] → [[1,1]] [[1,1]] [[0,1]]

i=15: all-white

output = [[1,1]] → [[1,1]] [[1,1]] [[1,1]]

16 lines without explanation

1import torch
2
3def make_dataset():
4    """Generate all 16 possible 2x2 binary images
5    with their diagonal-flipped targets."""
6    inputs, targets = [], []
7    for i in range(16):
8        bits = [(i >> 3) & 1, (i >> 2) & 1,
9                (i >> 1) & 1, i & 1]
10        img = torch.tensor(bits, dtype=torch.float32)
11        flipped = img.reshape(2, 2).T.reshape(-1)
12        inputs.append(img)
13        targets.append(flipped)
14    return torch.stack(inputs), torch.stack(targets)
15
16X, Y = make_dataset()
17print(f"Dataset: {X.shape[0]} samples, "
18      f"input shape {X.shape}, "
19      f"target shape {Y.shape}")
20
21# Show a few examples
22for i in [0, 6, 11, 15]:
23    inp = X[i].tolist()
24    tgt = Y[i].tolist()
25    print(f"  [{inp[:2]}] → [{tgt[:2]}]")
26    print(f"  [{inp[2:]}]    [{tgt[2:]}]")

Only 16 samples! This is rare — in real ML, you can't enumerate every possible input (ImageNet has 14 million images). But for learning, a complete dataset lets us verify the network truly learns the pattern, not just memorizes training examples.

Batch Forward Pass

Instead of feeding images one at a time, we stack all 16 into a matrix and process them simultaneously. The batch dimension (16) rides along through every matrix multiplication — the math is identical, just applied to all images in parallel:

Batch Forward Pass — All 16 Images at Once

🐍batch_forward.py

Explanation(10)

Code(17)

1import torch

PyTorch for tensor operations.

4model = DiagonalFlipNet()

Same network architecture from the PyTorch section above.

EXECUTION STATE

model = Input(4) → Linear(4→3) → ReLU → Linear(3→4) → Output(4)

5X, Y = make_dataset() — All 16 images

X contains all 16 possible 2×2 binary images stacked as rows. Y contains their diagonal flips.

EXECUTION STATE

⬇ X shape = (16, 4) — 16 images, each flattened to 4 pixels

⬇ Y shape = (16, 4) — 16 target flips

8predictions = model(X) — THE BATCH FORWARD PASS

Feed ALL 16 images through the network in one call. The batch dimension (16) rides along through every operation: (16,4) @ (4,3) = (16,3) → ReLU → (16,3) @ (3,4) = (16,4).

EXECUTION STATE

⬇ X = Shape (16, 4) — 16 input images

Layer 1: X @ W1.T + b1 = (16, 4) @ (4, 3) = (16, 3) — 16 sets of 3 hidden values

ReLU = (16, 3) → (16, 3) — element-wise, shape unchanged

Layer 2: h @ W2.T + b2 = (16, 3) @ (3, 4) = (16, 4) — 16 output predictions

⬆ result: predictions = Shape (16, 4) — one prediction per image

9batch_loss = torch.mean((predictions - Y) ** 2)

MSE loss averaged over ALL 16 images. Each image contributes its error to the total.

EXECUTION STATE

predictions - Y = Shape (16, 4) — error for each pixel of each image

(…) ** 2 = Shape (16, 4) — squared errors

⬆ result: batch_loss = Single scalar — average of all 64 squared errors (16 images × 4 pixels)

12single = model(X[11]) — Single image forward pass

X[11] is our running example [1,0,1,1]. Feed just this one image through the network.

EXECUTION STATE

X[11] = tensor([1., 0., 1., 1.]) — shape (4,)

⬆ single = tensor([0.1, -0.1, -0.05, -0.25]) — shape (4,)

13batched = predictions[11] — Extract from batch

Row 11 of the batch output. Should be identical to the single-image result.

EXECUTION STATE

⬆ batched = tensor([0.1, -0.1, -0.05, -0.25]) — same as single!

14torch.allclose(single, batched) — Verify equality

Confirms both approaches produce identical results. Batching changes speed, not correctness.

EXECUTION STATE

📚 torch.allclose() = Returns True if all elements are equal within floating-point tolerance (default 1e-8)

output = True

16print predictions shape

Display the output shape.

EXECUTION STATE

output = Predictions shape: torch.Size([16, 4])

17print batch loss

The average loss across all 16 images with random weights.

EXECUTION STATE

output = Batch MSE loss: 0.6372

7 lines without explanation

1import torch
2
3# ── Setup (reusing model and dataset from above) ──
4model = DiagonalFlipNet()
5X, Y = make_dataset()  # X: (16, 4), Y: (16, 4)
6
7# ── Batch forward pass — all 16 images at once ──
8predictions = model(X)
9batch_loss = torch.mean((predictions - Y) ** 2)
10
11# ── Verify: batch == single ──
12single = model(X[11])
13batched = predictions[11]
14print(torch.allclose(single, batched))
15
16print(f"Predictions shape: {predictions.shape}")
17print(f"Batch MSE loss:    {batch_loss.item():.4f}")

Why batch? On a GPU, all 16 forward passes run as one parallel matrix multiply. For our tiny network the speedup is negligible, but for real networks (millions of parameters, thousands of inputs), batching is essential — it's the difference between minutes and hours.

Quick Check

If X has shape (16, 4) and the network is 4 → 3 → 4, what shape does model(X) produce?

What Changes During Training?

Component	Fixed or Learned?	Details
Architecture $4 \to 3 \to 4$	Fixed	You choose before training
Activation (ReLU)	Fixed	You choose before training
Weights $\mathbf{W}_1, \mathbf{W}_2$	Learned	24 numbers that change every step
Biases $\mathbf{b}_1, \mathbf{b}_2$	Learned	7 numbers that change every step
Training data $(\mathbf{X}, \mathbf{Y})$	Fixed	The examples you provide
Forward pass	Fixed	$\text{Linear} \to \text{ReLU} \to \text{Linear}$

Learning = finding the right 31 numbers. The structure, activations, number of layers — all fixed. Training only adjusts the weights and biases.

Preview of Chapter 8: How do we find those 31 numbers? We compute how the loss changes when we nudge each weight (the gradient), then adjust each weight to reduce the loss. That's backpropagation + gradient descent. We'll trace every gradient by hand, just like we traced every forward pass here.

The Computation Graph

Here's the complete forward pass as a computation graph — a directed acyclic graph (DAG) where each node is an operation and edges show data flow. Hover over any node to see its value from our running example.

Loading computation graph...

Every node in this graph stores its output during the forward pass. In Chapter 8, we'll flow gradients backward through these same edges to compute how each weight should change.

Exercises

Different input. Pick $[0, 1, 1, 0]$ and compute the forward pass by hand, then verify with the PyTorch code.
Change hidden size. Modify $\texttt{DiagonalFlipNet}$ to use 5 hidden neurons. How does the parameter count change?
Remove activation. Comment out F.relu in $\texttt{forward()}$ . The output still exists but training will fail — why?
Batch vs single. Verify that $\texttt{model(X)[11]}$ equals $\texttt{model(X[11])}$ . Why is batching faster?

Summary

PyTorch mirrors the math. $\texttt{model(x)}$ calls $\texttt{forward()}$ which runs the same computation we did by hand.
nn.Linear stores weights transposed — shape (out, in) not (in, out). Internally computes $\texttt{x @ W.T + b}$ .
All 31 parameters verified. Hand calculation, NumPy, and PyTorch produce identical results.
16 training examples cover every possible 2×2 binary image.
The network currently produces garbage. Training (Chapter 8) will fix this.