Chapter 8
35 min read
Section 52 of 178

Implementing Backprop from Scratch

Backpropagation from Scratch

Learning Objectives

By the end of this section, you will be able to:

  1. Implement automatic differentiation from scratch using a simple Value class that tracks gradients
  2. Build forward and backward passes for arbitrary computational graphs
  3. Construct neurons, layers, and networks using your own autograd engine
  4. Train a neural network using gradient descent on your from-scratch implementation
  5. Understand how PyTorch's autograd relates to your implementation
  6. Debug gradient computations and avoid common implementation mistakes
Why This Matters: Every deep learning framework—PyTorch, TensorFlow, JAX—implements automatic differentiation at its core. By building it yourself, you'll understand exactly what happens when you call loss.backward(), making you a more effective practitioner and debugger.

The Big Picture

In the previous section, we derived the mathematics of backpropagation. Now we'll bring those equations to life by implementing them in code. Our goal is to build a miniature version of PyTorch's autograd system—a system that can automatically compute gradients for any computation we define.

What We're Building

We'll implement micrograd—a tiny automatic differentiation engine inspired by Andrej Karpathy's famous educational project. Despite being only ~100 lines of code, it captures the essence of how modern deep learning frameworks work.

ComponentPurposeKey Method
ValueWraps a scalar, stores gradient__add__, __mul__, backward()
NeuronWeighted sum + activation__call__(x)
LayerCollection of neurons__call__(x)
MLPMulti-layer perceptron__call__(x), parameters()

The Core Insight

The key insight is that every mathematical operation knows two things:

  1. How to compute its output (forward pass): given inputs, produce output
  2. How to propagate gradients backward (backward pass): given Loutput\frac{\partial L}{\partial \text{output}}, compute Linputs\frac{\partial L}{\partial \text{inputs}}

If every operation knows these two things, we can chain them together to compute gradients through arbitrarily complex computations.


Building Blocks: The Value Class

Our Value class is the fundamental building block. It wraps a scalar number and tracks:

  • data: The actual numerical value
  • grad: The gradient Lthis value\frac{\partial L}{\partial \text{this value}}, initially 0
  • _backward: A function that propagates gradients to inputs
  • _prev: The set of values that produced this value (for topological sort)

The Gradient Equation

For each operation, the backward function implements the chain rule. For example, if c=a+bc = a + b, then:

La=Lcca=Lc1\frac{\partial L}{\partial a} = \frac{\partial L}{\partial c} \cdot \frac{\partial c}{\partial a} = \frac{\partial L}{\partial c} \cdot 1
Lb=Lccb=Lc1\frac{\partial L}{\partial b} = \frac{\partial L}{\partial c} \cdot \frac{\partial c}{\partial b} = \frac{\partial L}{\partial c} \cdot 1

Since ca=cb=1\frac{\partial c}{\partial a} = \frac{\partial c}{\partial b} = 1 for addition, both inputs receive the output's gradient unchanged. This is why gradients "add up" at branches in the computational graph.


Interactive: Forward & Backward Pass

Let's visualize a complete forward and backward pass through a simple neuron. Step through each stage to see how values flow forward and gradients flow backward.

Backpropagation Step-by-Step

SETUP
y = 1x₁=22.000x₂=33.000w₁0.500w₂-0.500b0.100×1.000×-1.500+-0.400σ0.401ŷ0.401L0.179
Step 1/8

Initial State

Our simple neural network: two inputs weighted, summed with bias, passed through sigmoid, and compared to target.

Adjust Parameters

Notice the Pattern

During the forward pass, we compute values left-to-right. During the backward pass, we compute gradients right-to-left. The gradient at each node tells us: "If I increase this value by a tiny amount, how much does the loss change?"

Implementing the Value Class

Let's implement our automatic differentiation engine step by step:

The Value Class: Our Autograd Engine
🐍micrograd/engine.py
10Constructor

Each Value stores its numerical data, gradient (initially 0), a backward function, and references to its input values. The _prev set is used for topological sorting.

12Gradient Initialization

Gradients start at 0 and accumulate during backward pass. The += in backward functions handles cases where a value is used multiple times.

15Backward Function

This lambda is replaced by each operation. It implements the chain rule: multiply incoming gradient by local derivative and pass to inputs.

24Addition Backward

For c = a + b: ∂c/∂a = ∂c/∂b = 1. So both inputs receive the output's gradient unchanged. The += handles the case where a value is used multiple times.

EXAMPLE
If a is used in two additions, its gradient is the sum of gradients from both paths.
38Multiplication Backward

For c = a * b: ∂c/∂a = b and ∂c/∂b = a. Each input's gradient is the other input's value times the output gradient.

EXAMPLE
c = 3 * 4: ∂c/∂a = 4, ∂c/∂b = 3
50Power Backward

For c = a^n: ∂c/∂a = n * a^(n-1). This enables division (a/b = a * b^(-1)) and other operations.

60Tanh Backward

For c = tanh(a): ∂c/∂a = 1 - tanh(a)². We store t = tanh(a) during forward pass and reuse it in backward for efficiency.

76Topological Sort

We build a topological ordering of the computational graph. This ensures we process nodes in the correct order: each node is processed after all its outputs.

88Seed Gradient

The output's gradient is set to 1.0 (∂L/∂L = 1). All other gradients are computed by chain rule multiplication from here.

92Backward Pass

Process nodes in reverse topological order. Each node calls its _backward function to propagate gradients to its inputs.

114 lines without explanation
1import math
2
3class Value:
4    """Stores a scalar value and its gradient.
5
6    This is the building block of our autograd engine.
7    Each Value knows how to propagate gradients backward.
8    """
9
10    def __init__(self, data, _children=(), _op='', label=''):
11        self.data = data
12        self.grad = 0.0  # Derivative of loss w.r.t. this value
13
14        # Internal variables for autograd graph
15        self._backward = lambda: None  # Function to propagate gradients
16        self._prev = set(_children)     # Input values that created this
17        self._op = _op                   # Operation that created this
18        self.label = label               # For visualization
19
20    def __repr__(self):
21        return f"Value(data={self.data:.4f}, grad={self.grad:.4f})"
22
23    def __add__(self, other):
24        """Addition: c = a + b"""
25        other = other if isinstance(other, Value) else Value(other)
26        out = Value(self.data + other.data, (self, other), '+')
27
28        def _backward():
29            # d(a+b)/da = 1, d(a+b)/db = 1
30            # Use += because a value might be used multiple times
31            self.grad += 1.0 * out.grad
32            other.grad += 1.0 * out.grad
33        out._backward = _backward
34
35        return out
36
37    def __mul__(self, other):
38        """Multiplication: c = a * b"""
39        other = other if isinstance(other, Value) else Value(other)
40        out = Value(self.data * other.data, (self, other), '*')
41
42        def _backward():
43            # d(a*b)/da = b, d(a*b)/db = a
44            self.grad += other.data * out.grad
45            other.grad += self.data * out.grad
46        out._backward = _backward
47
48        return out
49
50    def __pow__(self, other):
51        """Power: c = a ** n (n is a constant)"""
52        assert isinstance(other, (int, float))
53        out = Value(self.data ** other, (self,), f'**{other}')
54
55        def _backward():
56            # d(a^n)/da = n * a^(n-1)
57            self.grad += other * (self.data ** (other - 1)) * out.grad
58        out._backward = _backward
59
60        return out
61
62    def tanh(self):
63        """Hyperbolic tangent activation"""
64        t = math.tanh(self.data)
65        out = Value(t, (self,), 'tanh')
66
67        def _backward():
68            # d(tanh(x))/dx = 1 - tanh(x)^2
69            self.grad += (1 - t ** 2) * out.grad
70        out._backward = _backward
71
72        return out
73
74    def exp(self):
75        """Exponential function"""
76        out = Value(math.exp(self.data), (self,), 'exp')
77
78        def _backward():
79            # d(e^x)/dx = e^x
80            self.grad += out.data * out.grad
81        out._backward = _backward
82
83        return out
84
85    def backward(self):
86        """Compute gradients for all values in the graph.
87
88        Uses reverse-mode automatic differentiation (backpropagation).
89        """
90        # Build topological order of all nodes
91        topo = []
92        visited = set()
93
94        def build_topo(v):
95            if v not in visited:
96                visited.add(v)
97                for child in v._prev:
98                    build_topo(child)
99                topo.append(v)
100
101        build_topo(self)
102
103        # Start with gradient of 1.0 for the output
104        self.grad = 1.0
105
106        # Propagate gradients in reverse topological order
107        for node in reversed(topo):
108            node._backward()
109
110    # Operator overloading for convenience
111    def __neg__(self):
112        return self * -1
113
114    def __sub__(self, other):
115        return self + (-other)
116
117    def __truediv__(self, other):
118        return self * other ** -1
119
120    def __radd__(self, other):
121        return self + other
122
123    def __rmul__(self, other):
124        return self * other

Quick Check

In the Value class, why do we use += instead of = when updating gradients in the backward functions?


Building Neurons and Layers

Now we can build neural network components using our Value class:

Neural Network Components
🐍micrograd/nn.py
7Weight Initialization

Weights are initialized randomly in [-1, 1]. Better initializations (Xavier, He) would be used in production, but this works for small networks.

12Forward Pass

The neuron computes a weighted sum of inputs (w·x + b) then applies tanh activation. The sum() function uses Value's __add__ which builds the computation graph.

EXAMPLE
If x = [1, 2], w = [0.3, -0.4], b = 0.1: act = 0.3*1 + (-0.4)*2 + 0.1 = -0.4
16Parameters Method

Returns all learnable parameters. This is used by the optimizer to know which values to update during training.

25Layer Forward

Each neuron in the layer processes the same input independently. If there's only one neuron, return a scalar; otherwise return a list.

35MLP Architecture

The MLP chains layers together. For [2, 4, 4, 1]: Layer 1 is 2→4, Layer 2 is 4→4, Layer 3 is 4→1.

EXAMPLE
MLP(2, [4, 4, 1]) creates a network with 2 inputs, two hidden layers of 4, and 1 output.
41MLP Forward

Data flows through each layer sequentially. Each layer's output becomes the next layer's input. This is function composition.

49Parameter Count

For MLP(2, [4, 4, 1]): Layer 1 has 4*(2+1)=12, Layer 2 has 4*(4+1)=20, Layer 3 has 1*(4+1)=5. Total: 37 parameters.

46 lines without explanation
1import random
2
3class Neuron:
4    """A single neuron: weighted sum of inputs + bias, then activation."""
5
6    def __init__(self, nin):
7        # Random initialization of weights and bias
8        self.w = [Value(random.uniform(-1, 1)) for _ in range(nin)]
9        self.b = Value(random.uniform(-1, 1))
10
11    def __call__(self, x):
12        # w · x + b
13        act = sum((wi * xi for wi, xi in zip(self.w, x)), self.b)
14        return act.tanh()
15
16    def parameters(self):
17        return self.w + [self.b]
18
19
20class Layer:
21    """A layer of neurons."""
22
23    def __init__(self, nin, nout):
24        self.neurons = [Neuron(nin) for _ in range(nout)]
25
26    def __call__(self, x):
27        outs = [n(x) for n in self.neurons]
28        return outs[0] if len(outs) == 1 else outs
29
30    def parameters(self):
31        return [p for neuron in self.neurons for p in neuron.parameters()]
32
33
34class MLP:
35    """Multi-Layer Perceptron."""
36
37    def __init__(self, nin, nouts):
38        sz = [nin] + nouts
39        self.layers = [Layer(sz[i], sz[i+1]) for i in range(len(nouts))]
40
41    def __call__(self, x):
42        for layer in self.layers:
43            x = layer(x)
44        return x
45
46    def parameters(self):
47        return [p for layer in self.layers for p in layer.parameters()]
48
49
50# Example: Create a 3-layer MLP
51# Input: 2 features, Hidden: 4 neurons, Hidden: 4 neurons, Output: 1
52model = MLP(2, [4, 4, 1])
53print(f"Number of parameters: {len(model.parameters())}")

The Training Loop

Now we can train our network! The training loop has three main steps:

Complete Training Loop
🐍train.py
2Training Data

XOR is the classic test case: it requires a hidden layer to solve. A single perceptron cannot learn XOR because it's not linearly separable.

16Forward Pass

Pass each input through the model. This builds a fresh computational graph for each sample. The loss at the end connects all these graphs.

19Loss Function

Mean Squared Error: L = Σ(ŷ - y)². The sum connects all predictions into one computational graph rooted at loss.

23Zero Gradients

CRITICAL: Gradients accumulate with +=, so we must zero them before each backward pass. Forgetting this is a common bug!

EXAMPLE
Without zeroing, gradients grow unbounded across iterations.
28Backward Pass

loss.backward() traverses the computational graph in reverse, calling each node's _backward function to compute gradients.

31Parameter Update

Gradient descent: move each parameter in the opposite direction of its gradient. Learning rate controls step size.

EXAMPLE
If ∂L/∂w = 0.5 and lr = 0.1, then w_new = w - 0.1 * 0.5
37 lines without explanation
1# Training data: XOR problem
2xs = [
3    [Value(0.0), Value(0.0)],
4    [Value(0.0), Value(1.0)],
5    [Value(1.0), Value(0.0)],
6    [Value(1.0), Value(1.0)],
7]
8ys = [Value(0.0), Value(1.0), Value(1.0), Value(0.0)]  # XOR targets
9
10# Create model
11model = MLP(2, [4, 4, 1])
12
13# Training loop
14learning_rate = 0.1
15for epoch in range(100):
16
17    # STEP 1: Forward pass - compute predictions
18    ypred = [model(x) for x in xs]
19
20    # STEP 2: Compute loss (mean squared error)
21    loss = sum((yp - yt) ** 2 for yp, yt in zip(ypred, ys))
22
23    # STEP 3: Backward pass - compute gradients
24    # First, zero out gradients from previous iteration
25    for p in model.parameters():
26        p.grad = 0.0
27
28    # Then, compute new gradients
29    loss.backward()
30
31    # STEP 4: Update parameters (gradient descent)
32    for p in model.parameters():
33        p.data -= learning_rate * p.grad
34
35    # Print progress
36    if epoch % 10 == 0:
37        print(f"Epoch {epoch}: loss = {loss.data:.4f}")
38
39# Test predictions
40print("\nFinal predictions:")
41for x, yt in zip(xs, ys):
42    yp = model(x)
43    print(f"  {[xi.data for xi in x]} -> {yp.data:.4f} (target: {yt.data})")

The Three Critical Steps

  1. Forward Pass: Compute predictions and loss, building the computational graph
  2. Zero Gradients: Reset all gradients to 0 (they accumulate by default)
  3. Backward Pass + Update: Compute gradients with loss.backward(), then update parameters

Interactive: Training XOR

Watch a neural network learn XOR in real-time. The network has two inputs, a hidden layer, and one output. Observe how the loss decreases and the network eventually classifies all four XOR cases correctly.

Gradient Flow: Training XOR

Epoch: 0Step: 0
x10x20h10.500h20.500ŷ0.622target: 0InputHiddenOutput
(0, 0)
0.62
target: 0
(0, 1)
0.65
target: 1
(1, 0)
0.65
target: 1
(1, 1)
0.68
target: 0

Quick Check

Why can't a single perceptron (without hidden layers) solve XOR?


MicroGrad: See It All Together

This interactive demo shows the complete micrograd system in action. Adjust the inputs and weights to see how values propagate forward through the computation, and toggle gradients to see how they flow backward.

MicroGrad: Automatic Differentiation Engine

x12.0000w1-3.0000x20.0000w21.0000b6.8814*-6.0000*0.0000+-6.0000+0.8814tanh0.7071

The Computation

n = x1 · w1 + x2 · w2 + b = 2 · (-3) + 0 · 1 + 6.8814= 0.8814
o = tanh(n) = tanh(0.8814) = 0.7071

Adjust Values

How Automatic Differentiation Works

  • • Each Value stores its data and gradient
  • • Operations build a directed acyclic graph (DAG)
  • • Forward pass: compute values, build graph
  • • Backward pass: traverse graph in reverse, apply chain rule
  • • Each node knows how to propagate gradients through itself

Notice how:

  • Each node stores both its value (computed forward) and its gradient (computed backward)
  • Gradients multiply through the chain: Lw1=Loonnw1\frac{\partial L}{\partial w_1} = \frac{\partial L}{\partial o} \cdot \frac{\partial o}{\partial n} \cdot \frac{\partial n}{\partial w_1}
  • The gradient of a weight tells us how much the output would change if we nudged that weight

From Scratch to PyTorch

Our micrograd implementation mirrors exactly how PyTorch works. Here's the comparison:

MicrogradPyTorchPurpose
Valuetorch.TensorScalar/tensor with gradient tracking
value.datatensor.data or tensor.item()Raw numerical value
value.gradtensor.gradAccumulated gradient
value.backward()tensor.backward()Trigger gradient computation
p.data -= lr * p.gradoptimizer.step()Update parameters
p.grad = 0optimizer.zero_grad()Reset gradients

Here's the same training loop in PyTorch:

Same Training Loop in PyTorch
🐍train_pytorch.py
10Model Definition

PyTorch's nn.Sequential chains layers together. This is equivalent to our MLP class. nn.Linear = our Layer, nn.Tanh = our tanh activation.

21Optimizer

The optimizer wraps model.parameters() and handles the update step. SGD (Stochastic Gradient Descent) is the simplest optimizer.

29Zero Gradients

optimizer.zero_grad() is equivalent to our loop that sets p.grad = 0 for all parameters. PyTorch gradients also accumulate by default.

30Backward Pass

loss.backward() works exactly like our implementation: it traverses the computational graph and computes gradients using the chain rule.

33Parameter Update

optimizer.step() applies the update rule. For SGD: p = p - lr * p.grad. Other optimizers (Adam, RMSprop) have more complex update rules.

32 lines without explanation
1import torch
2import torch.nn as nn
3
4# Training data
5X = torch.tensor([[0.0, 0.0], [0.0, 1.0],
6                  [1.0, 0.0], [1.0, 1.0]])
7y = torch.tensor([[0.0], [1.0], [1.0], [0.0]])
8
9# Create model
10model = nn.Sequential(
11    nn.Linear(2, 4),
12    nn.Tanh(),
13    nn.Linear(4, 4),
14    nn.Tanh(),
15    nn.Linear(4, 1),
16    nn.Tanh()
17)
18
19# Loss and optimizer
20criterion = nn.MSELoss()
21optimizer = torch.optim.SGD(model.parameters(), lr=0.1)
22
23# Training loop
24for epoch in range(100):
25    # Forward pass
26    y_pred = model(X)
27    loss = criterion(y_pred, y)
28
29    # Backward pass
30    optimizer.zero_grad()  # Same as: for p in params: p.grad = 0
31    loss.backward()        # Same as our loss.backward()
32
33    # Update
34    optimizer.step()       # Same as: for p in params: p.data -= lr * p.grad
35
36    if epoch % 10 == 0:
37        print(f"Epoch {epoch}: loss = {loss.item():.4f}")

You Now Understand loss.backward()

When you call loss.backward() in PyTorch, it does exactly what our implementation does: traverse the computational graph in reverse topological order, calling each operation's backward function to propagate gradients via the chain rule.

Common Implementation Pitfalls

Here are the most common mistakes when implementing or using backpropagation:

1. Forgetting to Zero Gradients

🐍pitfall1.py
1# WRONG: Gradients accumulate!
2for epoch in range(100):
3    loss = compute_loss()
4    loss.backward()  # Gradients ADD to existing values
5    optimizer.step()
6
7# CORRECT: Zero gradients before each backward
8for epoch in range(100):
9    optimizer.zero_grad()  # Reset gradients to 0
10    loss = compute_loss()
11    loss.backward()
12    optimizer.step()

2. In-Place Operations Breaking the Graph

🐍pitfall2.py
1# WRONG: In-place modification breaks autograd
2x.data = x.data * 2  # Doesn't track gradient!
3
4# CORRECT: Create new tensor
5x = x * 2  # Creates new Value/Tensor with proper gradient tracking

3. Using a Value Multiple Times

🐍pitfall3.py
1# This is CORRECT - gradients accumulate
2y = x + x  # x.grad will be 2.0 after backward
3
4# Common confusion: some expect x.grad = 1.0
5# But chain rule says: dy/dx = d(x+x)/dx = 1 + 1 = 2

4. Reusing Computation Graphs

🐍pitfall4.py
1# WRONG: Graph is cleared after backward()
2loss.backward()
3loss.backward()  # Error! Graph already freed
4
5# CORRECT: retain_graph=True if you need multiple backward passes
6loss.backward(retain_graph=True)
7loss.backward()  # Works, but unusual

Debug with Numerical Gradients

If your gradients seem wrong, verify them with numerical differentiation:∂L/∂w ≈ (L(w + ε) - L(w - ε)) / (2ε) where ε is small (like 1e-5). If the analytical and numerical gradients don't match, there's a bug in your backward function.

Test Your Understanding

Backpropagation Quiz

Question 1 of 5

In backpropagation, what is the purpose of the chain rule?


Summary

We've built a complete automatic differentiation system from scratch, implementing the same principles that power PyTorch and TensorFlow.

Key Concepts

ConceptKey InsightImplementation
Value ClassWraps scalar with gradient trackingdata, grad, _backward, _prev
Forward PassCompute outputs, build graphOperations return new Values
Backward PassPropagate gradients via chain ruleReverse topological order
Gradient AccumulationHandles branching in graphUse += not = for gradients
Training LoopForward → zero_grad → backward → updateStandard pattern in all frameworks

The Chain Rule in Code

Each operation's _backward function implements one step of the chain rule:

input.grad+=local_derivative×output.grad\text{input.grad} \mathrel{+}= \text{local\_derivative} \times \text{output.grad}
The Big Picture: Backpropagation is not magic—it's just the chain rule applied systematically. By implementing it yourself, you've demystified the core algorithm that enables modern deep learning.

Exercises

Conceptual Questions

  1. Why is reverse-mode automatic differentiation (backpropagation) more efficient than forward-mode for neural networks?
  2. Explain why gradients accumulate (+=) rather than replace (=). Give an example of a computation where this matters.
  3. If a computation uses a value xx in three different operations that all contribute to the loss, what will x.gradx.\text{grad} equal after backward()?

Coding Exercises

  1. Add ReLU: Implement relu() for the Value class. Remember: ReLU(x)=max(0,x)\text{ReLU}(x) = \max(0, x) and its derivative is 1 if x > 0, else 0.
  2. Add Sigmoid: Implement sigmoid() using σ(x)=1/(1+ex)\sigma(x) = 1/(1 + e^{-x}). You can use the exp() method we already have.
  3. Numerical Gradient Check: Write a function that verifies gradients numerically: check_gradients(value, epsilon=1e-5).
  4. Train on Moons: Use sklearn's make_moons dataset and train your MLP to classify the two moons.

Challenge Exercise

Implement Cross-Entropy Loss: Add a log() operation to Value, then implement cross-entropy loss: L=iyilog(y^i)L = -\sum_i y_i \log(\hat{y}_i). Train a classifier using cross-entropy instead of MSE.

Solution Hints

  • ReLU backward: self.grad += (self.data > 0) * out.grad
  • Sigmoid: Can be implemented as 1 / (1 + (-self).exp())
  • Log backward: self.grad += (1 / self.data) * out.grad

In the next section, we'll analyze gradient flow through deep networks, understanding phenomena like vanishing and exploding gradients, and how architectural choices affect learning.