Learning Objectives
By the end of this section, you will be able to:
- Implement automatic differentiation from scratch using a simple Value class that tracks gradients
- Build forward and backward passes for arbitrary computational graphs
- Construct neurons, layers, and networks using your own autograd engine
- Train a neural network using gradient descent on your from-scratch implementation
- Understand how PyTorch's autograd relates to your implementation
- Debug gradient computations and avoid common implementation mistakes
Why This Matters: Every deep learning framework—PyTorch, TensorFlow, JAX—implements automatic differentiation at its core. By building it yourself, you'll understand exactly what happens when you callloss.backward(), making you a more effective practitioner and debugger.
The Big Picture
In the previous section, we derived the mathematics of backpropagation. Now we'll bring those equations to life by implementing them in code. Our goal is to build a miniature version of PyTorch's autograd system—a system that can automatically compute gradients for any computation we define.
What We're Building
We'll implement micrograd—a tiny automatic differentiation engine inspired by Andrej Karpathy's famous educational project. Despite being only ~100 lines of code, it captures the essence of how modern deep learning frameworks work.
| Component | Purpose | Key Method |
|---|---|---|
| Value | Wraps a scalar, stores gradient | __add__, __mul__, backward() |
| Neuron | Weighted sum + activation | __call__(x) |
| Layer | Collection of neurons | __call__(x) |
| MLP | Multi-layer perceptron | __call__(x), parameters() |
The Core Insight
The key insight is that every mathematical operation knows two things:
- How to compute its output (forward pass): given inputs, produce output
- How to propagate gradients backward (backward pass): given , compute
If every operation knows these two things, we can chain them together to compute gradients through arbitrarily complex computations.
Building Blocks: The Value Class
Our Value class is the fundamental building block. It wraps a scalar number and tracks:
- data: The actual numerical value
- grad: The gradient , initially 0
- _backward: A function that propagates gradients to inputs
- _prev: The set of values that produced this value (for topological sort)
The Gradient Equation
For each operation, the backward function implements the chain rule. For example, if , then:
Since for addition, both inputs receive the output's gradient unchanged. This is why gradients "add up" at branches in the computational graph.
Interactive: Forward & Backward Pass
Let's visualize a complete forward and backward pass through a simple neuron. Step through each stage to see how values flow forward and gradients flow backward.
Backpropagation Step-by-Step
Initial State
Our simple neural network: two inputs weighted, summed with bias, passed through sigmoid, and compared to target.
Adjust Parameters
Notice the Pattern
Implementing the Value Class
Let's implement our automatic differentiation engine step by step:
Quick Check
In the Value class, why do we use += instead of = when updating gradients in the backward functions?
Building Neurons and Layers
Now we can build neural network components using our Value class:
The Training Loop
Now we can train our network! The training loop has three main steps:
The Three Critical Steps
- Forward Pass: Compute predictions and loss, building the computational graph
- Zero Gradients: Reset all gradients to 0 (they accumulate by default)
- Backward Pass + Update: Compute gradients with loss.backward(), then update parameters
Interactive: Training XOR
Watch a neural network learn XOR in real-time. The network has two inputs, a hidden layer, and one output. Observe how the loss decreases and the network eventually classifies all four XOR cases correctly.
Gradient Flow: Training XOR
Quick Check
Why can't a single perceptron (without hidden layers) solve XOR?
MicroGrad: See It All Together
This interactive demo shows the complete micrograd system in action. Adjust the inputs and weights to see how values propagate forward through the computation, and toggle gradients to see how they flow backward.
MicroGrad: Automatic Differentiation Engine
The Computation
Adjust Values
How Automatic Differentiation Works
- • Each Value stores its data and gradient
- • Operations build a directed acyclic graph (DAG)
- • Forward pass: compute values, build graph
- • Backward pass: traverse graph in reverse, apply chain rule
- • Each node knows how to propagate gradients through itself
Notice how:
- Each node stores both its value (computed forward) and its gradient (computed backward)
- Gradients multiply through the chain:
- The gradient of a weight tells us how much the output would change if we nudged that weight
From Scratch to PyTorch
Our micrograd implementation mirrors exactly how PyTorch works. Here's the comparison:
| Micrograd | PyTorch | Purpose |
|---|---|---|
| Value | torch.Tensor | Scalar/tensor with gradient tracking |
| value.data | tensor.data or tensor.item() | Raw numerical value |
| value.grad | tensor.grad | Accumulated gradient |
| value.backward() | tensor.backward() | Trigger gradient computation |
| p.data -= lr * p.grad | optimizer.step() | Update parameters |
| p.grad = 0 | optimizer.zero_grad() | Reset gradients |
Here's the same training loop in PyTorch:
You Now Understand loss.backward()
loss.backward() in PyTorch, it does exactly what our implementation does: traverse the computational graph in reverse topological order, calling each operation's backward function to propagate gradients via the chain rule.Common Implementation Pitfalls
Here are the most common mistakes when implementing or using backpropagation:
1. Forgetting to Zero Gradients
1# WRONG: Gradients accumulate!
2for epoch in range(100):
3 loss = compute_loss()
4 loss.backward() # Gradients ADD to existing values
5 optimizer.step()
6
7# CORRECT: Zero gradients before each backward
8for epoch in range(100):
9 optimizer.zero_grad() # Reset gradients to 0
10 loss = compute_loss()
11 loss.backward()
12 optimizer.step()2. In-Place Operations Breaking the Graph
1# WRONG: In-place modification breaks autograd
2x.data = x.data * 2 # Doesn't track gradient!
3
4# CORRECT: Create new tensor
5x = x * 2 # Creates new Value/Tensor with proper gradient tracking3. Using a Value Multiple Times
1# This is CORRECT - gradients accumulate
2y = x + x # x.grad will be 2.0 after backward
3
4# Common confusion: some expect x.grad = 1.0
5# But chain rule says: dy/dx = d(x+x)/dx = 1 + 1 = 24. Reusing Computation Graphs
1# WRONG: Graph is cleared after backward()
2loss.backward()
3loss.backward() # Error! Graph already freed
4
5# CORRECT: retain_graph=True if you need multiple backward passes
6loss.backward(retain_graph=True)
7loss.backward() # Works, but unusualDebug with Numerical Gradients
∂L/∂w ≈ (L(w + ε) - L(w - ε)) / (2ε) where ε is small (like 1e-5). If the analytical and numerical gradients don't match, there's a bug in your backward function.Test Your Understanding
Backpropagation Quiz
Question 1 of 5In backpropagation, what is the purpose of the chain rule?
Summary
We've built a complete automatic differentiation system from scratch, implementing the same principles that power PyTorch and TensorFlow.
Key Concepts
| Concept | Key Insight | Implementation |
|---|---|---|
| Value Class | Wraps scalar with gradient tracking | data, grad, _backward, _prev |
| Forward Pass | Compute outputs, build graph | Operations return new Values |
| Backward Pass | Propagate gradients via chain rule | Reverse topological order |
| Gradient Accumulation | Handles branching in graph | Use += not = for gradients |
| Training Loop | Forward → zero_grad → backward → update | Standard pattern in all frameworks |
The Chain Rule in Code
Each operation's _backward function implements one step of the chain rule:
The Big Picture: Backpropagation is not magic—it's just the chain rule applied systematically. By implementing it yourself, you've demystified the core algorithm that enables modern deep learning.
Exercises
Conceptual Questions
- Why is reverse-mode automatic differentiation (backpropagation) more efficient than forward-mode for neural networks?
- Explain why gradients accumulate (+=) rather than replace (=). Give an example of a computation where this matters.
- If a computation uses a value in three different operations that all contribute to the loss, what will equal after backward()?
Coding Exercises
- Add ReLU: Implement
relu()for the Value class. Remember: and its derivative is 1 if x > 0, else 0. - Add Sigmoid: Implement
sigmoid()using . You can use the exp() method we already have. - Numerical Gradient Check: Write a function that verifies gradients numerically:
check_gradients(value, epsilon=1e-5). - Train on Moons: Use sklearn's make_moons dataset and train your MLP to classify the two moons.
Challenge Exercise
Implement Cross-Entropy Loss: Add a log() operation to Value, then implement cross-entropy loss: . Train a classifier using cross-entropy instead of MSE.
Solution Hints
- ReLU backward:
self.grad += (self.data > 0) * out.grad - Sigmoid: Can be implemented as
1 / (1 + (-self).exp()) - Log backward:
self.grad += (1 / self.data) * out.grad
In the next section, we'll analyze gradient flow through deep networks, understanding phenomena like vanishing and exploding gradients, and how architectural choices affect learning.