Boo-AI — Master Artificial Intelligence by Building from Scratch

Learning Objectives

By the end of this section, you will be able to:

Understand automatic differentiation and why it's essential for deep learning
Use the requires_grad flag to control gradient tracking
Visualize computational graphs and understand how PyTorch builds them
Apply the backward() method correctly to compute gradients
Manage gradient accumulation and avoid common bugs
Control gradient computation using no_grad(), detach(), and retain_graph
Debug autograd issues in real training scenarios

Why This Matters: Autograd is the engine that powers all of deep learning. Without automatic differentiation, we would have to manually derive and implement gradients for every operation - an error-prone task that becomes impossible for complex models with millions of parameters. Understanding autograd is essential for debugging, optimizing, and extending neural networks.

The Big Picture

Training a neural network requires computing gradients of a loss function with respect to millions of parameters. Consider a simple network:

\mathcal{L} = \text{Loss}(f_n(f_{n-1}(\cdots f_1(\mathbf{x}; \theta_1); \theta_2) \cdots; \theta_n), \mathbf{y})

To update parameters using gradient descent, we need:

\frac{\partial \mathcal{L}}{\partial \theta_1}, \frac{\partial \mathcal{L}}{\partial \theta_2}, \ldots, \frac{\partial \mathcal{L}}{\partial \theta_n}

Computing these manually would require applying the chain rule through every layer. For a model like GPT-4 with trillions of parameters, this is humanly impossible. Automatic differentiation solves this by automatically computing exact gradients for any computation expressed in code.

A Brief History

The idea of automatic differentiation dates back to the 1960s with work by Robert Edwin Wengert. However, it wasn't until frameworks like Theano (2007), TensorFlow (2015), and PyTorch (2016) that autodiff became practical for deep learning. PyTorch's innovation was the dynamic computational graph - building the graph on-the-fly during execution rather than defining it statically beforehand.

Two Modes of Autodiff

Mode	Direction	When It's Efficient	Used In
Forward Mode	Input → Output	Few inputs, many outputs	Jacobian computation
Reverse Mode	Output → Input	Many inputs, few outputs	Deep learning (backprop)

PyTorch uses reverse-mode autodiff (also called backpropagation). This is ideal for neural networks where we have a single scalar loss but millions of parameters - we can compute all gradients in a single backward pass.

What is Autograd?

Autograd is PyTorch's automatic differentiation engine. It provides automatic computation of gradients for tensor operations. Here's the key insight:

Core Idea: Every tensor operation in PyTorch can be thought of as a function with a known derivative. By chaining these functions and their derivatives, PyTorch can compute the gradient of any composition of operations.

The Fundamental Components

Tensors with requires_grad=True: These are the “watched” variables that we want gradients for.
Computational Graph: A directed acyclic graph (DAG) that records all operations performed on tensors.
grad_fn: Each tensor resulting from an operation stores a reference to the function that created it, enabling backward traversal.
backward(): Triggers the gradient computation by traversing the graph in reverse.

The requires_grad Flag

The requires_grad attribute is the switch that tells PyTorch whether to track operations on a tensor for gradient computation.

Understanding requires_grad

🐍requires_grad.py

Explanation(5)

Code(20)

4Enabling Gradient Tracking

Setting requires_grad=True tells PyTorch to record all operations on this tensor for gradient computation.

8Default Behavior

By default, tensors don't track gradients. This saves memory and computation.

12Gradient Propagation

Operations on tracked tensors produce new tensors that also track gradients. The gradient requirement propagates through the graph.

15grad_fn Attribute

Each non-leaf tensor stores a reference to the Function that created it. This is used during backward pass.

19Leaf vs Non-leaf Tensors

Leaf tensors are created directly by users. Only leaf tensors accumulate gradients in .grad.

15 lines without explanation

1import torch
2
3# Create a tensor that tracks gradients
4x = torch.tensor([2.0, 3.0], requires_grad=True)
5print(x.requires_grad)  # True
6
7# Regular tensor (no gradient tracking)
8y = torch.tensor([1.0, 2.0])
9print(y.requires_grad)  # False
10
11# Operations on tracked tensors create new tracked tensors
12z = x * 2 + 1
13print(z.requires_grad)  # True (inherited from x)
14
15# The result stores a reference to its creator
16print(z.grad_fn)  # <AddBackward0 object>
17
18# Check if a tensor is a leaf (created directly, not from ops)
19print(x.is_leaf)  # True
20print(z.is_leaf)  # False

Leaf Tensors vs Non-Leaf Tensors

This distinction is crucial for understanding where gradients are stored:

Property	Leaf Tensor	Non-Leaf Tensor
Creation	Created directly by user	Result of operations
is_leaf	True	False
grad_fn	None	Points to creating function
.grad storage	Gradients accumulated here	Not stored by default
Example	torch.tensor([1.0], requires_grad=True)	x + y, x * 2, etc.

🐍leaf_example.py

1x = torch.tensor([2.0], requires_grad=True)  # Leaf
2y = x ** 2                                     # Non-leaf
3z = y + 1                                      # Non-leaf
4
5z.backward()
6
7print(x.grad)  # tensor([4.]) - gradient stored here!
8print(y.grad)  # None - non-leaf, not retained
9
10# To retain non-leaf gradients:
11x = torch.tensor([2.0], requires_grad=True)
12y = x ** 2
13y.retain_grad()  # Tell PyTorch to keep this gradient
14z = y + 1
15z.backward()
16print(y.grad)  # tensor([1.]) - now it's stored!

Computational Graphs

At the heart of autograd is the computational graph - a data structure that records every operation performed on tensors. This graph has two key properties:

Directed: Edges point from inputs to outputs (forward direction) and from outputs to inputs (backward direction for gradients).
Acyclic: There are no cycles - you can't have a tensor that depends on itself through a chain of operations.

Dynamic vs Static Graphs

PyTorch uses dynamic computational graphs (define-by-run), meaning the graph is built fresh during each forward pass:

Aspect	Dynamic (PyTorch)	Static (TensorFlow 1.x)
Graph Creation	Built during execution	Defined before execution
Python Control Flow	Fully supported (if, for, while)	Limited (tf.cond, tf.while_loop)
Debugging	Standard Python debugger works	Requires special tools
Flexibility	Different graph each forward pass	Same graph every time
Optimization	Less compile-time optimization	More compile-time optimization

🐍dynamic_graph.py

1import torch
2
3def forward(x, use_relu=True):
4    """Dynamic graph changes based on Python conditions"""
5    y = x ** 2
6
7    # This Python if-statement works naturally!
8    if use_relu:
9        y = torch.relu(y)
10    else:
11        y = torch.tanh(y)
12
13    return y.sum()
14
15x = torch.randn(3, requires_grad=True)
16
17# Two different graphs based on the condition
18loss1 = forward(x, use_relu=True)
19loss2 = forward(x, use_relu=False)  # Different graph!

Modern TensorFlow

TensorFlow 2.0+ introduced eager execution by default, which also provides dynamic graphs similar to PyTorch. The distinction is now less pronounced.

Interactive: Computational Graph Builder

Explore how PyTorch builds computational graphs during the forward pass and computes gradients during the backward pass. Select different examples to see various graph structures.

Computational Graph Visualizer

PyTorch Code

x = torch.tensor([2.0], requires_grad=True)
y = x ** 2 + 3 * x
# dy/dx = 2x + 3 = 7

Phase: Idle

Click “Forward Pass” to see how PyTorch builds the computational graph as operations are executed.

Input (leaf tensor)

Operation

Output

requires_grad=True

The backward() Method

The backward() method triggers gradient computation by traversing the computational graph in reverse order, applying the chain rule at each step.

Using backward()

🐍backward_method.py

Explanation(4)

Code(25)

8Triggering Backward Pass

backward() computes gradients by traversing the graph from this tensor back to all leaf tensors.

11Gradient Storage

Gradients are accumulated in the .grad attribute of leaf tensors (those with requires_grad=True).

16Scalar Requirement

backward() requires a scalar by default. For losses, use .sum() or .mean() to reduce to a scalar.

23Vector-Jacobian Product

For non-scalar tensors, pass a gradient argument of the same shape. This computes the VJP: v^T @ J.

21 lines without explanation

1import torch
2
3# Simple example: y = x^2, find dy/dx at x=3
4x = torch.tensor([3.0], requires_grad=True)
5y = x ** 2  # y = 9
6
7# Compute gradients
8y.backward()
9
10# Gradient is stored in x.grad
11print(x.grad)  # tensor([6.]) since dy/dx = 2x = 6
12
13# For multiple outputs, backward works on scalars
14a = torch.tensor([1.0, 2.0], requires_grad=True)
15b = a ** 2  # b = [1, 4]
16loss = b.sum()  # loss = 5 (scalar)
17loss.backward()
18print(a.grad)  # tensor([2., 4.]) since d(sum)/da = 2a
19
20# For non-scalar outputs, provide gradient argument
21x = torch.tensor([1.0, 2.0], requires_grad=True)
22y = x ** 2  # y = [1, 4]
23# This is the "vector-Jacobian product"
24y.backward(torch.tensor([1.0, 1.0]))  # gradient = ones
25print(x.grad)  # tensor([2., 4.])

The Chain Rule in Action

Let's trace through how autograd applies the chain rule:

🐍chain_rule.py

1# Compute gradient of z = (x + y)^2 where x=2, y=3
2x = torch.tensor([2.0], requires_grad=True)
3y = torch.tensor([3.0], requires_grad=True)
4
5# Forward pass: build the graph
6s = x + y      # s = 5, grad_fn=AddBackward0
7z = s ** 2     # z = 25, grad_fn=PowBackward0
8
9# Backward pass: apply chain rule
10z.backward()
11
12# dz/dx = dz/ds * ds/dx = 2s * 1 = 2*5 = 10
13print(x.grad)  # tensor([10.])
14
15# dz/dy = dz/ds * ds/dy = 2s * 1 = 2*5 = 10
16print(y.grad)  # tensor([10.])

The chain rule states that:

\frac{\partial z}{\partial x} = \frac{\partial z}{\partial s} \cdot \frac{\partial s}{\partial x} = 2s \cdot 1 = 2(x+y) = 2(5) = 10

Gradient Accumulation

One of the most common sources of bugs in PyTorch is forgetting that gradients accumulate by default. Each call to backward()adds to existing gradients rather than replacing them.

Critical: Gradient Accumulation

PyTorch accumulates gradients in the .grad attribute. You MUST calloptimizer.zero_grad() or manually zero gradients before each backward pass in a standard training loop.

🐍accumulation_bug.py

1x = torch.tensor([2.0], requires_grad=True)
2
3# First backward
4y1 = x ** 2
5y1.backward()
6print(x.grad)  # tensor([4.])
7
8# Second backward WITHOUT zeroing
9y2 = x ** 2
10y2.backward()
11print(x.grad)  # tensor([8.]) - ACCUMULATED! 4 + 4 = 8
12
13# This is a bug in most training loops!
14
15# Correct approach: zero gradients
16x.grad.zero_()  # or x.grad = None
17y3 = x ** 2
18y3.backward()
19print(x.grad)  # tensor([4.]) - correct!

Why Accumulation?

Gradient accumulation is actually a feature, not a bug. It's useful for:

Gradient accumulation across mini-batches: When GPU memory is limited, process multiple small batches and accumulate gradients before updating.
Multi-task learning: Compute gradients from multiple loss functions and combine them before optimization.

🐍intentional_accumulation.py

1# Simulate large batch using gradient accumulation
2accumulation_steps = 4
3optimizer.zero_grad()
4
5for i, batch in enumerate(dataloader):
6    output = model(batch)
7    loss = criterion(output) / accumulation_steps  # Scale loss
8    loss.backward()  # Accumulate gradients
9
10    if (i + 1) % accumulation_steps == 0:
11        optimizer.step()  # Update with accumulated gradients
12        optimizer.zero_grad()  # Reset for next accumulation

Interactive: Gradient Accumulation Demo

Experience the difference between forgetting to zero gradients (bug) and properly resetting them each iteration (correct). Watch how gradients grow out of control when not zeroed!

Gradient Accumulation Demo

Current Gradient Value

0.0

Training Loop

# ❌ BUG: Missing zero_grad()
for batch in loader:
    output = model(batch)
    loss = criterion(output, target)
    loss.backward()  # Gradients ADD UP!
    optimizer.step()

Key Insight: PyTorch accumulates gradients by design. This is useful for gradient accumulation across mini-batches, but you must calloptimizer.zero_grad() before each backward pass in a standard training loop.

Controlling Gradient Computation

PyTorch provides several mechanisms to control when and how gradients are computed:

1. torch.no_grad()

A context manager that disables gradient computation. Use during inference to save memory and computation:

🐍no_grad.py

1model.eval()  # Set model to evaluation mode
2
3# Inference without gradients
4with torch.no_grad():
5    predictions = model(test_data)
6    # No computational graph is built
7    # Tensors inside won't have grad_fn
8    print(predictions.requires_grad)  # False
9
10# Decorator version
11@torch.no_grad()
12def inference(model, data):
13    return model(data)

2. tensor.detach()

Creates a new tensor that shares data but is detached from the computational graph:

🐍detach.py

1x = torch.tensor([2.0], requires_grad=True)
2y = x ** 2
3
4# Detach y from the graph
5y_detached = y.detach()
6
7print(y.requires_grad)         # True
8print(y_detached.requires_grad)  # False
9print(y_detached.grad_fn)        # None
10
11# Common use: stop gradient flow in part of the network
12def forward_with_stop_gradient(x):
13    # First part: gradients flow
14    h1 = self.layer1(x)
15
16    # Stop gradients here
17    h1_stopped = h1.detach()
18
19    # Second part: no gradients to layer1
20    h2 = self.layer2(h1_stopped)
21    return h2

3. retain_graph

By default, the computational graph is freed after backward(). Use retain_graph=True to keep it for multiple backward passes:

🐍retain_graph.py

1x = torch.tensor([2.0], requires_grad=True)
2y = x ** 2
3
4# First backward - graph is retained
5y.backward(retain_graph=True)
6print(x.grad)  # tensor([4.])
7
8x.grad.zero_()  # Reset gradient
9
10# Second backward on same graph
11y.backward()  # Works because graph was retained
12print(x.grad)  # tensor([4.])
13
14# Without retain_graph=True, second backward would fail!

4. torch.enable_grad()

Re-enables gradient computation inside a no_grad block:

🐍enable_grad.py

1with torch.no_grad():
2    x = torch.tensor([1.0], requires_grad=True)
3    y = x * 2  # No gradient tracking
4
5    with torch.enable_grad():
6        z = x * 3  # Gradient tracking re-enabled
7        print(z.requires_grad)  # True

Quick Check

What happens if you call backward() twice on the same tensor without retain_graph=True?

Under the Hood

Let's peek under the hood to understand how autograd actually works:

The Function Class

Every operation in PyTorch is implemented as a subclass of torch.autograd.Functionwith two methods:

forward(): Computes the output given inputs
backward(): Computes gradients given the gradient of the output

🐍function_class.py

1class Square(torch.autograd.Function):
2    @staticmethod
3    def forward(ctx, x):
4        # ctx is a context object for saving info for backward
5        ctx.save_for_backward(x)
6        return x ** 2
7
8    @staticmethod
9    def backward(ctx, grad_output):
10        # grad_output is dL/d(output)
11        # We need to return dL/dx = dL/d(output) * d(output)/dx
12        x, = ctx.saved_tensors
13        return grad_output * 2 * x  # d(x^2)/dx = 2x
14
15# Usage
16x = torch.tensor([3.0], requires_grad=True)
17y = Square.apply(x)  # y = 9
18y.backward()
19print(x.grad)  # tensor([6.]) = 2 * 3

The Backward Pass Algorithm

When you call backward(), PyTorch:

Starts at the output tensor with gradient = 1 (for scalar) or the provided gradient
Traverses the graph in reverse topological order
At each node, calls the backward() method of its grad_fn
Multiplies the incoming gradient by the local gradient (chain rule)
Propagates the result to input nodes
Accumulates gradients at leaf nodes in their .grad attribute

Practical Examples

Example 1: Neural Network Training Loop

Standard Training Loop

🐍training_loop.py

Explanation(4)

Code(25)

15Forward Pass

Model computes predictions, building the computational graph as operations execute.

19Zero Gradients

CRITICAL step! Reset gradients before backward to prevent accumulation from previous iterations.

20Backward Pass

Computes gradients of loss with respect to all parameters with requires_grad=True.

21Parameter Update

Optimizer uses computed gradients to update model parameters.

21 lines without explanation

1import torch
2import torch.nn as nn
3
4# Simple model
5model = nn.Linear(10, 1)
6criterion = nn.MSELoss()
7optimizer = torch.optim.SGD(model.parameters(), lr=0.01)
8
9# Training data
10X = torch.randn(32, 10)
11y = torch.randn(32, 1)
12
13# Training loop
14for epoch in range(100):
15    # Forward pass
16    predictions = model(X)
17    loss = criterion(predictions, y)
18
19    # Backward pass
20    optimizer.zero_grad()  # CRITICAL: zero gradients!
21    loss.backward()        # Compute gradients
22    optimizer.step()       # Update parameters
23
24    if epoch % 20 == 0:
25        print(f"Epoch {epoch}, Loss: {loss.item():.4f}")

Example 2: Custom Gradient Modification

🐍gradient_modification.py

1# Gradient clipping
2x = torch.randn(100, requires_grad=True)
3y = (x ** 2).sum()
4y.backward()
5
6# Clip gradients to prevent explosion
7torch.nn.utils.clip_grad_norm_([x], max_norm=1.0)
8
9# Or manually
10x.grad.clamp_(-1, 1)
11
12# Register a backward hook for custom gradient processing
13def print_grad(grad):
14    print(f"Gradient shape: {grad.shape}, mean: {grad.mean():.4f}")
15    return grad  # Return modified gradient or None to use original
16
17x = torch.randn(10, requires_grad=True)
18x.register_hook(print_grad)
19y = (x ** 2).sum()
20y.backward()  # Hook is called during backward

Example 3: Higher-Order Gradients

🐍higher_order.py

1# Second derivative: d²y/dx²
2x = torch.tensor([2.0], requires_grad=True)
3y = x ** 3  # y = 8
4
5# First derivative
6grad1 = torch.autograd.grad(y, x, create_graph=True)[0]
7print(grad1)  # tensor([12.]) = 3x² = 3*4 = 12
8
9# Second derivative
10grad2 = torch.autograd.grad(grad1, x)[0]
11print(grad2)  # tensor([12.]) = 6x = 6*2 = 12
12
13# Useful for: physics-informed neural networks, Hessian computation

Common Pitfalls

1. Forgetting zero_grad()

🐍pitfall_zero_grad.py

1# ❌ BUG: Gradients accumulate across iterations
2for batch in dataloader:
3    loss = model(batch).sum()
4    loss.backward()
5    optimizer.step()  # Uses accumulated gradients!
6
7# ✓ CORRECT
8for batch in dataloader:
9    optimizer.zero_grad()  # Reset first!
10    loss = model(batch).sum()
11    loss.backward()
12    optimizer.step()

2. Modifying Tensors In-Place

🐍pitfall_inplace.py

1# ❌ BUG: In-place operations can break autograd
2x = torch.tensor([2.0], requires_grad=True)
3y = x ** 2
4x.add_(1)  # In-place modification after y was computed
5y.backward()  # RuntimeError!
6
7# ✓ CORRECT: Use out-of-place operations
8x = torch.tensor([2.0], requires_grad=True)
9y = x ** 2
10x = x + 1  # Creates new tensor
11# (Note: y still depends on original x value)

3. Mixing Devices

🐍pitfall_devices.py

1# ❌ BUG: Tensors on different devices
2x = torch.tensor([1.0], device='cuda', requires_grad=True)
3y = torch.tensor([2.0], device='cpu', requires_grad=True)
4z = x + y  # RuntimeError!
5
6# ✓ CORRECT: Same device
7x = torch.tensor([1.0], device='cuda', requires_grad=True)
8y = torch.tensor([2.0], device='cuda', requires_grad=True)
9z = x + y  # Works

4. Converting to NumPy with Gradients

🐍pitfall_numpy.py

1# ❌ BUG: Can't convert tensor with gradients to NumPy
2x = torch.tensor([1.0], requires_grad=True)
3y = x ** 2
4y.numpy()  # RuntimeError!
5
6# ✓ CORRECT: Detach first
7y.detach().numpy()  # Works
8
9# Or if on GPU:
10y.detach().cpu().numpy()  # Works

Knowledge Check

Test your understanding of PyTorch's autograd system with this comprehensive quiz.

Autograd Knowledge Check

Question 1 of 10

What does `requires_grad=True` do when creating a tensor?

Summary

In this section, we covered PyTorch's autograd system - the automatic differentiation engine that makes deep learning training possible:

Concept	Key Points
requires_grad	Flag that enables gradient tracking; set True for learnable parameters
Computational Graph	Dynamic DAG built during forward pass; traversed in reverse for gradients
backward()	Triggers gradient computation; accumulates gradients in leaf tensor .grad
Gradient Accumulation	Gradients add up by default; use zero_grad() to reset
no_grad()	Disables gradient tracking; use for inference to save memory
detach()	Creates new tensor detached from graph; stops gradient flow
retain_graph	Keeps graph for multiple backward passes; needed for higher-order derivatives

Key Takeaways

Autograd builds graphs dynamically: Unlike static graph frameworks, PyTorch builds the computational graph during execution, enabling Pythonic control flow.
Gradients accumulate by default: Always call optimizer.zero_grad()before backward() in training loops unless you intentionally want accumulation.
Only leaf tensors store gradients: Use retain_grad()if you need gradients for intermediate computations.
Use no_grad() for inference: This saves memory and computation by not building the computational graph.
Understand the chain rule: Autograd applies the chain rule automatically, but understanding it helps debug gradient issues.

Exercises

Conceptual Questions

Explain the difference between forward-mode and reverse-mode automatic differentiation. Why is reverse-mode preferred for deep learning?
What is a leaf tensor? Why do only leaf tensors have their gradients stored in .grad?
Describe a scenario where gradient accumulation (without zero_grad()) is actually the desired behavior.

Coding Exercises

Manual Chain Rule: Compute the gradient of $f(x) = \sin(x^2)$ at $x=\pi/4$ both manually using the chain rule and using autograd. Verify they match.
Custom Autograd Function: Implement a custom autograd function for the function $f(x) = x \cdot \text{sign}(x)$ (absolute value). Handle the gradient at x=0.

Gradient Debugging: Given this buggy code, identify and fix the issues:

🐍exercise_debug.py

1model = nn.Linear(10, 1)
2optimizer = torch.optim.SGD(model.parameters(), lr=0.01)
3
4for i in range(100):
5    x = torch.randn(32, 10)
6    y = model(x).sum()
7    y.backward()
8    optimizer.step()
9    print(f"Loss: {y.item()}")

Higher-Order Derivatives: Use autograd to compute the second and third derivatives of $f(x) = e^{x^2}$ at x=1.

Challenge Exercise

Build a Mini-Autograd: Implement a simplified autograd system that supports addition, multiplication, and power operations. Your implementation should:

Define a Tensor class that stores values and gradients
Implement forward operations that build a computational graph
Implement a backward() method that computes gradients
Handle gradient accumulation for tensors used multiple times

🐍challenge_starter.py

1class MiniTensor:
2    def __init__(self, value, requires_grad=False):
3        self.value = value
4        self.grad = 0.0
5        self.requires_grad = requires_grad
6        self._backward = lambda: None
7        self._prev = set()
8
9    def __add__(self, other):
10        # Implement me!
11        pass
12
13    def __mul__(self, other):
14        # Implement me!
15        pass
16
17    def backward(self):
18        # Implement me!
19        pass
20
21# Test your implementation:
22a = MiniTensor(2.0, requires_grad=True)
23b = MiniTensor(3.0, requires_grad=True)
24c = a + b  # c = 5
25d = c * a  # d = 10
26d.backward()
27print(a.grad)  # Should be 7 (dc/da * a + c * 1 = 1*2 + 5*1 = 7)
28print(b.grad)  # Should be 2 (dc/db * a = 1*2 = 2)

Implementation Hint

Use reverse topological sort to ensure nodes are processed in the correct order during backward pass. Each operation should record its inputs and define how to propagate gradients to them.

In the next section, we'll explore how to build custom autograd functions for operations that aren't supported by PyTorch out of the box, or when you need to optimize gradient computation for special cases.