Learning Objectives
By the end of this section, you will be able to:
- Understand automatic differentiation and why it's essential for deep learning
- Use the requires_grad flag to control gradient tracking
- Visualize computational graphs and understand how PyTorch builds them
- Apply the backward() method correctly to compute gradients
- Manage gradient accumulation and avoid common bugs
- Control gradient computation using no_grad(), detach(), and retain_graph
- Debug autograd issues in real training scenarios
Why This Matters: Autograd is the engine that powers all of deep learning. Without automatic differentiation, we would have to manually derive and implement gradients for every operation - an error-prone task that becomes impossible for complex models with millions of parameters. Understanding autograd is essential for debugging, optimizing, and extending neural networks.
The Big Picture
Training a neural network requires computing gradients of a loss function with respect to millions of parameters. Consider a simple network:
To update parameters using gradient descent, we need:
Computing these manually would require applying the chain rule through every layer. For a model like GPT-4 with trillions of parameters, this is humanly impossible. Automatic differentiation solves this by automatically computing exact gradients for any computation expressed in code.
A Brief History
The idea of automatic differentiation dates back to the 1960s with work by Robert Edwin Wengert. However, it wasn't until frameworks like Theano (2007), TensorFlow (2015), and PyTorch (2016) that autodiff became practical for deep learning. PyTorch's innovation was the dynamic computational graph - building the graph on-the-fly during execution rather than defining it statically beforehand.
Two Modes of Autodiff
| Mode | Direction | When It's Efficient | Used In |
|---|---|---|---|
| Forward Mode | Input → Output | Few inputs, many outputs | Jacobian computation |
| Reverse Mode | Output → Input | Many inputs, few outputs | Deep learning (backprop) |
PyTorch uses reverse-mode autodiff (also called backpropagation). This is ideal for neural networks where we have a single scalar loss but millions of parameters - we can compute all gradients in a single backward pass.
What is Autograd?
Autograd is PyTorch's automatic differentiation engine. It provides automatic computation of gradients for tensor operations. Here's the key insight:
Core Idea: Every tensor operation in PyTorch can be thought of as a function with a known derivative. By chaining these functions and their derivatives, PyTorch can compute the gradient of any composition of operations.
The Fundamental Components
- Tensors with requires_grad=True: These are the “watched” variables that we want gradients for.
- Computational Graph: A directed acyclic graph (DAG) that records all operations performed on tensors.
- grad_fn: Each tensor resulting from an operation stores a reference to the function that created it, enabling backward traversal.
- backward(): Triggers the gradient computation by traversing the graph in reverse.
The requires_grad Flag
The requires_grad attribute is the switch that tells PyTorch whether to track operations on a tensor for gradient computation.
Leaf Tensors vs Non-Leaf Tensors
This distinction is crucial for understanding where gradients are stored:
| Property | Leaf Tensor | Non-Leaf Tensor |
|---|---|---|
| Creation | Created directly by user | Result of operations |
| is_leaf | True | False |
| grad_fn | None | Points to creating function |
| .grad storage | Gradients accumulated here | Not stored by default |
| Example | torch.tensor([1.0], requires_grad=True) | x + y, x * 2, etc. |
1x = torch.tensor([2.0], requires_grad=True) # Leaf
2y = x ** 2 # Non-leaf
3z = y + 1 # Non-leaf
4
5z.backward()
6
7print(x.grad) # tensor([4.]) - gradient stored here!
8print(y.grad) # None - non-leaf, not retained
9
10# To retain non-leaf gradients:
11x = torch.tensor([2.0], requires_grad=True)
12y = x ** 2
13y.retain_grad() # Tell PyTorch to keep this gradient
14z = y + 1
15z.backward()
16print(y.grad) # tensor([1.]) - now it's stored!Computational Graphs
At the heart of autograd is the computational graph - a data structure that records every operation performed on tensors. This graph has two key properties:
- Directed: Edges point from inputs to outputs (forward direction) and from outputs to inputs (backward direction for gradients).
- Acyclic: There are no cycles - you can't have a tensor that depends on itself through a chain of operations.
Dynamic vs Static Graphs
PyTorch uses dynamic computational graphs (define-by-run), meaning the graph is built fresh during each forward pass:
| Aspect | Dynamic (PyTorch) | Static (TensorFlow 1.x) |
|---|---|---|
| Graph Creation | Built during execution | Defined before execution |
| Python Control Flow | Fully supported (if, for, while) | Limited (tf.cond, tf.while_loop) |
| Debugging | Standard Python debugger works | Requires special tools |
| Flexibility | Different graph each forward pass | Same graph every time |
| Optimization | Less compile-time optimization | More compile-time optimization |
1import torch
2
3def forward(x, use_relu=True):
4 """Dynamic graph changes based on Python conditions"""
5 y = x ** 2
6
7 # This Python if-statement works naturally!
8 if use_relu:
9 y = torch.relu(y)
10 else:
11 y = torch.tanh(y)
12
13 return y.sum()
14
15x = torch.randn(3, requires_grad=True)
16
17# Two different graphs based on the condition
18loss1 = forward(x, use_relu=True)
19loss2 = forward(x, use_relu=False) # Different graph!Modern TensorFlow
Interactive: Computational Graph Builder
Explore how PyTorch builds computational graphs during the forward pass and computes gradients during the backward pass. Select different examples to see various graph structures.
Computational Graph Visualizer
x = torch.tensor([2.0], requires_grad=True) y = x ** 2 + 3 * x # dy/dx = 2x + 3 = 7
Click “Forward Pass” to see how PyTorch builds the computational graph as operations are executed.
The backward() Method
The backward() method triggers gradient computation by traversing the computational graph in reverse order, applying the chain rule at each step.
The Chain Rule in Action
Let's trace through how autograd applies the chain rule:
1# Compute gradient of z = (x + y)^2 where x=2, y=3
2x = torch.tensor([2.0], requires_grad=True)
3y = torch.tensor([3.0], requires_grad=True)
4
5# Forward pass: build the graph
6s = x + y # s = 5, grad_fn=AddBackward0
7z = s ** 2 # z = 25, grad_fn=PowBackward0
8
9# Backward pass: apply chain rule
10z.backward()
11
12# dz/dx = dz/ds * ds/dx = 2s * 1 = 2*5 = 10
13print(x.grad) # tensor([10.])
14
15# dz/dy = dz/ds * ds/dy = 2s * 1 = 2*5 = 10
16print(y.grad) # tensor([10.])The chain rule states that:
Gradient Accumulation
One of the most common sources of bugs in PyTorch is forgetting that gradients accumulate by default. Each call to backward()adds to existing gradients rather than replacing them.
Critical: Gradient Accumulation
.grad attribute. You MUST calloptimizer.zero_grad() or manually zero gradients before each backward pass in a standard training loop.1x = torch.tensor([2.0], requires_grad=True)
2
3# First backward
4y1 = x ** 2
5y1.backward()
6print(x.grad) # tensor([4.])
7
8# Second backward WITHOUT zeroing
9y2 = x ** 2
10y2.backward()
11print(x.grad) # tensor([8.]) - ACCUMULATED! 4 + 4 = 8
12
13# This is a bug in most training loops!
14
15# Correct approach: zero gradients
16x.grad.zero_() # or x.grad = None
17y3 = x ** 2
18y3.backward()
19print(x.grad) # tensor([4.]) - correct!Why Accumulation?
Gradient accumulation is actually a feature, not a bug. It's useful for:
- Gradient accumulation across mini-batches: When GPU memory is limited, process multiple small batches and accumulate gradients before updating.
- Multi-task learning: Compute gradients from multiple loss functions and combine them before optimization.
1# Simulate large batch using gradient accumulation
2accumulation_steps = 4
3optimizer.zero_grad()
4
5for i, batch in enumerate(dataloader):
6 output = model(batch)
7 loss = criterion(output) / accumulation_steps # Scale loss
8 loss.backward() # Accumulate gradients
9
10 if (i + 1) % accumulation_steps == 0:
11 optimizer.step() # Update with accumulated gradients
12 optimizer.zero_grad() # Reset for next accumulationInteractive: Gradient Accumulation Demo
Experience the difference between forgetting to zero gradients (bug) and properly resetting them each iteration (correct). Watch how gradients grow out of control when not zeroed!
Gradient Accumulation Demo
# ❌ BUG: Missing zero_grad() for batch in loader: output = model(batch) loss = criterion(output, target) loss.backward() # Gradients ADD UP! optimizer.step()
Key Insight: PyTorch accumulates gradients by design. This is useful for gradient accumulation across mini-batches, but you must calloptimizer.zero_grad() before each backward pass in a standard training loop.
Controlling Gradient Computation
PyTorch provides several mechanisms to control when and how gradients are computed:
1. torch.no_grad()
A context manager that disables gradient computation. Use during inference to save memory and computation:
1model.eval() # Set model to evaluation mode
2
3# Inference without gradients
4with torch.no_grad():
5 predictions = model(test_data)
6 # No computational graph is built
7 # Tensors inside won't have grad_fn
8 print(predictions.requires_grad) # False
9
10# Decorator version
11@torch.no_grad()
12def inference(model, data):
13 return model(data)2. tensor.detach()
Creates a new tensor that shares data but is detached from the computational graph:
1x = torch.tensor([2.0], requires_grad=True)
2y = x ** 2
3
4# Detach y from the graph
5y_detached = y.detach()
6
7print(y.requires_grad) # True
8print(y_detached.requires_grad) # False
9print(y_detached.grad_fn) # None
10
11# Common use: stop gradient flow in part of the network
12def forward_with_stop_gradient(x):
13 # First part: gradients flow
14 h1 = self.layer1(x)
15
16 # Stop gradients here
17 h1_stopped = h1.detach()
18
19 # Second part: no gradients to layer1
20 h2 = self.layer2(h1_stopped)
21 return h23. retain_graph
By default, the computational graph is freed after backward(). Use retain_graph=True to keep it for multiple backward passes:
1x = torch.tensor([2.0], requires_grad=True)
2y = x ** 2
3
4# First backward - graph is retained
5y.backward(retain_graph=True)
6print(x.grad) # tensor([4.])
7
8x.grad.zero_() # Reset gradient
9
10# Second backward on same graph
11y.backward() # Works because graph was retained
12print(x.grad) # tensor([4.])
13
14# Without retain_graph=True, second backward would fail!4. torch.enable_grad()
Re-enables gradient computation inside a no_grad block:
1with torch.no_grad():
2 x = torch.tensor([1.0], requires_grad=True)
3 y = x * 2 # No gradient tracking
4
5 with torch.enable_grad():
6 z = x * 3 # Gradient tracking re-enabled
7 print(z.requires_grad) # TrueQuick Check
What happens if you call backward() twice on the same tensor without retain_graph=True?
Under the Hood
Let's peek under the hood to understand how autograd actually works:
The Function Class
Every operation in PyTorch is implemented as a subclass of torch.autograd.Functionwith two methods:
- forward(): Computes the output given inputs
- backward(): Computes gradients given the gradient of the output
1class Square(torch.autograd.Function):
2 @staticmethod
3 def forward(ctx, x):
4 # ctx is a context object for saving info for backward
5 ctx.save_for_backward(x)
6 return x ** 2
7
8 @staticmethod
9 def backward(ctx, grad_output):
10 # grad_output is dL/d(output)
11 # We need to return dL/dx = dL/d(output) * d(output)/dx
12 x, = ctx.saved_tensors
13 return grad_output * 2 * x # d(x^2)/dx = 2x
14
15# Usage
16x = torch.tensor([3.0], requires_grad=True)
17y = Square.apply(x) # y = 9
18y.backward()
19print(x.grad) # tensor([6.]) = 2 * 3The Backward Pass Algorithm
When you call backward(), PyTorch:
- Starts at the output tensor with gradient = 1 (for scalar) or the provided gradient
- Traverses the graph in reverse topological order
- At each node, calls the
backward()method of itsgrad_fn - Multiplies the incoming gradient by the local gradient (chain rule)
- Propagates the result to input nodes
- Accumulates gradients at leaf nodes in their
.gradattribute
Practical Examples
Example 1: Neural Network Training Loop
Example 2: Custom Gradient Modification
1# Gradient clipping
2x = torch.randn(100, requires_grad=True)
3y = (x ** 2).sum()
4y.backward()
5
6# Clip gradients to prevent explosion
7torch.nn.utils.clip_grad_norm_([x], max_norm=1.0)
8
9# Or manually
10x.grad.clamp_(-1, 1)
11
12# Register a backward hook for custom gradient processing
13def print_grad(grad):
14 print(f"Gradient shape: {grad.shape}, mean: {grad.mean():.4f}")
15 return grad # Return modified gradient or None to use original
16
17x = torch.randn(10, requires_grad=True)
18x.register_hook(print_grad)
19y = (x ** 2).sum()
20y.backward() # Hook is called during backwardExample 3: Higher-Order Gradients
1# Second derivative: d²y/dx²
2x = torch.tensor([2.0], requires_grad=True)
3y = x ** 3 # y = 8
4
5# First derivative
6grad1 = torch.autograd.grad(y, x, create_graph=True)[0]
7print(grad1) # tensor([12.]) = 3x² = 3*4 = 12
8
9# Second derivative
10grad2 = torch.autograd.grad(grad1, x)[0]
11print(grad2) # tensor([12.]) = 6x = 6*2 = 12
12
13# Useful for: physics-informed neural networks, Hessian computationCommon Pitfalls
1. Forgetting zero_grad()
1# ❌ BUG: Gradients accumulate across iterations
2for batch in dataloader:
3 loss = model(batch).sum()
4 loss.backward()
5 optimizer.step() # Uses accumulated gradients!
6
7# ✓ CORRECT
8for batch in dataloader:
9 optimizer.zero_grad() # Reset first!
10 loss = model(batch).sum()
11 loss.backward()
12 optimizer.step()2. Modifying Tensors In-Place
1# ❌ BUG: In-place operations can break autograd
2x = torch.tensor([2.0], requires_grad=True)
3y = x ** 2
4x.add_(1) # In-place modification after y was computed
5y.backward() # RuntimeError!
6
7# ✓ CORRECT: Use out-of-place operations
8x = torch.tensor([2.0], requires_grad=True)
9y = x ** 2
10x = x + 1 # Creates new tensor
11# (Note: y still depends on original x value)3. Mixing Devices
1# ❌ BUG: Tensors on different devices
2x = torch.tensor([1.0], device='cuda', requires_grad=True)
3y = torch.tensor([2.0], device='cpu', requires_grad=True)
4z = x + y # RuntimeError!
5
6# ✓ CORRECT: Same device
7x = torch.tensor([1.0], device='cuda', requires_grad=True)
8y = torch.tensor([2.0], device='cuda', requires_grad=True)
9z = x + y # Works4. Converting to NumPy with Gradients
1# ❌ BUG: Can't convert tensor with gradients to NumPy
2x = torch.tensor([1.0], requires_grad=True)
3y = x ** 2
4y.numpy() # RuntimeError!
5
6# ✓ CORRECT: Detach first
7y.detach().numpy() # Works
8
9# Or if on GPU:
10y.detach().cpu().numpy() # WorksKnowledge Check
Test your understanding of PyTorch's autograd system with this comprehensive quiz.
Autograd Knowledge Check
Question 1 of 10What does `requires_grad=True` do when creating a tensor?
Summary
In this section, we covered PyTorch's autograd system - the automatic differentiation engine that makes deep learning training possible:
| Concept | Key Points |
|---|---|
| requires_grad | Flag that enables gradient tracking; set True for learnable parameters |
| Computational Graph | Dynamic DAG built during forward pass; traversed in reverse for gradients |
| backward() | Triggers gradient computation; accumulates gradients in leaf tensor .grad |
| Gradient Accumulation | Gradients add up by default; use zero_grad() to reset |
| no_grad() | Disables gradient tracking; use for inference to save memory |
| detach() | Creates new tensor detached from graph; stops gradient flow |
| retain_graph | Keeps graph for multiple backward passes; needed for higher-order derivatives |
Key Takeaways
- Autograd builds graphs dynamically: Unlike static graph frameworks, PyTorch builds the computational graph during execution, enabling Pythonic control flow.
- Gradients accumulate by default: Always call
optimizer.zero_grad()beforebackward()in training loops unless you intentionally want accumulation. - Only leaf tensors store gradients: Use
retain_grad()if you need gradients for intermediate computations. - Use no_grad() for inference: This saves memory and computation by not building the computational graph.
- Understand the chain rule: Autograd applies the chain rule automatically, but understanding it helps debug gradient issues.
Exercises
Conceptual Questions
- Explain the difference between forward-mode and reverse-mode automatic differentiation. Why is reverse-mode preferred for deep learning?
- What is a leaf tensor? Why do only leaf tensors have their gradients stored in
.grad? - Describe a scenario where gradient accumulation (without
zero_grad()) is actually the desired behavior.
Coding Exercises
- Manual Chain Rule: Compute the gradient of at both manually using the chain rule and using autograd. Verify they match.
- Custom Autograd Function: Implement a custom autograd function for the function (absolute value). Handle the gradient at x=0.
- Gradient Debugging: Given this buggy code, identify and fix the issues:🐍exercise_debug.py
1model = nn.Linear(10, 1) 2optimizer = torch.optim.SGD(model.parameters(), lr=0.01) 3 4for i in range(100): 5 x = torch.randn(32, 10) 6 y = model(x).sum() 7 y.backward() 8 optimizer.step() 9 print(f"Loss: {y.item()}") - Higher-Order Derivatives: Use autograd to compute the second and third derivatives of at x=1.
Challenge Exercise
Build a Mini-Autograd: Implement a simplified autograd system that supports addition, multiplication, and power operations. Your implementation should:
- Define a
Tensorclass that stores values and gradients - Implement forward operations that build a computational graph
- Implement a
backward()method that computes gradients - Handle gradient accumulation for tensors used multiple times
1class MiniTensor:
2 def __init__(self, value, requires_grad=False):
3 self.value = value
4 self.grad = 0.0
5 self.requires_grad = requires_grad
6 self._backward = lambda: None
7 self._prev = set()
8
9 def __add__(self, other):
10 # Implement me!
11 pass
12
13 def __mul__(self, other):
14 # Implement me!
15 pass
16
17 def backward(self):
18 # Implement me!
19 pass
20
21# Test your implementation:
22a = MiniTensor(2.0, requires_grad=True)
23b = MiniTensor(3.0, requires_grad=True)
24c = a + b # c = 5
25d = c * a # d = 10
26d.backward()
27print(a.grad) # Should be 7 (dc/da * a + c * 1 = 1*2 + 5*1 = 7)
28print(b.grad) # Should be 2 (dc/db * a = 1*2 = 2)Implementation Hint
In the next section, we'll explore how to build custom autograd functions for operations that aren't supported by PyTorch out of the box, or when you need to optimize gradient computation for special cases.