Learning Objectives
By the end of this section, you will be able to:
- Understand computational graphs as the fundamental data structure for representing neural network computations
- Visualize forward and backward passes as information flowing through the graph in opposite directions
- Apply the chain rule on graphs to systematically compute gradients for any computation
- Explain how autograd systems work by recording operations and replaying them in reverse
- Analyze gradient flow patterns including addition, multiplication, and branching operations
- Understand the memory vs computation trade-offs in modern deep learning frameworks
Why This Matters: Computational graphs are the secret weapon of modern deep learning. They transform the complex problem of computing gradients through millions of parameters into a simple, systematic algorithm. Every major deep learning framework—PyTorch, TensorFlow, JAX—is built on this foundation.
The Big Picture
Imagine you need to find the derivative of a complex function involving dozens of operations and millions of parameters. Doing this by hand would be impossibly tedious and error-prone. Computational graphs provide an elegant solution: represent the computation as a directed graph, then compute gradients automatically using the chain rule.
The Origin Story
The idea of using graphs to represent computations dates back to the 1960s, but the breakthrough for neural networks came in the 1980s with the development of backpropagation. Researchers realized that by structuring computations as directed acyclic graphs (DAGs), they could:
- Compute outputs efficiently in a single forward pass
- Compute all gradients efficiently in a single backward pass
- Reuse intermediate values, avoiding redundant computation
This insight is what makes training neural networks with billions of parameters feasible. Without computational graphs and efficient backpropagation, modern AI would not exist.
Two Key Insights
| Insight | Explanation | Consequence |
|---|---|---|
| Modularity | Complex functions can be decomposed into simple operations | Each operation only needs a local gradient |
| Chain Rule | Gradients flow through the graph via multiplication | Global gradients computed from local ones |
What Is a Computational Graph?
A computational graph is a directed graph where:
- Nodes represent values (inputs, intermediate results, outputs)
- Edges represent operations or data flow between values
A Simple Example
Consider the function , which is the squared error loss for linear regression. We can break this into elementary operations:
Each operation becomes a node in our graph, with edges showing the flow of data from inputs to outputs. This decomposition is key: each elementary operation has a simple, known gradient.
Why Graphs?
Using a graph structure provides several advantages:
- Automatic differentiation: Gradients computed systematically, not derived by hand
- Efficiency: Intermediate values cached during forward pass, reused in backward pass
- Generality: Works for any computation that can be expressed as a DAG
- Parallelization: Independent operations can be computed in parallel
Anatomy of a Computational Graph
Let's examine the components of a computational graph in detail:
Node Types
| Node Type | Description | Example |
|---|---|---|
| Input | External values provided to the computation | x, w, b, y (training data and parameters) |
| Operation | Computes a value from its inputs | +, ×, σ, ReLU, matmul |
| Output | Final result of the computation | Loss L, prediction ŷ |
Edge Information
Each edge in the graph carries two pieces of information:
- Forward direction: The value flowing from one node to the next
- Backward direction: The local gradient
The Gradient Function
Every operation in the graph has an associated gradient function (or grad_fn in PyTorch terminology). This function computes how the operation's output changes with respect to each input.
Local vs Global Gradients
Interactive: Forward and Backward Pass
Explore the computational graph for . In the forward pass, values flow from inputs to outputs. In the backward pass, gradients flow from outputs back to inputs.
Interactive Computational Graph
Visualize forward pass (computing values) and backward pass (computing gradients)
Input values
Start with inputs: x=2, w=3, b=1, target=5
Computation Summary
Quick Check
In the backward pass, what is the gradient ∂L/∂z at the node z = wx when ∂L/∂y = 2(y-t) and y = z + b?
The Chain Rule on Graphs
The chain rule is the mathematical foundation of backpropagation. On computational graphs, it takes a particularly elegant form.
Single Path
For a simple chain , the chain rule gives:
This is just multiplying local gradients along the path from to .
Multiple Paths
When a node has multiple outgoing edges (its value is used by multiple downstream operations), we sum the gradients from all paths:
This is the multivariate chain rule. If influences through multiple intermediate nodes :
The Backward Pass Algorithm
The backward pass computes gradients efficiently by:
- Starting at the output with
- Processing nodes in reverse topological order
- At each node, multiplying incoming gradient by local gradient
- Accumulating gradients when paths merge (multiple outgoing edges)
Efficiency
Interactive: Chain Rule Exploration
Explore how the chain rule applies to different computational graph structures. Click on different examples to see single-path chains, multi-path graphs, and neural network layers.
Chain Rule on Computational Graphs
Explore how the chain rule propagates gradients through different graph structures
f(g(x)) - single path, direct chain rule application
Applying the Chain Rule:
Key Insight: Local Gradients
Each edge carries a local gradient (∂output/∂input for that operation). Backpropagation multiplies these local gradients along each path from output to input.
Quick Check
If x feeds into both a = x² and b = x³, and the output is y = a + b, what is ∂y/∂x?
Key Gradient Flow Patterns
Understanding how gradients flow through common operations is essential for debugging neural networks and designing architectures.
Addition Gate
For :
Pattern: Addition acts as a gradient distributor. The incoming gradient is passed unchanged to both inputs.
Multiplication Gate
For :
Pattern: Multiplication acts as a gradient swapper. Each input receives the gradient multiplied by the other input's value.
Max Gate
For :
Pattern: Max acts as a gradient router. The full gradient flows only to the larger input; the smaller input receives zero gradient.
Copy/Fan-out
When a value is used multiple times:
Pattern: Copying acts as a gradient accumulator. Gradients from all uses are summed.
| Operation | Forward | Backward Pattern |
|---|---|---|
| Add: z = x + y | z = x + y | ∂z/∂x = ∂z/∂y = 1 (distribute) |
| Multiply: z = xy | z = x × y | ∂z/∂x = y, ∂z/∂y = x (swap) |
| Max: z = max(x,y) | z = max(x,y) | Gradient to winner only (route) |
| Copy: z₁=z₂=x | Copy x | Sum all incoming gradients (accumulate) |
| ReLU: z = max(0,x) | z = max(0,x) | Pass if x>0, else 0 |
Dead ReLU Problem
See Chapter 9 Section 7
Autograd: Automatic Differentiation
Modern deep learning frameworks implement automatic differentiation (autograd) to compute gradients without manual derivation. Understanding how this works demystifies framework internals.
The Two Phases
- Forward Pass (Tape Recording): As operations execute, the framework records each operation, its inputs, and its output onto a "tape" (also called a "trace" or "computational graph").
- Backward Pass (Tape Playback): Starting from the output, the framework walks the tape in reverse, applying the chain rule at each step to compute gradients.
What Gets Stored?
For each operation, the tape stores:
- The operation type (add, multiply, matmul, etc.)
- References to input tensors
- The output tensor
- A gradient function (grad_fn) that knows how to compute local gradients
- Any intermediate values needed for the backward pass
Define-by-Run vs Define-and-Run
| Paradigm | Framework | How It Works |
|---|---|---|
| Define-by-Run | PyTorch, JAX | Graph built dynamically during execution |
| Define-and-Run | TensorFlow 1.x | Graph defined first, then executed |
PyTorch's define-by-run approach means the graph is constructed fresh for each forward pass, enabling dynamic computation (different graphs for different inputs) and easier debugging.
Interactive: How Autograd Works
Step through PyTorch code and see how the computational tape is built during the forward pass and traversed during the backward pass.
How Autograd Works: The Tape
Understand how PyTorch's automatic differentiation records operations and computes gradients
x = torch.tensor(2.0, requires_grad=True) w = torch.tensor(3.0, requires_grad=True) b = torch.tensor(1.0, requires_grad=True)
Initialize Inputs
Create tensors with requires_grad=True to track gradients
When requires_grad=True, PyTorch will record operations on these tensors.
1. Forward Pass
Compute output values while recording operations to the tape
2. Build Graph
Store local gradients (grad_fn) at each node for backward pass
3. Backward Pass
Walk tape in reverse, multiply local gradients via chain rule
Quick Check
In PyTorch, when does the computational graph get built?
Implementation in PyTorch
Let's see how computational graphs work in PyTorch, from basic tensor operations to understanding the grad_fn chain.
Gradient Accumulation
An important detail: gradients in PyTorch are accumulated, not replaced. If you call backward() multiple times without zeroing gradients, they add up:
Training Loop Pattern
optimizer.zero_grad() then loss.backward() then optimizer.step(). This pattern prevents gradient accumulation bugs.Common Operations and Their Gradients
Here are the gradient rules for common neural network operations. Understanding these helps debug gradient flow issues.
| Operation | Forward | Local Gradient |
|---|---|---|
| Add | z = x + y | ∂z/∂x = 1, ∂z/∂y = 1 |
| Subtract | z = x - y | ∂z/∂x = 1, ∂z/∂y = -1 |
| Multiply | z = xy | ∂z/∂x = y, ∂z/∂y = x |
| Divide | z = x/y | ∂z/∂x = 1/y, ∂z/∂y = -x/y² |
| Power | z = xⁿ | ∂z/∂x = nxⁿ⁻¹ |
| Exp | z = eˣ | ∂z/∂x = eˣ = z |
| Log | z = ln(x) | ∂z/∂x = 1/x |
| Sigmoid | z = σ(x) | ∂z/∂x = σ(x)(1-σ(x)) = z(1-z) |
| Tanh | z = tanh(x) | ∂z/∂x = 1 - tanh²(x) = 1 - z² |
| ReLU | z = max(0,x) | ∂z/∂x = 1 if x>0 else 0 |
| Softmax | zᵢ = eˣⁱ/Σeˣʲ | Complex (Jacobian matrix) |
Matrix Operations
Memory vs Computation Trade-offs
Building computational graphs comes with memory costs. Understanding these trade-offs is crucial for training large models.
What Gets Stored?
During the forward pass, activations (intermediate values) must be saved for the backward pass. For a network with layers:
- Activations: All intermediate outputs (grows with batch size and model width)
- Graph structure: Links between operations (relatively small)
- Intermediate values: Values needed for gradients (e.g., for BatchNorm)
Memory-Saving Techniques
| Technique | How It Works | Trade-off |
|---|---|---|
| Gradient checkpointing | Don't save all activations; recompute during backward | 2× compute, much less memory |
| Mixed precision | Use FP16 for forward, FP32 for gradients | ~2× memory savings |
| Activation compression | Compress stored activations | Some precision loss |
| torch.no_grad() | Don't build graph for inference | No gradients available |
1# Disable graph building for inference
2with torch.no_grad():
3 output = model(input) # No grad_fn, saves memory
4
5# Gradient checkpointing for memory-efficient training
6from torch.utils.checkpoint import checkpoint
7output = checkpoint(expensive_layer, input)Knowledge Check
Test your understanding of computational graphs with this quiz:
Knowledge Check
In a computational graph, what does each node represent?
Summary
Computational graphs are the foundation of modern deep learning, enabling efficient and automatic gradient computation.
Key Concepts
| Concept | Key Insight | Formula/Example |
|---|---|---|
| Computational Graph | DAG representing computation | Nodes = values, Edges = operations |
| Forward Pass | Compute outputs, build graph | Values flow input → output |
| Backward Pass | Compute gradients via chain rule | Gradients flow output → input |
| Local Gradient | Derivative at single operation | ∂output/∂input for each edge |
| Chain Rule | Multiply along paths, sum across paths | ∂L/∂x = Σ(∏ local gradients) |
| Autograd | Automatic graph construction + traversal | requires_grad=True enables tracking |
Key Takeaways
- Computational graphs decompose complex functions into simple, differentiable operations
- The forward pass computes values while building the graph; backward pass computes gradients
- The chain rule on graphs: multiply local gradients along paths, sum across multiple paths
- Autograd systems record operations to a "tape" and replay in reverse for gradients
- Understanding gradient flow patterns (add distributes, multiply swaps, max routes) helps debugging
- Memory-computation trade-offs are important for training large models
Looking Ahead
Now that you understand computational graphs, you're ready to see the full backpropagation algorithm in the next section. We'll derive the exact equations for gradient computation in multi-layer networks and implement backprop from scratch.
Exercises
Conceptual Questions
- Draw the computational graph for where is the sigmoid function. Label all intermediate nodes.
- For the graph in question 1, write out the local gradient at each edge. Then compute using the chain rule.
- Explain why the gradient at a node that splits into multiple branches must be the sum of gradients from all branches, not the product or average.
- If a ReLU neuron always outputs 0 (its input is always negative), what happens to its gradient? Why is this called a "dead neuron"?
Coding Exercises
- Manual Gradient Check: Create a simple computation in PyTorch. Compute using autograd, then verify by computing the gradient manually.
- Graph Exploration: Build a small neural network (2 layers) in PyTorch. After a forward pass, explore the
grad_fnchain by followingoutput.grad_fn.next_functions. Print the operation names. - Memory Analysis: Create tensors of increasing size with and without
requires_grad=True. Measure memory usage to understand the overhead of gradient tracking.
Solution Hints
- For Q1: You'll need nodes for wx, wx+b, σ(·), and log(·)
- For Q3: Think of x contributing to L through multiple independent paths
- For coding: Use
torch.cuda.memory_allocated()or thememory_profilerpackage
Challenge Exercise
Implement a Mini Autograd: Create a simple Variable class that:
- Stores a value and tracks whether it requires gradients
- Implements
__add__,__mul__, and__pow__that return new Variables and record the operation - Has a
backward()method that computes gradients using the chain rule
Start with just scalar values before attempting vectors/matrices.
In the next section, we'll use computational graphs to derive the backpropagation algorithm for multi-layer neural networks—the core algorithm that makes deep learning possible.