Chapter 8
30 min read
Section 51 of 178

Backpropagation Derivation

Backpropagation from Scratch

Learning Objectives

By the end of this section, you will be able to:

  1. Understand the credit assignment problem and why we need an efficient algorithm to compute gradients
  2. Apply the chain rule to composite functions and understand how derivatives propagate backward
  3. Trace gradient flow through computation graphs and handle multiple paths
  4. Derive the four fundamental equations of backpropagation from first principles
  5. Express backpropagation in matrix form for efficient computation
  6. Connect the mathematical derivation to practical implementation
Why This Matters: Backpropagation is the algorithm that makes deep learning possible. Without it, training neural networks with millions of parameters would be computationally infeasible. Understanding its derivation reveals why neural networks can learn and demystifies the "black box" of gradient computation.

The Credit Assignment Problem

Imagine you're training a neural network with 1000 weights. When the network makes a prediction error, you want to update each weight to reduce this error. But how much should each weight change? This is the credit assignment problem: assigning credit (or blame) to each parameter for the network's output.

The Naive Approach: Numerical Gradients

One approach is to compute gradients numerically. For each weight wiw_i, we could estimate:

LwiL(wi+ϵ)L(wiϵ)2ϵ\frac{\partial L}{\partial w_i} \approx \frac{L(w_i + \epsilon) - L(w_i - \epsilon)}{2\epsilon}

This requires two forward passes per weight. For a network with nn weights, we need 2n2n forward passes—completely impractical for modern networks with millions of parameters.

ApproachForward PassesFor 1M Parameters
Numerical Gradients2n2,000,000 passes
Backpropagation2 (forward + backward)2 passes

The Key Insight

Backpropagation exploits the structure of neural networks—the fact that they are composed of simple, differentiable operations connected in a graph. By applying the chain rule systematically, we can compute all gradients in a single backward pass.

Computational Efficiency

The cost of one backward pass is roughly the same as the cost of one forward pass. This means we can compute gradients for ALL parameters with just twice the computational cost of a single forward pass!

Chain Rule: The Mathematical Foundation

Backpropagation is fundamentally an application of the chain rule from calculus. Let's review this essential concept before diving into neural networks.

Single Variable Chain Rule

For a composite function f(g(x))f(g(x)), the chain rule states:

ddx[f(g(x))]=f(g(x))g(x)=dfdgdgdx\frac{d}{dx}[f(g(x))] = f'(g(x)) \cdot g'(x) = \frac{df}{dg} \cdot \frac{dg}{dx}

In words: the derivative of the composite function equals the derivative of the outer function (evaluated at the inner function) times the derivative of the inner function.

Multivariate Chain Rule

For functions of multiple variables, if z=f(u,v)z = f(u, v) where u=g(x,y)u = g(x, y) and v=h(x,y)v = h(x, y), then:

zx=zuux+zvvx\frac{\partial z}{\partial x} = \frac{\partial z}{\partial u} \cdot \frac{\partial u}{\partial x} + \frac{\partial z}{\partial v} \cdot \frac{\partial v}{\partial x}

When a variable influences the output through multiple paths, we sum the contributions from each path.

The Backpropagation Principle: Multiply local gradients along each path, then sum across all paths from a variable to the output.

Interactive: Chain Rule Explorer

Explore how the chain rule works for composite functions. Adjust the input value to see how derivatives are computed step by step.

The Chain Rule: Derivatives of Composite Functions
f(g(x)) = sin(x2)
where g(x) = x2 and f(u) = sin(u)
(2.0, -0.757)xy
Function Value
f(g(2.00)) = sin(4.000) = -0.7568
Derivative (Slope)
df/dx = -2.6146

Quick Check

For f(g(x)) = sin(x²) at x = π/4, if g'(x) = 2x and f'(u) = cos(u), which expression gives df/dx?


Gradient Flow in Computation Graphs

Neural networks can be represented as computation graphs where nodes represent operations and edges represent data flow. Understanding how gradients flow through these graphs is key to understanding backpropagation.

Forward and Backward Passes

Every computation graph supports two passes:

  1. Forward Pass: Compute the output by evaluating nodes in topological order (inputs → output)
  2. Backward Pass: Compute gradients by propagating errors in reverse topological order (output → inputs)

Local Gradients

Each operation in the graph has a local gradient—the derivative of its output with respect to each of its inputs. During the backward pass, we:

  1. Receive the gradient flowing into this node (from downstream)
  2. Multiply by the local gradient
  3. Pass the result upstream

The Sum Rule for Multiple Paths

When a variable is used multiple times (has multiple outgoing edges), the gradients from each path are summed:

Lx=pathsLxpath\frac{\partial L}{\partial x} = \sum_{\text{paths}} \frac{\partial L}{\partial x}\bigg|_{\text{path}}

Why Sum?

Each path represents a way that xx influences the output. A small change in xx affects the output through ALL paths simultaneously, so the total effect is the sum.

Interactive: Gradient Flow Demo

Watch how gradients flow backward through a computation graph. Notice how gradients are multiplied along paths and summed when paths merge.

Gradient Flow Through a Computation Graph
f(x, y) = (x + y) × y = xy + y²
Forward Pass: Compute values through the graph
+1+1ayx=2y=3+=5×=15ForwardBackward

Quick Check

In the graph f = (x + y) × y, if x = 2 and y = 3, what is ∂f/∂y?


Deriving Backpropagation

Now let's derive backpropagation for neural networks step by step, starting from the simplest case and building up complexity.

Single Neuron Case

Consider a single neuron with input xx, weight ww, bias bb, and activation function σ\sigma:

z=wx+b,a=σ(z),L=12(ay)2z = wx + b, \quad a = \sigma(z), \quad L = \frac{1}{2}(a - y)^2

To find L/w\partial L / \partial w, we apply the chain rule:

Lw=Laazzw\frac{\partial L}{\partial w} = \frac{\partial L}{\partial a} \cdot \frac{\partial a}{\partial z} \cdot \frac{\partial z}{\partial w}

Computing each term:

  • L/a=ay\partial L / \partial a = a - y (derivative of squared error)
  • a/z=σ(z)\partial a / \partial z = \sigma'(z) (derivative of activation function)
  • z/w=x\partial z / \partial w = x (derivative of linear combination)

Therefore:

Lw=(ay)σ(z)x\frac{\partial L}{\partial w} = (a - y) \cdot \sigma'(z) \cdot x

Pattern Recognition

Notice the pattern: the gradient with respect to a weight equals the upstream gradient (from the loss) times the local gradient (activation derivative) times the input to that weight.

Single Layer Case

Now consider a layer with multiple neurons. For the jj-th neuron with inputs x=(x1,,xn)\mathbf{x} = (x_1, \ldots, x_n):

zj=k=1nwjkxk+bj,aj=σ(zj)z_j = \sum_{k=1}^{n} w_{jk} x_k + b_j, \quad a_j = \sigma(z_j)

The gradient with respect to weight wjkw_{jk}:

Lwjk=Lajajzjzjwjk=Lajσ(zj)xk\frac{\partial L}{\partial w_{jk}} = \frac{\partial L}{\partial a_j} \cdot \frac{\partial a_j}{\partial z_j} \cdot \frac{\partial z_j}{\partial w_{jk}} = \frac{\partial L}{\partial a_j} \cdot \sigma'(z_j) \cdot x_k

We can define the error term (or delta) for neuron jj:

δj=Lajσ(zj)=Lzj\delta_j = \frac{\partial L}{\partial a_j} \cdot \sigma'(z_j) = \frac{\partial L}{\partial z_j}

Then the weight gradient becomes simply:

Lwjk=δjxk\frac{\partial L}{\partial w_{jk}} = \delta_j \cdot x_k

Multi-Layer Network

For a network with layers l=1,2,,Ll = 1, 2, \ldots, L, we need to propagate error terms backward through the layers. Consider the error term for neuron jj in layer ll:

δj(l)=Lzj(l)\delta_j^{(l)} = \frac{\partial L}{\partial z_j^{(l)}}

By the chain rule, this error depends on how neuron jj in layer ll affects all neurons in layer l+1l+1:

δj(l)=Laj(l)σ(zj(l))\delta_j^{(l)} = \frac{\partial L}{\partial a_j^{(l)}} \cdot \sigma'(z_j^{(l)})

where:

Laj(l)=iLzi(l+1)zi(l+1)aj(l)=iδi(l+1)wij(l+1)\frac{\partial L}{\partial a_j^{(l)}} = \sum_{i} \frac{\partial L}{\partial z_i^{(l+1)}} \cdot \frac{\partial z_i^{(l+1)}}{\partial a_j^{(l)}} = \sum_{i} \delta_i^{(l+1)} \cdot w_{ij}^{(l+1)}

This gives us the backward propagation equation:

δj(l)=σ(zj(l))iwij(l+1)δi(l+1)\delta_j^{(l)} = \sigma'(z_j^{(l)}) \sum_{i} w_{ij}^{(l+1)} \delta_i^{(l+1)}
The Propagation Insight: The error at a hidden neuron is the weighted sum of errors from all neurons it connects to in the next layer, scaled by the activation derivative. Errors flow backward through the same weights used in the forward pass!

The Four Fundamental Equations

The entire backpropagation algorithm can be summarized in four equations. Click each equation to expand its intuition:

The Four Fundamental Equations of Backpropagation
1
Output Layer Error
δL = ∇aC \u2299 \u03C3'(zL)
The error at the output layer is the product of the cost gradient with respect to activations and the derivative of the activation function.
Intuition:
Think of this as 'how wrong are we?' multiplied by 'how sensitive is the output?' The cost gradient tells us the direction to improve, and σ'(z) tells us how much the pre-activation affects the output.
2
Error Backpropagation
δl = ((Wl+1)Tδl+1) \u2299 \u03C3'(zl)
The error at layer l is computed by propagating the error from layer l+1 backward through the weights.
3
Weight Gradient
\u2202C/\u2202wljk = al-1k δlj
The gradient of the cost with respect to a weight is the product of the input activation and the output error.
4
Bias Gradient
\u2202C/\u2202blj = δlj
The gradient of the cost with respect to a bias equals the error at that neuron.
Notation:
δl = error at layer l
C = cost/loss function
\u03C3 = activation function
\u2299 = element-wise multiplication
zl = pre-activation at layer l
al = activation at layer l
Wl = weights at layer l
bl = biases at layer l

Equation 1: Output Layer Error

δ(L)=aCσ(z(L))\delta^{(L)} = \nabla_a C \odot \sigma'(z^{(L)})

Where aC\nabla_a C is the gradient of the cost with respect to the output activations, and \odot denotes element-wise multiplication. For mean squared error: aC=(a(L)y)\nabla_a C = (a^{(L)} - y).

Equation 2: Error Backpropagation

δ(l)=((W(l+1))Tδ(l+1))σ(z(l))\delta^{(l)} = ((W^{(l+1)})^T \delta^{(l+1)}) \odot \sigma'(z^{(l)})

This propagates the error from layer l+1l+1 back to layer ll. The transposed weight matrix rotates the error signal back through the connections.

Equation 3: Weight Gradients

CW(l)=δ(l)(a(l1))T\frac{\partial C}{\partial W^{(l)}} = \delta^{(l)} (a^{(l-1)})^T

The weight gradient is an outer product of the error at a layer and the activations feeding into it. This makes intuitive sense: weights connecting active neurons to neurons with large errors should change the most.

Equation 4: Bias Gradients

Cb(l)=δ(l)\frac{\partial C}{\partial b^{(l)}} = \delta^{(l)}

The bias gradient equals the error term directly. Biases are like weights with a fixed input of 1.


Interactive: Backpropagation Step-by-Step

Watch backpropagation unfold step by step in a simple two-layer network. Observe how gradients are computed and flow backward from the loss to each weight.

Interactive Backpropagation
Forward Pass Complete
We've computed the output. Now let's propagate gradients backward.
x0.50Hidden Layerz1a1z1=0.200a1=0.550Output Layerz2a2z2=0.160a2=0.540Loss0.1059y=1w1=0.8w2=1.2dL/da2=-0.4601dL/dz2=-0.1143dL/dw2=-0.0628dL/da1=-0.1372dL/dz1=-0.0340dL/dw1=-0.0170Forward PassBackward Pass (Gradients)

Quick Check

In backpropagation, if a hidden neuron has activation a = 0.5 (sigmoid output), what is σ'(z) at that neuron?


Matrix Formulation

For efficient computation, we express backpropagation in matrix form. This enables vectorized operations and GPU acceleration.

Forward Pass (Vectorized)

For a batch of mm examples with input matrix XRn×mX \in \mathbb{R}^{n \times m}:

Z(l)=W(l)A(l1)+b(l),A(l)=σ(Z(l))Z^{(l)} = W^{(l)} A^{(l-1)} + b^{(l)}, \quad A^{(l)} = \sigma(Z^{(l)})

Where:

  • W(l)Rnl×nl1W^{(l)} \in \mathbb{R}^{n_l \times n_{l-1}} is the weight matrix
  • A(l)Rnl×mA^{(l)} \in \mathbb{R}^{n_l \times m} is the activation matrix (each column is one example)
  • b(l)Rnl×1b^{(l)} \in \mathbb{R}^{n_l \times 1} is broadcast across examples

Backward Pass (Vectorized)

The error terms and gradients for a batch:

Δ(L)=ACσ(Z(L))Δ(l)=(W(l+1))TΔ(l+1)σ(Z(l))CW(l)=1mΔ(l)(A(l1))TCb(l)=1mi=1mΔi(l)\begin{aligned} \Delta^{(L)} &= \nabla_A C \odot \sigma'(Z^{(L)}) \\ \Delta^{(l)} &= (W^{(l+1)})^T \Delta^{(l+1)} \odot \sigma'(Z^{(l)}) \\ \frac{\partial C}{\partial W^{(l)}} &= \frac{1}{m} \Delta^{(l)} (A^{(l-1)})^T \\ \frac{\partial C}{\partial b^{(l)}} &= \frac{1}{m} \sum_{i=1}^{m} \Delta^{(l)}_i \end{aligned}

Averaging Over Batch

We divide by mm (batch size) to get the average gradient over the batch. This stabilizes training and makes learning rate selection independent of batch size.

Implementation Preview

Here's a preview of how backpropagation is implemented in code. We'll build the complete implementation in the next section.

Backpropagation Implementation
🐍backprop_network.py
7Network Architecture

The network stores layer sizes and initializes weights between consecutive layers. For example, layer_sizes=[784, 128, 10] creates a network for MNIST with 784 inputs, 128 hidden neurons, and 10 outputs.

17Xavier/He Initialization

Weights are initialized with variance scaled by 2/fan_in. This prevents gradients from vanishing or exploding in deep networks by ensuring activations have reasonable variance.

21Sigmoid Activation

We clip z to [-500, 500] to prevent numerical overflow in the exponential. This is a practical consideration for numerical stability.

25Sigmoid Derivative

The elegant property: σ'(z) = σ(z)(1 - σ(z)) = a(1-a). We can compute the derivative directly from the activation value, no need to store z!

33Forward Pass Storage

We store ALL activations during the forward pass because we need them for the backward pass. This is the memory cost of backpropagation.

57Equation 1: Output Error

δ^L = (a - y) ⊙ σ'(z^L). For MSE loss, ∇_a C = (a - y). We multiply element-wise by the activation derivative to get the error signal.

61Equations 3 & 4: Output Gradients

Weight gradient = outer product of error and input activations. Bias gradient = average of errors (bias doesn't depend on input).

68Equation 2: Error Backpropagation

The error is propagated backward by multiplying with the transposed weight matrix, then scaling by the activation derivative. This is the heart of backpropagation!

EXAMPLE
If layer l+1 has error [0.1, -0.2] and W^T maps this to layer l, we get the 'responsibility' of each hidden neuron for the downstream error.
98 lines without explanation
1import numpy as np
2
3class BackpropNetwork:
4    """Neural network with backpropagation implemented from scratch."""
5
6    def __init__(self, layer_sizes):
7        """Initialize network with given architecture.
8
9        Args:
10            layer_sizes: List of layer sizes, e.g., [784, 128, 10]
11        """
12        self.num_layers = len(layer_sizes)
13        self.sizes = layer_sizes
14
15        # Initialize weights with Xavier/He initialization
16        self.weights = [
17            np.random.randn(y, x) * np.sqrt(2.0 / x)
18            for x, y in zip(layer_sizes[:-1], layer_sizes[1:])
19        ]
20        self.biases = [np.zeros((y, 1)) for y in layer_sizes[1:]]
21
22    def sigmoid(self, z):
23        """Sigmoid activation function."""
24        return 1.0 / (1.0 + np.exp(-np.clip(z, -500, 500)))
25
26    def sigmoid_derivative(self, a):
27        """Derivative of sigmoid: σ'(z) = a(1-a) where a = σ(z)."""
28        return a * (1 - a)
29
30    def forward(self, X):
31        """Forward pass: compute activations for all layers.
32
33        Args:
34            X: Input data, shape (n_features, batch_size)
35
36        Returns:
37            Tuple of (activations, pre_activations) for all layers
38        """
39        activations = [X]
40        zs = []  # Pre-activations
41
42        a = X
43        for w, b in zip(self.weights, self.biases):
44            z = np.dot(w, a) + b  # Linear transformation
45            zs.append(z)
46            a = self.sigmoid(z)   # Activation
47            activations.append(a)
48
49        return activations, zs
50
51    def backward(self, X, y):
52        """Backward pass: compute gradients using backpropagation.
53
54        Args:
55            X: Input data, shape (n_features, batch_size)
56            y: True labels, shape (n_outputs, batch_size)
57
58        Returns:
59            Tuple of (weight_gradients, bias_gradients)
60        """
61        m = X.shape[1]  # Batch size
62        grad_w = [np.zeros_like(w) for w in self.weights]
63        grad_b = [np.zeros_like(b) for b in self.biases]
64
65        # Forward pass (store all activations)
66        activations, zs = self.forward(X)
67
68        # EQUATION 1: Output layer error
69        # δ^L = ∇_a C ⊙ σ'(z^L)
70        delta = (activations[-1] - y) * self.sigmoid_derivative(activations[-1])
71
72        # EQUATION 3 & 4: Gradients for last layer
73        grad_w[-1] = np.dot(delta, activations[-2].T) / m
74        grad_b[-1] = np.mean(delta, axis=1, keepdims=True)
75
76        # Backpropagate through hidden layers
77        for l in range(2, self.num_layers):
78            # EQUATION 2: Backpropagate error
79            # δ^l = (W^{l+1})^T δ^{l+1} ⊙ σ'(z^l)
80            delta = np.dot(self.weights[-l+1].T, delta) * \
81                    self.sigmoid_derivative(activations[-l])
82
83            # EQUATION 3 & 4: Gradients for this layer
84            grad_w[-l] = np.dot(delta, activations[-l-1].T) / m
85            grad_b[-l] = np.mean(delta, axis=1, keepdims=True)
86
87        return grad_w, grad_b
88
89    def compute_loss(self, X, y):
90        """Compute mean squared error loss."""
91        activations, _ = self.forward(X)
92        predictions = activations[-1]
93        return 0.5 * np.mean((predictions - y) ** 2)
94
95# Example usage
96network = BackpropNetwork([2, 3, 1])  # 2 inputs, 3 hidden, 1 output
97
98# Random example
99X = np.random.randn(2, 5)  # 5 examples
100y = np.random.randn(1, 5)
101
102# Compute gradients
103grad_w, grad_b = network.backward(X, y)
104
105print("Weight gradients shapes:", [g.shape for g in grad_w])
106print("Bias gradients shapes:", [g.shape for g in grad_b])

Common Misconceptions

Let's address some common misconceptions about backpropagation:

Misconception 1: "Backpropagation is a learning algorithm"

Reality: Backpropagation is an algorithm for computing gradients efficiently. The learning algorithm is gradient descent (or its variants like Adam, SGD with momentum, etc.). Backpropagation provides the gradients that gradient descent uses to update weights.

Misconception 2: "Gradients flow through the network during backprop"

Reality: This is a useful mental model, but physically, backpropagation is just matrix multiplications computed in reverse order. The "flow" is a metaphor for the computational dependency structure.

Misconception 3: "We need to store all forward pass values"

Partially true: Standard backprop requires storing activations from the forward pass. However, techniques like gradient checkpointing can trade computation for memory by recomputing some activations during the backward pass.

Misconception 4: "Backpropagation doesn't work with ReLU because it's not differentiable at 0"

Reality: While ReLU isn't differentiable at exactly 0, we simply define the derivative as 0 (or sometimes 1) at that point. This works well in practice because hitting exactly 0 is rare with floating-point numbers.

Dead Neurons

While the derivative at 0 isn't a problem, ReLU can cause "dead neurons"—neurons that always output 0 and therefore never update. This happens when a neuron's input is always negative. Techniques like Leaky ReLU address this issue.

Summary

Backpropagation is the efficient algorithm that makes training deep neural networks possible. Let's review the key concepts:

ConceptKey InsightMathematical Form
Chain RuleDerivatives of composed functions multiplydf/dx = (df/dg)(dg/dx)
Local GradientsEach operation knows its own derivatives∂output/∂input
Gradient FlowMultiply along paths, sum across paths∑(product of local gradients)
Error Terms (δ)Gradient of loss w.r.t. pre-activationδ = ∂L/∂z
Weight GradientsError times input activation∂L/∂w = δ × a_in
Error BackpropWeighted sum of downstream errorsδ^l = (W^{l+1})^T δ^{l+1} ⊙ σ'

The Algorithm in Three Steps

  1. Forward Pass: Compute and store all activations from input to output
  2. Compute Output Error: Calculate δ(L)\delta^{(L)} from the loss gradient and activation derivative
  3. Backward Pass: Propagate errors backward, computing weight and bias gradients at each layer
The Big Picture: Backpropagation is simply the chain rule applied systematically to a computation graph. Its elegance lies in computing ALL gradients with just one additional pass through the network, enabling the training of networks with millions of parameters.

Exercises

Conceptual Questions

  1. Explain why the gradient with respect to a weight depends on both the error at that neuron AND the activation feeding into it. Give an intuitive explanation.
  2. In the error backpropagation equation, why do we use the transpose of the weight matrix?
  3. What happens to the gradient flow if a neuron's activation is in the saturated region of the sigmoid (close to 0 or 1)? How does this relate to the vanishing gradient problem?
  4. Why is it important that the forward and backward pass have approximately the same computational cost? What would happen if backward was much more expensive?

Mathematical Derivations

  1. Derive the backpropagation equations for a network using the cross-entropy loss instead of MSE. What changes in the output error term?
  2. Derive the gradient of the loss with respect to a bias term. Show that it equals the error term at that neuron.
  3. For a network with ReLU activations, write out the backward pass equations. What is σ'(z) for ReLU?

Coding Exercises

  1. Gradient Checking: Implement numerical gradient computation and use it to verify that your backpropagation implementation is correct. Compare the analytical gradient to the numerical gradient for each weight.
  2. Gradient Flow Analysis: Modify the BackpropNetwork class to track and visualize the magnitude of gradients at each layer during training. Plot how gradient magnitudes change over training epochs.
  3. Batch Processing: Extend the implementation to handle batches of arbitrary size efficiently using NumPy broadcasting.

Solution Hints

  • Weight gradient intuition: If the input is zero, the weight had no influence on the output, so its gradient should be zero. If the error is zero, the output was correct, so no update is needed.
  • Transpose: Think of W as mapping from layer l to l+1. To go backward, we need the inverse mapping. For rectangular matrices, transpose plays a similar role.
  • Cross-entropy: With softmax + cross-entropy, the output error simplifies beautifully to (a - y), the same as MSE with sigmoid!

Challenge Exercise

Implement Gradient Checkpointing: Modify the BackpropNetwork class to use gradient checkpointing—instead of storing all activations, only store activations at every k-th layer and recompute the others during the backward pass. Measure the memory vs. computation trade-off.


In the next section, we'll implement backpropagation from scratch in pure Python/NumPy, and then see how PyTorch's autograd system does all of this automatically.