Cross-Correlation vs Convolution
In signal-processing textbooks, convolution flips the kernel before sliding. In deep-learning libraries it does not. Both operations are called “convolution” — which is a recipe for confusion the first time you read a paper.
The mathematical (flipped) version
True convolution rotates the kernel 180° (i.e. flips it both horizontally and vertically) and then slides:
The flip is what gives convolution the property of commutativity: . It also makes convolution play nicely with the Fourier transform — multiplication in frequency becomes convolution in time/space.
The deep-learning (unflipped) version
Deep-learning frameworks skip the flip. The operation they implement is cross-correlation:
PyTorch, TensorFlow, JAX, MXNet, Caffe — all of them call this “convolution.” The reason it does not matter in practice is that the kernel weights are learned: whatever flipping we skip, gradient descent absorbs into the weights. A kernel that would detect edges as a cross-correlation can be re-expressed as a (different) kernel that detects the same edges as a true convolution.
Terminology alert
| Aspect | True convolution | Cross-correlation (DL) |
|---|---|---|
| Kernel flipped? | Yes (rotate 180°) | No |
| Commutative? | Yes: f * g = g * f | No |
| Where used | Signal processing, math, DSP | Deep learning (PyTorch, TF, …) |
| Matters for training? | No — weights adapt | No — weights adapt |
| Speedup via FFT? | Yes (convolution theorem) | Yes (after a flip) |
Practical shortcut
Step-by-Step Computation (NumPy)
We will do one 2-D convolution end to end — by hand, then in NumPy, then in PyTorch — so that every number you see in the output is traceable to an explicit pair of window and kernel values.
Input: a 5×5 gradient image
1I = [[10, 20, 30, 40, 50],
2 [20, 40, 60, 80, 100],
3 [30, 60, 90, 120, 150],
4 [40, 80, 120, 160, 200],
5 [50, 100, 150, 200, 250]]Kernel: 3×3 Sobel-X (vertical-edge detector)
1K = [[-1, 0, 1],
2 [-2, 0, 2],
3 [-1, 0, 1]]One position by hand: output[0, 0]
The top-left window is the 3×3 block :
1Window I[0:3, 0:3]: Kernel K: Element-wise product:
2[[10, 20, 30], [[-1, 0, 1], [[-10, 0, 30],
3 [20, 40, 60], × [-2, 0, 2], = [-40, 0, 120],
4 [30, 60, 90]] [-1, 0, 1]] [-30, 0, 90]]
5
6Sum = -10 + 0 + 30 - 40 + 0 + 120 - 30 + 0 + 90 = 160
7
8→ output[0, 0] = 160All nine positions — the interactive trace
The code below runs the full loop. Click any line to see what that line does. The loop lines (27, 28, 29, 30) expand into per-iteration cards with the exact window, the element-wise product, and the final sum at each of the 9 output positions. Nothing is hidden.
Why is the output NOT constant?
Quick Check
At position (i, j) = (1, 1), which slice of the input does the 3×3 kernel cover?
The Same Computation in PyTorch
We now replay the same 2-D convolution with torch.nn.functional.conv2d. Every number should match the NumPy version to the last digit. The point is to see how the hand-rolled loop maps onto a production framework call.
When to use F.conv2d vs nn.Conv2d
F.conv2d is stateless — you pass weights in. nn.Conv2d is a Module that owns the weights (and bias), gets registered in the optimiser, and participates in .parameters() / checkpoints. Use nn.Conv2d for anything learnable; use F.conv2d for one-off operations or when you want a custom weight source.Interactive Convolution Calculator
Click any output cell to see the step-by-step calculation, or press “Animate” to watch the kernel slide across the input. Use the kernel dropdown to see how different weights produce different outputs.
Interactive Convolution Calculator
Detects vertical edges (Sobel X)
Input Image (5×5)
Kernel (3×3)
Output (3×3)
Click a cell to see calculation
Key Insight
The convolution operation slides the kernel across the input, computing a weighted sum at each position. The same kernel weights are used everywhere—this is weight sharing. Output size = Input size - Kernel size + 1 = 5 - 3 + 1 = 3×3.
- Identity — output equals input (useful as a sanity check).
- Vertical edge (Sobel X) — high response where brightness changes left → right.
- Horizontal edge (Sobel Y) — high response where brightness changes top → bottom.
- Box blur — smooths by averaging all nine neighbours.
- Sharpen — enhances local differences, hence edges.
Kernel Effects Gallery
Different weights → different outputs, applied to the same input. The gallery below is the whole idea of classical image processing in a box: each filter was hand-designed by someone studying a specific problem.
Kernel Effect Gallery
Input
Kernel
Output
Description
Detects vertical edges by computing horizontal gradients. Left pixels are subtracted from right pixels.
Formula
Gx = ∂I/∂x ≈ I(x+1) - I(x-1)
Use Case in Deep Learning
Edge detection, feature extraction in CNNs
Compare All Kernels:
Identity
Sobel X (Vertical Edges)
Sobel Y (Horizontal Edges)
Box Blur
Gaussian Blur
Sharpen
Laplacian
Emboss
Key Insight
Each kernel acts as a feature detector. In CNNs, instead of hand-designing these kernels, we let the network learn optimal kernels from data. The first layers often learn edge detectors similar to Sobel, while deeper layers learn more complex patterns.
Why each kernel works
| Kernel | Weight pattern | Why it works |
|---|---|---|
| Vertical edge (Sobel X) | negative left, positive right | subtracts left column from right column → response ∝ horizontal intensity change |
| Horizontal edge (Sobel Y) | negative top, positive bottom | same idea, 90° rotated — detects horizontal edges |
| Box blur (3×3) | all 1/9 | arithmetic mean of the 9-pixel neighbourhood — suppresses noise, blurs detail |
| Gaussian blur (3×3) | bell-curve weights | weighted mean with peak at the centre — smoother than box blur for the same radius |
| Sharpen | 5 at the centre, −1 at the 4-neighbours, 0 at corners | boosts the centre relative to its neighbours — enhances edges |
| Laplacian | −4 at the centre, 1 at the 4-neighbours | discrete second derivative — responds to edges of any orientation |
From hand-designed to learned
Reference
Multi-Channel Convolution
Real images have 3 channels (RGB). Intermediate feature maps can have hundreds. How does convolution handle this? The answer is a single, tidy rule: one kernel spans every input channel and produces exactly one output channel. To get multiple output channels you use multiple kernels.
RGB convolution, spelled out
- Input: shape — 3 colour channels.
- One kernel: shape — separate weights for each input channel, covering the same spatial window.
- Output: shape — one scalar per position.
At each output position we compute three element-wise multiply-and-sum operations (one per input channel) and add them together to get one scalar. The diagram below is the visual version of that recipe.
Multi-Channel (RGB) Convolution
Input Image (4×4×3 RGB)
Kernel (3×3×3)
Calculation at position (0, 0):
Output Feature Map (2×2×1)
Key Insight 1: Single Kernel Spans All Channels
One 3×3×3 kernel covers all input channels and produces one output value per position. The kernel has separate weights for R, G, and B, but their contributions are summed.
Key Insight 2: Multiple Kernels = Multiple Outputs
To produce multiple output channels (feature maps), we use multiple kernels. 64 kernels → 64 output channels. Each kernel learns different features!
Parameter Count Formula:
Example: 3 × 3 × 3 × 64 + 64 = 1,792 parameters
Numerical trace: a 2×2 RGB patch by one 2×2×3 kernel
The code below walks the full arithmetic with tiny integer values so you can check every multiplication in your head.
Multiple output channels
To produce output feature maps we use independent multi-channel kernels. The full weight tensor of a conv layer therefore has shape:
The first layer of many CNNs chooses : 64 different kernels, each watching all 3 RGB channels. Different kernels converge to detect different features — horizontal edges, vertical edges, diagonals, colour blobs, …
Parameter count
… where the trailing is one bias per output channel.
1# A canonical "first conv layer": RGB input, 64 filters, 3×3 kernels
2import torch.nn as nn
3
4conv1 = nn.Conv2d(in_channels=3, out_channels=64, kernel_size=3)
5params = sum(p.numel() for p in conv1.parameters())
6print(f"Parameters: {params}")
7# 3 × 3 × 3 × 64 + 64 = 1,728 + 64 = 1,792Quick Check
How many parameters does nn.Conv2d(64, 128, kernel_size=3) have?
We now have the operation itself down cold — in math, in NumPy, and in PyTorch. Section 3 adds the three knobs that turn this raw operation into a practical CNN building block: stride, padding, and pooling. It also introduces the concept that explains why CNN depth matters — the receptive field.