Why Linear Algebra for Neural Networks?
Every neural network, from a simple two-layer classifier to GPT-4, performs the same core operation in every layer:
Here is a matrix of weights, is a vector of inputs, is a vector of biases, and is the output vector. This single equation is pure linear algebra — matrix-vector multiplication followed by vector addition.
Understanding these operations gives you deep intuition for:
- Why networks have the shapes they do — the dimensions of determine how many neurons each layer has
- What "parameters" really are — every entry in and is a learnable number
- How data flows through layers — each layer transforms the vector into a new vector
- Why certain architectures work — attention, convolution, and residual connections are all special cases of matrix operations
The Big Picture: Neural networks are, at their mathematical core, a sequence of matrix multiplications and element-wise nonlinearities. If you understand matrices and vectors, you understand the skeleton of every neural network ever built.
This section covers exactly the linear algebra you need — no more, no less. We start with vectors, build up to matrices, and finish by tracing a complete forward pass through a neural network layer. Every concept comes with NumPy and PyTorch code so you can verify the math with your own hands.
Vectors: Direction and Magnitude
What Is a Vector?
A vector is an ordered list of numbers. That's it — no mystery. In neural networks, almost everything is a vector: the input features fed to a model, the weights of each neuron, the activations flowing between layers, and the gradients used to update those weights during training.
We write vectors as bold lowercase letters: . The individual numbers inside are called components, written with subscripts: . A vector with components lives in -dimensional space.
For example, a 3-D vector represents a point in 3-D space where the first axis has value 0.8, the second has 0.3, and the third has 0.5. In a neural network, these might be three pixel intensities from an image, or three features extracted by a previous layer.
Magnitude: How Big Is the Vector?
The magnitude (or length or norm) of a vector tells you how far it is from the origin. For a vector , the magnitude is:
This is the Pythagorean theorem generalized to dimensions. For a 2-D vector , the magnitude is . For a 1000-D vector of word embeddings, the same formula applies — just with 1000 terms under the square root.
Unit Vectors: Direction Without Magnitude
A unit vector has magnitude exactly 1. You create one by dividing a vector by its magnitude: . This process is called normalization. It strips away the "how much" and keeps only the "which direction."
Neural networks use normalization constantly — batch normalization, layer normalization, and cosine similarity all rely on the concept of unit vectors.
Vector Arithmetic
Vectors support three fundamental operations:
- Addition — add corresponding components: . This is how bias vectors are added to layer outputs.
- Subtraction — subtract corresponding components: . This is how errors are computed (prediction minus target).
- Scalar multiplication — multiply every component by a number: . This is how the learning rate scales gradient updates.
Drag the sliders below to see how changing a vector's components affects its direction and magnitude in 2-D space:
The same operations in PyTorch use tensors instead of arrays. The API is nearly identical, with a few naming differences:
| Operation | NumPy | PyTorch |
|---|---|---|
| Create vector | np.array([...]) | torch.tensor([...]) |
| Magnitude | np.linalg.norm(v) | torch.norm(v) |
| Dimensions | v.ndim (property) | v.dim() (method) |
| Default dtype | float64 | float32 |
The Dot Product
The dot product is the single most important operation in neural networks. Every neuron in every layer computes a dot product of its weights with its inputs. Attention mechanisms compute dot products between queries and keys. Loss functions compare predicted and target vectors using dot products. If you understand one operation deeply, make it this one.
The Formula
Given two vectors and with the same number of components, their dot product is:
Multiply corresponding elements, then sum. For and : the dot product is .
The Geometric Meaning
The dot product has a beautiful geometric interpretation:
where is the angle between the two vectors. This connects the algebraic definition (multiply-and-sum) with geometry (angle). The sign of the dot product tells you the relationship between the vectors:
| Dot Product | Angle | Meaning |
|---|---|---|
| Positive (large) | θ < 90° | Vectors point in similar directions — high similarity |
| Zero | θ = 90° | Vectors are perpendicular — completely independent |
| Negative (large) | θ > 90° | Vectors point in opposite directions — anti-similar |
Why This Matters for Neural Networks
Every neuron computes — the dot product of its weight vector with the input vector . A high dot product means "this input matches what this neuron is looking for." A low or negative dot product means the input is irrelevant or actively contradicts the neuron's pattern.
Key Insight: A neuron's weight vector defines a direction in input space. The dot product measures how much each input aligns with that direction. Training adjusts the weight vectors so they point toward the patterns the network needs to detect.
Drag the vectors below to see how the dot product changes with the angle between them:
PyTorch provides both torch.dot() for raw dot products and F.cosine_similarity() for normalized similarity. The cosine similarity is especially useful when you care about direction but not magnitude — for example, comparing word embeddings where you want to know if two words have similar meanings regardless of how frequently they appear.
Matrices: Tables of Numbers with Superpowers
What Is a Matrix?
A matrix is a 2-D grid of numbers arranged in rows and columns. If a vector is a list, a matrix is a table. In neural networks, the weight matrix stores all the weights for one layer — every connection between every input and every neuron, organized into a single rectangular block of numbers.
Shape Notation
An matrix has rows and columns. We write to say " is a real-valued matrix with rows and columns." In neural network layers, is the number of neurons (outputs) and is the number of inputs. A layer that maps 3 inputs to 2 neurons has a weight matrix — 6 learnable parameters.
Accessing Elements
is the element at row , column . Each row contains one neuron's complete set of weights. Each column contains the weights that all neurons assign to one particular input. This dual perspective is key to understanding how information flows through a network.
Transpose
The transpose swaps rows and columns — the first row becomes the first column, the second row becomes the second column, and so on. A matrix with shape becomes . Transpose appears constantly in neural networks: during backpropagation, gradients flow through , and PyTorch's nn.Linear internally computes rather than .
Special Matrices
A few special matrices appear throughout deep learning:
| Matrix | Definition | Neural Network Use |
|---|---|---|
| Identity (I) | 1s on diagonal, 0s elsewhere. I × v = v | Residual connections: output = f(x) + I·x |
| Diagonal | Non-zero only on the diagonal | Per-feature scaling (like layer norm’s γ) |
| Zero matrix | All entries are 0 | Initializing biases to zero |
Key Insight: The identity matrix is the matrix equivalent of the number 1 — multiplying any vector by returns the same vector unchanged. Residual connections in networks like ResNet and Transformers exploit this: by adding to the layer's output, the network can easily learn to "do nothing" when that is the best strategy, which makes very deep networks trainable.
PyTorch matrices use the same indexing and transpose syntax. The key difference is that nn.Linear creates and manages weight matrices for you automatically:
| Operation | NumPy | PyTorch |
|---|---|---|
| Create matrix | np.array([[...], [...]]) | torch.tensor([[...], [...]]) |
| Single element | W[0, 1] | W[0, 1].item() |
| Transpose | W.T | W.T (or W.transpose(0, 1)) |
| Identity | np.eye(n) | torch.eye(n) |
Matrix-Vector Multiplication: The Core of Neural Networks
This is THE fundamental operation: . Every neuron, every layer, every forward pass comes down to this single equation. If you understand matrix-vector multiplication, you understand the skeleton of every neural network.
How It Works — Row by Row
Each row of is one neuron's weight vector. Matrix-vector multiplication takes the dot product of each row with the input vector , producing one output per neuron:
- Row 0 = neuron 0's pre-activation
- Row 1 = neuron 1's pre-activation
- …and so on for every neuron in the layer
For with shape and with shape : row 0 (3 elements) dots with (3 elements) to give neuron 0's output, and row 1 dots with to give neuron 1's output. The result is a vector with elements — one per neuron.
The Shape Rule
— the inner dimensions must match. The columns of must equal the elements of . The output has elements, one per row of . If these dimensions don't match, you get a shape error — the most common bug in deep learning code.
Adding the Bias
After the dot products, each neuron adds its own bias . The bias vector has one element per neuron. Bias lets a neuron fire even when the input is all zeros — it shifts the neuron's activation threshold.
The Linear Transformation
This operation is called a linear transformation because it preserves addition and scaling. The weight matrix literally transforms the input vector into a new space — a 3-D input becomes a 2-D output, or a 768-D embedding becomes a 3072-D hidden representation. The matrix controls how the space is warped: which directions get stretched, which get compressed, and which get rotated.
The visualizer below shows a 2-D simplification — each matrix entry changes how the grid of input points is warped into output points. Try adjusting the matrix values to see stretching, rotation, shearing, and reflection:
In PyTorch, nn.Linear wraps this computation into a reusable layer that integrates with the optimizer for training. The layer stores .weight (the matrix) and .bias (the vector), and its forward method computes automatically when you call layer(x):
Matrix-Matrix Multiplication: Processing Batches
In practice, we don't process one input at a time — we process batches. Instead of one vector , we have a matrix where each row is one sample. If we have 4 images, each with 3 features, has shape — 4 rows, 3 columns.
The Shape Rule for Matrix Multiply
— the inner dimensions must match. The columns of the left matrix must equal the rows of the right matrix. The output takes the outer dimensions: rows from the left and columns from the right.
For our example: 4 samples with 3 features each () times () gives (). Each row of is one sample's output through both neurons — all 4 samples processed simultaneously.
Non-Commutativity: Order Matters
Unlike scalar multiplication, in general. Swapping the order changes both the shapes and the values. In fact, if is and is , then won't even work — the inner dimensions don't match. Matrix order is a frequent source of bugs in deep learning code.
Why Batches Matter
GPUs are massively parallel processors — multiplying a big matrix is almost as fast as multiplying a single vector. A batch of 256 samples through a layer takes roughly the same time as 1 sample, because the GPU processes all rows in parallel. This is why training always uses batches: it's essentially free parallelism.
The interactive visualizer below shows how matrix-matrix multiplication works step by step. Click "Play" to watch the multiplication animate automatically, or use "Step →" to advance one dot product at a time:
In PyTorch, nn.Linear handles batches automatically — the same layer(X) call works whether X is a single vector or a batch matrix. Internally, it computes regardless of batch size:
Putting It All Together: The Forward Pass
Now we have all the pieces to understand a complete neural network layer. The forward pass is the pipeline that transforms raw input into useful output — and it consists of exactly the operations we've learned:
- Input — a batch of samples (matrix)
- Linear transform — matrix multiply + bias (the weighted sum)
- Activation — element-wise nonlinear function (the decision gate)
- Output — the activated result, ready for the next layer
Each step is an operation we've already learned: matrix multiply, vector add, and element-wise function application. A neural network is just a chain of these layers, one feeding into the next.
The Big Picture: A neural network is just a chain of these simple operations. Each layer takes numbers in, multiplies, adds, and applies a nonlinear function. That's it. The magic comes from stacking many layers and learning the right weight values through training.
ReLU: The Most Popular Activation
The activation function we'll use is ReLU (Rectified Linear Unit): . It's dead simple — pass positive values through, replace negatives with zero. Yet this tiny nonlinearity is what gives neural networks the power to learn complex patterns. Without it, stacking layers would collapse into a single linear transformation, and the network would be no more powerful than logistic regression.
ReLU effectively decides which neurons "fire" (positive output) and which stay silent (zero). This selective activation creates the sparse, efficient representations that make deep learning work.
The Full Pipeline with Numbers
Let's trace the complete forward pass with our 4-sample batch through one layer, watching every value transform step by step:
In PyTorch, the forward pass is just two lines: layer(X) for the linear transform and F.relu(Z) for the activation. The F.relu function from torch.nn.functional is functionally identical to our NumPy relu(), but with one crucial difference: it supports autograd. During training, PyTorch automatically tracks that relu was applied and computes the correct gradients during backpropagation — the gradient is 1 where the input was positive and 0 where it was negative.
Element-wise Operations, Broadcasting, and Reshape
Beyond matrix multiplication, neural networks rely heavily on three more operations. These trip up beginners but are essential to understand: element-wise operations, broadcasting, and reshape. Once you see how they work, many "mysterious" lines of neural network code will suddenly make perfect sense.
Element-wise Operations
An element-wise operation applies a function to each element independently. If you have a matrix Z of shape , an element-wise operation produces an output of the same shape , where output position depends only on input position . There is no mixing across positions — completely unlike matrix multiplication, where each output combines an entire row with an entire column.
The most common element-wise operations in neural networks are:
- ReLU activation: — keep positives, zero out negatives. Applied after every linear layer.
- Sigmoid: — squash values to the range . Used for binary classification outputs.
- Element-wise addition: Adding two same-shape matrices — position in matrix A is added to position in matrix B.
Key Insight: Element-wise means position only cares about in the inputs. No mixing across positions. Matrix multiplication mixes across an entire row and column — that's what makes it powerful but computationally expensive.
Broadcasting
Broadcasting is NumPy and PyTorch's mechanism for combining arrays of different shapes. Instead of requiring both arrays to be the same shape, the smaller array is "stretched" to match the larger one. The rule: align shapes from the right, and each dimension must either be equal or one of them must be 1 (or missing).
The most common use in neural networks: adding a bias vector of shape to a batch result of shape . The bias is broadcast across all 4 rows — every sample gets the same bias added.
| Operation | Shape A | Shape B | Result Shape | What Happens |
|---|---|---|---|---|
| A + B | (4, 2) | (4, 2) | (4, 2) | Same shape — element-wise addition, no broadcasting needed |
| Z + bias | (4, 2) | (2,) | (4, 2) | Bias (2,) stretched across 4 rows |
| X + col | (3, 3) | (3, 1) | (3, 3) | Column vector (3,1) stretched across 3 columns |
| a + b | (1,) | (4, 2) | (4, 2) | Scalar-like (1,) stretched to match entire matrix |
| FAIL | (4, 2) | (3,) | ERROR | Last dimensions 2 ≠ 3 — cannot broadcast |
Reshape
Reshaping changes an array's shape without changing its data. The same numbers are simply reinterpreted in a new arrangement. This is critical in neural networks — for example, flattening a 2D image into a 1D vector for a fully connected layer. A 28×28 image ( pixels) becomes a vector of length .
The -1 trick: passing -1 as a dimension tells NumPy or PyTorch to "figure it out automatically." If you have 12 elements and reshape to , the -1 becomes 4 because . The most common use is reshape(-1), which flattens any array to 1-D.
In PyTorch, element-wise operations and broadcasting work identically to NumPy. The main differences are in reshaping: PyTorch offers both .view() and .reshape(). The .view() method requires contiguous memory and returns a view (no data copy, shared storage — modifying the view modifies the original). The .reshape() method works always but may create a copy if the tensor is not contiguous. In practice, most tensors are contiguous, so .view() is the more common choice.
You now have all the linear algebra you need for neural networks. Matrix multiplication combines inputs with weights, element-wise operations like ReLU introduce nonlinearity, broadcasting lets you add biases efficiently, and reshape lets you reorganize data between layers. Everything from here builds on these four operations — they are the building blocks of every neural network layer, every forward pass, and every training step.