Boo-AI — Master Artificial Intelligence by Building from Scratch

Why Linear Algebra for Neural Networks?

Every neural network, from a simple two-layer classifier to GPT-4, performs the same core operation in every layer:

$\mathbf{y} = W\mathbf{x} + \mathbf{b}$

Here $W$ is a matrix of weights, $\mathbf{x}$ is a vector of inputs, $\mathbf{b}$ is a vector of biases, and $\mathbf{y}$ is the output vector. This single equation is pure linear algebra — matrix-vector multiplication followed by vector addition.

Understanding these operations gives you deep intuition for:

Why networks have the shapes they do — the dimensions of $W$ determine how many neurons each layer has
What "parameters" really are — every entry in $W$ and $\mathbf{b}$ is a learnable number
How data flows through layers — each layer transforms the vector $\mathbf{x}$ into a new vector $\mathbf{y}$
Why certain architectures work — attention, convolution, and residual connections are all special cases of matrix operations

The Big Picture: Neural networks are, at their mathematical core, a sequence of matrix multiplications and element-wise nonlinearities. If you understand matrices and vectors, you understand the skeleton of every neural network ever built.

This section covers exactly the linear algebra you need — no more, no less. We start with vectors, build up to matrices, and finish by tracing a complete forward pass through a neural network layer. Every concept comes with NumPy and PyTorch code so you can verify the math with your own hands.

Vectors: Direction and Magnitude

What Is a Vector?

A vector is an ordered list of numbers. That's it — no mystery. In neural networks, almost everything is a vector: the input features fed to a model, the weights of each neuron, the activations flowing between layers, and the gradients used to update those weights during training.

We write vectors as bold lowercase letters: $\mathbf{v}$ . The individual numbers inside are called components, written with subscripts: $v_1, v_2, \ldots, v_n$ . A vector with $n$ components lives in $n$ -dimensional space.

For example, a 3-D vector $\mathbf{v} = [0.8, 0.3, 0.5]$ represents a point in 3-D space where the first axis has value 0.8, the second has 0.3, and the third has 0.5. In a neural network, these might be three pixel intensities from an image, or three features extracted by a previous layer.

Magnitude: How Big Is the Vector?

The magnitude (or length or norm) of a vector tells you how far it is from the origin. For a vector $\mathbf{v} = [v_1, v_2, \ldots, v_n]$ , the magnitude is:

$\|\mathbf{v}\| = \sqrt{v_1^2 + v_2^2 + \cdots + v_n^2}$

This is the Pythagorean theorem generalized to $n$ dimensions. For a 2-D vector $[3, 4]$ , the magnitude is $\sqrt{9 + 16} = \sqrt{25} = 5$ . For a 1000-D vector of word embeddings, the same formula applies — just with 1000 terms under the square root.

Unit Vectors: Direction Without Magnitude

A unit vector has magnitude exactly 1. You create one by dividing a vector by its magnitude: $\hat{\mathbf{v}} = \mathbf{v} / \|\mathbf{v}\|$ . This process is called normalization. It strips away the "how much" and keeps only the "which direction."

Neural networks use normalization constantly — batch normalization, layer normalization, and cosine similarity all rely on the concept of unit vectors.

Vector Arithmetic

Vectors support three fundamental operations:

Addition — add corresponding components: $[a_1 + b_1, \, a_2 + b_2, \, \ldots]$ . This is how bias vectors are added to layer outputs.
Subtraction — subtract corresponding components: $[a_1 - b_1, \, a_2 - b_2, \, \ldots]$ . This is how errors are computed (prediction minus target).
Scalar multiplication — multiply every component by a number: $c \cdot \mathbf{a} = [c \cdot a_1, \, c \cdot a_2, \, \ldots]$ . This is how the learning rate scales gradient updates.

Drag the sliders below to see how changing a vector's components affects its direction and magnitude in 2-D space:

Loading vector visualization...

Python — Vector Basics with NumPy

🐍vectors.py

Explanation(22)

Code(29)

1import numpy as np

NumPy is the foundation of numerical computing in Python. It provides fast N-dimensional arrays (ndarray) and vectorized math operations that run in compiled C, not slow Python loops. Every neural network framework builds on these concepts.

EXECUTION STATE

📚 numpy = Library for numerical computing — ndarray type, linear algebra (np.linalg), element-wise operations, and broadcasting. Aliased as np by convention.

3# A vector: 3 pixel intensities from an image

We create a vector representing 3 pixel intensities from an image. In neural networks, input data is always represented as vectors — an image becomes a vector of pixel values, a sentence becomes a vector of word embeddings.

EXECUTION STATE

real-world context = RGB pixel: Red=0.8 (bright), Green=0.3 (dim), Blue=0.5 (medium). Each number is a feature that the network will process.

4pixel = np.array([0.8, 0.3, 0.5]) — Create a 1-D array

np.array() converts a Python list into a NumPy ndarray. This 1-D array IS a vector — an ordered list of 3 numbers. NumPy stores these contiguously in memory for fast math.

EXECUTION STATE

📚 np.array(list) = Converts a Python list to a NumPy ndarray. The dtype is inferred as float64 from the decimal values.

⬇ pixel = [0.8, 0.3, 0.5]

6print("Vector:", pixel)

Prints the vector contents. NumPy displays 1-D arrays in a single row with spaces between elements.

EXECUTION STATE

output = Vector: [0.8 0.3 0.5]

7print("Shape:", pixel.shape)

The .shape attribute returns a tuple describing the array dimensions. A 1-D array with 3 elements has shape (3,). The trailing comma distinguishes it from a scalar — this is a tuple with one element.

EXECUTION STATE

pixel.shape = (3,) — 1 dimension, 3 elements. A matrix would be (rows, cols), e.g. (3, 2).

8print("Dimensions:", pixel.ndim)

The .ndim attribute returns the number of dimensions (axes). A vector is 1-D, a matrix is 2-D, a batch of images is 4-D (batch, channels, height, width).

EXECUTION STATE

pixel.ndim = 1 — this is a 1-dimensional array (a vector). Matrices have ndim=2, tensors can have ndim=3, 4, or more.

10# Magnitude (length) of the vector

The magnitude tells us "how big" the vector is — its distance from the origin. For pixel intensities, a larger magnitude means brighter overall color. In neural networks, gradient magnitude tells you how much the loss wants weights to change.

11magnitude = np.linalg.norm(pixel) — Euclidean length

Computes the L2 (Euclidean) norm: √(0.8² + 0.3² + 0.5²) = √(0.64 + 0.09 + 0.25) = √0.98 ≈ 0.9899. This is the straight-line distance from the origin to the point (0.8, 0.3, 0.5) in 3-D space.

EXECUTION STATE

📚 np.linalg.norm(x) = Computes the L2 norm (Euclidean length) by default. Takes an array, returns a scalar. Can also compute L1, Linf, and matrix norms via the ord= parameter.

computation = √(0.8² + 0.3² + 0.5²) = √(0.64 + 0.09 + 0.25) = √0.98

magnitude = 0.9899 — nearly 1.0, so this pixel vector is close to unit length already.

12print("Magnitude:", magnitude)

Displays the computed magnitude.

EXECUTION STATE

output = Magnitude: 0.9899494936611665

14# Unit vector (direction only, length = 1)

A unit vector preserves the direction of the original vector but has magnitude exactly 1.0. This is called normalization — it removes the "how much" and keeps only the "which way." Neural networks use normalization extensively (layer norm, batch norm).

15unit = pixel / magnitude — Normalize to unit length

Divides each component by the magnitude. NumPy broadcasts the scalar division across all 3 elements: [0.8/0.9899, 0.3/0.9899, 0.5/0.9899]. The result points in the same direction but has length 1.

EXECUTION STATE

computation = [0.8/0.9899, 0.3/0.9899, 0.5/0.9899]

unit = [0.8081, 0.3030, 0.5051] — same direction, length = 1.0

16print("Unit vector:", unit)

Displays the normalized vector. Notice the ratios between components are preserved: 0.8081/0.3030 ≈ 2.67 and 0.8/0.3 ≈ 2.67.

EXECUTION STATE

output = Unit vector: [0.80812204 0.30304576 0.50507627]

17print("Unit magnitude:", np.linalg.norm(unit))

Verifies that the unit vector has magnitude 1.0. This is the defining property of unit vectors — they lie on the unit circle (2-D) or unit sphere (3-D, N-D).

EXECUTION STATE

output = Unit magnitude: 0.9999999999999999 — effectively 1.0 (tiny floating-point rounding)

19# Vector arithmetic

Vectors support three fundamental operations: addition (combine two vectors), subtraction (difference between two vectors), and scalar multiplication (scale a vector). These three operations are all you need for neural network math.

20a = np.array([3.0, 2.0]) — First 2-D vector

A 2-D vector pointing right and up. Think of this as a neuron's weight vector: it "looks for" inputs that are strong in the first feature (3.0) and moderate in the second (2.0).

EXECUTION STATE

⬇ a = [3.0, 2.0]

21b = np.array([1.0, 4.0]) — Second 2-D vector

A 2-D vector pointing slightly right and strongly up. This vector emphasizes the second feature more than the first — a different "preference" than vector a.

EXECUTION STATE

⬇ b = [1.0, 4.0]

23add = a + b — Component-wise addition

Vector addition adds corresponding components: [3.0+1.0, 2.0+4.0] = [4.0, 6.0]. Geometrically, place vector b at the tip of vector a — the result points to where you end up. In neural networks, this is how bias vectors are added to layer outputs.

EXECUTION STATE

computation = [3.0+1.0, 2.0+4.0]

add = [4.0, 6.0]

24sub = a - b — Component-wise subtraction

Vector subtraction: [3.0-1.0, 2.0-4.0] = [2.0, -2.0]. The result vector points FROM b TO a. Its magnitude (2.83) is the distance between the two points. Neural networks use differences to compute errors.

EXECUTION STATE

computation = [3.0-1.0, 2.0-4.0]

sub = [2.0, -2.0]

25scaled = 2.5 * a — Scalar multiplication

Multiplying a vector by a scalar scales every component: [2.5×3.0, 2.5×2.0] = [7.5, 5.0]. The direction stays the same but the magnitude grows by 2.5×. In gradient descent, the learning rate is a scalar that scales the gradient vector.

EXECUTION STATE

computation = [2.5×3.0, 2.5×2.0]

scaled = [7.5, 5.0]

27print("a + b =", add)

Displays the addition result.

EXECUTION STATE

output = a + b = [4. 6.]

28print("a - b =", sub)

Displays the subtraction result.

EXECUTION STATE

output = a - b = [ 2. -2.]

29print("2.5 * a =", scaled)

Displays the scaled vector.

EXECUTION STATE

output = 2.5 * a = [7.5 5. ]

7 lines without explanation

1import numpy as np
2
3# A vector: 3 pixel intensities from an image
4pixel = np.array([0.8, 0.3, 0.5])
5
6print("Vector:", pixel)
7print("Shape:", pixel.shape)
8print("Dimensions:", pixel.ndim)
9
10# Magnitude (length) of the vector
11magnitude = np.linalg.norm(pixel)
12print("Magnitude:", magnitude)
13
14# Unit vector (direction only, length = 1)
15unit = pixel / magnitude
16print("Unit vector:", unit)
17print("Unit magnitude:", np.linalg.norm(unit))
18
19# Vector arithmetic
20a = np.array([3.0, 2.0])
21b = np.array([1.0, 4.0])
22
23add = a + b
24sub = a - b
25scaled = 2.5 * a
26
27print("a + b =", add)
28print("a - b =", sub)
29print("2.5 * a =", scaled)

The same operations in PyTorch use tensors instead of arrays. The API is nearly identical, with a few naming differences:

Operation	NumPy	PyTorch
Create vector	np.array([...])	torch.tensor([...])
Magnitude	np.linalg.norm(v)	torch.norm(v)
Dimensions	v.ndim (property)	v.dim() (method)
Default dtype	float64	float32

PyTorch — The Same Operations as Tensors

🐍vectors_torch.py

Explanation(18)

Code(24)

1import torch

PyTorch is the most popular deep learning framework. Its core data structure is the Tensor — like a NumPy array but with GPU acceleration and automatic differentiation built in. Every modern neural network is built with tensors.

EXECUTION STATE

📚 torch = PyTorch core library. Key difference from NumPy: tensors can live on GPU and track gradients for backpropagation automatically.

3# Create a 1-D tensor (vector)

In PyTorch, vectors are 1-D tensors. The word "tensor" is just a generalization: a scalar is a 0-D tensor, a vector is a 1-D tensor, a matrix is a 2-D tensor, and higher dimensions are N-D tensors.

4pixel = torch.tensor([0.8, 0.3, 0.5]) — Create a tensor

torch.tensor() converts a Python list to a PyTorch tensor. Same data as the NumPy version — 3 pixel intensities. The default dtype is torch.float32 (not float64 like NumPy), which is the standard for neural network training.

EXECUTION STATE

📚 torch.tensor(data) = Creates a new tensor from data (list, tuple, or ndarray). Infers dtype from input — float values become float32. Unlike np.array which defaults to float64.

⬇ pixel = tensor([0.8000, 0.3000, 0.5000]), dtype=torch.float32

6print("Tensor:", pixel)

PyTorch prints tensors with "tensor(...)" wrapper and shows 4 decimal places by default.

EXECUTION STATE

output = Tensor: tensor([0.8000, 0.3000, 0.5000])

7print("Shape:", pixel.shape)

Same as NumPy — .shape returns a torch.Size object (behaves like a tuple). A 1-D tensor with 3 elements has shape torch.Size([3]).

EXECUTION STATE

pixel.shape = torch.Size([3]) — equivalent to NumPy's (3,). Both mean: 1 dimension, 3 elements.

8print("Dimensions:", pixel.dim())

PyTorch uses .dim() as a method call, while NumPy uses .ndim as a property. Both return the number of dimensions. This is a common gotcha when switching between frameworks.

EXECUTION STATE

📚 .dim() = Returns the number of dimensions. PyTorch: pixel.dim() → 1. NumPy equivalent: pixel.ndim → 1. Note: dim() has parentheses, ndim does not.

10# Magnitude

Computing the Euclidean norm (magnitude) of the tensor. Same mathematical operation as NumPy, different function name.

11magnitude = torch.norm(pixel) — Euclidean length

Computes the L2 norm: √(0.8² + 0.3² + 0.5²) = √0.98 ≈ 0.9899. Note the API difference: torch.norm(tensor) vs np.linalg.norm(array). The result is a 0-D tensor (scalar tensor).

EXECUTION STATE

📚 torch.norm(input) = Computes the L2 norm by default. Returns a scalar tensor. NumPy equivalent: np.linalg.norm(). Can also compute other norms via the p= parameter.

magnitude = tensor(0.9899) — same value as NumPy, but stored as a float32 tensor

12print("Magnitude:", magnitude)

Displays the magnitude. Note the slight difference in decimal precision from NumPy due to float32 vs float64.

EXECUTION STATE

output = Magnitude: tensor(0.9899)

14# Unit vector

Normalizing the tensor to unit length, exactly as in NumPy.

15unit = pixel / magnitude — Normalize

Divides each component by the magnitude. PyTorch broadcasting works the same as NumPy — the scalar magnitude is applied to all 3 elements. The resulting tensor has magnitude 1.0.

EXECUTION STATE

unit = tensor([0.8081, 0.3030, 0.5051]) — same direction, unit length

16print("Unit vector:", unit)

Displays the normalized tensor.

EXECUTION STATE

output = Unit vector: tensor([0.8081, 0.3030, 0.5051])

18# Arithmetic

PyTorch tensor arithmetic uses the same operators as NumPy: +, -, *. The operations are element-wise and support broadcasting.

19a = torch.tensor([3.0, 2.0])

Creates a 2-D tensor. Same values as the NumPy version for direct comparison.

EXECUTION STATE

a = tensor([3.0, 2.0])

20b = torch.tensor([1.0, 4.0])

Second tensor for arithmetic operations.

EXECUTION STATE

b = tensor([1.0, 4.0])

22print("a + b =", a + b)

Element-wise addition: tensor([3.0+1.0, 2.0+4.0]) = tensor([4.0, 6.0]). Identical to NumPy.

EXECUTION STATE

output = a + b = tensor([4., 6.])

23print("a - b =", a - b)

Element-wise subtraction: tensor([3.0-1.0, 2.0-4.0]) = tensor([2.0, -2.0]).

EXECUTION STATE

output = a - b = tensor([ 2., -2.])

24print("2.5 * a =", 2.5 * a)

Scalar multiplication: tensor([2.5×3.0, 2.5×2.0]) = tensor([7.5, 5.0]). The scalar 2.5 is broadcast across all elements.

EXECUTION STATE

output = 2.5 * a = tensor([7.5000, 5.0000])

6 lines without explanation

1import torch
2
3# Create a 1-D tensor (vector)
4pixel = torch.tensor([0.8, 0.3, 0.5])
5
6print("Tensor:", pixel)
7print("Shape:", pixel.shape)
8print("Dimensions:", pixel.dim())
9
10# Magnitude
11magnitude = torch.norm(pixel)
12print("Magnitude:", magnitude)
13
14# Unit vector
15unit = pixel / magnitude
16print("Unit vector:", unit)
17
18# Arithmetic
19a = torch.tensor([3.0, 2.0])
20b = torch.tensor([1.0, 4.0])
21
22print("a + b =", a + b)
23print("a - b =", a - b)
24print("2.5 * a =", 2.5 * a)

The Dot Product

The dot product is the single most important operation in neural networks. Every neuron in every layer computes a dot product of its weights with its inputs. Attention mechanisms compute dot products between queries and keys. Loss functions compare predicted and target vectors using dot products. If you understand one operation deeply, make it this one.

The Formula

Given two vectors $\mathbf{a}$ and $\mathbf{b}$ with the same number of components, their dot product is:

$\mathbf{a} \cdot \mathbf{b} = a_1 b_1 + a_2 b_2 + \cdots + a_n b_n$

Multiply corresponding elements, then sum. For $\mathbf{a} = [3, 2]$ and $\mathbf{b} = [1, 4]$ : the dot product is $3 \times 1 + 2 \times 4 = 3 + 8 = 11$ .

The Geometric Meaning

The dot product has a beautiful geometric interpretation:

$\mathbf{a} \cdot \mathbf{b} = \|\mathbf{a}\| \, \|\mathbf{b}\| \, \cos(\theta)$

where $\theta$ is the angle between the two vectors. This connects the algebraic definition (multiply-and-sum) with geometry (angle). The sign of the dot product tells you the relationship between the vectors:

Dot Product	Angle	Meaning
Positive (large)	θ < 90°	Vectors point in similar directions — high similarity
Zero	θ = 90°	Vectors are perpendicular — completely independent
Negative (large)	θ > 90°	Vectors point in opposite directions — anti-similar

Why This Matters for Neural Networks

Every neuron computes $\mathbf{w} \cdot \mathbf{x}$ — the dot product of its weight vector $\mathbf{w}$ with the input vector $\mathbf{x}$ . A high dot product means "this input matches what this neuron is looking for." A low or negative dot product means the input is irrelevant or actively contradicts the neuron's pattern.

Key Insight: A neuron's weight vector defines a direction in input space. The dot product measures how much each input aligns with that direction. Training adjusts the weight vectors so they point toward the patterns the network needs to detect.

Drag the vectors below to see how the dot product changes with the angle between them:

Loading dot product visualization...

Python — The Dot Product and Geometric Interpretation

🐍dot_product.py

Explanation(25)

Code(32)

1import numpy as np

Importing NumPy for array creation, dot product computation, and trigonometric functions.

EXECUTION STATE

📚 numpy = Provides np.dot(), np.linalg.norm(), np.arccos(), and np.degrees() — all used in this example.

3a = np.array([3.0, 2.0]) — Weight-like vector

Think of this as a neuron's weight vector. It defines what the neuron "looks for" — inputs that are strong in the first feature (3.0) and moderate in the second (2.0).

EXECUTION STATE

⬇ a = [3.0, 2.0] — the neuron's preference direction

4b = np.array([1.0, 4.0]) — Input-like vector

Think of this as an input data point. This input has a weak first feature (1.0) but strong second feature (4.0) — somewhat aligned with vector a, but not perfectly.

EXECUTION STATE

⬇ b = [1.0, 4.0] — the input signal

6# Method 1: np.dot

The dot product is the fundamental operation in neural networks. Every neuron computes a dot product: weights · inputs. It measures how aligned two vectors are.

7dot = np.dot(a, b) — Compute the dot product

Multiplies corresponding elements and sums: (3.0×1.0) + (2.0×4.0) = 3.0 + 8.0 = 11.0. This single number summarizes how much vectors a and b agree.

EXECUTION STATE

📚 np.dot(a, b) = For 1-D arrays: computes the dot product (inner product). Multiplies element-wise and sums. For 2-D arrays: performs matrix multiplication. Returns a scalar for 1-D inputs.

computation = (3.0 × 1.0) + (2.0 × 4.0) = 3.0 + 8.0 = 11.0

⬆ dot = 11.0 — positive means the vectors generally point the same way

8print("a . b =", dot)

Displays the dot product result.

EXECUTION STATE

output = a . b = 11.0

10# Method 2: manual computation

Computing the dot product by hand to make the operation concrete. This is exactly what happens inside np.dot — just multiply and add.

11manual = a[0]*b[0] + a[1]*b[1] — Multiply-and-sum

Index into each vector, multiply corresponding elements, and sum. a[0]*b[0] = 3.0×1.0 = 3.0, a[1]*b[1] = 2.0×4.0 = 8.0, total = 11.0. This is the same as np.dot but slower — never do this in practice.

EXECUTION STATE

a[0]*b[0] = 3.0 × 1.0 = 3.0

a[1]*b[1] = 2.0 × 4.0 = 8.0

manual = 3.0 + 8.0 = 11.0 — matches np.dot exactly

12print("Manual:", manual)

Confirms the manual computation matches np.dot.

EXECUTION STATE

output = Manual: 11.0

14# Geometric interpretation

The dot product has a beautiful geometric meaning: a · b = ||a|| ||b|| cos(θ). It connects algebra (multiply-and-sum) with geometry (angle between vectors). This is why dot products measure similarity.

15mag_a = np.linalg.norm(a) — Length of vector a

Computes ||a|| = √(3.0² + 2.0²) = √(9 + 4) = √13 ≈ 3.6056.

EXECUTION STATE

📚 np.linalg.norm(a) = L2 norm of a. √(3² + 2²) = √13.

mag_a = 3.6056 — the length of vector a

16mag_b = np.linalg.norm(b) — Length of vector b

Computes ||b|| = √(1.0² + 4.0²) = √(1 + 16) = √17 ≈ 4.1231.

EXECUTION STATE

mag_b = 4.1231 — the length of vector b

17cos_theta = dot / (mag_a * mag_b) — Cosine of angle

Rearranging a · b = ||a|| ||b|| cos(θ), we get cos(θ) = a · b / (||a|| ||b||) = 11.0 / (3.6056 × 4.1231) = 11.0 / 14.8661 ≈ 0.7399. This is the cosine similarity — a value between -1 and 1.

EXECUTION STATE

computation = 11.0 / (3.6056 × 4.1231) = 11.0 / 14.8661

cos_theta = 0.7399 — close to 1 means the vectors are fairly aligned

18theta_deg = np.degrees(np.arccos(cos_theta)) — Angle in degrees

First, np.arccos(0.7399) converts cosine back to the angle in radians (≈ 0.7378 rad). Then np.degrees() converts radians to degrees: 0.7378 × (180/π) ≈ 42.27°.

EXECUTION STATE

📚 np.arccos(x) = Inverse cosine function. Takes a value in [-1, 1], returns angle in radians [0, π]. arccos(0.7399) = 0.7378 radians.

📚 np.degrees(rad) = Converts radians to degrees by multiplying by 180/π. 0.7378 × 57.2958 = 42.27°.

theta_deg = 42.27° — the vectors are separated by about 42 degrees

20print("|a| =", round(mag_a, 4))

Displays the magnitude of vector a, rounded to 4 decimal places.

EXECUTION STATE

output = |a| = 3.6056

21print("|b| =", round(mag_b, 4))

Displays the magnitude of vector b.

EXECUTION STATE

output = |b| = 4.1231

22print("cos(theta) =", round(cos_theta, 4))

Displays the cosine similarity between the two vectors.

EXECUTION STATE

output = cos(theta) = 0.7399

23print("theta =", round(theta_deg, 2), "degrees")

Displays the angle between vectors a and b.

EXECUTION STATE

output = theta = 42.27 degrees

25# Neural network use: similarity

Now we demonstrate why the dot product matters: it measures how similar two vectors are. We test three cases — same direction, perpendicular, and opposite — to build intuition for what neurons "see."

26similar = np.array([3.0, 2.0]) — Same direction as a

This vector is identical to a — maximum possible alignment. The dot product will be large and positive.

EXECUTION STATE

similar = [3.0, 2.0] — pointing in the exact same direction as a

27orthogonal = np.array([-2.0, 3.0]) — Perpendicular to a

This vector is perpendicular (90°) to a. Check: a · orthogonal = 3.0×(-2.0) + 2.0×3.0 = -6 + 6 = 0. Zero dot product means zero similarity — the two directions are completely independent.

EXECUTION STATE

orthogonal = [-2.0, 3.0] — perpendicular to a. Verify: 3×(-2) + 2×3 = 0

28opposite = np.array([-3.0, -2.0]) — Opposite direction

This is -a, pointing in exactly the opposite direction. The dot product will be large and negative — maximum anti-alignment.

EXECUTION STATE

opposite = [-3.0, -2.0] — the negation of a, pointing 180° away

30print("Similar:", np.dot(a, similar))

Dot product with the identical vector: 3.0×3.0 + 2.0×2.0 = 9 + 4 = 13.0. This is the maximum similarity — it equals ||a||² = 13.

EXECUTION STATE

output = Similar: 13.0 — large positive → same direction

31print("Orthogonal:", np.dot(a, orthogonal))

Dot product with the perpendicular vector: 3.0×(-2.0) + 2.0×3.0 = -6 + 6 = 0.0. Zero means the vectors are completely independent — neither similar nor opposite.

EXECUTION STATE

output = Orthogonal: 0.0 — zero → no similarity at all

32print("Opposite:", np.dot(a, opposite))

Dot product with the negated vector: 3.0×(-3.0) + 2.0×(-2.0) = -9 + (-4) = -13.0. Large negative means maximum anti-alignment — the vectors disagree as much as possible.

EXECUTION STATE

output = Opposite: -13.0 — large negative → opposite direction

7 lines without explanation

1import numpy as np
2
3a = np.array([3.0, 2.0])
4b = np.array([1.0, 4.0])
5
6# Method 1: np.dot
7dot = np.dot(a, b)
8print("a . b =", dot)
9
10# Method 2: manual computation
11manual = a[0]*b[0] + a[1]*b[1]
12print("Manual:", manual)
13
14# Geometric interpretation
15mag_a = np.linalg.norm(a)
16mag_b = np.linalg.norm(b)
17cos_theta = dot / (mag_a * mag_b)
18theta_deg = np.degrees(np.arccos(cos_theta))
19
20print("|a| =", round(mag_a, 4))
21print("|b| =", round(mag_b, 4))
22print("cos(theta) =", round(cos_theta, 4))
23print("theta =", round(theta_deg, 2), "degrees")
24
25# Neural network use: similarity
26similar = np.array([3.0, 2.0])
27orthogonal = np.array([-2.0, 3.0])
28opposite = np.array([-3.0, -2.0])
29
30print("Similar:", np.dot(a, similar))
31print("Orthogonal:", np.dot(a, orthogonal))
32print("Opposite:", np.dot(a, opposite))

PyTorch provides both torch.dot() for raw dot products and F.cosine_similarity() for normalized similarity. The cosine similarity is especially useful when you care about direction but not magnitude — for example, comparing word embeddings where you want to know if two words have similar meanings regardless of how frequently they appear.

PyTorch — Dot Product and Cosine Similarity

🐍dot_product_torch.py

Explanation(10)

Code(13)

1import torch

Importing PyTorch for tensor operations. We will use torch.dot() for the dot product and torch.nn.functional for cosine similarity.

EXECUTION STATE

📚 torch = PyTorch core. Provides torch.dot() for dot products and torch.nn.functional for higher-level operations like cosine similarity.

3a = torch.tensor([3.0, 2.0])

Creates a 1-D tensor. Same values as the NumPy version for direct comparison.

EXECUTION STATE

a = tensor([3.0, 2.0])

4b = torch.tensor([1.0, 4.0])

Second tensor for dot product computation.

EXECUTION STATE

b = tensor([1.0, 4.0])

6# Dot product

PyTorch has a dedicated torch.dot() function for 1-D tensor dot products. Unlike NumPy's np.dot(), torch.dot() ONLY works with 1-D tensors — for matrix multiplication, you use torch.matmul() or the @ operator.

7dot = torch.dot(a, b) — Dot product

Computes (3.0×1.0) + (2.0×4.0) = 11.0. API difference: torch.dot() only accepts 1-D inputs. For higher dimensions, use torch.matmul() or @. NumPy's np.dot() handles both, which can cause subtle bugs.

EXECUTION STATE

📚 torch.dot(a, b) = Dot product of two 1-D tensors. Raises RuntimeError if inputs are not 1-D. This strictness prevents accidental matrix multiplications.

⬆ dot = tensor(11.0) — same result as NumPy, stored as a scalar tensor

8print("a . b =", dot)

Displays the dot product result.

EXECUTION STATE

output = a . b = tensor(11.)

10# Cosine similarity (built-in!)

PyTorch provides cosine similarity as a built-in function. In NumPy we computed it manually (dot / (||a|| × ||b||)), but PyTorch makes it a one-liner. This is used heavily in embedding models, recommendation systems, and attention mechanisms.

11import torch.nn.functional as F

The functional API for neural network operations. Contains stateless versions of layers, loss functions, and similarity metrics. Aliased as F by convention.

EXECUTION STATE

📚 torch.nn.functional = Stateless neural network functions: activations (F.relu), losses (F.cross_entropy), similarities (F.cosine_similarity), and more. Unlike nn.Module classes, these don't store parameters.

12cos_sim = F.cosine_similarity(a.unsqueeze(0), b.unsqueeze(0))

F.cosine_similarity expects 2-D inputs (batch dimension required), so .unsqueeze(0) adds a batch dimension: [3.0, 2.0] → [[3.0, 2.0]]. The function then computes dot(a,b) / (||a|| × ||b||) = 11.0 / (3.6056 × 4.1231) ≈ 0.7399.

EXECUTION STATE

📚 .unsqueeze(0) = Adds a dimension at position 0. Shape changes from [2] to [1, 2]. This "batch dimension" is required because F.cosine_similarity compares corresponding rows of two 2-D tensors.

📚 F.cosine_similarity(x1, x2) = Computes cosine similarity along dim=1 by default. Returns dot(x1,x2) / (||x1|| × ||x2||), a value in [-1, 1]. Handles the normalization automatically.

cos_sim = tensor([0.7399]) — same as our manual computation. 1.0 = identical, 0 = orthogonal, -1 = opposite.

13print("Cosine similarity:", cos_sim)

Displays the cosine similarity. This single number tells you how aligned the two vectors are, regardless of their magnitudes.

EXECUTION STATE

output = Cosine similarity: tensor([0.7399])

3 lines without explanation

1import torch
2
3a = torch.tensor([3.0, 2.0])
4b = torch.tensor([1.0, 4.0])
5
6# Dot product
7dot = torch.dot(a, b)
8print("a . b =", dot)
9
10# Cosine similarity (built-in!)
11import torch.nn.functional as F
12cos_sim = F.cosine_similarity(a.unsqueeze(0), b.unsqueeze(0))
13print("Cosine similarity:", cos_sim)

Matrices: Tables of Numbers with Superpowers

What Is a Matrix?

A matrix is a 2-D grid of numbers arranged in rows and columns. If a vector is a list, a matrix is a table. In neural networks, the weight matrix $W$ stores all the weights for one layer — every connection between every input and every neuron, organized into a single rectangular block of numbers.

Shape Notation

An $m \times n$ matrix has $m$ rows and $n$ columns. We write $W \in \mathbb{R}^{m \times n}$ to say " $W$ is a real-valued matrix with $m$ rows and $n$ columns." In neural network layers, $m$ is the number of neurons (outputs) and $n$ is the number of inputs. A layer that maps 3 inputs to 2 neurons has a $2 \times 3$ weight matrix — 6 learnable parameters.

Accessing Elements

$W[i, j]$ is the element at row $i$ , column $j$ . Each row contains one neuron's complete set of weights. Each column contains the weights that all neurons assign to one particular input. This dual perspective is key to understanding how information flows through a network.

Transpose

The transpose $W^T$ swaps rows and columns — the first row becomes the first column, the second row becomes the second column, and so on. A matrix with shape $(m, n)$ becomes $(n, m)$ . Transpose appears constantly in neural networks: during backpropagation, gradients flow through $W^T$ , and PyTorch's nn.Linear internally computes $x \cdot W^T$ rather than $W \cdot x$ .

Special Matrices

A few special matrices appear throughout deep learning:

Matrix	Definition	Neural Network Use
Identity (I)	1s on diagonal, 0s elsewhere. I × v = v	Residual connections: output = f(x) + I·x
Diagonal	Non-zero only on the diagonal	Per-feature scaling (like layer norm’s γ)
Zero matrix	All entries are 0	Initializing biases to zero

Key Insight: The identity matrix $I$ is the matrix equivalent of the number 1 — multiplying any vector by $I$ returns the same vector unchanged. Residual connections in networks like ResNet and Transformers exploit this: by adding $I \cdot x$ to the layer's output, the network can easily learn to "do nothing" when that is the best strategy, which makes very deep networks trainable.

Python — Matrix Basics with NumPy

🐍matrices.py

Explanation(20)

Code(27)

1import numpy as np

NumPy provides the ndarray type that supports 2-D arrays (matrices) with fast C-level indexing, slicing, and linear algebra operations like transpose and matrix multiply.

EXECUTION STATE

📚 numpy = Library for numerical computing. Key matrix features: 2-D arrays, .T transpose, @ matrix multiply, np.eye() identity matrix, and advanced indexing like W[:, 1].

3# Weight matrix: 2 neurons, 3 inputs each

In a neural network layer, the weight matrix W has one ROW per neuron and one COLUMN per input. A layer with 2 neurons receiving 3 inputs has a 2×3 weight matrix — 6 learnable parameters total.

4W = np.array([[0.2, -0.5, 0.1], [0.8, 0.3, -0.4]]) — Create a 2×3 matrix

np.array() with a nested list creates a 2-D array. The outer list has 2 elements (rows), each inner list has 3 elements (columns). Row 0 is neuron 0’s weights, row 1 is neuron 1’s weights.

EXECUTION STATE

📚 np.array(nested_list) = Converts a list of lists into a 2-D ndarray. Each inner list becomes one row. All inner lists must have the same length.

⬇ W (2×3) =

  inp0   inp1   inp2
neuron0:  0.2   -0.5    0.1
neuron1:  0.8    0.3   -0.4

7print("W shape:", W.shape)

The .shape attribute returns (rows, columns) for a 2-D array. This is the most important property of any matrix — it tells you how many neurons (rows) and inputs (columns) the layer has.

EXECUTION STATE

W.shape = (2, 3) — 2 rows (neurons), 3 columns (inputs). In neural network notation: this layer maps 3 inputs to 2 outputs.

output = W shape: (2, 3)

8print("W:\n", W)

Prints the matrix with a newline before it so it displays as a formatted grid. NumPy automatically aligns columns for readability.

EXECUTION STATE

output = W: [[ 0.2 -0.5 0.1] [ 0.8 0.3 -0.4]]

10# Accessing elements

NumPy 2-D arrays support three kinds of indexing: single element W[row, col], full row W[row], and full column W[:, col]. These are essential for understanding what each part of the weight matrix represents.

11print("W[0,1] =", W[0, 1]) — Single element access

W[0, 1] accesses row 0, column 1: the weight connecting input 1 to neuron 0. This single number controls how strongly the second input influences the first neuron’s output.

EXECUTION STATE

📚 W[row, col] = 2-D array indexing. First index selects the row, second selects the column. Both are 0-indexed.

W[0, 1] = -0.5 — a negative weight means input 1 inhibits neuron 0. When input 1 is large, it pushes neuron 0’s output down.

output = W[0,1] = -0.5

12print("Row 0 (neuron 1 weights):", W[0]) — Full row

W[0] returns the entire first row — all three weights for neuron 0. This is neuron 0’s weight vector, the vector it will dot-product with the input during the forward pass.

EXECUTION STATE

📚 W[row] = Selecting a single row returns a 1-D array (vector). W[0] returns row 0, W[1] returns row 1.

W[0] = [0.2, -0.5, 0.1] — neuron 0’s weight vector. It responds positively to input 0 and input 2, but negatively to input 1.

output = Row 0 (neuron 1 weights): [ 0.2 -0.5 0.1]

13print("Column 1 (all neurons, input 2):", W[:, 1]) — Full column

W[:, 1] returns column 1 — the weights that ALL neurons assign to input 1. The : means “all rows” and 1 selects column index 1.

EXECUTION STATE

📚 W[:, col] = Column slicing. The colon : selects all rows, and the integer selects a specific column. Returns a 1-D array with one element per row.

W[:, 1] = [-0.5, 0.3] — neuron 0 has weight -0.5 for input 1 (inhibitory), neuron 1 has weight 0.3 for input 1 (excitatory). Same input, opposite effects.

output = Column 1 (all neurons, input 2): [-0.5 0.3]

15# Transpose: swap rows and columns

The transpose flips a matrix along its diagonal — rows become columns and columns become rows. Shape (m, n) becomes (n, m). This operation is used constantly in neural networks, especially when computing gradients during backpropagation.

16W_T = W.T — Transpose the weight matrix

W.T swaps rows and columns. The 2×3 matrix becomes a 3×2 matrix. Row 0 [0.2, -0.5, 0.1] becomes column 0; row 1 [0.8, 0.3, -0.4] becomes column 1.

EXECUTION STATE

📚 .T = NumPy transpose property. For a 2-D array, swaps axes: shape (m, n) → (n, m). Equivalent to W.transpose() or np.transpose(W). Returns a view (no data copy).

⬆ W_T (3×2) =

       neuron0  neuron1
inp0:    0.2      0.8
inp1:   -0.5      0.3
inp2:    0.1     -0.4

17print("W.T shape:", W_T.shape)

Confirms the shape has flipped from (2, 3) to (3, 2).

EXECUTION STATE

W_T.shape = (3, 2) — was (2, 3). Rows and columns are swapped.

output = W.T shape: (3, 2)

18print("W.T:\n", W_T)

Displays the transposed matrix. Compare with the original: the first row of W [0.2, -0.5, 0.1] is now the first column of W.T.

EXECUTION STATE

output = W.T: [[ 0.2 0.8] [-0.5 0.3] [ 0.1 -0.4]]

20# Identity matrix

The identity matrix I is the matrix equivalent of the number 1 — multiplying by I leaves the vector unchanged. It has 1s on the diagonal and 0s everywhere else. In neural networks, residual connections add the identity: output = f(x) + x, which is equivalent to output = f(x) + I·x.

21I = np.eye(3) — Create a 3×3 identity matrix

np.eye(n) creates an n×n matrix with 1.0 on the main diagonal and 0.0 everywhere else. The name comes from the mathematical notation I (for identity).

EXECUTION STATE

📚 np.eye(n) = Creates an n×n identity matrix. np.eye(3) returns a 3×3 matrix. Can also create non-square matrices with np.eye(m, n). Always returns float64 by default.

⬇ I (3×3) =

     c0   c1   c2
r0: 1.0  0.0  0.0
r1: 0.0  1.0  0.0
r2: 0.0  0.0  1.0

22print("I (3x3):\n", I)

Displays the identity matrix. Notice the diagonal pattern: 1s from top-left to bottom-right, 0s everywhere else.

EXECUTION STATE

output = I (3x3): [[1. 0. 0.] [0. 1. 0.] [0. 0. 1.]]

24# I @ x = x (identity does nothing)

The defining property of the identity matrix: multiplying any vector by I returns the same vector unchanged. This is why it is called the “identity” — it is the “do nothing” transformation.

25x = np.array([0.8, 0.3, 0.5]) — Test vector

A 3-element vector that we will multiply by the identity matrix to verify the identity property.

EXECUTION STATE

⬇ x = [0.8, 0.3, 0.5]

26print("I @ x =", I @ x) — Identity multiplication

I @ x performs matrix-vector multiplication: each row of I is dot-producted with x. Row 0: 1×0.8 + 0×0.3 + 0×0.5 = 0.8. Row 1: 0×0.8 + 1×0.3 + 0×0.5 = 0.3. Row 2: 0×0.8 + 0×0.3 + 1×0.5 = 0.5. Result = x, unchanged.

EXECUTION STATE

📚 @ operator = Python’s matrix multiplication operator. For a (3×3) matrix and a (3,) vector, computes 3 dot products (one per row), returning a (3,) vector.

I @ x = [0.8, 0.3, 0.5] — identical to x. The identity matrix preserves the input.

output = I @ x = [0.8 0.3 0.5]

27print("x =", x) — Confirm x is unchanged

Prints x for direct comparison with I @ x. They are identical, confirming the identity property.

EXECUTION STATE

output = x = [0.8 0.3 0.5]

7 lines without explanation

1import numpy as np
2
3# Weight matrix: 2 neurons, 3 inputs each
4W = np.array([[0.2, -0.5, 0.1],
5              [0.8,  0.3, -0.4]])
6
7print("W shape:", W.shape)
8print("W:\n", W)
9
10# Accessing elements
11print("W[0,1] =", W[0, 1])
12print("Row 0 (neuron 1 weights):", W[0])
13print("Column 1 (all neurons, input 2):", W[:, 1])
14
15# Transpose: swap rows and columns
16W_T = W.T
17print("W.T shape:", W_T.shape)
18print("W.T:\n", W_T)
19
20# Identity matrix
21I = np.eye(3)
22print("I (3x3):\n", I)
23
24# I @ x = x (identity does nothing)
25x = np.array([0.8, 0.3, 0.5])
26print("I @ x =", I @ x)
27print("x     =", x)

PyTorch matrices use the same indexing and transpose syntax. The key difference is that nn.Linear creates and manages weight matrices for you automatically:

Operation	NumPy	PyTorch
Create matrix	np.array([[...], [...]])	torch.tensor([[...], [...]])
Single element	W[0, 1]	W[0, 1].item()
Transpose	W.T	W.T (or W.transpose(0, 1))
Identity	np.eye(n)	torch.eye(n)

PyTorch — The Same Operations as Tensors

🐍matrices_torch.py

Explanation(14)

Code(19)

1import torch

PyTorch provides the same matrix operations as NumPy but with GPU acceleration and automatic differentiation. The API is nearly identical for basic matrix creation, indexing, and transpose.

EXECUTION STATE

📚 torch = PyTorch core library. Key matrix operations: .shape, .T, @ operator, torch.eye(), and the same [row, col] indexing as NumPy.

3# Weight matrix as a tensor

Creating the same 2×3 weight matrix as the NumPy version, but as a PyTorch tensor. In practice, nn.Linear creates weight matrices automatically — this manual creation is for learning.

4W = torch.tensor([[0.2, -0.5, 0.1], [0.8, 0.3, -0.4]]) — 2×3 tensor

torch.tensor() with a nested list creates a 2-D tensor. Same data as the NumPy version. Default dtype is float32 (not float64 like NumPy).

EXECUTION STATE

📚 torch.tensor(nested_list) = Creates a tensor from a nested list. Each inner list becomes one row. Infers dtype from input values — float values become float32.

⬇ W =

tensor([[ 0.2000, -0.5000,  0.1000],
        [ 0.8000,  0.3000, -0.4000]])

7print("Shape:", W.shape)

Returns torch.Size([2, 3]) — equivalent to NumPy’s (2, 3) tuple.

EXECUTION STATE

W.shape = torch.Size([2, 3]) — same meaning as NumPy: 2 rows, 3 columns.

output = Shape: torch.Size([2, 3])

8print("W[0, 1]:", W[0, 1].item()) — Extract single element

W[0, 1] returns a 0-D tensor. The .item() method converts it to a plain Python float. This is needed because printing a 0-D tensor shows “tensor(-0.5000)”, but .item() gives just -0.5.

EXECUTION STATE

📚 .item() = Converts a single-element tensor to a Python scalar. Only works on tensors with exactly one element. W[0, 1] is a 0-D tensor, so .item() returns -0.5 as a float.

W[0, 1].item() = -0.5 — same element as NumPy. Indexing is identical across frameworks.

output = W[0, 1]: -0.5

9print("Row 0:", W[0])

Returns the first row as a 1-D tensor — neuron 0’s weights.

EXECUTION STATE

W[0] = tensor([ 0.2000, -0.5000, 0.1000])

output = Row 0: tensor([ 0.2000, -0.5000, 0.1000])

10print("Column 1:", W[:, 1])

Returns column 1 as a 1-D tensor — both neurons’ weights for input 1.

EXECUTION STATE

W[:, 1] = tensor([-0.5000, 0.3000]) — same slicing syntax as NumPy

output = Column 1: tensor([-0.5000, 0.3000])

12# Transpose

PyTorch supports .T for 2-D tensors, just like NumPy. For higher-dimensional tensors, use .transpose(dim0, dim1) or .permute().

13print("W.T shape:", W.T.shape)

The transpose flips shape (2, 3) to (3, 2), identical to NumPy.

EXECUTION STATE

W.T.shape = torch.Size([3, 2])

output = W.T shape: torch.Size([3, 2])

14print("W.T:\n", W.T)

Displays the transposed tensor.

EXECUTION STATE

output = W.T: tensor([[ 0.2000, 0.8000], [-0.5000, 0.3000], [ 0.1000, -0.4000]])

16# Identity

Creating an identity matrix and verifying that I @ x = x, same as the NumPy version.

17I = torch.eye(3) — 3×3 identity tensor

torch.eye(n) creates an n×n identity tensor with 1s on the diagonal and 0s elsewhere. Same function name and behavior as np.eye().

EXECUTION STATE

📚 torch.eye(n) = Creates an n×n identity tensor. Returns float32 by default. Equivalent to np.eye(n).

I =

tensor([[1., 0., 0.],
        [0., 1., 0.],
        [0., 0., 1.]])

18x = torch.tensor([0.8, 0.3, 0.5]) — Test vector

Same test vector as the NumPy version.

EXECUTION STATE

x = tensor([0.8000, 0.3000, 0.5000])

19print("I @ x:", I @ x) — Identity multiplication

The @ operator performs matrix-vector multiplication. I @ x returns the same vector x, confirming the identity property.

EXECUTION STATE

I @ x = tensor([0.8000, 0.3000, 0.5000]) — identical to x

output = I @ x: tensor([0.8000, 0.3000, 0.5000])

5 lines without explanation

1import torch
2
3# Weight matrix as a tensor
4W = torch.tensor([[0.2, -0.5, 0.1],
5                  [0.8,  0.3, -0.4]])
6
7print("Shape:", W.shape)
8print("W[0, 1]:", W[0, 1].item())
9print("Row 0:", W[0])
10print("Column 1:", W[:, 1])
11
12# Transpose
13print("W.T shape:", W.T.shape)
14print("W.T:\n", W.T)
15
16# Identity
17I = torch.eye(3)
18x = torch.tensor([0.8, 0.3, 0.5])
19print("I @ x:", I @ x)

Matrix-Vector Multiplication: The Core of Neural Networks

This is THE fundamental operation: $\mathbf{z} = W\mathbf{x} + \mathbf{b}$ . Every neuron, every layer, every forward pass comes down to this single equation. If you understand matrix-vector multiplication, you understand the skeleton of every neural network.

How It Works — Row by Row

Each row of $W$ is one neuron's weight vector. Matrix-vector multiplication takes the dot product of each row with the input vector $\mathbf{x}$ , producing one output per neuron:

Row 0 $\cdot$ $\mathbf{x}$ = neuron 0's pre-activation
Row 1 $\cdot$ $\mathbf{x}$ = neuron 1's pre-activation
…and so on for every neuron in the layer

For $W$ with shape $(2 \times 3)$ and $\mathbf{x}$ with shape $(3,)$ : row 0 (3 elements) dots with $\mathbf{x}$ (3 elements) to give neuron 0's output, and row 1 dots with $\mathbf{x}$ to give neuron 1's output. The result is a vector with $m$ elements — one per neuron.

The Shape Rule

$(m \times n) \; @ \; (n,) \rightarrow (m,)$ — the inner dimensions must match. The $n$ columns of $W$ must equal the $n$ elements of $\mathbf{x}$ . The output has $m$ elements, one per row of $W$ . If these dimensions don't match, you get a shape error — the most common bug in deep learning code.

Adding the Bias

After the dot products, each neuron adds its own bias $b_i$ . The bias vector $\mathbf{b}$ has one element per neuron. Bias lets a neuron fire even when the input is all zeros — it shifts the neuron's activation threshold.

The Linear Transformation

This operation is called a linear transformation because it preserves addition and scaling. The weight matrix literally transforms the input vector into a new space — a 3-D input becomes a 2-D output, or a 768-D embedding becomes a 3072-D hidden representation. The matrix controls how the space is warped: which directions get stretched, which get compressed, and which get rotated.

The visualizer below shows a 2-D simplification — each matrix entry changes how the grid of input points is warped into output points. Try adjusting the matrix values to see stretching, rotation, shearing, and reflection:

Loading matrix transformation...

Python — Matrix-Vector Multiplication (The Forward Pass)

🐍matrix_vector.py

Explanation(14)

Code(18)

1import numpy as np

NumPy provides the @ operator for matrix-vector multiplication and np.dot() for explicit dot products. Both are used here to demonstrate the forward pass.

EXECUTION STATE

📚 numpy = Provides @ (matrix multiply), np.dot() (dot product), and np.array() for creating matrices and vectors.

3# Layer: 2 neurons, 3 inputs

We define a complete neural network layer: weight matrix W, bias vector b, and input vector x. This is everything needed for a single forward pass computation: z = W @ x + b.

4W = np.array([[0.2, -0.5, 0.1], [0.8, 0.3, -0.4]]) — Weight matrix (2×3)

The weight matrix for a layer with 2 neurons and 3 inputs. Row 0 = neuron 0’s weights, row 1 = neuron 1’s weights. Each row will be dot-producted with the input vector.

EXECUTION STATE

⬇ W (2×3) =

  inp0   inp1   inp2
neuron0:  0.2   -0.5    0.1
neuron1:  0.8    0.3   -0.4

6b = np.array([0.1, -0.1]) — Bias vector (2 elements, one per neuron)

Each neuron has its own bias — a constant added after the dot product. Bias lets neurons fire even when the input is zero. Neuron 0 has bias 0.1 (slight positive push), neuron 1 has bias -0.1 (slight negative push).

EXECUTION STATE

⬇ b = [0.1, -0.1] — b[0] for neuron 0, b[1] for neuron 1

7x = np.array([0.8, 0.3, 0.5]) — Input vector (3 features)

The input to this layer — 3 feature values. In a real network, this could be pixel intensities, word embedding values, or the output of a previous layer.

EXECUTION STATE

⬇ x = [0.8, 0.3, 0.5] — 3 input features

9# The forward pass: z = W @ x + b

This is THE equation of neural networks. W @ x computes the dot product of each neuron’s weights with the input, and + b adds the bias. The result z is the pre-activation output — it will be passed through an activation function (like ReLU) in a real network.

10z = W @ x + b — The forward pass in one line

Step 1: W @ x computes [row0·x, row1·x] = [0.2×0.8 + (-0.5)×0.3 + 0.1×0.5, 0.8×0.8 + 0.3×0.3 + (-0.4)×0.5] = [0.06, 0.53]. Step 2: + b adds [0.1, -0.1], giving [0.16, 0.43].

EXECUTION STATE

📚 @ operator = Matrix-vector multiplication. For W(2×3) @ x(3,): computes the dot product of each row of W with x. Returns a vector with one element per row of W. Shape rule: (m×n) @ (n,) → (m,).

── Neuron 0 ── =

W[0] · x = 0.2×0.8 + (-0.5)×0.3 + 0.1×0.5 = 0.16 + (-0.15) + 0.05 = 0.06

0.06 + b[0] = 0.06 + 0.1 = 0.16

── Neuron 1 ── =

W[1] · x = 0.8×0.8 + 0.3×0.3 + (-0.4)×0.5 = 0.64 + 0.09 + (-0.20) = 0.53

0.53 + b[1] = 0.53 + (-0.1) = 0.43

⬆ z = [0.16, 0.43] — neuron 0 output = 0.16, neuron 1 output = 0.43. Neuron 1 fires more strongly for this input.

11print("z =", z)

Displays the forward pass result — the pre-activation output of both neurons.

EXECUTION STATE

output = z = [0.16 0.43]

13# What @ does, row by row:

To make the matrix-vector multiplication concrete, we now manually compute each neuron’s output step by step. This loop does exactly what W @ x + b does internally, but one neuron at a time.

14for i in range(len(W)): — Loop over neurons

Iterates over each row of W (each neuron). len(W) = 2, so i takes values 0 and 1. Each iteration computes one neuron’s output: dot product of its weight row with x, plus its bias.

EXECUTION STATE

📚 range(len(W)) = range(2) → [0, 1]. len(W) returns the number of rows (neurons). We iterate once per neuron.

LOOP TRACE · 2 iterations

i=0 (Neuron 0)

i = 0 — first neuron

row = W[0] = [0.2, -0.5, 0.1]

dot = np.dot(row, x) = 0.2×0.8 + (-0.5)×0.3 + 0.1×0.5 = 0.16 - 0.15 + 0.05 = 0.0600

result = dot + b[0] = 0.0600 + 0.1 = 0.1600

output = Neuron 0: [ 0.2 -0.5 0.1] . [0.8 0.3 0.5] = 0.0600, + b=0.1 -> 0.1600

i=1 (Neuron 1)

i = 1 — second neuron

row = W[1] = [0.8, 0.3, -0.4]

dot = np.dot(row, x) = 0.8×0.8 + 0.3×0.3 + (-0.4)×0.5 = 0.64 + 0.09 - 0.20 = 0.5300

result = dot + b[1] = 0.5300 + (-0.1) = 0.4300

output = Neuron 1: [ 0.8 0.3 -0.4] . [0.8 0.3 0.5] = 0.5300, + b=-0.1 -> 0.4300

15row = W[i] — Get neuron i’s weight vector

Extracts row i from the weight matrix — this is neuron i’s complete set of weights. Each element is the weight for one input.

EXECUTION STATE

row (when i=0) = [0.2, -0.5, 0.1] — neuron 0’s weights

row (when i=1) = [0.8, 0.3, -0.4] — neuron 1’s weights

16dot = np.dot(row, x) — Dot product of weights with input

The core computation: multiply each weight by its corresponding input and sum. This is what the neuron “sees” before the bias is added.

EXECUTION STATE

📚 np.dot(row, x) = Computes the dot product of two 1-D arrays: sum of element-wise products. Returns a scalar.

dot (when i=0) = 0.2×0.8 + (-0.5)×0.3 + 0.1×0.5 = 0.0600

dot (when i=1) = 0.8×0.8 + 0.3×0.3 + (-0.4)×0.5 = 0.5300

17result = dot + b[i] — Add the bias

Adds the neuron’s bias to the dot product. The bias shifts the neuron’s activation threshold — a positive bias makes the neuron more likely to fire.

EXECUTION STATE

result (when i=0) = 0.0600 + 0.1 = 0.1600

result (when i=1) = 0.5300 + (-0.1) = 0.4300

18print(f" Neuron {i}: ...") — Display computation

Prints the complete computation for each neuron: weights, input, dot product, bias, and final result. This matches what W @ x + b computes in one vectorized operation.

EXECUTION STATE

output (i=0) = Neuron 0: [ 0.2 -0.5 0.1] . [0.8 0.3 0.5] = 0.0600, + b=0.1 -> 0.1600

output (i=1) = Neuron 1: [ 0.8 0.3 -0.4] . [0.8 0.3 0.5] = 0.5300, + b=-0.1 -> 0.4300

4 lines without explanation

1import numpy as np
2
3# Layer: 2 neurons, 3 inputs
4W = np.array([[0.2, -0.5, 0.1],
5              [0.8,  0.3, -0.4]])
6b = np.array([0.1, -0.1])
7x = np.array([0.8, 0.3, 0.5])
8
9# The forward pass: z = W @ x + b
10z = W @ x + b
11print("z =", z)
12
13# What @ does, row by row:
14for i in range(len(W)):
15    row = W[i]
16    dot = np.dot(row, x)
17    result = dot + b[i]
18    print(f"  Neuron {i}: {row} . {x} = {dot:.4f}, + b={b[i]} -> {result:.4f}")

In PyTorch, nn.Linear wraps this computation into a reusable layer that integrates with the optimizer for training. The layer stores .weight (the $W$ matrix) and .bias (the $\mathbf{b}$ vector), and its forward method computes $\mathbf{z} = \mathbf{x} \cdot W^T + \mathbf{b}$ automatically when you call layer(x):

PyTorch — nn.Linear and the Forward Pass

🐍matrix_vector_torch.py

Explanation(15)

Code(21)

1import torch

PyTorch core library, needed for tensor creation and the @ operator.

EXECUTION STATE

📚 torch = PyTorch core. Provides torch.tensor(), torch.no_grad(), and the @ matrix multiply operator.

2import torch.nn as nn

The neural network module that contains all layer types. nn.Linear is the fully-connected (dense) layer that computes z = W @ x + b internally.

EXECUTION STATE

📚 torch.nn = Neural network building blocks: nn.Linear, nn.Conv2d, nn.ReLU, nn.Module, etc. Aliased as nn by convention.

4# Using nn.Linear (the standard way)

In real PyTorch code, you never manually create weight matrices. Instead, you use nn.Linear, which creates the weight matrix and bias vector automatically, handles the forward pass math, and integrates with the optimizer for training.

5layer = nn.Linear(3, 2, bias=True) — Create a linear layer

Creates a fully-connected layer with 3 inputs and 2 outputs. Internally, this creates a weight matrix W of shape (2, 3) and a bias vector b of shape (2,), both filled with random values initially.

EXECUTION STATE

📚 nn.Linear(in_features, out_features, bias) = Creates a linear transformation layer. Stores .weight (out×in matrix) and .bias (out vector). Forward: output = input @ W.T + b.

⬇ arg 1: in_features = 3 = Number of input features. Sets the number of columns in the weight matrix. Each input neuron connects to every output neuron.

⬇ arg 2: out_features = 2 = Number of output neurons. Sets the number of rows in the weight matrix and the length of the bias vector.

⬇ arg 3: bias = True = Whether to include a bias vector. Default is True. Setting False removes the bias and computes just W @ x.

layer.weight.shape = torch.Size([2, 3]) — (out_features, in_features). Note: PyTorch stores W transposed relative to the math notation!

layer.bias.shape = torch.Size([2]) — one bias per output neuron

7# Set specific weights for comparison

nn.Linear initializes weights randomly. We override them with our known values so the output matches the NumPy version exactly. In real training, the optimizer updates these weights automatically.

8with torch.no_grad(): — Disable gradient tracking

torch.no_grad() tells PyTorch not to track operations for gradient computation. We use it here because we are manually setting weights, not training. Without it, PyTorch would try to build a computation graph for these assignments, wasting memory.

EXECUTION STATE

📚 torch.no_grad() = Context manager that disables gradient tracking. Use for: manual weight setting, inference (evaluation), and any operation where you don’t need backpropagation. Saves memory and computation.

9layer.weight.copy_(...) — Set weight matrix

The .copy_() method overwrites the tensor’s data in-place (the trailing underscore _ is PyTorch convention for in-place operations). We set W to our known values.

EXECUTION STATE

📚 .copy_(tensor) = In-place copy: replaces all values in the tensor with values from the argument. The _ suffix means in-place (modifies the tensor, returns the same tensor). Contrast with .clone() which creates a new tensor.

layer.weight after copy_ =

tensor([[ 0.2000, -0.5000,  0.1000],
        [ 0.8000,  0.3000, -0.4000]])

11layer.bias.copy_(...) — Set bias vector

Sets the bias vector to [0.1, -0.1] to match our NumPy example.

EXECUTION STATE

layer.bias after copy_ = tensor([ 0.1000, -0.1000])

13x = torch.tensor([0.8, 0.3, 0.5]) — Input tensor

Same input as the NumPy version. A 1-D tensor with 3 features.

EXECUTION STATE

x = tensor([0.8000, 0.3000, 0.5000])

15# Forward pass: z = W @ x + b

The forward pass through the layer. Calling layer(x) internally computes W @ x + b, but layer(x) also handles batched inputs, hooks, and other PyTorch machinery.

16z = layer(x) — Forward pass through the layer

Calling layer(x) invokes the layer’s forward() method, which computes F.linear(x, self.weight, self.bias) = x @ W.T + b. Note: PyTorch computes x @ W.T (not W @ x) because it expects batched inputs where x has shape (batch, in_features).

EXECUTION STATE

📚 layer(x) = Calls layer.__call__(x), which runs forward hooks, then layer.forward(x), then backward hooks. forward() computes F.linear(x, weight, bias) = x @ weight.T + bias.

computation = x @ W.T + b = [0.8, 0.3, 0.5] @ W.T + [0.1, -0.1] = [0.06 + 0.1, 0.53 + (-0.1)] = [0.16, 0.43]

⬆ z = tensor([0.1600, 0.4300]) — matches NumPy exactly

17print("z =", z)

Displays the forward pass result.

EXECUTION STATE

output = z = tensor([0.1600, 0.4300], grad_fn=<AddBackward0>)

19# Manual equivalent

To verify, we compute W @ x + b manually using the stored weight and bias tensors. This should match layer(x) exactly.

20z_manual = layer.weight @ x + layer.bias — Manual computation

Directly multiplies the weight matrix by x and adds the bias. This is the raw math that layer.forward() does internally (via F.linear). The result is identical.

EXECUTION STATE

layer.weight @ x = tensor([0.0600, 0.5300]) — the dot products without bias

+ layer.bias = tensor([0.0600 + 0.1000, 0.5300 + (-0.1000)])

⬆ z_manual = tensor([0.1600, 0.4300]) — matches layer(x) exactly

21print("Manual:", z_manual)

Confirms the manual computation matches the layer’s forward pass.

EXECUTION STATE

output = Manual: tensor([0.1600, 0.4300], grad_fn=<AddBackward0>)

6 lines without explanation

1import torch
2import torch.nn as nn
3
4# Using nn.Linear (the standard way)
5layer = nn.Linear(3, 2, bias=True)
6
7# Set specific weights for comparison
8with torch.no_grad():
9    layer.weight.copy_(torch.tensor([[0.2, -0.5, 0.1],
10                                     [0.8,  0.3, -0.4]]))
11    layer.bias.copy_(torch.tensor([0.1, -0.1]))
12
13x = torch.tensor([0.8, 0.3, 0.5])
14
15# Forward pass: z = W @ x + b
16z = layer(x)
17print("z =", z)
18
19# Manual equivalent
20z_manual = layer.weight @ x + layer.bias
21print("Manual:", z_manual)

Matrix-Matrix Multiplication: Processing Batches

In practice, we don't process one input at a time — we process batches. Instead of one vector $\mathbf{x}$ , we have a matrix $X$ where each row is one sample. If we have 4 images, each with 3 features, $X$ has shape $(4 \times 3)$ — 4 rows, 3 columns.

The Shape Rule for Matrix Multiply

$(m \times k) \; @ \; (k \times n) \rightarrow (m \times n)$ — the inner dimensions must match. The $k$ columns of the left matrix must equal the $k$ rows of the right matrix. The output takes the outer dimensions: $m$ rows from the left and $n$ columns from the right.

For our example: 4 samples with 3 features each ( $4 \times 3$ ) times $W^T$ ( $3 \times 2$ ) gives $Z$ ( $4 \times 2$ ). Each row of $Z$ is one sample's output through both neurons — all 4 samples processed simultaneously.

Non-Commutativity: Order Matters

Unlike scalar multiplication, $A \; @ \; B \neq B \; @ \; A$ in general. Swapping the order changes both the shapes and the values. In fact, if $A$ is $(4 \times 3)$ and $B$ is $(3 \times 2)$ , then $B \; @ \; A$ won't even work — the inner dimensions $2 \neq 4$ don't match. Matrix order is a frequent source of bugs in deep learning code.

Why Batches Matter

GPUs are massively parallel processors — multiplying a big matrix is almost as fast as multiplying a single vector. A batch of 256 samples through a layer takes roughly the same time as 1 sample, because the GPU processes all rows in parallel. This is why training always uses batches: it's essentially free parallelism.

The interactive visualizer below shows how matrix-matrix multiplication works step by step. Click "Play" to watch the multiplication animate automatically, or use "Step →" to advance one dot product at a time:

Loading matrix multiplication...

Python — Matrix-Matrix Multiplication (Batch Processing)

🐍matrix_multiply.py

Explanation(15)

Code(24)

1import numpy as np

NumPy provides ndarray and the @ operator for matrix multiplication. All matrix math in this example runs as optimized C code, not Python loops.

EXECUTION STATE

📚 numpy = Numerical computing library — ndarray type, @ matrix multiply operator, broadcasting, and vectorized math. Aliased as np by convention.

3# Batch of 4 input samples, 3 features each

Instead of processing one input vector at a time, we stack 4 samples into a matrix. Each row is one sample with 3 features. This is how real training works — data comes in batches of 32, 64, or 256 samples processed simultaneously.

4X = np.array([[0.8, 0.3, 0.5], ...]) — Input batch matrix

Creates a 4×3 matrix where each row is one input sample and each column is one feature. Row 0 is the same input we used in the single-sample example: [0.8, 0.3, 0.5].

EXECUTION STATE

📚 np.array(nested_list) = Converts a list of lists into a 2-D ndarray (matrix). The outer list defines rows, inner lists define columns. All rows must have the same length.

⬇ X (4×3) =

       f0    f1    f2
s0  [0.80, 0.30, 0.50]
s1  [0.10, 0.90, 0.20]
s2  [0.60, 0.40, 0.70]
s3  [0.30, 0.80, 0.10]

→ X.shape = (4, 3) — 4 samples, 3 features each

9W = np.array([[0.2, -0.5, 0.1], [0.8, 0.3, -0.4]]) — Weight matrix

Same weight matrix from the single-sample example. Shape (2, 3): 2 output neurons, each with 3 weights (one per input feature).

EXECUTION STATE

⬇ W (2×3) =

         f0     f1     f2
n0  [ 0.20, -0.50,  0.10]
n1  [ 0.80,  0.30, -0.40]

→ W.shape = (2, 3) — 2 neurons, 3 input features each

12b = np.array([0.1, -0.1]) — Bias vector

One bias per output neuron. After matrix multiplication, b is added to every row via broadcasting.

EXECUTION STATE

⬇ b = [0.1, -0.1] — bias for neuron 0 is +0.1, bias for neuron 1 is -0.1

14# ALL 4 samples at once: Z = X @ W.T + b

This is the key insight: instead of looping over samples one by one, we multiply the entire batch matrix X by the transposed weight matrix W.T in a single operation. The GPU can parallelize this across thousands of cores.

15Z = X @ W.T + b — Batch forward pass

Matrix multiply X (4×3) by W.T (3×2) to get Z_raw (4×2), then add bias b (2,) to each row via broadcasting. Each row of Z is one sample’s output — both neurons computed simultaneously for all 4 samples.

EXECUTION STATE

@ (matrix multiply) = Python’s matrix multiplication operator. X(4×3) @ W.T(3×2) → (4×2). Inner dimensions must match: 3 = 3. Each element Z[i,j] = dot(X[i], W.T[:,j]).

W.T (3×2) =

        n0     n1
f0  [ 0.20,  0.80]
f1  [-0.50,  0.30]
f2  [ 0.10, -0.40]

── Row 0 (sample 0) ── =

Z_raw[0,0] = 0.8×0.2 + 0.3×(-0.5) + 0.5×0.1 = 0.16 - 0.15 + 0.05 = 0.06

Z_raw[0,1] = 0.8×0.8 + 0.3×0.3 + 0.5×(-0.4) = 0.64 + 0.09 - 0.20 = 0.53

Z[0] = Z_raw[0] + b = [0.06+0.1, 0.53+(-0.1)] = [0.16, 0.43]

── Row 1 (sample 1) ── =

Z_raw[1,0] = 0.1×0.2 + 0.9×(-0.5) + 0.2×0.1 = 0.02 - 0.45 + 0.02 = -0.41

Z_raw[1,1] = 0.1×0.8 + 0.9×0.3 + 0.2×(-0.4) = 0.08 + 0.27 - 0.08 = 0.27

Z[1] = Z_raw[1] + b = [-0.41+0.1, 0.27+(-0.1)] = [-0.31, 0.17]

── Row 2 (sample 2) ── =

Z_raw[2,0] = 0.6×0.2 + 0.4×(-0.5) + 0.7×0.1 = 0.12 - 0.20 + 0.07 = -0.01

Z_raw[2,1] = 0.6×0.8 + 0.4×0.3 + 0.7×(-0.4) = 0.48 + 0.12 - 0.28 = 0.32

Z[2] = Z_raw[2] + b = [-0.01+0.1, 0.32+(-0.1)] = [0.09, 0.22]

── Row 3 (sample 3) ── =

Z_raw[3,0] = 0.3×0.2 + 0.8×(-0.5) + 0.1×0.1 = 0.06 - 0.40 + 0.01 = -0.33

Z_raw[3,1] = 0.3×0.8 + 0.8×0.3 + 0.1×(-0.4) = 0.24 + 0.24 - 0.04 = 0.44

Z[3] = Z_raw[3] + b = [-0.33+0.1, 0.44+(-0.1)] = [-0.23, 0.34]

⬆ Z (4×2) =

       n0     n1
s0  [ 0.16,  0.43]
s1  [-0.31,  0.17]
s2  [ 0.09,  0.22]
s3  [-0.23,  0.34]

16print("X shape:", X.shape)

Displays the shape of the input batch matrix.

EXECUTION STATE

output = X shape: (4, 3)

17print("W.T shape:", W.T.shape)

The transposed weight matrix shape. W is (2,3), so W.T is (3,2) — inner dimensions 3 match X’s columns.

EXECUTION STATE

output = W.T shape: (3, 2)

18print("Z shape:", Z.shape)

The output shape: 4 samples, 2 neurons each. (4×3) @ (3×2) → (4×2).

EXECUTION STATE

output = Z shape: (4, 2)

19print("Z:\\n", Z)

Displays the full output matrix. Each row is one sample’s output through both neurons.

EXECUTION STATE

output =

Z:
 [[ 0.16  0.43]
 [-0.31  0.17]
 [ 0.09  0.22]
 [-0.23  0.34]]

21# Verify: row 0 matches single-sample result

To prove that batch processing gives the same results as single-sample processing, we compute W @ X[0] + b (the single-sample formula) and compare it to Z[0] (the first row of the batch result).

22z0 = W @ X[0] + b — Single-sample computation

Computes one sample the old way: W (2×3) @ X[0] (3,) + b (2,). This is the matrix-vector multiply from the previous section.

EXECUTION STATE

X[0] = [0.8, 0.3, 0.5] — the first sample extracted as a 1-D vector

W @ X[0] = [0.2×0.8+(-0.5)×0.3+0.1×0.5, 0.8×0.8+0.3×0.3+(-0.4)×0.5] = [0.06, 0.53]

⬆ z0 = W @ X[0] + b = [0.06+0.1, 0.53-0.1] = [0.16, 0.43]

23print("Single sample z0:", z0)

Displays the single-sample result for comparison.

EXECUTION STATE

output = Single sample z0: [0.16 0.43]

24print("Batch row 0: ", Z[0])

Displays the first row of the batch result. It matches z0 exactly — proving the batch operation is equivalent to processing each sample individually.

EXECUTION STATE

output = Batch row 0: [0.16 0.43]

z0 == Z[0]? = True — [0.16, 0.43] = [0.16, 0.43]. Batch processing gives identical results to single-sample processing, just faster.

9 lines without explanation

1import numpy as np
2
3# Batch of 4 input samples, 3 features each
4X = np.array([[0.8, 0.3, 0.5],
5              [0.1, 0.9, 0.2],
6              [0.6, 0.4, 0.7],
7              [0.3, 0.8, 0.1]])
8
9W = np.array([[0.2, -0.5, 0.1],
10              [0.8,  0.3, -0.4]])
11
12b = np.array([0.1, -0.1])
13
14# ALL 4 samples at once: Z = X @ W.T + b
15Z = X @ W.T + b
16print("X shape:", X.shape)
17print("W.T shape:", W.T.shape)
18print("Z shape:", Z.shape)
19print("Z:\n", Z)
20
21# Verify: row 0 matches single-sample result
22z0 = W @ X[0] + b
23print("Single sample z0:", z0)
24print("Batch row 0:     ", Z[0])

In PyTorch, nn.Linear handles batches automatically — the same layer(X) call works whether X is a single vector or a batch matrix. Internally, it computes $X \; @ \; W^T + \mathbf{b}$ regardless of batch size:

PyTorch — Batch Forward Pass with nn.Linear

🐍matrix_multiply_torch.py

Explanation(15)

Code(23)

1import torch

PyTorch core library for tensor operations and the @ matrix multiply operator.

EXECUTION STATE

📚 torch = PyTorch core. Provides torch.tensor(), torch.no_grad(), and the @ matrix multiply operator.

2import torch.nn as nn

Neural network module containing layer types. nn.Linear handles batched inputs automatically — it computes X @ W.T + b for any batch size.

EXECUTION STATE

📚 torch.nn = Neural network building blocks: nn.Linear, nn.Conv2d, nn.ReLU, nn.Module, etc. Aliased as nn by convention.

4# Batch of 4 samples

We create a batch of 4 input samples as a 2-D tensor. nn.Linear expects input shape (batch_size, in_features) and returns (batch_size, out_features).

5X = torch.tensor([[0.8, 0.3, 0.5], ...]) — Batch input tensor

Creates a 4×3 tensor where each row is one sample. Same data as the NumPy version.

EXECUTION STATE

📚 torch.tensor(data) = Creates a PyTorch tensor from a Python list or NumPy array. Tensors support GPU acceleration and automatic differentiation (autograd).

⬇ X (4×3) =

tensor([[0.8, 0.3, 0.5],
        [0.1, 0.9, 0.2],
        [0.6, 0.4, 0.7],
        [0.3, 0.8, 0.1]])

X.shape = torch.Size([4, 3]) — 4 samples, 3 features each

10layer = nn.Linear(3, 2, bias=True) — Create a linear layer

Creates a fully-connected layer with 3 inputs and 2 outputs. nn.Linear automatically handles any batch size — pass 1 sample or 1000, same code.

EXECUTION STATE

📚 nn.Linear(in_features, out_features, bias) = Creates a linear transformation layer. Internally stores .weight (2×3 matrix) and .bias (2-element vector). Forward: output = input @ W.T + b.

⬇ arg 1: in_features = 3 = Number of input features per sample. Each sample is a 3-element vector.

⬇ arg 2: out_features = 2 = Number of output neurons. The layer produces 2 values per sample.

⬇ arg 3: bias = True = Include a bias vector. Default is True. Each output neuron gets its own additive bias term.

11with torch.no_grad(): — Disable gradient tracking

Disables autograd for the weight-setting operations. We are manually overriding weights, not training, so no gradient computation is needed.

EXECUTION STATE

📚 torch.no_grad() = Context manager that disables gradient tracking. Saves memory and computation when you don’t need backpropagation.

12layer.weight.copy_(...) — Set weight matrix

Overwrites the randomly initialized weights with our known values so results match the NumPy version exactly.

EXECUTION STATE

📚 .copy_(tensor) = In-place operation (trailing _) that replaces all values. The source tensor must have the same shape.

layer.weight after copy_ =

tensor([[ 0.2, -0.5,  0.1],
        [ 0.8,  0.3, -0.4]])

14layer.bias.copy_(...) — Set bias vector

Sets the bias to [0.1, -0.1] to match the NumPy example.

EXECUTION STATE

layer.bias after copy_ = tensor([ 0.1, -0.1])

16# nn.Linear handles batches automatically!

This is the beauty of nn.Linear: the same layer(X) call works whether X is a single vector (3,) or a batch matrix (4×3). Internally it computes X @ W.T + b, and the matrix multiply naturally handles the batch dimension.

17Z = layer(X) — Batch forward pass

Passes all 4 samples through the layer simultaneously. Internally computes X(4×3) @ W.T(3×2) + b(2,) = Z(4×2). The bias is broadcast-added to each row.

EXECUTION STATE

📚 layer(X) = Calls layer.__call__(X), which runs forward hooks, then layer.forward(X), then backward hooks. forward() computes F.linear(X, weight, bias) = X @ weight.T + bias.

computation = X(4×3) @ W.T(3×2) + b(2,) = Z(4×2)

⬆ Z (4×2) =

tensor([[ 0.16,  0.43],
        [-0.31,  0.17],
        [ 0.09,  0.22],
        [-0.23,  0.34]])

18print("Z shape:", Z.shape)

Displays the output shape: 4 samples, 2 outputs each.

EXECUTION STATE

output = Z shape: torch.Size([4, 2])

19print("Z:\\n", Z)

Displays the full batch output. Every row matches the NumPy result.

EXECUTION STATE

output =

Z:
 tensor([[ 0.1600,  0.4300],
        [-0.3100,  0.1700],
        [ 0.0900,  0.2200],
        [-0.2300,  0.3400]])

21# Equivalent manual computation

To verify, we compute the same operation manually using the @ operator and the layer’s stored weight and bias tensors.

22Z_manual = X @ layer.weight.T + layer.bias — Manual batch computation

Manually computes X @ W.T + b. The .T transpose on layer.weight converts (2,3) → (3,2) for the matrix multiply. The bias (2,) is broadcast-added to each of the 4 rows.

EXECUTION STATE

.T (transpose) = Swaps rows and columns. layer.weight is (2,3), so layer.weight.T is (3,2). This matches the inner dimension of X (4,3).

X @ layer.weight.T =

tensor([[ 0.06,  0.53],
        [-0.41,  0.27],
        [-0.01,  0.32],
        [-0.33,  0.44]]) — before bias

+ layer.bias = Broadcasting: [0.1, -0.1] is added to each row independently

⬆ Z_manual (4×2) =

tensor([[ 0.16,  0.43],
        [-0.31,  0.17],
        [ 0.09,  0.22],
        [-0.23,  0.34]]) — matches layer(X) exactly

23print("Manual:\\n", Z_manual)

Displays the manual computation result. Confirms that layer(X) and X @ W.T + b produce identical outputs.

EXECUTION STATE

output =

Manual:
 tensor([[ 0.1600,  0.4300],
        [-0.3100,  0.1700],
        [ 0.0900,  0.2200],
        [-0.2300,  0.3400]])

Z == Z_manual? = True — both methods produce identical results. layer(X) is the standard way; manual computation is for understanding.

8 lines without explanation

1import torch
2import torch.nn as nn
3
4# Batch of 4 samples
5X = torch.tensor([[0.8, 0.3, 0.5],
6                  [0.1, 0.9, 0.2],
7                  [0.6, 0.4, 0.7],
8                  [0.3, 0.8, 0.1]])
9
10layer = nn.Linear(3, 2, bias=True)
11with torch.no_grad():
12    layer.weight.copy_(torch.tensor([[0.2, -0.5, 0.1],
13                                      [0.8,  0.3, -0.4]]))
14    layer.bias.copy_(torch.tensor([0.1, -0.1]))
15
16# nn.Linear handles batches automatically!
17Z = layer(X)
18print("Z shape:", Z.shape)
19print("Z:\n", Z)
20
21# Equivalent manual computation
22Z_manual = X @ layer.weight.T + layer.bias
23print("Manual:\n", Z_manual)

Putting It All Together: The Forward Pass

Now we have all the pieces to understand a complete neural network layer. The forward pass is the pipeline that transforms raw input into useful output — and it consists of exactly the operations we've learned:

Input $X$ — a batch of samples (matrix)
Linear transform $Z = X \; @ \; W^T + \mathbf{b}$ — matrix multiply + bias (the weighted sum)
Activation $A = f(Z)$ — element-wise nonlinear function (the decision gate)
Output $A$ — the activated result, ready for the next layer

Each step is an operation we've already learned: matrix multiply, vector add, and element-wise function application. A neural network is just a chain of these layers, one feeding into the next.

The Big Picture: A neural network is just a chain of these simple operations. Each layer takes numbers in, multiplies, adds, and applies a nonlinear function. That's it. The magic comes from stacking many layers and learning the right weight values through training.

ReLU: The Most Popular Activation

The activation function we'll use is ReLU (Rectified Linear Unit): $\text{relu}(z) = \max(0, z)$ . It's dead simple — pass positive values through, replace negatives with zero. Yet this tiny nonlinearity is what gives neural networks the power to learn complex patterns. Without it, stacking layers would collapse into a single linear transformation, and the network would be no more powerful than logistic regression.

ReLU effectively decides which neurons "fire" (positive output) and which stay silent (zero). This selective activation creates the sparse, efficient representations that make deep learning work.

The Full Pipeline with Numbers

Let's trace the complete forward pass with our 4-sample batch through one layer, watching every value transform step by step:

Python — The Complete Forward Pass (Linear + ReLU)

🐍forward_pass.py

Explanation(16)

Code(26)

1import numpy as np

NumPy for array operations: matrix multiply (@), element-wise maximum (np.maximum), comparison (!=), and summation (np.sum).

EXECUTION STATE

📚 numpy = Numerical computing library. Key functions used here: np.maximum() for element-wise max, np.sum() for counting, and the @ operator for matrix multiply.

3def relu(x) — ReLU activation function

Defines the Rectified Linear Unit (ReLU) activation. ReLU is the most commonly used activation in deep learning because it is simple, fast, and avoids the vanishing gradient problem. It passes positive values unchanged and replaces negatives with zero.

EXECUTION STATE

⬇ input: x = A NumPy array of any shape — could be a vector, matrix, or higher-dimensional tensor. Each element is processed independently.

→ ReLU formula = relu(x) = max(0, x) — if x > 0, output x; if x ≤ 0, output 0

→ Example = relu([0.5, -0.3, 0.0, 1.2]) = [0.5, 0.0, 0.0, 1.2]

⬆ returns = np.ndarray — same shape as input, with all negatives replaced by 0

4"""Activation: keep positives, zero out negatives."""

Docstring summarizing ReLU’s behavior. The word “activation” refers to its role in the neural network pipeline: it decides which neurons “fire” (positive output) and which stay silent (zero output).

5return np.maximum(0, x) — Element-wise max with 0

np.maximum(0, x) compares each element of x with 0 and keeps the larger value. This is a vectorized operation — no Python loops, runs in compiled C for speed.

EXECUTION STATE

📚 np.maximum(a, b) = Element-wise maximum of two arrays (or array and scalar). Compares each element independently: np.maximum(0, [-1, 2, -3]) = [0, 2, 0]. Different from np.max() which finds the single largest element.

⬇ arg 1: 0 = The threshold — any value below 0 becomes 0. This scalar is broadcast against every element in x.

⬇ arg 2: x = The input array. Each element is compared independently with 0.

⬆ return = Array where element[i] = max(0, x[i]). Positive values pass through unchanged, negatives become 0.

7# Full forward pass: input -> linear -> activate

The complete forward pass pipeline: (1) take input X, (2) apply linear transformation Z = X @ W.T + b, (3) apply activation A = relu(Z). This two-step process is what every layer in a neural network does.

8X = np.array([[0.8, 0.3, 0.5], ...]) — Input batch

Same 4-sample batch from the previous section. Each row is one input sample with 3 features.

EXECUTION STATE

⬇ X (4×3) =

       f0    f1    f2
s0  [0.80, 0.30, 0.50]
s1  [0.10, 0.90, 0.20]
s2  [0.60, 0.40, 0.70]
s3  [0.30, 0.80, 0.10]

13W = np.array([[0.2, -0.5, 0.1], [0.8, 0.3, -0.4]]) — Weights

Same weight matrix: 2 output neurons, 3 weights each.

EXECUTION STATE

⬇ W (2×3) =

         f0     f1     f2
n0  [ 0.20, -0.50,  0.10]
n1  [ 0.80,  0.30, -0.40]

15b = np.array([0.1, -0.1]) — Bias

One bias per output neuron, same as before.

EXECUTION STATE

⬇ b = [0.1, -0.1]

17# Step 1: Linear transformation

The first step of the forward pass: multiply the input by the weight matrix and add the bias. This computes the “pre-activation” values — the raw neuron outputs before the activation function decides which ones to keep.

18Z = X @ W.T + b — Pre-activation values

Computes Z = X(4×3) @ W.T(3×2) + b(2,). Z contains the raw outputs of each neuron before ReLU. Some values are negative — those neurons have evidence “against” their feature, and ReLU will silence them.

EXECUTION STATE

⬆ Z (4×2) — pre-activation =

       n0     n1
s0  [ 0.16,  0.43]
s1  [-0.31,  0.17]
s2  [ 0.09,  0.22]
s3  [-0.23,  0.34]

→ negative values = Z[1,0] = -0.31 and Z[3,0] = -0.23 — these two cells will become 0 after ReLU

19print("Pre-activation Z:\\n", Z)

Displays the pre-activation matrix. Notice two negative values in column 0.

EXECUTION STATE

output =

Pre-activation Z:
 [[ 0.16  0.43]
 [-0.31  0.17]
 [ 0.09  0.22]
 [-0.23  0.34]]

21# Step 2: Activation

The second step: apply the nonlinear activation function. This is what gives neural networks the ability to learn nonlinear patterns. Without this step, stacking layers would collapse into a single linear transformation — deep networks would be no more powerful than shallow ones.

22A = relu(Z) — Post-activation values

Applies relu() to every element of Z. Positive values pass through unchanged; negative values become 0. Two cells change: Z[1,0] = -0.31 → 0, Z[3,0] = -0.23 → 0. The other 6 cells are positive, so they remain unchanged.

EXECUTION STATE

relu applied element-wise = relu(0.16)=0.16, relu(0.43)=0.43 relu(-0.31)=0.00, relu(0.17)=0.17 relu(0.09)=0.09, relu(0.22)=0.22 relu(-0.23)=0.00, relu(0.34)=0.34

⬆ A (4×2) — post-activation =

       n0     n1
s0  [ 0.16,  0.43]
s1  [ 0.00,  0.17]
s2  [ 0.09,  0.22]
s3  [ 0.00,  0.34]

→ what changed = Z[1,0]: -0.31 → 0.00 (neuron 0 silenced for sample 1) Z[3,0]: -0.23 → 0.00 (neuron 0 silenced for sample 3) All other cells: unchanged (already positive)

23print("Post-activation A:\\n", A)

Displays the activated output. Compare with Z above — two cells flipped from negative to zero.

EXECUTION STATE

output =

Post-activation A:
 [[0.16 0.43]
 [0.   0.17]
 [0.09 0.22]
 [0.   0.34]]

25# What happened: negatives became 0

This line counts how many cells changed between Z and A. Any cell where Z was negative became 0 in A, so Z != A is True for those cells.

26print("Changed cells:", np.sum(Z != A), "out of", Z.size)

Z != A produces a boolean matrix (True where values differ). np.sum() counts the True values. Z.size is the total number of elements (4×2 = 8). Result: 2 out of 8 cells changed — 25% of neurons were silenced by ReLU.

EXECUTION STATE

📚 np.sum(boolean_array) = Counts True values. True is treated as 1, False as 0. np.sum([True, False, True]) = 2.

Z != A =

[[False, False],
 [ True, False],
 [False, False],
 [ True, False]] — True where ReLU changed the value

np.sum(Z != A) = 2 — two cells changed from negative to zero

📚 Z.size = Total number of elements in the array. For a (4,2) matrix: 4×2 = 8. Different from .shape which gives dimensions.

output = Changed cells: 2 out of 8

10 lines without explanation

1import numpy as np
2
3def relu(x):
4    """Activation: keep positives, zero out negatives."""
5    return np.maximum(0, x)
6
7# Full forward pass: input -> linear -> activate
8X = np.array([[0.8, 0.3, 0.5],
9              [0.1, 0.9, 0.2],
10              [0.6, 0.4, 0.7],
11              [0.3, 0.8, 0.1]])
12
13W = np.array([[0.2, -0.5, 0.1],
14              [0.8,  0.3, -0.4]])
15b = np.array([0.1, -0.1])
16
17# Step 1: Linear transformation
18Z = X @ W.T + b
19print("Pre-activation Z:\n", Z)
20
21# Step 2: Activation
22A = relu(Z)
23print("Post-activation A:\n", A)
24
25# What happened: negatives became 0
26print("Changed cells:", np.sum(Z != A), "out of", Z.size)

In PyTorch, the forward pass is just two lines: layer(X) for the linear transform and F.relu(Z) for the activation. The F.relu function from torch.nn.functional is functionally identical to our NumPy relu(), but with one crucial difference: it supports autograd. During training, PyTorch automatically tracks that relu was applied and computes the correct gradients during backpropagation — the gradient is 1 where the input was positive and 0 where it was negative.

PyTorch — The Complete Forward Pass

🐍forward_pass_torch.py

Explanation(14)

Code(21)

1import torch

PyTorch core library for tensor creation and computation.

EXECUTION STATE

📚 torch = PyTorch core. Provides torch.tensor(), torch.no_grad(), and the @ operator.

2import torch.nn as nn

Neural network module containing layer types like nn.Linear.

EXECUTION STATE

📚 torch.nn = Neural network building blocks: nn.Linear, nn.ReLU, nn.Module, etc.

3import torch.nn.functional as F

Functional API for neural network operations. Contains stateless versions of activations (F.relu, F.softmax, F.sigmoid) and other operations. Unlike nn.ReLU() which is a module, F.relu() is a plain function — use it when you don’t need to store it as a layer.

EXECUTION STATE

📚 torch.nn.functional = Stateless functions: F.relu(x), F.softmax(x, dim), F.sigmoid(x), F.cross_entropy(pred, target), etc. Aliased as F by convention. These functions support autograd — gradients flow through them automatically during backpropagation.

5# Define a single layer

We create one fully-connected layer and set its weights to our known values for comparison with the NumPy version.

6layer = nn.Linear(3, 2) — Create layer (3 inputs → 2 outputs)

Creates a linear layer. The default bias=True is used, so both weight (2×3) and bias (2,) are created.

EXECUTION STATE

📚 nn.Linear(3, 2) = Fully-connected layer: 3 inputs, 2 outputs. Stores weight(2×3) and bias(2,). Forward: output = input @ W.T + b.

7with torch.no_grad(): — Disable gradient tracking

We manually set weights inside this context manager to avoid unnecessary gradient computation.

EXECUTION STATE

📚 torch.no_grad() = Context manager that disables autograd. Use for manual weight setting and inference.

8layer.weight.copy_(...) — Set weight matrix

Sets the weight matrix to our known values.

EXECUTION STATE

layer.weight after copy_ =

tensor([[ 0.2, -0.5,  0.1],
        [ 0.8,  0.3, -0.4]])

10layer.bias.copy_(...) — Set bias vector

Sets the bias to [0.1, -0.1].

EXECUTION STATE

layer.bias after copy_ = tensor([ 0.1, -0.1])

12X = torch.tensor([[0.8, 0.3, 0.5], ...]) — Batch input

Same 4-sample batch as the NumPy version.

EXECUTION STATE

⬇ X (4×3) =

tensor([[0.8, 0.3, 0.5],
        [0.1, 0.9, 0.2],
        [0.6, 0.4, 0.7],
        [0.3, 0.8, 0.1]])

17# The complete forward pass in 2 lines

The entire forward pass — linear transformation plus activation — takes just 2 lines in PyTorch. This is what every neural network layer does: transform, then activate.

18Z = layer(X) — Linear transformation (pre-activation)

Computes X @ W.T + b for all 4 samples at once. Z contains the raw pre-activation values — before the nonlinearity.

EXECUTION STATE

📚 layer(X) = Calls the layer’s forward() method: F.linear(X, weight, bias) = X @ weight.T + bias. Handles any batch size.

⬆ Z (4×2) =

tensor([[ 0.16,  0.43],
        [-0.31,  0.17],
        [ 0.09,  0.22],
        [-0.23,  0.34]])

19A = F.relu(Z) — ReLU activation (post-activation)

F.relu() is PyTorch’s ReLU function from torch.nn.functional. It applies max(0, x) element-wise, just like our NumPy relu() function. The key difference: F.relu() supports autograd, so during training, gradients automatically flow through it — PyTorch knows that relu’s gradient is 1 for positive inputs and 0 for negative inputs.

EXECUTION STATE

📚 F.relu(input) = Element-wise ReLU: max(0, x). Part of torch.nn.functional. Supports autograd — gradient is 1 where input > 0, and 0 where input ≤ 0. Equivalent to torch.clamp(input, min=0).

⬇ arg: Z (4×2) = The pre-activation tensor. Contains 2 negative values: Z[1,0]=-0.31 and Z[3,0]=-0.23.

⬆ A (4×2) =

tensor([[0.16, 0.43],
        [0.00, 0.17],
        [0.09, 0.22],
        [0.00, 0.34]])

→ F.relu vs nn.ReLU = F.relu(x) is a function — call it directly. nn.ReLU() is a module — instantiate it first, then call. Use F.relu for one-off calls, nn.ReLU() when building nn.Sequential layers.

→ autograd = During training, PyTorch records this operation. On backward pass, gradient flows through: dA/dZ = 1 where Z > 0, dA/dZ = 0 where Z ≤ 0. This is why ReLU can cause 'dead neurons' — if Z is always negative, the gradient is always 0.

20print("Z:\\n", Z)

Displays pre-activation values with grad_fn attached (PyTorch tracks operations for backprop).

EXECUTION STATE

output =

Z:
 tensor([[ 0.1600,  0.4300],
        [-0.3100,  0.1700],
        [ 0.0900,  0.2200],
        [-0.2300,  0.3400]], grad_fn=<AddmmBackward0>)

21print("A:\\n", A)

Displays post-activation values. Two cells changed from negative to zero. The grad_fn shows ReluBackward0 — PyTorch remembers this operation for backpropagation.

EXECUTION STATE

output =

A:
 tensor([[0.1600, 0.4300],
        [0.0000, 0.1700],
        [0.0900, 0.2200],
        [0.0000, 0.3400]], grad_fn=<ReluBackward0>)

7 lines without explanation

1import torch
2import torch.nn as nn
3import torch.nn.functional as F
4
5# Define a single layer
6layer = nn.Linear(3, 2)
7with torch.no_grad():
8    layer.weight.copy_(torch.tensor([[0.2, -0.5, 0.1],
9                                      [0.8,  0.3, -0.4]]))
10    layer.bias.copy_(torch.tensor([0.1, -0.1]))
11
12X = torch.tensor([[0.8, 0.3, 0.5],
13                  [0.1, 0.9, 0.2],
14                  [0.6, 0.4, 0.7],
15                  [0.3, 0.8, 0.1]])
16
17# The complete forward pass in 2 lines
18Z = layer(X)
19A = F.relu(Z)
20print("Z:\n", Z)
21print("A:\n", A)

Element-wise Operations, Broadcasting, and Reshape

Beyond matrix multiplication, neural networks rely heavily on three more operations. These trip up beginners but are essential to understand: element-wise operations, broadcasting, and reshape. Once you see how they work, many "mysterious" lines of neural network code will suddenly make perfect sense.

Element-wise Operations

An element-wise operation applies a function to each element independently. If you have a matrix Z of shape $(4, 2)$ , an element-wise operation produces an output of the same shape $(4, 2)$ , where output position $[i, j]$ depends only on input position $[i, j]$ . There is no mixing across positions — completely unlike matrix multiplication, where each output combines an entire row with an entire column.

The most common element-wise operations in neural networks are:

ReLU activation: $\max(0, z)$ — keep positives, zero out negatives. Applied after every linear layer.
Sigmoid: $\frac{1}{1 + e^{-z}}$ — squash values to the range $(0, 1)$ . Used for binary classification outputs.
Element-wise addition: Adding two same-shape matrices — position $[i, j]$ in matrix A is added to position $[i, j]$ in matrix B.

Key Insight: Element-wise means position $[i, j]$ only cares about $[i, j]$ in the inputs. No mixing across positions. Matrix multiplication mixes across an entire row and column — that's what makes it powerful but computationally expensive.

Broadcasting

Broadcasting is NumPy and PyTorch's mechanism for combining arrays of different shapes. Instead of requiring both arrays to be the same shape, the smaller array is "stretched" to match the larger one. The rule: align shapes from the right, and each dimension must either be equal or one of them must be 1 (or missing).

The most common use in neural networks: adding a bias vector of shape $(2{,})$ to a batch result of shape $(4, 2)$ . The bias is broadcast across all 4 rows — every sample gets the same bias added.

Operation	Shape A	Shape B	Result Shape	What Happens
A + B	(4, 2)	(4, 2)	(4, 2)	Same shape — element-wise addition, no broadcasting needed
Z + bias	(4, 2)	(2,)	(4, 2)	Bias (2,) stretched across 4 rows
X + col	(3, 3)	(3, 1)	(3, 3)	Column vector (3,1) stretched across 3 columns
a + b	(1,)	(4, 2)	(4, 2)	Scalar-like (1,) stretched to match entire matrix
FAIL	(4, 2)	(3,)	ERROR	Last dimensions 2 ≠ 3 — cannot broadcast

Reshape

Reshaping changes an array's shape without changing its data. The same numbers are simply reinterpreted in a new arrangement. This is critical in neural networks — for example, flattening a 2D image into a 1D vector for a fully connected layer. A 28×28 image ( $784$ pixels) becomes a vector of length $784$ .

The -1 trick: passing -1 as a dimension tells NumPy or PyTorch to "figure it out automatically." If you have 12 elements and reshape to $(3, -1)$ , the -1 becomes 4 because $12 / 3 = 4$ . The most common use is reshape(-1), which flattens any array to 1-D.

Python — Element-wise, Broadcasting, and Reshape

🐍element_ops.py

Explanation(24)

Code(33)

1import numpy as np

NumPy for array creation, element-wise operations (np.maximum), broadcasting (automatic shape alignment), and reshaping (.reshape). These three operations are used in every neural network alongside matrix multiplication.

EXECUTION STATE

📚 numpy = Numerical computing library. Key functions used here: np.maximum() for element-wise max, np.array() for array creation, np.arange() for ranges, and .reshape() for changing array dimensions.

3# ── Element-wise Operations ──

Element-wise operations apply a function to each array element independently. Position [i,j] in the output depends ONLY on position [i,j] in the input — no mixing across positions. This is fundamentally different from matrix multiplication, where each output element depends on an entire row and column.

4# Each element is processed independently

The key insight: in element-wise ops, each cell is computed in isolation. Z[0,0] only affects output[0,0]. Z[2,1] only affects output[2,1]. There is no interaction between different positions — unlike matrix multiply where every output element mixes multiple inputs.

5Z = np.array([[ 0.16, 0.43], ...]) — Pre-activation matrix

This is the same pre-activation matrix Z from our forward pass example. These are the raw neuron outputs BEFORE applying an activation function like ReLU. Some values are positive, some negative — the activation function will decide what to keep.

EXECUTION STATE

⬇ Z (4×2) =

         n0     n1
s0  [ 0.16,  0.43]
s1  [-0.31,  0.17]
s2  [ 0.09,  0.22]
s3  [-0.23,  0.34]

→ Z.shape = (4, 2) — 4 samples, 2 neurons. 8 total elements, each will be processed independently by ReLU.

11# ReLU: max(0, x) applied to EACH element

ReLU is the most common element-wise operation in neural networks. It is applied after the linear transformation (Z = X @ W.T + b) to introduce nonlinearity. Without activation functions, stacking multiple layers would collapse into a single linear transformation.

12A = np.maximum(0, Z) — Element-wise ReLU

np.maximum(0, Z) compares each of the 8 elements in Z with 0, keeping the larger value. Positive values pass through unchanged; negative values become 0. This is ReLU — the Rectified Linear Unit activation.

EXECUTION STATE

📚 np.maximum(a, b) = Element-wise maximum of two arrays (or array and scalar). Compares each position independently. Different from np.max() which finds the single largest element across an array.

⬇ arg 1: 0 = The threshold scalar. Broadcast against every element of Z. Any Z value below 0 becomes 0.

⬇ arg 2: Z (4×2) = The pre-activation matrix. Contains 2 negative values: Z[1,0]=-0.31 and Z[3,0]=-0.23.

── Row 0 (sample 0) ── =

max(0, 0.16) = 0.16 — positive, passes through unchanged

max(0, 0.43) = 0.43 — positive, passes through unchanged

── Row 1 (sample 1) ── =

max(0, -0.31) = 0.00 — NEGATIVE, clamped to zero ← neuron 0 is "dead" for this sample

max(0, 0.17) = 0.17 — positive, passes through unchanged

── Row 2 (sample 2) ── =

max(0, 0.09) = 0.09 — positive, passes through unchanged

max(0, 0.22) = 0.22 — positive, passes through unchanged

── Row 3 (sample 3) ── =

max(0, -0.23) = 0.00 — NEGATIVE, clamped to zero ← neuron 0 is "dead" for this sample

max(0, 0.34) = 0.34 — positive, passes through unchanged

⬆ A (4×2) =

         n0     n1
s0  [0.16,  0.43]
s1  [0.00,  0.17]
s2  [0.09,  0.22]
s3  [0.00,  0.34]

13print("Z:\\n", Z)

Displays the original pre-activation matrix so we can compare with the ReLU output. Two entries are negative: Z[1,0]=-0.31 and Z[3,0]=-0.23.

EXECUTION STATE

output =

Z:
 [[ 0.16  0.43]
 [-0.31  0.17]
 [ 0.09  0.22]
 [-0.23  0.34]]

14print("ReLU(Z):\\n", A)

Displays the post-activation matrix. The two negative values have become 0.00, while the six positive values are unchanged. This is the element-wise nature of ReLU — each position is handled independently.

EXECUTION STATE

output =

ReLU(Z):
 [[0.16 0.43]
 [0.   0.17]
 [0.09 0.22]
 [0.   0.34]]

16# ── Broadcasting ──

Broadcasting is NumPy’s mechanism for combining arrays of different shapes. Instead of requiring both arrays to be the same shape, NumPy automatically "stretches" the smaller array to match. The rule: dimensions must be equal, or one of them must be 1 (or missing).

17# Adding bias (2,) to batch (4, 2)

This is the most common broadcasting scenario in neural networks: adding a bias vector to a batch of outputs. The bias has shape (2,) — one value per neuron. The batch has shape (4, 2) — 4 samples, 2 neurons each. Broadcasting stretches the bias across all 4 rows.

18bias = np.array([0.1, -0.1]) — Bias vector (shape (2,))

A 1-D bias vector with 2 elements — one per output neuron. Neuron 0 gets bias +0.1 (shifted up), neuron 1 gets bias -0.1 (shifted down).

EXECUTION STATE

⬇ bias = [0.1, -0.1]

bias.shape = (2,) — a 1-D array with 2 elements. When added to a (4, 2) matrix, NumPy broadcasts this across 4 rows.

19result = Z + bias — Broadcasting in action

NumPy adds bias (2,) to Z (4, 2). Broadcasting rule: align shapes from the right. Z is (4, 2) and bias is (2,). The last dimension matches (2 == 2). Bias is missing the first dimension, so it is conceptually "stretched" to (4, 2) by repeating the same [0.1, -0.1] for every row. Each sample gets the SAME bias added to its output.

EXECUTION STATE

📚 Broadcasting rule = Align shapes right-to-left. Each dimension must be: (a) equal, or (b) one of them is 1 (or missing). The size-1 dimension is "stretched" to match. Z(4,2) + bias(2,) → bias treated as (1,2) → stretched to (4,2).

⬇ Z (4×2) =

         n0     n1
s0  [ 0.16,  0.43]
s1  [-0.31,  0.17]
s2  [ 0.09,  0.22]
s3  [-0.23,  0.34]

⬇ bias (stretched to 4×2) =

         n0     n1
s0  [ 0.10, -0.10]  ← same row
s1  [ 0.10, -0.10]  ← copied
s2  [ 0.10, -0.10]  ← copied
s3  [ 0.10, -0.10]  ← copied

── Row 0 (sample 0) ── =

Z[0,0] + bias[0] = 0.16 + 0.10 = 0.26

Z[0,1] + bias[1] = 0.43 + (-0.10) = 0.33

── Row 1 (sample 1) ── =

Z[1,0] + bias[0] = -0.31 + 0.10 = -0.21

Z[1,1] + bias[1] = 0.17 + (-0.10) = 0.07

── Row 2 (sample 2) ── =

Z[2,0] + bias[0] = 0.09 + 0.10 = 0.19

Z[2,1] + bias[1] = 0.22 + (-0.10) = 0.12

── Row 3 (sample 3) ── =

Z[3,0] + bias[0] = -0.23 + 0.10 = -0.13

Z[3,1] + bias[1] = 0.34 + (-0.10) = 0.24

⬆ result (4×2) =

         n0     n1
s0  [ 0.26,  0.33]
s1  [-0.21,  0.07]
s2  [ 0.19,  0.12]
s3  [-0.13,  0.24]

20print("Z + bias:\\n", result)

Displays the broadcasted addition result. Every column 0 increased by 0.1 (bias[0]), every column 1 decreased by 0.1 (bias[1]).

EXECUTION STATE

output =

Z + bias:
 [[ 0.26  0.33]
 [-0.21  0.07]
 [ 0.19  0.12]
 [-0.13  0.24]]

21print("bias shape:", bias.shape, " Z shape:", Z.shape)

Shows the shapes to confirm the broadcasting worked: bias (2,) was broadcast across all 4 rows of Z (4, 2).

EXECUTION STATE

output = bias shape: (2,) Z shape: (4, 2)

23# ── Reshape ──

Reshaping changes an array’s dimensions without changing its data. The same numbers are rearranged into a new shape. This is critical for neural networks — for example, flattening a 2D image grid into a 1D vector so it can be fed into a fully connected layer.

24# Flatten a 2x3 "image" into a 6-element vector

A common neural network operation: a convolutional layer outputs a 2D feature map, but the fully connected layer expects a 1D vector. Reshape (flattening) bridges this gap without losing any information — the same 6 numbers, just rearranged from 2 rows of 3 into 1 row of 6.

25image = np.array([[0.1, 0.5, 0.9], [0.2, 0.4, 0.8]]) — 2×3 image

A small 2×3 "image" with 6 pixel values. Think of this as 2 rows and 3 columns of pixel intensities. In a real network, this could be a feature map from a convolutional layer.

EXECUTION STATE

⬇ image (2×3) =

     c0   c1   c2
r0  [0.1, 0.5, 0.9]
r1  [0.2, 0.4, 0.8]

image.shape = (2, 3) — 2 rows, 3 columns, 6 total elements

27flat = image.reshape(-1) — Flatten to 1-D

Reshapes the 2×3 matrix into a 1-D vector with 6 elements. The -1 argument means "figure out this dimension automatically" — NumPy computes 2×3 = 6 elements, so -1 becomes 6. Elements are read in row-major (C) order: row 0 first, then row 1.

EXECUTION STATE

📚 .reshape(new_shape) = Returns a new view of the array with a different shape but the same data. The total number of elements must stay the same. Elements are read row-by-row (row-major/C order) by default.

⬇ arg: -1 = A wildcard dimension. NumPy calculates it: total_elements / product_of_other_dims. Here: 6 / (nothing else) = 6. So reshape(-1) = reshape(6) = flatten to 1-D.

→ Reading order = Row 0: [0.1, 0.5, 0.9] → positions 0, 1, 2 Row 1: [0.2, 0.4, 0.8] → positions 3, 4, 5 Flat: [0.1, 0.5, 0.9, 0.2, 0.4, 0.8]

⬆ flat = [0.1, 0.5, 0.9, 0.2, 0.4, 0.8]

flat.shape = (6,) — 1-D vector with 6 elements. Same data, different arrangement.

28print("Image shape:", image.shape, "-> Flat shape:", flat.shape)

Confirms the shape change from 2-D to 1-D while preserving the same 6 elements.

EXECUTION STATE

output = Image shape: (2, 3) -> Flat shape: (6,)

29print("Flat:", flat)

Shows the flattened array. Elements appear in row-major order: row 0 elements first (0.1, 0.5, 0.9), then row 1 (0.2, 0.4, 0.8).

EXECUTION STATE

output = Flat: [0.1 0.5 0.9 0.2 0.4 0.8]

31# Organize 12 values into a (3, 4) matrix

Reshape also works in reverse — taking a flat sequence and organizing it into a matrix. This is useful when data arrives as a 1-D stream but needs to be processed as a structured grid.

32data = np.arange(12) — Create [0, 1, 2, ..., 11]

np.arange(12) creates a 1-D array of 12 integers from 0 to 11. This is analogous to Python’s range(12) but returns a NumPy array instead of a Python range object.

EXECUTION STATE

📚 np.arange(n) = Creates a 1-D array of n consecutive integers starting from 0: [0, 1, 2, ..., n-1]. Like Python range() but returns an ndarray. Also accepts (start, stop, step).

⬇ arg: 12 = Number of elements to generate: 0, 1, 2, ..., 11.

⬆ data = [ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11]

data.shape = (12,) — 1-D array with 12 elements

33matrix = data.reshape(3, 4) — Reshape to 3 rows × 4 columns

Takes the 12-element flat array and arranges it into a 3×4 matrix. The first 4 elements fill row 0, the next 4 fill row 1, the last 4 fill row 2. Total elements: 3 × 4 = 12, matching the original.

EXECUTION STATE

📚 .reshape(rows, cols) = Rearranges elements into the given shape. Total elements must match: rows × cols must equal the original array size. Elements fill in row-major order (left to right, top to bottom).

⬇ arg 1: 3 = Number of rows in the output matrix.

⬇ arg 2: 4 = Number of columns in the output matrix. 3 × 4 = 12 elements, matching data.size.

→ Row filling order = Row 0: data[0:4] = [0, 1, 2, 3] Row 1: data[4:8] = [4, 5, 6, 7] Row 2: data[8:12] = [8, 9, 10, 11]

⬆ matrix (3×4) =

     c0  c1  c2  c3
r0  [ 0,  1,  2,  3]
r1  [ 4,  5,  6,  7]
r2  [ 8,  9, 10, 11]

34print("Reshaped (3x4):\\n", matrix)

Displays the reshaped matrix. The same 12 numbers from [0..11], now organized as 3 rows of 4.

EXECUTION STATE

output =

Reshaped (3x4):
 [[ 0  1  2  3]
 [ 4  5  6  7]
 [ 8  9 10 11]]

9 lines without explanation

1import numpy as np
2
3# ── Element-wise Operations ──
4# Each element is processed independently
5Z = np.array([[ 0.16,  0.43],
6              [-0.31,  0.17],
7              [ 0.09,  0.22],
8              [-0.23,  0.34]])
9
10# ReLU: max(0, x) applied to EACH element
11A = np.maximum(0, Z)
12print("Z:\n", Z)
13print("ReLU(Z):\n", A)
14
15# ── Broadcasting ──
16# Adding bias (2,) to batch (4, 2)
17bias = np.array([0.1, -0.1])
18result = Z + bias
19print("Z + bias:\n", result)
20print("bias shape:", bias.shape, "  Z shape:", Z.shape)
21
22# ── Reshape ──
23# Flatten a 2x3 "image" into a 6-element vector
24image = np.array([[0.1, 0.5, 0.9],
25                  [0.2, 0.4, 0.8]])
26flat = image.reshape(-1)
27print("Image shape:", image.shape, "-> Flat shape:", flat.shape)
28print("Flat:", flat)
29
30# Organize 12 values into a (3, 4) matrix
31data = np.arange(12)
32matrix = data.reshape(3, 4)
33print("Reshaped (3x4):\n", matrix)

In PyTorch, element-wise operations and broadcasting work identically to NumPy. The main differences are in reshaping: PyTorch offers both .view() and .reshape(). The .view() method requires contiguous memory and returns a view (no data copy, shared storage — modifying the view modifies the original). The .reshape() method works always but may create a copy if the tensor is not contiguous. In practice, most tensors are contiguous, so .view() is the more common choice.

PyTorch — Element-wise, Broadcasting, and Reshape

🐍element_ops_torch.py

Explanation(17)

Code(26)

1import torch

PyTorch core library for tensor creation, arithmetic, and autograd.

EXECUTION STATE

📚 torch = Core PyTorch. Provides torch.tensor(), torch.arange(), and the + operator for element-wise addition with broadcasting.

2import torch.nn.functional as F

Functional API for neural network operations. F.relu() is the stateless activation function we use here.

EXECUTION STATE

📚 torch.nn.functional = Stateless functions: F.relu(x), F.softmax(x, dim), F.sigmoid(x), etc. Aliased as F. All support autograd — gradients flow through them during backpropagation.

4Z = torch.tensor([[ 0.16, 0.43], ...]) — Pre-activation tensor

Creates the same 4×2 pre-activation matrix as the NumPy version, but as a PyTorch tensor. The values are identical — we can directly compare NumPy and PyTorch results.

EXECUTION STATE

⬇ Z (4×2) =

tensor([[ 0.16,  0.43],
        [-0.31,  0.17],
        [ 0.09,  0.22],
        [-0.23,  0.34]])

9# Element-wise ReLU

F.relu() applies max(0, x) to each element independently, just like np.maximum(0, Z). The difference: F.relu() records the operation for autograd, enabling automatic gradient computation during training.

10A = F.relu(Z) — Element-wise ReLU activation

Applies ReLU element-wise: max(0, x) for each of the 8 elements. Identical result to np.maximum(0, Z), but with autograd support. The gradient is 1 where Z > 0 and 0 where Z ≤ 0.

EXECUTION STATE

📚 F.relu(input) = Element-wise ReLU from torch.nn.functional: max(0, x). Supports autograd — gradient is 1 for positive inputs, 0 for negative. Equivalent to torch.clamp(input, min=0).

⬇ arg: Z (4×2) = The input tensor. 2 negative values: Z[1,0]=-0.31 and Z[3,0]=-0.23. These become 0.

⬆ A (4×2) =

tensor([[0.16, 0.43],
        [0.00, 0.17],
        [0.09, 0.22],
        [0.00, 0.34]])

→ F.relu vs np.maximum = F.relu(Z): PyTorch function, supports autograd, works on tensors. np.maximum(0, Z): NumPy function, no autograd, works on ndarrays. Both compute max(0, x) element-wise — same math, different frameworks.

11print("ReLU(Z):\\n", A)

Displays the post-activation tensor. Matches the NumPy output exactly.

EXECUTION STATE

output =

ReLU(Z):
 tensor([[0.1600, 0.4300],
        [0.0000, 0.1700],
        [0.0900, 0.2200],
        [0.0000, 0.3400]])

13# Broadcasting (works identically to NumPy)

PyTorch uses the same broadcasting rules as NumPy. When shapes don’t match, the smaller tensor is "stretched" to match the larger one. The rules are identical: align dimensions from the right, each must be equal or one must be 1 (or missing).

14bias = torch.tensor([0.1, -0.1]) — Bias tensor (shape (2,))

A 1-D bias tensor with 2 elements. When added to the (4, 2) matrix Z, PyTorch broadcasts it across all 4 rows — same behavior as NumPy.

EXECUTION STATE

⬇ bias = tensor([0.1, -0.1])

bias.shape = torch.Size([2]) — 1-D tensor with 2 elements

15result = Z + bias — Broadcasted addition

PyTorch broadcasts bias (2,) across all 4 rows of Z (4, 2). Each row gets [0.1, -0.1] added. The result is identical to the NumPy version.

EXECUTION STATE

→ Broadcasting = Z(4,2) + bias(2,) → bias treated as (1,2) → stretched to (4,2)

── Row 0 ── =

[0.16+0.1, 0.43-0.1] = [0.26, 0.33]

── Row 1 ── =

[-0.31+0.1, 0.17-0.1] = [-0.21, 0.07]

── Row 2 ── =

[0.09+0.1, 0.22-0.1] = [0.19, 0.12]

── Row 3 ── =

[-0.23+0.1, 0.34-0.1] = [-0.13, 0.24]

⬆ result (4×2) =

tensor([[ 0.26,  0.33],
        [-0.21,  0.07],
        [ 0.19,  0.12],
        [-0.13,  0.24]])

16print("Z + bias:\\n", result)

Displays the broadcasted addition. Every column 0 shifted by +0.1, every column 1 shifted by -0.1.

EXECUTION STATE

output =

Z + bias:
 tensor([[ 0.2600,  0.3300],
        [-0.2100,  0.0700],
        [ 0.1900,  0.1200],
        [-0.1300,  0.2400]])

18# Reshape / View

PyTorch offers two methods for reshaping: .view() and .reshape(). Both change the tensor’s shape without changing the data. The key difference: .view() requires contiguous memory (faster, no copy), while .reshape() works always (may copy if needed).

19image = torch.tensor([[0.1, 0.5, 0.9], [0.2, 0.4, 0.8]]) — 2×3 image

Same 2×3 image as the NumPy version, now as a PyTorch tensor.

EXECUTION STATE

⬇ image (2×3) =

tensor([[0.1, 0.5, 0.9],
        [0.2, 0.4, 0.8]])

21flat = image.view(-1) — Flatten using view

Flattens the 2×3 tensor into a 1-D tensor with 6 elements. .view(-1) is PyTorch’s equivalent of NumPy’s .reshape(-1). The -1 means "infer this dimension." Since the tensor is contiguous in memory, .view() creates a new view without copying data — it just reinterprets the same memory block.

EXECUTION STATE

📚 .view(shape) = Returns a new tensor with different shape but SHARING the same underlying data. No memory copy — just a different "view" of the same storage. Requires contiguous memory layout.

⬇ arg: -1 = Wildcard dimension. PyTorch computes: 2 × 3 = 6 total elements → -1 becomes 6.

⬆ flat = tensor([0.1, 0.5, 0.9, 0.2, 0.4, 0.8])

→ .view() vs .reshape() = .view(): requires contiguous memory, returns a view (shared data, no copy, O(1)). .reshape(): works always, may return a view OR a copy depending on memory layout. Use .view() when you know the tensor is contiguous (most cases). Use .reshape() when unsure.

22print("Flat:", flat)

Displays the flattened tensor. Same 6 values in the same order as the NumPy version.

EXECUTION STATE

output = Flat: tensor([0.1000, 0.5000, 0.9000, 0.2000, 0.4000, 0.8000])

24# Reshape is also available

PyTorch provides .reshape() as a more flexible alternative to .view(). It works even when the tensor is not contiguous in memory (e.g., after a transpose). In most cases both give the same result.

25matrix = torch.arange(12).reshape(3, 4) — Create and reshape

Creates a 1-D tensor of 12 integers [0..11] using torch.arange(), then reshapes it to a 3×4 matrix. This is a common pattern: generate data, then organize it into the shape you need.

EXECUTION STATE

📚 torch.arange(n) = Creates a 1-D tensor of n consecutive integers: [0, 1, 2, ..., n-1]. PyTorch equivalent of np.arange(). Returns a tensor (not a Python range).

⬇ torch.arange(12) = tensor([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11])

📚 .reshape(rows, cols) = Rearranges elements into the given shape. Total elements must match: 3 × 4 = 12 = original size. May return a view or copy.

⬆ matrix (3×4) =

tensor([[ 0,  1,  2,  3],
        [ 4,  5,  6,  7],
        [ 8,  9, 10, 11]])

26print("Reshaped:\\n", matrix)

Displays the reshaped 3×4 matrix. Identical values and arrangement as the NumPy version.

EXECUTION STATE

output =

Reshaped:
 tensor([[ 0,  1,  2,  3],
        [ 4,  5,  6,  7],
        [ 8,  9, 10, 11]])

9 lines without explanation

1import torch
2import torch.nn.functional as F
3
4Z = torch.tensor([[ 0.16,  0.43],
5                  [-0.31,  0.17],
6                  [ 0.09,  0.22],
7                  [-0.23,  0.34]])
8
9# Element-wise ReLU
10A = F.relu(Z)
11print("ReLU(Z):\n", A)
12
13# Broadcasting (works identically to NumPy)
14bias = torch.tensor([0.1, -0.1])
15result = Z + bias
16print("Z + bias:\n", result)
17
18# Reshape / View
19image = torch.tensor([[0.1, 0.5, 0.9],
20                      [0.2, 0.4, 0.8]])
21flat = image.view(-1)
22print("Flat:", flat)
23
24# Reshape is also available
25matrix = torch.arange(12).reshape(3, 4)
26print("Reshaped:\n", matrix)

You now have all the linear algebra you need for neural networks. Matrix multiplication combines inputs with weights, element-wise operations like ReLU introduce nonlinearity, broadcasting lets you add biases efficiently, and reshape lets you reorganize data between layers. Everything from here builds on these four operations — they are the building blocks of every neural network layer, every forward pass, and every training step.