Chapter 2
25 min read
Section 4 of 65

Python Refresher for Neural Networks

Python and PyTorch Essentials

Why Python for Neural Networks

Before we build our first neural network, we need a language that speaks mathematics fluently. Python, paired with NumPy, gives us exactly that: the ability to write mathematical equations almost as they appear in textbooks, while executing them at near-C speed. Every major deep learning framework — PyTorch, TensorFlow, JAX — is built on top of Python and NumPy's conventions.

In this section, we'll build up from arrays to a complete neural network layer, ensuring you deeply understand the four operations that make up 90% of all neural network computation:

  1. Array creation and indexing — representing data as matrices
  2. Vectorized operations — replacing loops with whole-array math
  3. The dot product — the fundamental operation inside every neuron
  4. Matrix multiplication — computing an entire layer in one operation

By the end, you'll be able to implement a neural network layer from scratch in NumPy and understand exactly what happens to your data at every step.


NumPy Arrays: The Language of Data

Neural networks operate on numbers arranged in structured containers. A single data point is a vector — an ordered list of numbers. A batch of data points is a matrix — a 2D grid where each row is one data point and each column is one feature. NumPy's ndarray\texttt{ndarray} is the Python object that represents these mathematical structures.

Consider a dataset where each person is described by three measurements: height (cm), weight (kg), and age (years). One person is a vector with 3 elements. Four people form a 4×34 \times 3 matrix:

X=[1706525160553018080221555035]\mathbf{X} = \begin{bmatrix} 170 & 65 & 25 \\ 160 & 55 & 30 \\ 180 & 80 & 22 \\ 155 & 50 & 35 \end{bmatrix}

Each row is one person. Each column is one feature. This row=sample, column=feature\textbf{row} = \text{sample, column} = \text{feature} convention is universal across all of deep learning. Let's see how to create and work with this in NumPy:

Creating and Indexing NumPy Arrays
🐍numpy_basics.py
1import numpy as np

NumPy (Numerical Python) is the foundation of all scientific computing in Python. It provides the ndarray — a fast, memory-efficient array type where all elements share the same type. Every neural network library (PyTorch, TensorFlow) builds on top of NumPy’s conventions. We alias it as ‘np’ by universal convention.

EXECUTION STATE
📚 numpy = Core library for N-dimensional arrays, linear algebra, random numbers, and mathematical functions. Written in C under the hood — 1000× faster than Python lists for math.
as np = Creates alias so we write np.array() instead of numpy.array(). Used in virtually every Python data science script.
3Comment: A single data point

In neural networks, data is always represented as numbers. A single input to a network is a vector — an ordered list of numbers. Here, each person is described by 3 features: height, weight, and age. These become the 3 inputs to our neuron.

4x = np.array([170.0, 65.0, 25.0])

Creates a 1-dimensional NumPy array (a vector) from a Python list. Unlike a Python list, every element must be the same type (float64 here). This is crucial for speed — NumPy stores numbers in contiguous memory, just like C arrays, enabling fast vectorized operations.

EXECUTION STATE
📚 np.array() = Converts a Python list (or list of lists) into a NumPy ndarray. The dtype is inferred from the input values: integers become int64, floats become float64.
⬇ arg: [170.0, 65.0, 25.0] = A Python list with 3 float values. Each represents one feature of a data point: height=170cm, weight=65kg, age=25 years.
⬆ result: x = [170. 65. 25.]
→ why floats? = Neural networks use floating-point arithmetic for gradient computation. Using 170.0 instead of 170 ensures float64 dtype, avoiding silent integer truncation during division.
5print(x)

Displays the array contents. NumPy prints vectors in a compact format with consistent spacing. Notice the trailing dots — that’s NumPy’s way of showing these are floats, not integers.

EXECUTION STATE
output = [170. 65. 25.]
6print(x.shape)

The shape attribute is the single most important property of any array. It tells you the dimensionality — how many axes and how many elements along each axis. A shape of (3,) means: 1 axis with 3 elements. This is a 1D vector.

EXECUTION STATE
📚 .shape = A tuple of integers giving the size along each dimension. For a vector: (n,). For a matrix: (rows, cols). For a 3D tensor: (depth, rows, cols).
x.shape = (3,) — a 1D array with 3 elements. The trailing comma means it’s a tuple with one element, not the integer 3.
7print(x.dtype)

The dtype (data type) tells you the numeric precision. float64 means 64-bit floating point — about 15 decimal digits of precision. Neural networks typically use float32 for speed, but NumPy defaults to float64 for maximum precision.

EXECUTION STATE
📚 .dtype = The data type of every element in the array. Common types: float64 (default), float32 (GPU-friendly), int64, bool.
x.dtype = float64 — each number uses 8 bytes. A 1000-dim vector = 8KB. A 1000×1000 matrix = 8MB.
9Comment: A batch of 4 data points

In practice, neural networks process many data points at once (a ‘batch’). A batch of N vectors, each with D features, forms an N×D matrix. This is the fundamental data structure of deep learning: every layer receives and produces matrices.

10X = np.array([[170.0, 65.0, 25.0], ...])

Creates a 2D NumPy array (a matrix) from a list of lists. Each inner list becomes a row. Convention: lowercase x for a single vector, uppercase X for a matrix of multiple data points. Each row is one person; each column is one feature.

EXECUTION STATE
⬇ arg: list of lists = 4 inner lists, each with 3 values. NumPy verifies all rows have equal length — jagged arrays are not allowed.
⬆ result: X (4×3) =
  height  weight  age
  170.0    65.0  25.0
  160.0    55.0  30.0
  180.0    80.0  22.0
  155.0    50.0  35.0
15print(X.shape)

Shape (4, 3) means: 4 rows (data points) and 3 columns (features). In neural network terminology: batch_size=4, input_dim=3. You’ll see this pattern everywhere: the first axis is the batch dimension, the rest are feature dimensions.

EXECUTION STATE
X.shape = (4, 3) — 4 data points, each with 3 features. In ML: (batch_size, num_features).
17Comment: Access individual elements

NumPy supports powerful indexing — you can access individual elements, entire rows, entire columns, or arbitrary slices. This is critical for inspecting what your network sees at each stage.

18print(X[0])

Indexing with a single integer selects an entire row. X[0] gives the first row — the first person’s complete feature vector. This is what a neuron ‘sees’ as input for one data point.

EXECUTION STATE
X[0] = [170. 65. 25.] — first person: height=170, weight=65, age=25
19print(X[0, 1])

Two indices select a single element: X[row, col]. X[0, 1] means row 0, column 1 — the first person’s weight. NumPy uses 0-based indexing, so column 1 is the second feature.

EXECUTION STATE
X[0, 1] = 65.0 — first person’s weight (row=0, col=1)
20print(X[:, 0])

The colon : means ‘all elements along this axis.’ X[:, 0] selects all rows, column 0 — extracting every person’s height into a vector. This slice-based access is how you’ll extract features, compute statistics, and debug your networks.

EXECUTION STATE
: (colon) = Selects ALL elements along that axis. X[:, 0] = ‘all rows, column 0’. X[0, :] = ‘row 0, all columns’ (same as X[0]).
X[:, 0] = [170. 160. 180. 155.] — all 4 people’s heights
8 lines without explanation
1import numpy as np
2
3# A single data point: 3 features (height_cm, weight_kg, age)
4x = np.array([170.0, 65.0, 25.0])
5print(x)          # [170.  65.  25.]
6print(x.shape)    # (3,)
7print(x.dtype)    # float64
8
9# A batch of 4 data points, each with 3 features
10X = np.array([
11    [170.0, 65.0, 25.0],
12    [160.0, 55.0, 30.0],
13    [180.0, 80.0, 22.0],
14    [155.0, 50.0, 35.0],
15])
16print(X.shape)    # (4, 3)
17
18# Access individual elements
19print(X[0])       # [170.  65.  25.] — first data point
20print(X[0, 1])    # 65.0 — first person's weight
21print(X[:, 0])    # [170. 160. 180. 155.] — all heights

Shape is Everything

In neural networks, 90% of bugs come from shape mismatches. Before writing any operation, mentally track the shape of every array. The shape tells you what the array means: (4, 3) means "4 data points, each with 3 features." (2, 3) for a weight matrix means "2 neurons, each with 3 weights."

Vectorized Operations: Thinking Without Loops

The single most important mental shift for neural network programming is moving from loop thinking to array thinking. Instead of writing a loop to process one element at a time, we express operations on entire arrays at once. NumPy then executes the operation as optimized C code, achieving speeds 100–1000× faster than Python loops.

This matters enormously in deep learning. A typical training step might involve matrices with millions of elements. A Python loop over 10 million elements takes ~5 seconds; the same operation vectorized in NumPy takes ~5 milliseconds.

From Loops to Vectors

Let's normalize a set of height values to the [0,1][0, 1] range using min-max normalization: xnorm=xxminxmaxxminx_{\text{norm}} = \frac{x - x_{\min}}{x_{\max} - x_{\min}}. First the slow way with a loop, then the fast way with NumPy:

Loop vs Vectorized: Speed Through Array Thinking
🐍vectorized_ops.py
1import numpy as np

Import NumPy. We’ll compare Python loops vs NumPy vectorized operations to see why vectorization matters for neural networks.

3Comment: The slow way

Python loops iterate one element at a time. For each iteration, Python must: look up the variable type, find the right operation, handle dynamic dispatch. This overhead makes loops 100–1000× slower than C for numerical work.

4heights = [170.0, 160.0, 180.0, 155.0]

A plain Python list. Each element is a full Python object with type info, reference count, and value — ~28 bytes per float vs 8 bytes in NumPy. For 1 million elements, that’s 28MB vs 8MB.

EXECUTION STATE
heights = [170.0, 160.0, 180.0, 155.0] — Python list of floats
5normalized = []

Start with an empty list. We’ll append one result at a time — each append may trigger memory reallocation.

EXECUTION STATE
normalized = [] — empty Python list
6for h in heights:

Loop through each height value. Python executes this loop in its interpreter, one element at a time. For 4 elements this is fine, but for millions of elements (common in deep learning), this becomes the bottleneck.

LOOP TRACE · 4 iterations
h=170.0
normalized so far = (h - 155) / 25 = 15/25 = 0.6 → [0.6]
h=160.0
normalized so far = (h - 155) / 25 = 5/25 = 0.2 → [0.6, 0.2]
h=180.0
normalized so far = (h - 155) / 25 = 25/25 = 1.0 → [0.6, 0.2, 1.0]
h=155.0
normalized so far = (h - 155) / 25 = 0/25 = 0.0 → [0.6, 0.2, 1.0, 0.0]
7normalized.append((h - 155.0) / (180.0 - 155.0))

Min-max normalization: maps each value to the [0, 1] range. Formula: (x - min) / (max - min). Here min=155, max=180, so range=25. This is a common preprocessing step before feeding data to a neural network.

EXECUTION STATE
155.0 = The minimum height in our data
180.0 - 155.0 = 25.0 = The range (max - min). Dividing by this maps everything to [0, 1].
8print(normalized)

The result after 4 loop iterations. Each height is now between 0 and 1.

EXECUTION STATE
normalized = [0.6, 0.2, 1.0, 0.0]
10Comment: The fast way

NumPy’s vectorized operations apply the same operation to every element simultaneously using optimized C/Fortran code. No Python loop, no per-element overhead. The operation is expressed as a single mathematical expression on the entire array.

11heights = np.array([170.0, 160.0, 180.0, 155.0])

Convert to a NumPy array. Now all 4 values sit in contiguous memory as raw float64 numbers — no Python object overhead. This memory layout enables CPU SIMD instructions to process multiple values in a single clock cycle.

EXECUTION STATE
heights = [170. 160. 180. 155.] — NumPy array, contiguous in memory
12normalized = (heights - 155.0) / (180.0 - 155.0)

One line replaces the entire for loop. NumPy subtracts 155.0 from ALL elements at once (vectorized subtraction), then divides ALL results by 25.0 at once (vectorized division). Under the hood, this runs as a tight C loop with SIMD instructions — ~100× faster than the Python loop for large arrays.

EXECUTION STATE
heights - 155.0 = [15. 5. 25. 0.] — subtract 155 from each element simultaneously
/ 25.0 = Divide each element by 25.0 simultaneously
⬆ result: normalized = [0.6 0.2 1.0 0.0] — identical result, ~100× faster for large arrays
13print(normalized)

Same result as the loop version, but computed in a single vectorized operation. This is the fundamental insight: in neural networks, we NEVER write loops over individual data elements — we express everything as array operations.

EXECUTION STATE
normalized = [0.6 0.2 1.0 0.0]
15Comment: Element-wise operations

These element-wise operations are the building blocks of every neural network computation. Each operation applies independently to every element. Activation functions (ReLU, sigmoid), loss computation, gradient calculation — all are element-wise.

16a = np.array([1.0, 2.0, 3.0])

A simple 3-element vector. Think of this as one neuron’s input or one row of a weight matrix.

EXECUTION STATE
a = [1.0, 2.0, 3.0]
17b = np.array([4.0, 5.0, 6.0])

A second vector of equal shape. For element-wise operations, both arrays must have the same shape (or be broadcastable — covered later).

EXECUTION STATE
b = [4.0, 5.0, 6.0]
19print(a + b)

Element-wise addition: [1+4, 2+5, 3+6]. In neural networks, this is how bias is added to the weighted sum: z = Wx + b.

EXECUTION STATE
a + b = [5.0, 7.0, 9.0] — element-wise: 1+4=5, 2+5=7, 3+6=9
20print(a * b)

Element-wise multiplication (NOT dot product). Also called the Hadamard product. Each element of a is multiplied by the corresponding element of b. Used in gating mechanisms (LSTM, GRU) and attention masking.

EXECUTION STATE
a * b = [4.0, 10.0, 18.0] — element-wise: 1×4=4, 2×5=10, 3×6=18
→ NOT dot product = a * b returns a vector. np.dot(a, b) returns a scalar (4+10+18=32).
21print(a ** 2)

Element-wise squaring. Used in loss functions like Mean Squared Error: MSE = mean((predicted - actual)²).

EXECUTION STATE
a ** 2 = [1.0, 4.0, 9.0] — element-wise: 1²=1, 2²=4, 3²=9
22print(np.sqrt(a))

Element-wise square root. Used in normalization (e.g., dividing by standard deviation) and in the Adam optimizer’s computation.

EXECUTION STATE
📚 np.sqrt() = Computes √x for each element. Example: np.sqrt(4) = 2.0, np.sqrt(2) ≈ 1.414
np.sqrt(a) = [1.000, 1.414, 1.732] — √1=1, √2≈1.41, √3≈1.73
23print(np.exp(a))

Element-wise exponential: e raised to the power of each element. This is the core of the softmax function, which converts raw scores into probabilities. Also appears in sigmoid: σ(x) = 1/(1+e^(-x)).

EXECUTION STATE
📚 np.exp() = Computes e^x for each element. e ≈ 2.71828. np.exp(0)=1, np.exp(1)≈2.718, np.exp(-∞)=0
np.exp(a) = [2.718, 7.389, 20.086] — e¹≈2.72, e²≈7.39, e³≈20.09
24print(np.sum(a))

Sums all elements into a single scalar. Used in computing loss values, normalizing probabilities (softmax denominator), and aggregating gradients across a batch.

EXECUTION STATE
📚 np.sum() = Adds all elements. With axis=0: sum columns. With axis=1: sum rows. Without axis: sum everything into one number.
np.sum(a) = 6.0 — total: 1.0 + 2.0 + 3.0 = 6.0
25print(np.mean(a))

Computes the arithmetic mean. Equal to np.sum(a) / len(a). Central to batch normalization, mean squared error, and computing average loss over a batch.

EXECUTION STATE
📚 np.mean() = Computes average = sum / count. With axis: average along that dimension. Without: global average.
np.mean(a) = 2.0 — average: 6.0 / 3 = 2.0
4 lines without explanation
1import numpy as np
2
3# ── The slow way: Python loops ──
4heights = [170.0, 160.0, 180.0, 155.0]
5normalized = []
6for h in heights:
7    normalized.append((h - 155.0) / (180.0 - 155.0))
8print(normalized)   # [0.6, 0.2, 1.0, 0.0]
9
10# ── The fast way: NumPy vectorized ──
11heights = np.array([170.0, 160.0, 180.0, 155.0])
12normalized = (heights - 155.0) / (180.0 - 155.0)
13print(normalized)   # [0.6  0.2  1.0  0.0]
14
15# Element-wise operations — the building blocks
16a = np.array([1.0, 2.0, 3.0])
17b = np.array([4.0, 5.0, 6.0])
18
19print(a + b)        # [5. 7. 9.]  — element-wise addition
20print(a * b)        # [4. 10. 18.] — element-wise multiply
21print(a ** 2)       # [1. 4. 9.]  — element-wise square
22print(np.sqrt(a))   # [1.   1.41 1.73] — element-wise sqrt
23print(np.exp(a))    # [2.72  7.39 20.09] — element-wise e^x
24print(np.sum(a))    # 6.0 — sum all elements
25print(np.mean(a))   # 2.0 — average
The Vectorization Rule: If you're writing a for loop over array elements in neural network code, there's almost certainly a NumPy function that does it faster. Think in terms of whole-array operations: "subtract the minimum from the entire array," not "subtract the minimum from each element one by one."

The Dot Product: The Core Operation

Every neuron in a neural network computes exactly one operation at its heart: the dot product of its weight vector with its input vector. Understanding the dot product — both algebraically and geometrically — is essential for understanding how neural networks process information.

Algebraic Definition

Given two vectors a=[a1,a2,,an]\mathbf{a} = [a_1, a_2, \ldots, a_n] and b=[b1,b2,,bn]\mathbf{b} = [b_1, b_2, \ldots, b_n], the dot product is the sum of element-wise products:

ab=i=1naibi=a1b1+a2b2++anbn\mathbf{a} \cdot \mathbf{b} = \sum_{i=1}^{n} a_i \, b_i = a_1 b_1 + a_2 b_2 + \cdots + a_n b_n

The result is a single number (a scalar). For a neuron with weights w\mathbf{w} and inputs x\mathbf{x}, the weighted sum is wx=w1x1+w2x2+w3x3\mathbf{w} \cdot \mathbf{x} = w_1 x_1 + w_2 x_2 + w_3 x_3.

Geometric Interpretation

The dot product has a beautiful geometric meaning. It equals the product of the two vectors' magnitudes times the cosine of the angle between them:

ab=abcosθ\mathbf{a} \cdot \mathbf{b} = \|\mathbf{a}\| \, \|\mathbf{b}\| \cos\theta

This tells us three things about the relationship between two vectors:

  • Positive (θ<90\theta < 90^\circ): the vectors point in roughly the same direction — the input "matches" what the neuron is looking for
  • Zero (θ=90\theta = 90^\circ): the vectors are perpendicular — the input is irrelevant to this neuron
  • Negative (θ>90\theta > 90^\circ): the vectors point in opposing directions — the input is the opposite of what the neuron seeks

Drag the vectors below to build intuition for how the dot product measures alignment:

Loading dot product visualization...

Computing the Dot Product in Python

Let's implement the dot product both manually (to understand the mechanics) and with NumPy (for the speed we'll use in practice):

The Dot Product — Manual vs NumPy
🐍dot_product.py
1import numpy as np

Import NumPy for the vectorized dot product comparison.

3Comment: Pure Python dot product

Let’s first implement the dot product from scratch to understand exactly what it computes, then show the NumPy one-liner that does the same thing 100× faster.

4def dot_product(a, b):

A function that computes the dot product of two vectors. The dot product is THE fundamental operation in neural networks — every neuron computes a dot product of its inputs with its weights.

EXECUTION STATE
⬇ input: a = First vector (e.g., weights). A list or array of numbers.
⬇ input: b = Second vector (e.g., inputs). Must have the same length as a.
⬆ returns = A single number (scalar): the sum of element-wise products. a·b = a[0]×b[0] + a[1]×b[1] + ... + a[n]×b[n]
5Docstring: sum of element-wise products

This is the precise definition: multiply corresponding elements, then sum. For vectors of length n: a·b = Σ(a[i]×b[i]) for i=0 to n-1.

6total = 0.0

Initialize an accumulator to zero. We’ll add each product to this running total.

EXECUTION STATE
total = 0.0 — starting value for accumulation
7for i in range(len(a)):

Loop through each index. len(a) = 3, so i takes values 0, 1, 2. Each iteration multiplies one pair of corresponding elements.

LOOP TRACE · 3 iterations
i=0
a[0] × b[0] = 0.8 × 1.0 = 0.80
total = 0.0 + 0.80 = 0.80
i=1
a[1] × b[1] = (-0.5) × 0.5 = -0.25
total = 0.80 + (-0.25) = 0.55
i=2
a[2] × b[2] = 0.2 × (-0.3) = -0.06
total = 0.55 + (-0.06) = 0.49
8total += a[i] * b[i]

Multiply element i of a with element i of b, then add to the running total. The += operator means total = total + (a[i] * b[i]).

EXECUTION STATE
a[i] * b[i] = The product of corresponding elements. Positive products mean both have the same sign (both positive or both negative). Negative products mean they disagree.
9return total

Return the accumulated sum of all products. This single number tells us how ‘aligned’ the two vectors are — a large positive value means they point in similar directions.

EXECUTION STATE
⬆ return: total = 0.49 — the dot product of weights and inputs
11weights = [0.8, -0.5, 0.2]

A neuron’s weights determine how much it ‘cares’ about each input. w[0]=0.8 means ‘pay strong attention to input 0.’ w[1]=-0.5 means ‘penalize input 1.’ w[2]=0.2 means ‘slightly consider input 2.’

EXECUTION STATE
weights = [0.8, -0.5, 0.2] — learned importance of each input feature
12inputs = [1.0, 0.5, -0.3]

The input data flowing into this neuron. Each value is one feature of a data point (e.g., normalized height, weight, age).

EXECUTION STATE
inputs = [1.0, 0.5, -0.3] — one data point with 3 features
14result = dot_product(weights, inputs)

Compute the weighted sum: 0.8×1.0 + (-0.5)×0.5 + 0.2×(-0.3) = 0.80 - 0.25 - 0.06 = 0.49. This is the neuron’s ‘pre-activation’ value before applying an activation function.

EXECUTION STATE
result = 0.49 — the neuron’s weighted sum of inputs
→ breakdown = (0.8)(1.0) + (-0.5)(0.5) + (0.2)(-0.3) = 0.80 + (-0.25) + (-0.06) = 0.49
15print(f"Manual dot product: {result}")

f-string formatting: inserts the value of result into the string.

EXECUTION STATE
output = Manual dot product: 0.49
18Comment: NumPy one function call

Now the same computation using NumPy — faster, cleaner, and mathematically equivalent.

19w = np.array([0.8, -0.5, 0.2])

Same weights, now as a NumPy array for vectorized computation.

EXECUTION STATE
w = [0.8, -0.5, 0.2]
20x = np.array([1.0, 0.5, -0.3])

Same inputs as a NumPy array.

EXECUTION STATE
x = [1.0, 0.5, -0.3]
22Comment: Three equivalent ways

NumPy provides three syntactically different but mathematically identical ways to compute the dot product. All three produce 0.49.

23print(np.dot(w, x))

np.dot() is the explicit dot product function. For 1D arrays, it computes the inner product. For 2D arrays, it performs matrix multiplication.

EXECUTION STATE
📚 np.dot(a, b) = For 1D: inner product (scalar). For 2D: matrix multiply. For mixed dimensions: it broadcasts. This is the most explicit way to say ‘compute a dot product.’
⬆ result = 0.49
24print(w @ x)

The @ operator (Python 3.5+) is shorthand for matrix multiplication / dot product. It’s the most common syntax in modern code because it looks like the mathematical notation.

EXECUTION STATE
📚 @ operator = Matrix multiplication operator. For 1D arrays: dot product. For 2D: matmul. Equivalent to np.dot() but cleaner syntax. Introduced in PEP 465.
⬆ result = 0.49
25print(np.sum(w * x))

The ‘manual’ vectorized approach: first element-wise multiply (w * x), then sum. This makes the two steps explicit: multiply corresponding elements, then add them up. Conceptually clearest for understanding what a dot product actually does.

EXECUTION STATE
w * x = [0.80, -0.25, -0.06] — element-wise products
np.sum(...) = 0.80 + (-0.25) + (-0.06) = 0.49
⬆ result = 0.49
6 lines without explanation
1import numpy as np
2
3# ── Pure Python: dot product by hand ──
4def dot_product(a, b):
5    """Compute dot product: sum of element-wise products."""
6    total = 0.0
7    for i in range(len(a)):
8        total += a[i] * b[i]
9    return total
10
11weights = [0.8, -0.5, 0.2]
12inputs  = [1.0,  0.5, -0.3]
13
14result = dot_product(weights, inputs)
15print(f"Manual dot product: {result}")
16# Manual dot product: 0.49
17
18# ── NumPy: one function call ──
19w = np.array([0.8, -0.5, 0.2])
20x = np.array([1.0,  0.5, -0.3])
21
22# Three equivalent ways:
23print(np.dot(w, x))       # 0.49
24print(w @ x)              # 0.49
25print(np.sum(w * x))      # 0.49

Why the Dot Product is the Core of Neural Networks

A neuron computes z=wx+bz = \mathbf{w} \cdot \mathbf{x} + b. The dot product wx\mathbf{w} \cdot \mathbf{x} is a weighted sum of the inputs. Large positive weights amplify features the neuron considers important. Negative weights suppress unwanted features. During training, the network learns which weights to assign to each feature — that's the entire learning process.

Matrix Multiplication: Parallelizing Dot Products

A single neuron computes one dot product. But a neural network layer has many neurons, and we process many data points at once. Matrix multiplication is the operation that computes all dot products simultaneously — every neuron's response to every data point in one operation.

How Matrix Multiplication Works

Given matrix A\mathbf{A} of shape (m×n)(m \times n) and matrix B\mathbf{B} of shape (n×p)(n \times p), the product C=AB\mathbf{C} = \mathbf{A} \mathbf{B} has shape (m×p)(m \times p), where each element is a dot product of a row of A\mathbf{A} with a column of B\mathbf{B}:

Cij=k=1nAikBkj=(row i of A)(column j of B)C_{ij} = \sum_{k=1}^{n} A_{ik} \, B_{kj} = \text{(row } i \text{ of } \mathbf{A}) \cdot \text{(column } j \text{ of } \mathbf{B})

The critical rule: the inner dimensions must match. (m×n)×(n×p)(m×p)(m \times \mathbf{n}) \times (\mathbf{n} \times p) \to (m \times p). Click on any output cell in the visualization below to see which row and column produce it:

Loading matrix multiplication visualization...

Matrix Multiplication as a Neural Network Layer

In a neural network, the input matrix X\mathbf{X} has shape (batch_size×input_dim)(\text{batch\_size} \times \text{input\_dim}) and the weight matrix W\mathbf{W} has shape (num_neurons×input_dim)(\text{num\_neurons} \times \text{input\_dim}). The output is:

Z=XWT\mathbf{Z} = \mathbf{X} \, \mathbf{W}^T

where ZijZ_{ij} is data point ii processed by neuron jj. We transpose W\mathbf{W} because we store each neuron's weights as a row (convenient for reading), but matrix multiplication needs them as columns.

Matrix Multiplication as a Neural Network Layer
🐍matrix_multiply.py
1import numpy as np

Import NumPy for matrix operations.

3Comment: 4 data points, 3 features each

We’re simulating a mini-batch of 4 data points flowing through a layer with 2 neurons. Matrix multiplication lets us compute ALL neuron outputs for ALL data points in one operation.

4X = np.array([...]) — shape (4, 3)

The input matrix. Each row is one data point (already normalized). In neural network notation: X has shape (batch_size, input_dim) = (4, 3).

EXECUTION STATE
X (4×3) =
  feat0  feat1  feat2
   1.0    0.5   -0.3    ← data point 0
   0.2    1.5    0.7    ← data point 1
  -0.5    0.8    1.2    ← data point 2
   0.9   -0.2    0.4    ← data point 3
X.shape = (4, 3) — 4 samples, 3 features each
11Comment: Weight matrix

A neural network layer with 2 neurons needs a weight matrix where each row contains one neuron’s weights. Row 0 = neuron 0’s weights for the 3 inputs. Row 1 = neuron 1’s weights.

12W = np.array([...]) — shape (2, 3)

The weight matrix. Each row is one neuron’s weight vector. W[0] = [0.8, -0.5, 0.2] means neuron 0 weighs the three inputs by these amounts. W[1] = [0.3, 0.7, -0.1] is neuron 1’s perspective.

EXECUTION STATE
W (2×3) =
         feat0  feat1  feat2
neuron0   0.8   -0.5    0.2
neuron1   0.3    0.7   -0.1
W.shape = (2, 3) — 2 neurons, each with 3 weights
→ neuron 0 strategy = W[0]=[0.8,-0.5,0.2]: strongly favors feature 0, penalizes feature 1, slightly likes feature 2
→ neuron 1 strategy = W[1]=[0.3,0.7,-0.1]: slightly likes feature 0, strongly favors feature 1, slightly penalizes feature 2
17Comment: Matrix multiply

Here’s the key insight: X @ W.T computes EVERY dot product between EVERY data point and EVERY neuron in a single operation. It replaces a double-nested loop with one matrix multiplication.

18Z = X @ W.T

Matrix multiplication: X (4×3) times W transposed (3×2) gives Z (4×2). Z[i, j] = dot product of data point i with neuron j’s weights. The transpose W.T is needed because W stores weights as rows, but matmul needs them as columns.

EXECUTION STATE
📚 @ (matmul) = Matrix multiplication. For (m×n) @ (n×p) = (m×p). Inner dimensions must match (both n). Each output element is a dot product of a row from the left and a column from the right.
W.T (3×2) =
Transpose of W: rows become columns.
         n0    n1
feat0   0.8   0.3
feat1  -0.5   0.7
feat2   0.2  -0.1
⬆ result: Z (4×2) =
         neuron0  neuron1
data 0    0.49     0.68
data 1    0.29     1.04
data 2   -0.56     0.29
data 3    0.90     0.09
→ shape check = (4,3) @ (3,2) → (4,2) ✔ inner dims 3=3 match
19print(Z)

The output matrix Z. Each row is one data point’s output from the layer. Each column is one neuron’s response to all data points.

EXECUTION STATE
Z =
[[ 0.49  0.68]
 [ 0.29  1.04]
 [-0.56  0.29]
 [ 0.90  0.09]]
25Comment: What Z[0, 0] means

Let’s verify by computing one element manually to prove the matrix multiply does exactly what we expect.

26Z[0,0] = X[0] · W[0]

Z[0,0] is the dot product of the first data point [1.0, 0.5, -0.3] with neuron 0’s weights [0.8, -0.5, 0.2]. This is: 0.8×1.0 + (-0.5)×0.5 + 0.2×(-0.3) = 0.80 - 0.25 - 0.06 = 0.49.

EXECUTION STATE
X[0] = [1.0, 0.5, -0.3] — data point 0
W[0] = [0.8, -0.5, 0.2] — neuron 0 weights
np.dot(X[0], W[0]) = 0.49 = (0.8)(1.0) + (-0.5)(0.5) + (0.2)(-0.3)
29Z[0,1] = X[0] · W[1]

Z[0,1] is data point 0 processed by neuron 1: [1.0, 0.5, -0.3] · [0.3, 0.7, -0.1] = 0.30 + 0.35 + 0.03 = 0.68. Notice neuron 1 gives a different response because it has different weights — it’s looking for different patterns in the data.

EXECUTION STATE
X[0] = [1.0, 0.5, -0.3] — same data point
W[1] = [0.3, 0.7, -0.1] — neuron 1 weights
np.dot(X[0], W[1]) = 0.68 = (0.3)(1.0) + (0.7)(0.5) + (-0.1)(-0.3)
19 lines without explanation
1import numpy as np
2
3# 4 data points, each with 3 features
4X = np.array([
5    [1.0, 0.5, -0.3],
6    [0.2, 1.5,  0.7],
7    [-0.5, 0.8, 1.2],
8    [0.9, -0.2, 0.4],
9])  # shape: (4, 3)
10
11# Weight matrix: 3 inputs → 2 neurons
12W = np.array([
13    [0.8, -0.5, 0.2],
14    [0.3,  0.7, -0.1],
15])  # shape: (2, 3)
16
17# Matrix multiply: each neuron computes dot product with each input
18Z = X @ W.T   # (4, 3) @ (3, 2) → (4, 2)
19print(Z)
20# [[ 0.49  0.68]
21#  [ 0.29  1.04]
22#  [-0.56  0.29]
23#  [ 0.90  0.09]]
24
25# What Z[0, 0] means:
26print(f"Z[0,0] = X[0] · W[0] = {np.dot(X[0], W[0]):.2f}")
27# Z[0,0] = neuron 0's response to data point 0 = 0.49
28
29print(f"Z[0,1] = X[0] · W[1] = {np.dot(X[0], W[1]):.2f}")
30# Z[0,1] = neuron 1's response to data point 0 = 0.68
NotationShapeMeaning
X(batch_size, input_dim)Input data — each row is one data point
W(num_neurons, input_dim)Weights — each row is one neuron’s weights
Wᵀ(input_dim, num_neurons)Transposed for matmul compatibility
Z = XWᵀ(batch_size, num_neurons)Each element is one dot product: Z[i,j] = X[i]·W[j]

Broadcasting: Automatic Shape Alignment

After computing Z=XWT\mathbf{Z} = \mathbf{X}\mathbf{W}^T, every neuron adds its bias: Z=XWT+b\mathbf{Z} = \mathbf{X}\mathbf{W}^T + \mathbf{b}. But Z\mathbf{Z} has shape (4,2)(4, 2) and b\mathbf{b} has shape (2,)(2,). How can you add a vector to a matrix?

The answer is broadcasting — NumPy's mechanism for performing operations on arrays of different shapes. NumPy aligns shapes from the right and "stretches" dimensions of size 1 (or missing) to match the other array:

StepZ shapeb shapeAction
Start(4, 2)(2,)Align from right
Pad b(4, 2)(1, 2)Missing dim treated as 1
Stretch(4, 2)(4, 2)Dim of size 1 → repeated 4 times
Add(4, 2)(4, 2)Element-wise addition

No memory is actually copied — NumPy handles this virtually. The result: each row of Z\mathbf{Z} gets the same bias vector added. This is exactly what we want: the same two biases applied to every data point in the batch.

Broadcasting: Adding Bias to a Batch
🐍broadcasting.py
1import numpy as np

Import NumPy for broadcasting demonstration.

3Comment: Z has shape (4, 2)

Z is the output of our previous matrix multiplication: 4 data points, each processed by 2 neurons. Now we need to add a bias to each neuron’s output.

4Z = np.array([...]) — shape (4, 2)

The pre-bias neuron outputs. Each row is one data point. Each column is one neuron’s response.

EXECUTION STATE
Z (4×2) =
        n0     n1
data0  0.49   0.68
data1  0.29   1.04
data2 -0.56   0.29
data3  0.90   0.09
Z.shape = (4, 2)
11Comment: Bias vector

Each neuron has its own bias — a constant that shifts its output. The bias lets the neuron fire even when all inputs are zero.

12b = np.array([0.1, -0.2])

Two biases: b[0]=0.1 for neuron 0 (shifts output up), b[1]=-0.2 for neuron 1 (shifts output down). Shape is (2,) — just a 1D vector.

EXECUTION STATE
b = [0.1, -0.2] — one bias per neuron
b.shape = (2,) — does NOT match Z’s (4,2)... so how does Z + b work?
14Comment: Broadcasting rules

NumPy’s broadcasting rule: align shapes from the right. If one dimension is 1 (or missing), ‘stretch’ it to match the other. (4,2) + (2,): the (2,) becomes (1,2) then (4,2) by repeating the row 4 times. No memory is actually copied — NumPy does this virtually.

16Z_biased = Z + b

Broadcasting in action: b with shape (2,) is automatically added to every row of Z (shape 4×2). Each data point gets the same bias applied. This is equivalent to writing Z + np.array([[0.1,-0.2],[0.1,-0.2],[0.1,-0.2],[0.1,-0.2]]) but without actually creating that repeated matrix.

EXECUTION STATE
📚 broadcasting = NumPy’s rule: align shapes right-to-left. Dimensions match if they’re equal or one is 1. Missing dims are treated as 1. (4,2)+(2,) → (4,2)+(1,2) → (4,2)+(4,2) by virtual replication.
→ step-by-step = Z shape: (4, 2) b shape: (2,) → treated as (1, 2) → broadcast to (4, 2) Result: (4, 2)
⬆ result: Z_biased (4×2) =
        n0     n1
data0  0.59   0.48
data1  0.39   0.84
data2 -0.46   0.09
data3  1.00  -0.11
17print(Z_biased)

The biased output. Compare with Z: neuron 0’s values all increased by 0.1, neuron 1’s all decreased by 0.2.

EXECUTION STATE
Z_biased =
[[ 0.59  0.48]
 [ 0.39  0.84]
 [-0.46  0.09]
 [ 1.00 -0.11]]
22Comment: What happened row by row

The bias b=[0.1, -0.2] was added to every single row. This is broadcasting: a (2,) vector is ‘stretched’ to (4,2) by repeating it across the batch dimension. This pattern appears in every neural network layer: Z = X @ W.T + b.

23Row 0 breakdown

For data point 0: [0.49 + 0.1, 0.68 + (-0.2)] = [0.59, 0.48]. Neuron 0 was 0.49, now 0.59 (bias pushed it up). Neuron 1 was 0.68, now 0.48 (bias pulled it down).

EXECUTION STATE
Row 0 = [0.49 + 0.1, 0.68 - 0.2] = [0.59, 0.48]
24Row 1 breakdown

Same bias applied to data point 1: [0.29 + 0.1, 1.04 - 0.2] = [0.39, 0.84].

EXECUTION STATE
Row 1 = [0.29 + 0.1, 1.04 - 0.2] = [0.39, 0.84]
25Row 2 breakdown

Data point 2: [-0.56 + 0.1, 0.29 - 0.2] = [-0.46, 0.09].

EXECUTION STATE
Row 2 = [-0.56 + 0.1, 0.29 - 0.2] = [-0.46, 0.09]
26Row 3 breakdown

Data point 3: [0.90 + 0.1, 0.09 - 0.2] = [1.00, -0.11]. Notice how the bias shifted neuron 1’s output from positive (0.09) to negative (-0.11).

EXECUTION STATE
Row 3 = [0.90 + 0.1, 0.09 - 0.2] = [1.00, -0.11]
14 lines without explanation
1import numpy as np
2
3# After matrix multiply: Z has shape (4, 2)
4Z = np.array([
5    [0.49,  0.68],
6    [0.29,  1.04],
7    [-0.56, 0.29],
8    [0.90,  0.09],
9])
10
11# Bias vector: one value per neuron, shape (2,)
12b = np.array([0.1, -0.2])
13
14# Broadcasting: (4, 2) + (2,) → (4, 2)
15# NumPy "stretches" b to match Z's shape
16Z_biased = Z + b
17print(Z_biased)
18# [[ 0.59  0.48]
19#  [ 0.39  0.84]
20#  [-0.46  0.09]
21#  [ 1.00 -0.11]]
22
23# What happened? b was broadcast across all 4 rows:
24# Row 0: [0.49+0.1, 0.68+(-0.2)] = [0.59, 0.48]
25# Row 1: [0.29+0.1, 1.04+(-0.2)] = [0.39, 0.84]
26# Row 2: [-0.56+0.1, 0.29+(-0.2)] = [-0.46, 0.09]
27# Row 3: [0.90+0.1, 0.09+(-0.2)] = [1.00, -0.11]

Broadcasting Rules Summary

  1. Align shapes from the right: (4,2) and (2,) become (4,2) and (1,2)
  2. Dimensions match if they are equal OR one of them is 1
  3. A dimension of size 1 is "stretched" to match the other
  4. If dimensions don't match and neither is 1, it's an error

Building a Neuron from Scratch

Now we can put everything together. A single artificial neuron performs three operations in sequence:

  1. Weighted sum: compute the dot product wx\mathbf{w} \cdot \mathbf{x} of weights and inputs
  2. Add bias: shift the result by a constant bb, giving z=wx+bz = \mathbf{w} \cdot \mathbf{x} + b
  3. Apply activation: pass zz through a non-linear function σ\sigma, producing the output a=σ(z)a = \sigma(z)

This can be written as a single equation: a=σ(wx+b)a = \sigma(\mathbf{w} \cdot \mathbf{x} + b). The activation function σ\sigma is what makes neural networks powerful — without it, stacking layers would just produce another linear function, incapable of learning curves, boundaries, or any non-linear pattern.

Two common activation functions:

  • ReLU (Rectified Linear Unit): ReLU(z)=max(0,z)\text{ReLU}(z) = \max(0, z) — lets positive values through, blocks negatives. Simple, fast, and the default choice for hidden layers.
  • Sigmoid: σ(z)=11+ez\sigma(z) = \frac{1}{1 + e^{-z}} — squashes any value into (0,1)(0, 1). Output interpretable as a probability. Used in output layers for binary classification.

Adjust the sliders below to see how inputs, weights, bias, and activation function interact to produce a neuron's output:

Loading neuron visualization...

Implementation: A Single Neuron

A Complete Single Neuron Implementation
🐍single_neuron.py
1import numpy as np

Import NumPy for vectorized computation.

3Comment: Activation functions

Activation functions introduce non-linearity into the network. Without them, stacking multiple layers of weighted sums would be equivalent to a single linear transformation — the network couldn’t learn curves, only straight lines.

4def relu(z):

ReLU (Rectified Linear Unit) is the most widely used activation function in modern neural networks. It’s simple, fast, and works remarkably well in practice.

EXECUTION STATE
⬇ input: z = The pre-activation value (weighted sum + bias). Can be any real number: positive, negative, or zero.
⬆ returns = max(0, z): keeps positive values unchanged, replaces negatives with 0. Mathematically: f(z) = z if z > 0, else 0.
5Docstring: if positive keep, if negative zero

This is the complete intuition: ReLU acts as a gate. It lets positive signals through unchanged and blocks negative signals completely. This sparsity (many neurons outputting 0) is actually beneficial — it creates efficient representations.

6return np.maximum(0, z)

np.maximum(0, z) computes element-wise max of 0 and z. For scalar z: returns z if z>0, else 0. For array z: applies to each element independently — perfect for vectorized computation across a batch.

EXECUTION STATE
📚 np.maximum(a, b) = Element-wise maximum of a and b. np.maximum(0, 3)=3, np.maximum(0, -2)=0. Different from np.max() which finds the single largest value in an array.
→ examples = relu(2.5) = 2.5, relu(0) = 0, relu(-1.3) = 0 relu([1,-2,3,-4]) = [1, 0, 3, 0]
8def sigmoid(z):

Sigmoid squashes any real number into the range (0, 1). Historically the default activation; now mainly used in output layers for binary classification (probability output).

EXECUTION STATE
⬇ input: z = Pre-activation value. Large positive → output near 1.0. Large negative → output near 0.0. Zero → output = 0.5.
⬆ returns = σ(z) = 1/(1+e^(-z)). Always between 0 and 1. Monotonically increasing. S-shaped curve.
9Docstring: squashes to (0, 1)

The sigmoid’s output can be interpreted as a probability. σ(z)=0.9 means ‘this neuron is 90% confident.’ That’s why it’s used in binary classifiers.

10return 1.0 / (1.0 + np.exp(-z))

The sigmoid formula: σ(z) = 1 / (1 + e^(-z)). When z is large positive, e^(-z) ≈ 0, so output ≈ 1/(1+0) = 1. When z is large negative, e^(-z) is huge, so output ≈ 1/huge ≈ 0. When z=0, e^0=1, so output = 1/2 = 0.5.

EXECUTION STATE
np.exp(-z) = e raised to the power of -z. Flipping the sign means: positive z → small exp, negative z → large exp.
→ example: z=0.59 = exp(-0.59) = 0.5543 1 + 0.5543 = 1.5543 1 / 1.5543 = 0.6436
12Comment: A single neuron

Now we combine everything: a neuron takes inputs, multiplies by weights, adds bias, and applies an activation function. This is the fundamental computational unit of every neural network.

13def neuron(x, w, b, activation=relu):

A complete single neuron implementation. This function encapsulates the entire forward pass of one neuron: weighted sum → bias → activation. Every neuron in every neural network does exactly this.

EXECUTION STATE
⬇ input: x (3,) = Input vector. [1.0, 0.5, -0.3] — one data point with 3 features.
⬇ input: w (3,) = Weight vector. [0.8, -0.5, 0.2] — learned importance of each feature.
⬇ input: b = Bias scalar. 0.1 — shifts the activation threshold.
⬇ input: activation = The activation function to apply. Default is ReLU. Can pass sigmoid, tanh, or any function f(z) → scalar.
⬆ returns = A single scalar: the neuron’s output after the full computation pipeline.
14Docstring: x, w, b, activation

The docstring documents the four components of a neuron’s computation. Note that x and w must have the same length — each weight corresponds to exactly one input.

19z = np.dot(w, x) + b

Step 1: The pre-activation value. Compute the weighted sum (dot product) and add bias. z = w·x + b = (0.8)(1.0) + (-0.5)(0.5) + (0.2)(-0.3) + 0.1 = 0.49 + 0.1 = 0.59. This z is the neuron’s ‘raw opinion’ before the activation function squashes it.

EXECUTION STATE
np.dot(w, x) = 0.49 — the weighted sum: 0.80 + (-0.25) + (-0.06) = 0.49
+ b = + 0.1 = 0.59 — bias shifts the threshold
z = 0.59 — the pre-activation value
20a = activation(z)

Step 2: Apply the activation function to z. With ReLU: max(0, 0.59) = 0.59 (positive, so unchanged). With sigmoid: 1/(1+e^(-0.59)) = 0.6436. The activation function introduces non-linearity — without it, the network could only learn linear relationships.

EXECUTION STATE
activation(z) = Calls whatever function was passed as the activation parameter. Default is relu.
→ if relu = relu(0.59) = max(0, 0.59) = 0.59
→ if sigmoid = sigmoid(0.59) = 1/(1+e^(-0.59)) = 0.6436
a = The neuron’s final output (activation value)
21return a

Return the neuron’s output. This single number flows to the next layer (or becomes the network’s final prediction if this is the output layer).

EXECUTION STATE
⬆ return: a = 0.59 (ReLU) or 0.6436 (sigmoid) — the neuron’s activated output
23Comment: Test with our values

Let’s run the neuron with concrete values to verify everything works as expected.

24x = np.array([1.0, 0.5, -0.3])

The input: one data point with 3 features.

EXECUTION STATE
x = [1.0, 0.5, -0.3]
25w = np.array([0.8, -0.5, 0.2])

The weights: how much this neuron cares about each feature.

EXECUTION STATE
w = [0.8, -0.5, 0.2]
26b = 0.1

The bias: shifts the decision boundary. Without bias, the neuron can only represent hyperplanes through the origin.

EXECUTION STATE
b = 0.1
28Comment: Forward pass with ReLU

The ‘forward pass’ is when data flows through the network from input to output. Here we trace one neuron’s forward pass.

29output_relu = neuron(x, w, b, activation=relu)

Full computation: z = w·x + b = 0.49 + 0.1 = 0.59. Then relu(0.59) = max(0, 0.59) = 0.59. Since z is positive, ReLU passes it through unchanged.

EXECUTION STATE
z = np.dot(w,x) + b = 0.49 + 0.1 = 0.59
relu(0.59) = max(0, 0.59) = 0.59 — positive, so unchanged
output_relu = 0.5900
30print(f"ReLU output: {output_relu:.4f}")

:.4f formats the float to 4 decimal places.

EXECUTION STATE
output = ReLU output: 0.5900
33output_sig = neuron(x, w, b, activation=sigmoid)

Same neuron, same inputs, but using sigmoid activation. z = 0.59 (same weighted sum). sigmoid(0.59) = 1/(1+e^(-0.59)) = 1/(1+0.5543) = 1/1.5543 = 0.6436. The sigmoid compresses 0.59 to 0.6436 — slightly above 0.5, indicating mild positive confidence.

EXECUTION STATE
z = np.dot(w,x) + b = 0.59 (same as before — activation doesn’t change z)
sigmoid(0.59) = 1/(1+e^(-0.59)) = 1/1.5543 = 0.6436
output_sig = 0.6436
34print(f"Sigmoid output: {output_sig:.4f}")

The sigmoid output 0.6436 can be interpreted as a 64.36% probability. The same pre-activation (0.59) produces different outputs depending on the activation function.

EXECUTION STATE
output = Sigmoid output: 0.6436
14 lines without explanation
1import numpy as np
2
3# ── Activation functions ──
4def relu(z):
5    """ReLU: if positive keep it, if negative make it 0."""
6    return np.maximum(0, z)
7
8def sigmoid(z):
9    """Sigmoid: squashes any value to range (0, 1)."""
10    return 1.0 / (1.0 + np.exp(-z))
11
12# ── A single neuron ──
13def neuron(x, w, b, activation=relu):
14    """
15    x: input vector (n,)
16    w: weight vector (n,) — same size as x
17    b: bias scalar
18    activation: function to apply after weighted sum
19    """
20    z = np.dot(w, x) + b       # Step 1: weighted sum + bias
21    a = activation(z)           # Step 2: apply activation
22    return a
23
24# Test with our values
25x = np.array([1.0, 0.5, -0.3])
26w = np.array([0.8, -0.5, 0.2])
27b = 0.1
28
29# Forward pass with ReLU
30output_relu = neuron(x, w, b, activation=relu)
31print(f"ReLU output: {output_relu:.4f}")
32# ReLU output: 0.5900
33
34# Forward pass with Sigmoid
35output_sig = neuron(x, w, b, activation=sigmoid)
36print(f"Sigmoid output: {output_sig:.4f}")
37# Sigmoid output: 0.6436

Scaling Up: From One Neuron to a Full Layer

A single neuron takes a vector and produces a scalar. A layer of mm neurons takes the same vector and produces mm scalars — one from each neuron. When we process a batch of nn data points through mm neurons, we get an n×mn \times m output matrix. Matrix multiplication handles all of this in one operation:

A=σ(XWT+b)\mathbf{A} = \sigma(\mathbf{X} \mathbf{W}^T + \mathbf{b})

where X\mathbf{X} is (n×d)(n \times d),W\mathbf{W} is (m×d)(m \times d),b\mathbf{b} is (m,)(m,) (broadcast), andσ\sigma is applied element-wise.

A Complete Neural Network Layer
🐍layer_forward.py
1import numpy as np

Import NumPy for the complete layer implementation.

3def relu(Z):

ReLU activation that works on entire matrices. Since np.maximum is element-wise, it handles any shape — scalar, vector, or matrix.

EXECUTION STATE
⬇ input: Z = Can be any shape. For a full layer: shape (batch_size, num_neurons).
4return np.maximum(0, Z)

Element-wise ReLU on the entire matrix at once. Every negative value becomes 0; every positive value stays unchanged.

6def layer_forward(X, W, b):

The complete forward pass of one neural network layer. This single function does what every layer in every neural network does: linear transform followed by non-linear activation. Z = XWᵀ + b, then A = relu(Z).

EXECUTION STATE
⬇ input: X (4×3) =
  feat0  feat1  feat2
   1.0    0.5   -0.3
   0.2    1.5    0.7
  -0.5    0.8    1.2
   0.9   -0.2    0.4
⬇ input: W (2×3) =
         feat0  feat1  feat2
neuron0   0.8   -0.5    0.2
neuron1   0.3    0.7   -0.1
⬇ input: b (2,) = [0.1, -0.2] — one bias per neuron
⬆ returns = A (4×2): ReLU-activated output. Same shape as Z.
14Z = X @ W.T + b

The linear transformation: matrix multiply X (4×3) by W transposed (3×2) to get (4×2), then broadcast-add bias b (2,) to every row. This single line replaces 8 dot products (4 data points × 2 neurons) and 8 bias additions.

EXECUTION STATE
X @ W.T =
         n0     n1
data0   0.49   0.68
data1   0.29   1.04
data2  -0.56   0.29
data3   0.90   0.09
+ b (broadcast) = Add [0.1, -0.2] to every row
Z (4×2) =
         n0     n1
data0   0.59   0.48
data1   0.39   0.84
data2  -0.46   0.09
data3   1.00  -0.11
15A = relu(Z)

Apply ReLU to every element of Z. Negative values become 0 (the neuron ‘doesn’t fire’). Positive values pass through unchanged. Notice Z[2,0]=-0.46 becomes 0 and Z[3,1]=-0.11 becomes 0.

EXECUTION STATE
A (4×2) =
         n0     n1
data0   0.59   0.48
data1   0.39   0.84
data2   0.00   0.09  ← neuron 0 killed by ReLU
data3   1.00   0.00  ← neuron 1 killed by ReLU
→ sparsity = 2 out of 8 values are zero (25% sparsity). In deep networks, 50-80% of ReLU outputs are typically zero — this is a feature, not a bug.
16return A

Return the activated output. This A becomes the input X for the next layer — that’s how neural networks stack layers: output of layer 1 feeds into layer 2, and so on.

EXECUTION STATE
⬆ return: A (4×2) = The layer’s output. Shape (batch_size, num_neurons). Ready to feed into the next layer.
18Comment: 4 data points, 3 features

Setting up concrete values to trace through the layer.

19X = np.array([...]) — shape (4, 3)

Same input batch as before.

EXECUTION STATE
X.shape = (4, 3) — 4 data points, 3 features each
26W = np.array([...]) — shape (2, 3)

Same weight matrix: 2 neurons, each with 3 weights.

EXECUTION STATE
W.shape = (2, 3) — 2 neurons, 3 weights each
30b = np.array([0.1, -0.2])

Bias vector: one bias per neuron.

EXECUTION STATE
b = [0.1, -0.2]
32Comment: One line processes entire batch

The beauty of vectorized computation: one function call processes all 4 data points through all 2 neurons. No loops needed.

33A = layer_forward(X, W, b)

The complete forward pass: X(4×3) → Z=XWᵀ+b(4×2) → A=relu(Z)(4×2). Every neural network layer, from the simplest to the deepest, follows this exact pattern.

EXECUTION STATE
Step 1: X @ W.T = (4×3) @ (3×2) = (4×2) — all dot products
Step 2: + b = Broadcast [0.1,-0.2] to every row
Step 3: relu() = max(0, each element)
A (4×2) =
[[0.59, 0.48],
 [0.39, 0.84],
 [0.00, 0.09],
 [1.00, 0.00]]
34print("Activations...")

Display the final activations.

35print(A)

The final output of our layer. 4 data points, each with 2 neuron outputs. This is the complete result of one neural network layer processing a batch of data.

EXECUTION STATE
A =
[[0.59  0.48]
 [0.39  0.84]
 [0.00  0.09]
 [1.00  0.00]]
25 lines without explanation
1import numpy as np
2
3def relu(Z):
4    return np.maximum(0, Z)
5
6def layer_forward(X, W, b):
7    """
8    Full forward pass of one neural network layer.
9    X: input data (batch_size, input_dim)
10    W: weights    (num_neurons, input_dim)
11    b: biases     (num_neurons,)
12    Returns: activations (batch_size, num_neurons)
13    """
14    Z = X @ W.T + b    # Linear transform + broadcast bias
15    A = relu(Z)         # Apply activation element-wise
16    return A
17
18# 4 data points, 3 features each
19X = np.array([
20    [1.0, 0.5, -0.3],
21    [0.2, 1.5,  0.7],
22    [-0.5, 0.8, 1.2],
23    [0.9, -0.2, 0.4],
24])
25
26# 2 neurons, 3 weights each
27W = np.array([
28    [0.8, -0.5, 0.2],
29    [0.3,  0.7, -0.1],
30])
31b = np.array([0.1, -0.2])
32
33# One line: process entire batch through the layer
34A = layer_forward(X, W, b)
35print("Activations (4 data points × 2 neurons):")
36print(A)
37# [[0.59  0.48]
38#  [0.39  0.84]
39#  [0.00  0.09]
40#  [1.00  0.00]]
The Key Insight: A neural network layer is three operations: matrix multiplication (all dot products), broadcasting addition (all biases), and element-wise activation (all non-linearities). That's it. Every layer in every neural network — from a simple classifier to GPT-4 — is built from these same three operations.

Summary and Bridge to PyTorch

In this section, we built a neural network layer from first principles using Python and NumPy. Here's what we covered:

ConceptNumPy OperationRole in Neural Networks
Vectors & matricesnp.array()Represent data and weights
Vectorized operations+, *, np.exp()Element-wise computation without loops
Dot productnp.dot() or @One neuron’s weighted sum
Matrix multiplicationX @ W.TAll neurons processing all data at once
BroadcastingZ + bAdding bias across the batch
Activation functionsnp.maximum(0, z)Introducing non-linearity

The complete forward pass of one layer is a single line:

A=σ(XWT+b)\mathbf{A} = \sigma(\mathbf{X}\mathbf{W}^T + \mathbf{b})

In the next section, we'll see how PyTorch provides the same operations with automatic differentiation — the ability to automatically compute gradients for learning. Everything we built here in NumPy has a direct PyTorch equivalent, but PyTorch adds the critical ingredient: it remembers how each output was computed so it can calculate how to adjust the weights to reduce error. That's the foundation of training.

Practice Exercise

Try modifying the layer code: change the weights, add a third neuron, or process a larger batch. At each step, predict the output shapes before running the code. Shape prediction is the most valuable skill in neural network debugging.
Loading comments...