Chapter 3
15 min read
Section 10 of 65

Probability Basics

Mathematics for Neural Networks

Why Probability Matters for Neural Networks

In the previous sections, we learned how vectors and matrices transform data (Section 1) and how derivatives measure change (Section 2). Now we add the third essential ingredient:probability \u2014 the mathematics of uncertainty.

Neural networks are, at their core, probability machines. When a classifier looks at an image and outputs \u201C85% cat, 10% dog, 5% bird,\u201D it is producing a probability distribution. When we train the network, we measure how far its predicted distribution is from the true answer using a probability-based metric calledcross-entropy. When we initialize weights, we draw them from aGaussian distribution. Probability is woven into every aspect of deep learning.

The Big Picture: This section builds a straight path from basic probability concepts to the cross-entropy loss function \u2014 the single most important equation in classification neural networks. By the end, you will understand not just how to compute the loss, but why it takes the form it does, rooted in information theory and maximum likelihood estimation.

Here is how probability concepts connect to neural networks:

Probability ConceptNeural Network Application
Probability distributionsModel outputs (softmax layer)
Expected valueLoss = expected error over training data
Gaussian distributionWeight initialization, batch normalization
Bayes’ theoremClassification: P(class | input)
Softmax functionConverting logits to probabilities
Cross-entropyTHE loss function for classification
Maximum likelihoodWhy cross-entropy is the right loss

Random Variables and Probability Distributions

A random variable is a variable whose value is determined by a random process. It maps each outcome of a random experiment to a number. For example, when you roll a die, the random variable XX takes the value of whichever face lands up: 1, 2, 3, 4, 5, or 6.

A probability distribution describes how likely each value of the random variable is. It assigns a probability P(X=k)P(X = k) to each possible outcome kk. Two rules must always hold:

  1. Every probability is between 0 and 1: 0P(X=k)10 \leq P(X = k) \leq 1
  2. All probabilities sum to 1: kP(X=k)=1\sum_k P(X = k) = 1 (something must happen)

Discrete vs. Continuous

Random variables come in two flavors:

  • Discrete: Takes a finite or countable set of values. Examples: die roll (1\u20136), coin flip (0 or 1), a classifier's output class (cat, dog, bird). The distribution is described by a probability mass function (PMF): P(X=k)P(X = k) gives the probability of each specific value.
  • Continuous: Takes any value in a range. Examples: a person's height, a neural network weight, the output of a regression model. The distribution is described by a probability density function (PDF): f(x)f(x). The probability of falling in an interval [a,b][a, b] is the area under the curve: P(aXb)=abf(x)dxP(a \leq X \leq b) = \int_a^b f(x)\,dx.
Why This Matters: A neural network classifier outputs a discrete probability distribution: one probability per class. This is a PMF \u2014 and softmax ensures it sums to 1. Neural network weights, on the other hand, are continuous random variables during initialization, drawn from a Gaussian PDF.

Expected Value and Variance

Two numbers summarize the most important properties of any distribution: where it's centered (expected value) and how spread out it is (variance).

Expected Value (Mean)

The expected value E[X]E[X] is the probability-weighted average of all possible outcomes. It answers: \u201CIf I repeated this experiment infinitely many times, what would the average outcome be?\u201D

For a discrete random variable:

E[X]=kkP(X=k)E[X] = \sum_k k \cdot P(X = k)

For example, a fair die: E[X]=116+216+316+416+516+616=3.5E[X] = 1 \cdot \frac{1}{6} + 2 \cdot \frac{1}{6} + 3 \cdot \frac{1}{6} + 4 \cdot \frac{1}{6} + 5 \cdot \frac{1}{6} + 6 \cdot \frac{1}{6} = 3.5. Note that 3.5 is not a value the die can actually show \u2014 the expected value doesn't have to be a possible outcome.

Variance

Variance Var[X]\text{Var}[X] measures how spread out the distribution is around the mean. It is the expected value of the squared deviations from the mean:

Var[X]=E[(Xμ)2]=k(kμ)2P(X=k)\text{Var}[X] = E[(X - \mu)^2] = \sum_k (k - \mu)^2 \cdot P(X = k)

The standard deviation σ=Var[X]\sigma = \sqrt{\text{Var}[X]} is the square root of the variance and has the same units as XX, making it more interpretable: \u201Ctypically, outcomes are about σ\sigma away from the mean.\u201D

Neural Network Connection: When training a neural network, the loss function computes the expected loss over the training data:L=1Ni=1N(f(xi),yi)L = \frac{1}{N} \sum_{i=1}^{N} \ell(f(x_i), y_i) \u2014 this is a Monte Carlo estimate of E[]E[\ell]. Variance matters too: high-variance gradients (noisy) slow training. Batch normalization reduces internal variance, and techniques like gradient clipping control extreme gradient values.

Let's compute expected value and variance in Python:

Expected Value and Variance — Python
🐍expected_value.py
1import numpy as np

NumPy provides fast array operations for computing expected values, variances, and simulating random experiments. We use np.array for probability vectors, np.sum for weighted sums, and np.random for sampling.

EXECUTION STATE
📚 numpy = Numerical computing library. Key functions used here: np.array (create arrays), np.sum (weighted sums), np.sqrt (square root), np.random.randint (random sampling), np.mean/np.var (sample statistics).
3# A discrete random variable: fair 6-sided die

A random variable is a function that assigns a number to each outcome of a random experiment. A die roll is the simplest example: the random variable X takes values 1 through 6, each with equal probability 1/6. In neural networks, the output of a classification layer is a discrete random variable — it assigns a probability to each possible class.

4outcomes = np.array([1, 2, 3, 4, 5, 6])

The possible values our random variable X can take. For a fair die, these are the integers 1 through 6. In neural network terms, think of these as class labels — a 6-class classifier has 6 possible outcomes.

EXECUTION STATE
outcomes = [1, 2, 3, 4, 5, 6] — the sample space of a die roll. Each element is one possible realization of the random variable X.
📚 np.array() = Creates a NumPy ndarray from a Python list. ndarray supports vectorized math: outcomes * probabilities multiplies element-by-element without a loop.
5probabilities = np.array([1/6, 1/6, 1/6, 1/6, 1/6, 1/6])

The probability mass function (PMF): P(X = k) for each outcome k. For a fair die, each face has equal probability 1/6 ≈ 0.1667. Two rules must hold: (1) every probability is between 0 and 1, (2) all probabilities sum to exactly 1.0.

EXECUTION STATE
probabilities = [0.1667, 0.1667, 0.1667, 0.1667, 0.1667, 0.1667] — uniform distribution: all outcomes equally likely.
PMF rule 1 = 0 ≤ P(X=k) ≤ 1 for every k. Probabilities cannot be negative or exceed 1.
PMF rule 2 = Σ P(X=k) = 1.0 — exactly one outcome must occur. Here: 6 × (1/6) = 1.0 ✓
7print("Outcomes:", outcomes)

Displays the array of possible outcomes.

EXECUTION STATE
output = Outcomes: [1 2 3 4 5 6]
8print("Probabilities:", np.round(probabilities, 4))

Displays probabilities rounded to 4 decimal places for readability.

EXECUTION STATE
📚 np.round(array, decimals) = Rounds each element of the array to the specified number of decimal places. np.round(1/6, 4) = 0.1667.
output = Probabilities: [0.1667 0.1667 0.1667 0.1667 0.1667 0.1667]
9print("Sum of probabilities:", probabilities.sum())

Verifies that probabilities sum to 1.0 — a fundamental requirement of any valid probability distribution.

EXECUTION STATE
probabilities.sum() = 0.1667 + 0.1667 + 0.1667 + 0.1667 + 0.1667 + 0.1667 = 1.0000
output = Sum of probabilities: 1.0
11# Expected value: E[X] = sum(x * P(x))

The expected value is the probability-weighted average of all outcomes. It answers: “if I repeated this experiment infinitely many times, what would the average outcome be?” For neural networks, the loss function computes the expected loss over the training data.

12expected_value = np.sum(outcomes * probabilities)

Computes E[X] = Σ x·P(x) using element-wise multiplication then summation. First, outcomes * probabilities multiplies each value by its probability. Then np.sum adds all the weighted values together.

EXECUTION STATE
outcomes * probabilities = [1×0.1667, 2×0.1667, 3×0.1667, 4×0.1667, 5×0.1667, 6×0.1667] = [0.1667, 0.3333, 0.5000, 0.6667, 0.8333, 1.0000]
📚 np.sum() = Sums all elements of an array. np.sum([0.1667, 0.3333, 0.5, 0.6667, 0.8333, 1.0]) = 3.5
expected_value = 3.5000 = The average outcome of a fair die is 3.5. Note: 3.5 is not a value the die can actually show — the expected value doesn’t have to be a possible outcome.
13print(f"E[X] = {expected_value:.4f}")

Prints the expected value.

EXECUTION STATE
output = E[X] = 3.5000
15# Variance: Var[X] = E[(X - mu)^2]

Variance measures how spread out the distribution is around the mean. It answers: “on average, how far is each outcome from the expected value?” Low variance means values cluster tightly around the mean; high variance means they’re widely dispersed.

16variance = np.sum((outcomes - expected_value)**2 * probabilities)

Computes Var[X] = Σ (x - μ)² · P(x). For each outcome: (1) compute deviation from mean, (2) square it, (3) weight by probability, (4) sum all terms.

EXECUTION STATE
outcomes - expected_value = [1-3.5, 2-3.5, 3-3.5, 4-3.5, 5-3.5, 6-3.5] = [-2.5, -1.5, -0.5, 0.5, 1.5, 2.5]
(outcomes - expected_value)**2 = [6.25, 2.25, 0.25, 0.25, 2.25, 6.25] — squaring makes all values positive
... * probabilities = [6.25/6, 2.25/6, 0.25/6, 0.25/6, 2.25/6, 6.25/6] = [1.0417, 0.3750, 0.0417, 0.0417, 0.3750, 1.0417]
variance = 2.9167 = Sum of weighted squared deviations. A variance of 2.9167 means outcomes typically deviate from the mean by about √2.9167 ≈ 1.71.
17print(f"Var[X] = {variance:.4f}")

Prints the variance.

EXECUTION STATE
output = Var[X] = 2.9167
19# Standard deviation: sqrt(Var[X])

Standard deviation is the square root of variance. It has the same units as the original data (unlike variance, which is in squared units), making it more interpretable.

20std_dev = np.sqrt(variance)

Computes σ = √Var[X]. The standard deviation tells us the “typical” deviation from the mean in the same units as X.

EXECUTION STATE
📚 np.sqrt() = Element-wise square root. np.sqrt(2.9167) = 1.7078.
std_dev = 1.7078 = On average, a die roll deviates from 3.5 by about 1.71 units. About 68% of outcomes fall within μ ± σ = [1.79, 5.21].
21print(f"Std[X] = {std_dev:.4f}")

Prints the standard deviation.

EXECUTION STATE
output = Std[X] = 1.7078
23# Simulate: roll the die 10,000 times

The law of large numbers says that as we take more samples, the sample mean converges to the true expected value. Let’s verify this by simulating 10,000 die rolls.

24np.random.seed(42)

Sets the random number generator seed for reproducibility. With seed 42, every run produces the same sequence of “random” numbers.

EXECUTION STATE
📚 np.random.seed(n) = Initializes the pseudo-random number generator to a known state. Same seed = same sequence every time. Essential for reproducible experiments.
seed = 42 = A commonly used seed value in ML research (a nod to The Hitchhiker’s Guide to the Galaxy).
25rolls = np.random.randint(1, 7, size=10000)

Generates 10,000 random integers between 1 (inclusive) and 7 (exclusive), simulating 10,000 die rolls.

EXECUTION STATE
📚 np.random.randint(low, high, size) = Generates random integers from low (inclusive) to high (exclusive). randint(1, 7) gives values in {1, 2, 3, 4, 5, 6}.
⬇ arg: low = 1 = Minimum value (inclusive). Die starts at 1.
⬇ arg: high = 7 = Maximum value (exclusive). We want up to 6, so high=7.
⬇ arg: size = 10000 = Number of samples to generate. More samples = closer to theoretical values.
rolls = Array of 10,000 integers, each between 1 and 6. Example first 10: [4, 1, 1, 3, 6, 4, 3, 3, 6, 5]
26print(f"Simulated E[X] = {np.mean(rolls):.4f}")

The sample mean of 10,000 rolls should be very close to the theoretical E[X] = 3.5.

EXECUTION STATE
📚 np.mean() = Computes the arithmetic mean (average) of an array. For n values: mean = sum / n.
output = Simulated E[X] ≈ 3.5000 — very close to the theoretical value of 3.5!
27print(f"Simulated Var[X] = {np.var(rolls):.4f}")

The sample variance should approximate the theoretical Var[X] = 2.9167.

EXECUTION STATE
📚 np.var() = Computes variance: mean of squared deviations from the mean. np.var(x) = np.mean((x - np.mean(x))**2).
output = Simulated Var[X] ≈ 2.9167 — matches the theoretical value closely.
6 lines without explanation
1import numpy as np
2
3# A discrete random variable: fair 6-sided die
4outcomes = np.array([1, 2, 3, 4, 5, 6])
5probabilities = np.array([1/6, 1/6, 1/6, 1/6, 1/6, 1/6])
6
7print("Outcomes:", outcomes)
8print("Probabilities:", np.round(probabilities, 4))
9print("Sum of probabilities:", probabilities.sum())
10
11# Expected value: E[X] = sum(x * P(x))
12expected_value = np.sum(outcomes * probabilities)
13print(f"\nE[X] = {expected_value:.4f}")
14
15# Variance: Var[X] = E[(X - mu)^2]
16variance = np.sum((outcomes - expected_value)**2 * probabilities)
17print(f"Var[X] = {variance:.4f}")
18
19# Standard deviation: sqrt(Var[X])
20std_dev = np.sqrt(variance)
21print(f"Std[X] = {std_dev:.4f}")
22
23# Simulate: roll the die 10,000 times
24np.random.seed(42)
25rolls = np.random.randint(1, 7, size=10000)
26print(f"\nSimulated E[X] = {np.mean(rolls):.4f}")
27print(f"Simulated Var[X] = {np.var(rolls):.4f}")

The Gaussian Distribution

The Gaussian distribution (also called the normal distribution) is the most important continuous distribution in all of machine learning. Its probability density function is the famous \u201Cbell curve\u201D:

f(x)=1σ2πexp ⁣((xμ)22σ2)f(x) = \frac{1}{\sigma\sqrt{2\pi}} \exp\!\left(-\frac{(x - \mu)^2}{2\sigma^2}\right)

This formula has two parameters:

  • μ\mu (mu): The mean \u2014 the center of the bell curve. The distribution is perfectly symmetric around μ\mu.
  • σ\sigma (sigma): The standard deviation \u2014 controls the width. Small σ\sigma means a narrow, peaked bell; large σ\sigma means a wide, flat bell.

Why the Gaussian is Everywhere

The Gaussian distribution appears throughout neural networks for deep mathematical reasons:

  1. Central Limit Theorem: The sum of many independent random variables converges to a Gaussian, regardless of their individual distributions. Since a neuron computes z=iwixi+bz = \sum_i w_i x_i + b (a sum of many terms), its pre-activation output tends to be approximately Gaussian.
  2. Weight Initialization: We initialize neural network weights by drawing from a Gaussian distribution. Xavier initialization uses N(0,1/n)\mathcal{N}(0, 1/n) and He initialization uses N(0,2/n)\mathcal{N}(0, 2/n) where nn is the number of input neurons.
  3. Batch Normalization: Standardizes activations to have μ=0\mu = 0 and σ=1\sigma = 1, effectively making them follow a standard Gaussian. This stabilizes training dramatically.
  4. Regression Loss: The mean squared error (MSE) loss is equivalent to maximum likelihood estimation under the assumption that errors are Gaussian-distributed.

The 68-95-99.7 Rule

A powerful rule of thumb for any Gaussian distribution: approximately 68% of values fall within ±1σ\pm 1\sigma of the mean, 95% within ±2σ\pm 2\sigma, and 99.7% within ±3σ\pm 3\sigma. Anything beyond 3σ3\sigma is extremely rare.

Explore the Gaussian distribution interactively by adjusting the mean and standard deviation:

Loading Gaussian distribution visualization...

Now let's implement the Gaussian PDF from scratch and verify these properties:

Gaussian Distribution — Python
🐍gaussian_distribution.py
1import numpy as np

NumPy provides np.sqrt, np.pi, np.exp for the PDF formula, and np.random.normal for sampling from the Gaussian distribution.

EXECUTION STATE
📚 numpy = We use np.sqrt(2π) for the normalization constant, np.exp() for the exponential term, and np.random.normal() to draw samples.
3# The Gaussian (Normal) PDF

The Gaussian distribution is the most important continuous distribution in machine learning. It appears everywhere: weight initialization (Xavier/He), batch normalization (standardizing to zero mean, unit variance), the central limit theorem (sums of random variables become Gaussian), and the loss function for regression (MSE assumes Gaussian errors).

4def gaussian_pdf(x, mu, sigma) — Probability Density Function

Implements the Gaussian PDF formula. The PDF gives the “probability density” at each point — the height of the bell curve. Higher density means the random variable is more likely to take values near that point.

EXECUTION STATE
⬇ input: x = The point at which to evaluate the PDF. Can be any real number from -∞ to +∞.
⬇ input: mu (μ) = The mean — the center of the bell curve. The distribution is symmetric around μ. In neural networks, this is the average value of a neuron’s activation.
⬇ input: sigma (σ) = The standard deviation — controls the width of the bell curve. Small σ = narrow/peaked, large σ = wide/flat. In batch norm, we standardize to σ=1.
⬆ returns = The probability density at x. Note: this is NOT a probability (it can exceed 1.0 for narrow distributions). The probability of x falling in an interval [a,b] is the area under the curve from a to b.
5coeff = 1 / (sigma * np.sqrt(2 * np.pi))

The normalization constant ensures the total area under the bell curve equals 1.0. Without this, the curve would have the right shape but the wrong total probability.

EXECUTION STATE
📚 np.sqrt() = Square root function. np.sqrt(2π) = np.sqrt(6.2832) = 2.5066.
📚 np.pi = The mathematical constant π = 3.141592653589793.
2 * np.pi = 6.2832 — needed in the denominator of the Gaussian normalization.
sigma * np.sqrt(2*np.pi) = For σ=1: 1.0 × 2.5066 = 2.5066. Wider distributions (σ larger) have smaller peaks.
coeff = 0.3989 = For standard normal (σ=1): 1/2.5066 = 0.3989. This is the maximum height of the bell curve (at x=μ).
6exponent = -((x - mu)**2) / (2 * sigma**2)

The exponent controls the bell shape. It’s always ≤ 0 (because of the minus sign), equals 0 when x=μ (peak of the curve), and becomes more negative as x moves away from μ.

EXECUTION STATE
(x - mu) = Distance from the mean. At x=μ: deviation = 0. At x=μ+σ: deviation = σ.
(x - mu)**2 = Squared distance — always positive, symmetric around μ. Points equally far left and right have the same density.
2 * sigma**2 = For σ=1: 2×1² = 2. This controls the width: larger σ makes the exponent shrink slower, producing a wider bell.
exponent at x=0 (μ=0, σ=1) = -0²/2 = 0.0 → e⁰ = 1.0 (peak)
exponent at x=1 = -1²/2 = -0.5 → e^(-0.5) = 0.6065
exponent at x=2 = -4/2 = -2.0 → e^(-2) = 0.1353
7return coeff * np.exp(exponent)

Multiplies the normalization constant by the exponential term to get the final PDF value. The exp() function produces the bell shape: maximum at x=μ, falling off symmetrically.

EXECUTION STATE
📚 np.exp() = Computes e^x for each element. e ≈ 2.71828. np.exp(0)=1.0, np.exp(-0.5)=0.6065, np.exp(-2)=0.1353.
⬆ return at x=0 = 0.3989 × e⁰ = 0.3989 × 1.0 = 0.398942 (the peak)
⬆ return at x=1 = 0.3989 × e^(-0.5) = 0.3989 × 0.6065 = 0.241971
⬆ return at x=2 = 0.3989 × e^(-2) = 0.3989 × 0.1353 = 0.053991
9# Standard normal: mu=0, sigma=1

The standard normal distribution has mean 0 and standard deviation 1. It’s the “default” Gaussian that serves as a reference. Any Gaussian can be converted to standard normal by subtracting the mean and dividing by the standard deviation: z = (x - μ) / σ.

10mu, sigma = 0.0, 1.0

Python tuple unpacking assigns μ=0 and σ=1 simultaneously. These are the standard normal parameters.

EXECUTION STATE
mu = 0.0 = Mean = 0. The bell curve is centered at the origin.
sigma = 1.0 = Standard deviation = 1. About 68% of values fall between -1 and +1.
12# Evaluate PDF at specific points

We compute the probability density at five symmetric points around the mean to see the bell shape numerically.

13print("Standard Normal PDF values:")

Header for the PDF value table.

EXECUTION STATE
output = Standard Normal PDF values:
14for x in [-2.0, -1.0, 0.0, 1.0, 2.0]:

Evaluate the PDF at -2σ, -1σ, 0 (mean), +1σ, and +2σ. The symmetry of the Gaussian means P(-k) = P(+k) for any k.

LOOP TRACE · 5 iterations
x = -2.0
gaussian_pdf(-2.0, 0, 1) = 0.053991 — far from mean, low density
x = -1.0
gaussian_pdf(-1.0, 0, 1) = 0.241971 — one std dev below mean
x = 0.0
gaussian_pdf(0.0, 0, 1) = 0.398942 — the peak (at the mean)
x = 1.0
gaussian_pdf(1.0, 0, 1) = 0.241971 — same as x=-1.0 (symmetry!)
x = 2.0
gaussian_pdf(2.0, 0, 1) = 0.053991 — same as x=-2.0 (symmetry!)
15p = gaussian_pdf(x, mu, sigma)

Calls our PDF function with the current x value and the standard normal parameters.

EXECUTION STATE
p = The probability density at the current x. Highest at x=0 (0.3989), lowest at x=±2 (0.0540).
16print(f" P(x={x:5.1f}) = {p:.6f}")

Prints each (x, density) pair with formatted alignment.

EXECUTION STATE
output (all iterations) =
  P(x= -2.0) = 0.053991
  P(x= -1.0) = 0.241971
  P(x=  0.0) = 0.398942
  P(x=  1.0) = 0.241971
  P(x=  2.0) = 0.053991
18# The 68-95-99.7 rule

The 68-95-99.7 rule (the empirical rule) states that for any Gaussian distribution, approximately 68% of values fall within ±1σ, 95% within ±2σ, and 99.7% within ±3σ of the mean. This is fundamental to understanding confidence intervals, weight initialization bounds, and anomaly detection.

19print("68-95-99.7 Rule:")

Header for the empirical rule display.

EXECUTION STATE
output = 68-95-99.7 Rule:
20print(f" P(-1 < X < 1) = 68.27%")

About 68% of all values from a standard normal distribution fall between -1 and +1.

EXECUTION STATE
±1σ interval = 68.27% of data lies within one standard deviation of the mean. For weight initialization, this means most initial weights will be close to zero.
21print(f" P(-2 < X < 2) = 95.45%")

About 95% of values fall within 2 standard deviations.

EXECUTION STATE
±2σ interval = 95.45% of data. Values outside this range are rare — only ~5% chance.
22print(f" P(-3 < X < 3) = 99.73%")

Almost all values (99.7%) fall within 3 standard deviations. Anything beyond ±3σ is an extreme outlier.

EXECUTION STATE
±3σ interval = 99.73% of data. Only 0.27% of values exceed 3 standard deviations. In anomaly detection, points beyond 3σ are often flagged as outliers.
24# Generate samples

Draw actual samples from the standard normal distribution to verify our theoretical predictions.

25np.random.seed(42)

Same seed as before for reproducibility.

EXECUTION STATE
seed = 42 = Ensures reproducible random samples.
26samples = np.random.normal(mu, sigma, size=5)

Draws 5 samples from the Gaussian distribution N(μ=0, σ=1). Each sample is an independent random number from the bell curve.

EXECUTION STATE
📚 np.random.normal(mu, sigma, size) = Generates random numbers from a Gaussian distribution. Parameters: mu=center, sigma=spread, size=how many. In neural networks, this is used for weight initialization.
⬇ arg: mu = 0.0 = Center of the distribution. Samples will be scattered around 0.
⬇ arg: sigma = 1.0 = Width of the distribution. Most samples will be between -2 and +2.
⬇ arg: size = 5 = Number of independent samples to draw.
samples = [0.4967, -0.1383, 0.6477, 1.5230, -0.2342] — all close to 0, one outlier (1.52 at ~1.5σ)
27print(f"5 samples: {np.round(samples, 4)}")

Displays the 5 drawn samples.

EXECUTION STATE
output = 5 samples: [ 0.4967 -0.1383 0.6477 1.523 -0.2342]
28print(f"Sample mean: {np.mean(samples):.4f}")

The sample mean of 5 samples. With so few samples, it won’t perfectly match μ=0, but more samples would converge.

EXECUTION STATE
output = Sample mean: 0.4588 — not exactly 0, but only 5 samples. With 10,000 samples, this would be ~0.00.
29print(f"Sample std: {np.std(samples):.4f}")

The sample standard deviation. Should approximate σ=1.0 with enough samples.

EXECUTION STATE
output = Sample std: approximately 0.65 — underestimate with only 5 samples. More samples → closer to 1.0.
5 lines without explanation
1import numpy as np
2
3# The Gaussian (Normal) PDF
4def gaussian_pdf(x, mu, sigma):
5    coeff = 1 / (sigma * np.sqrt(2 * np.pi))
6    exponent = -((x - mu)**2) / (2 * sigma**2)
7    return coeff * np.exp(exponent)
8
9# Standard normal: mu=0, sigma=1
10mu, sigma = 0.0, 1.0
11
12# Evaluate PDF at specific points
13print("Standard Normal PDF values:")
14for x in [-2.0, -1.0, 0.0, 1.0, 2.0]:
15    p = gaussian_pdf(x, mu, sigma)
16    print(f"  P(x={x:5.1f}) = {p:.6f}")
17
18# The 68-95-99.7 rule
19print("\n68-95-99.7 Rule:")
20print(f"  P(-1 < X < 1) = 68.27%")
21print(f"  P(-2 < X < 2) = 95.45%")
22print(f"  P(-3 < X < 3) = 99.73%")
23
24# Generate samples
25np.random.seed(42)
26samples = np.random.normal(mu, sigma, size=5)
27print(f"\n5 samples: {np.round(samples, 4)}")
28print(f"Sample mean: {np.mean(samples):.4f}")
29print(f"Sample std:  {np.std(samples):.4f}")

Conditional Probability and Bayes' Theorem

Conditional probability is the probability of an event given that another event has already occurred. It is written P(AB)P(A \mid B) and read as \u201Cthe probability of A given B.\u201D

The formula is: P(AB)=P(AB)P(B)P(A \mid B) = \frac{P(A \cap B)}{P(B)}

This is exactly what neural network classifiers do: they compute P(classinput)P(\text{class} \mid \text{input}) \u2014 the probability of each class given the input data. When a model sees an image and outputs \u201C85% cat,\u201D it is computing P(catimage)=0.85P(\text{cat} \mid \text{image}) = 0.85.

Bayes' Theorem

Bayes' theorem lets us invert conditional probabilities. If we know P(BA)P(B \mid A) (the likelihood), we can compute P(AB)P(A \mid B) (the posterior):

P(AB)=P(BA)P(A)P(B)P(A \mid B) = \frac{P(B \mid A) \cdot P(A)}{P(B)}

The terms have specific names:

TermNameMeaning
P(A)PriorWhat we believed before seeing evidence
P(B | A)LikelihoodHow probable the evidence is if A is true
P(A | B)PosteriorUpdated belief after seeing evidence
P(B)EvidenceTotal probability of observing the evidence

Let's see Bayes' theorem in action with a famous counterintuitive example:

Bayes' Theorem — The Medical Test Paradox
🐍bayes_theorem.py
1import numpy as np

NumPy imported for consistency, though this example uses plain arithmetic. In practice, Bayes’ theorem computations in neural networks operate on tensors of probabilities.

EXECUTION STATE
📚 numpy = Not strictly needed here, but we keep the import for consistency with other code blocks.
3# Medical test: disease with 1% prevalence

This is the classic Bayes’ theorem example that reveals counterintuitive probability. A test with 99% accuracy seems perfect, but when the disease is rare (1% prevalence), most positive results are actually false positives. This same logic applies to neural networks: a classifier’s predictions must be interpreted in the context of class frequencies (priors).

4# Test accuracy: 99% true positive, 5% false positive

The test correctly identifies 99% of sick people (sensitivity) but incorrectly flags 5% of healthy people (false positive rate).

6# Prior probability

The prior probability P(disease) represents our belief BEFORE seeing any test result. In neural networks, priors encode what we know about class frequencies in the data.

7p_disease = 0.01 — Prior probability of disease

Only 1% of the population has this disease. This is the base rate — before any test, there’s a 1% chance a random person is sick.

EXECUTION STATE
p_disease = 0.01 = P(D) = 0.01. The prior: 1 in 100 people has the disease. This low base rate is what makes the result so surprising.
8p_no_disease = 0.99

The complement: 99% of people are healthy. P(healthy) = 1 - P(disease).

EXECUTION STATE
p_no_disease = 0.99 = P(¬D) = 0.99. The overwhelming majority of the population is healthy.
10# Likelihood: how accurate is the test?

The likelihood P(positive|disease) describes how probable the evidence (positive test) is under each hypothesis (disease vs. no disease). This is the test’s accuracy.

11p_pos_given_disease = 0.99 — True positive rate (sensitivity)

If you have the disease, the test correctly says “positive” 99% of the time. Only 1% of sick people get a false negative.

EXECUTION STATE
p_pos_given_disease = 0.99 = P(+|D) = 0.99. Sensitivity = 99%. The test is very good at catching disease when it exists.
12p_pos_given_no_disease = 0.05 — False positive rate

If you’re healthy, the test incorrectly says “positive” 5% of the time. This seems like a small error rate, but applied to 99% of the population, it generates a LOT of false positives.

EXECUTION STATE
p_pos_given_no_disease = 0.05 = P(+|¬D) = 0.05. False positive rate = 5%. Seems small, but 5% of 99 healthy people = 4.95 false positives.
14# Total probability of testing positive

The law of total probability: P(+) = P(+|D)·P(D) + P(+|¬D)·P(¬D). This sums over all ways someone can test positive: either they’re sick (true positive) or healthy (false positive).

15p_positive = P(+|D)·P(D) + P(+|¬D)·P(¬D)

Computes the total probability of a positive test across the entire population. This is the denominator of Bayes’ theorem.

EXECUTION STATE
p_pos_given_disease * p_disease = 0.99 × 0.01 = 0.0099 — true positives: sick people who test positive
p_pos_given_no_disease * p_no_disease = 0.05 × 0.99 = 0.0495 — false positives: healthy people who test positive
p_positive = 0.0594 = Total: 0.0099 + 0.0495 = 0.0594. About 5.94% of all people test positive. Notice: false positives (0.0495) dwarf true positives (0.0099)!
18print(f"P(positive) = {p_positive:.4f}")

Prints the total positive rate.

EXECUTION STATE
output = P(positive) = 0.0594
20# Bayes' theorem: P(disease | positive)

Now the key question: given a positive test result, what is the probability that the person actually has the disease? Bayes’ theorem inverts the conditional probability.

21p_disease_given_pos = P(+|D)·P(D) / P(+)

Bayes’ theorem: P(D|+) = P(+|D)·P(D) / P(+). We flip the conditional: from “how likely is a positive test given disease” to “how likely is disease given a positive test.”

EXECUTION STATE
numerator: P(+|D)·P(D) = 0.99 × 0.01 = 0.0099 — the joint probability of having disease AND testing positive
denominator: P(+) = 0.0594 — total probability of testing positive (true + false positives)
p_disease_given_pos = 0.1667 = 0.0099 / 0.0594 = 0.1667. Only 16.67%! A “99% accurate” test gives only a 1-in-6 chance of actually having the disease!
23print(f"P(disease | positive) = {p_disease_given_pos:.4f}")

The posterior probability: given a positive test, only 16.67% chance of disease.

EXECUTION STATE
output = P(disease | positive) = 0.1667
24print(f"That is only {p_disease_given_pos*100:.2f}%!")

Emphasizes the surprising result in percentage form.

EXECUTION STATE
output = That is only 16.67%!
26# Even with a 99% accurate test...

The key lesson: accuracy of a classifier depends on base rates. A neural network that’s 99% accurate on a class that makes up 1% of the data might still produce mostly false positives for that class.

27print(f"False positive contribution: ...")

Shows the false positive contribution to the total positive rate.

EXECUTION STATE
false positive contribution = 0.05 × 0.99 = 0.0495 — almost 5x larger than true positives!
28print(f"True positive contribution: ...")

Shows the true positive contribution.

EXECUTION STATE
true positive contribution = 0.99 × 0.01 = 0.0099 — swamped by false positives because the disease is so rare.
10 lines without explanation
1import numpy as np
2
3# Medical test: disease with 1% prevalence
4# Test accuracy: 99% true positive, 5% false positive
5
6# Prior probability
7p_disease = 0.01        # 1% of population has disease
8p_no_disease = 0.99     # 99% are healthy
9
10# Likelihood: how accurate is the test?
11p_pos_given_disease = 0.99    # true positive rate
12p_pos_given_no_disease = 0.05 # false positive rate
13
14# Total probability of testing positive (law of total probability)
15p_positive = (p_pos_given_disease * p_disease +
16              p_pos_given_no_disease * p_no_disease)
17
18print(f"P(positive) = {p_positive:.4f}")
19
20# Bayes' theorem: P(disease | positive)
21p_disease_given_pos = (p_pos_given_disease * p_disease) / p_positive
22
23print(f"P(disease | positive) = {p_disease_given_pos:.4f}")
24print(f"That is only {p_disease_given_pos*100:.2f}%!")
25
26# Even with a 99% accurate test, if the disease is rare,
27# most positive results are false positives!
28print(f"\nFalse positive contribution: {p_pos_given_no_disease * p_no_disease:.4f}")
29print(f"True positive contribution:  {p_pos_given_disease * p_disease:.4f}")
Key Insight: Even a 99% accurate test gives only a 16.7% chance of disease when the base rate is 1%. This is the base rate fallacy. In neural networks, the same principle applies: if a class is very rare in the training data, the model may achieve high accuracy by simply predicting the majority class. This is why we use balanced accuracy, precision/recall, and F1 score instead of raw accuracy for imbalanced datasets.

From Logits to Probabilities: The Softmax Function

Neural network classification layers output raw numbers called logits \u2014 they can be any real number (positive, negative, or zero) and don't form a valid probability distribution. The softmax function converts these raw scores into probabilities:

softmax(zi)=ezij=1Kezj\text{softmax}(z_i) = \frac{e^{z_i}}{\sum_{j=1}^{K} e^{z_j}}

Softmax does three things simultaneously:

  1. Makes all values positive: The exponential ezie^{z_i} is always positive, even for negative logits. This ensures no negative \u201Cprobabilities.\u201D
  2. Normalizes to sum to 1: Dividing by the sum ezj\sum e^{z_j} guarantees the outputs form a valid probability distribution.
  3. Preserves ordering: Larger logits always get larger probabilities. The class with the highest logit gets the highest probability.

The Numerical Stability Trick

In practice, we subtract the maximum logit before computing softmax: softmax(zi)=softmax(zimax(z))\text{softmax}(z_i) = \text{softmax}(z_i - \max(z)). This prevents elarge number=e^{\text{large number}} = \infty (overflow) without changing the result. The proof is elegant: the max(z)\max(z) constant cancels in the numerator and denominator.

Temperature Scaling

Dividing logits by a temperature parameter TT before softmax controls the \u201Csharpness\u201D of the output distribution:softmax(zi/T)\text{softmax}(z_i / T).

  • T0T \to 0: Distribution becomes one-hot (all probability on the argmax). The model is maximally confident.
  • T=1T = 1: Standard softmax. The default.
  • TT \to \infty: Distribution becomes uniform (1/K1/K for each class). The model expresses maximum uncertainty.

Temperature scaling is used in knowledge distillation (training a small model from a large model's soft predictions) and language model sampling (higher temperature = more creative text generation).

Experiment with softmax interactively \u2014 adjust the logits and temperature to see how they affect the output probabilities:

Loading softmax visualization...

Now let's implement softmax from scratch with the numerical stability trick and temperature:

Softmax Function — From Logits to Probabilities
🐍softmax_function.py
1import numpy as np

NumPy provides np.exp (element-wise exponential), np.max (for numerical stability), and np.sum (normalization denominator).

EXECUTION STATE
📚 numpy = Key functions: np.exp() for exponentiation, np.max() for the stability trick, np.sum() for normalization.
3def softmax(logits, temperature=1.0) — Convert scores to probabilities

Softmax is THE function that bridges neural networks and probability. It takes a vector of arbitrary real numbers (logits) and converts them into a valid probability distribution: all values between 0 and 1, summing to exactly 1.0. Every classification neural network uses softmax as its final layer.

EXECUTION STATE
⬇ input: logits = Raw network outputs (can be any real number, positive or negative). Example: [2.0, 1.0, 0.1] — the network thinks “cat” is most likely.
⬇ input: temperature (default=1.0) = Controls the “sharpness” of the distribution. T<1 makes it peakier (more confident), T>1 makes it flatter (more uncertain). T→0 becomes argmax, T→∞ becomes uniform.
⬆ returns = np.ndarray of probabilities summing to 1.0. Same length as input logits.
4"""Convert raw scores to probabilities."""

Docstring summarizing the function’s purpose.

5# Scale by temperature

Temperature scaling divides logits by T before applying softmax. This is used in knowledge distillation, language model sampling, and controlling model confidence.

6scaled = logits / temperature

Element-wise division by temperature. With T=1.0, this is a no-op (logits unchanged). With T<1, differences are amplified. With T>1, differences are dampened.

EXECUTION STATE
logits / temperature = For T=1.0: [2.0, 1.0, 0.1] / 1.0 = [2.0, 1.0, 0.1] (unchanged) For T=0.5: [2.0, 1.0, 0.1] / 0.5 = [4.0, 2.0, 0.2] (amplified) For T=2.0: [2.0, 1.0, 0.1] / 2.0 = [1.0, 0.5, 0.05] (dampened)
scaled (T=1.0) = [2.0, 1.0, 0.1]
8# Subtract max for numerical stability

The “log-sum-exp trick.” Without this, exp(large number) = infinity (overflow). Subtracting the max ensures all exponents are ≤ 0, so exp values are between 0 and 1. This doesn’t change the result because softmax(z) = softmax(z - c) for any constant c.

9shifted = scaled - np.max(scaled)

Subtract the maximum value from all logits. The max becomes 0, all others become negative.

EXECUTION STATE
📚 np.max(scaled) = Finds the maximum element. np.max([2.0, 1.0, 0.1]) = 2.0. This is subtracted from all elements.
shifted = [2.0-2.0, 1.0-2.0, 0.1-2.0] = [0.0, -1.0, -1.9]
→ why this works = softmax(z-c) = exp(z_i-c)/Σexp(z_j-c) = exp(z_i)·exp(-c) / Σexp(z_j)·exp(-c) = exp(z_i)/Σexp(z_j) = softmax(z). The constant cancels!
11# Exponentiate: all values become positive

The exponential function e^x is always positive. This is why softmax produces valid probabilities — no negative values possible.

12exp_vals = np.exp(shifted)

Compute e^x for each shifted logit. Larger logits produce exponentially larger values, which is how softmax amplifies differences between scores.

EXECUTION STATE
📚 np.exp() = Element-wise exponential. e⁰=1.0, e^(-1)=0.3679, e^(-1.9)=0.1496. All results are positive since e^x > 0 for all x.
exp_vals = [e⁰, e^(-1), e^(-1.9)] = [1.0000, 0.3679, 0.1496]
→ note the amplification = Input difference: 2.0 vs 0.1 = ratio 20:1. After exp: 1.0 vs 0.15 = ratio 6.7:1. Softmax preserves ordering but the exponential amplifies gaps.
14# Normalize: divide by sum so they sum to 1

Divide each exponential by the total to get a valid probability distribution.

15probs = exp_vals / np.sum(exp_vals)

Each probability = exp(z_i) / Σ exp(z_j). This guarantees all values are between 0 and 1, and the sum is exactly 1.0.

EXECUTION STATE
📚 np.sum(exp_vals) = 1.0000 + 0.3679 + 0.1496 = 1.5174. This is the partition function Z — the normalizing constant.
probs = [1.0/1.5174, 0.3679/1.5174, 0.1496/1.5174] = [0.6590, 0.2424, 0.0986]
→ verify sum = 0.6590 + 0.2424 + 0.0986 = 1.0000 ✓
16return probs

Returns the probability vector. The highest probability corresponds to the model’s predicted class.

EXECUTION STATE
⬆ return: probs = [0.6590, 0.2424, 0.0986] — cat=65.9%, dog=24.2%, bird=9.9%. The model predicts “cat” with 65.9% confidence.
18# Network outputs (logits) for 3 classes

Simulating what a neural network’s last layer might output. These raw scores have no probabilistic meaning yet — softmax converts them.

19logits = np.array([2.0, 1.0, 0.1])

Three logit scores for three classes. The network’s raw output before softmax. Higher logit = network thinks this class is more likely.

EXECUTION STATE
logits = [2.0, 1.0, 0.1] — cat has the highest score (2.0), bird the lowest (0.1). But these aren’t probabilities — they don’t sum to 1 and can be negative.
20labels = ["cat", "dog", "bird"]

Human-readable class names corresponding to each logit index.

EXECUTION STATE
labels = ["cat", "dog", "bird"] — index 0=cat, 1=dog, 2=bird
22probs = softmax(logits)

Apply softmax with default temperature=1.0 to convert logits to probabilities.

EXECUTION STATE
probs = [0.6590, 0.2424, 0.0986]
23print("Class probabilities:")

Header for the output.

EXECUTION STATE
output = Class probabilities:
24for label, logit, prob in zip(labels, logits, probs):

Iterates over labels, logits, and probabilities together using zip to display each class’s logit and probability.

LOOP TRACE · 3 iterations
cat
logit=2.0 → P=0.6590 = Highest logit → highest probability. The model’s top pick.
dog
logit=1.0 → P=0.2424 = Second-highest logit. About 24% chance.
bird
logit=0.1 → P=0.0986 = Lowest logit. About 10% chance.
25print(f" {label:5s}: logit={logit:5.1f} -> P={prob:.4f}")

Formatted output showing the transformation from logit to probability.

EXECUTION STATE
output (all iterations) =
  cat  : logit=  2.0 -> P=0.6590
  dog  : logit=  1.0 -> P=0.2424
  bird : logit=  0.1 -> P=0.0986
26print(f" Sum = {probs.sum():.4f}")

Verifies probabilities sum to 1.0.

EXECUTION STATE
output = Sum = 1.0000 ✓
28# Temperature effect: controls "confidence"

Temperature T scales the logits before softmax. T<1 sharpens the distribution (more confident), T>1 flattens it (more uncertain). This is used in language model generation (high T for creativity, low T for accuracy) and knowledge distillation.

29print("Temperature effect:")

Header for temperature comparison.

EXECUTION STATE
output = Temperature effect:
30for T in [0.5, 1.0, 2.0, 5.0]:

Tests four temperature values to show the effect. Lower T concentrates probability on the winner; higher T spreads it evenly.

LOOP TRACE · 4 iterations
T = 0.5 (sharp)
softmax(logits/0.5) = [0.8638, 0.1169, 0.0193] — very confident: 86% on cat. Low temperature amplifies differences.
T = 1.0 (default)
softmax(logits/1.0) = [0.6590, 0.2424, 0.0986] — standard softmax output.
T = 2.0 (soft)
softmax(logits/2.0) = [0.5017, 0.3043, 0.1940] — more uniform, less confident.
T = 5.0 (very soft)
softmax(logits/5.0) = [0.3996, 0.3272, 0.2733] — nearly uniform. High T erases differences.
31p = softmax(logits, temperature=T)

Calls softmax with the current temperature value.

EXECUTION STATE
p = The probability vector at the current temperature. See iterations above for all values.
32print(f" T={T:.1f}: {np.round(p, 4)}")

Prints the temperature and resulting probability vector.

EXECUTION STATE
output (all iterations) =
  T=0.5: [0.8638 0.1169 0.0193]
  T=1.0: [0.659  0.2424 0.0986]
  T=2.0: [0.5017 0.3043 0.194 ]
  T=5.0: [0.3996 0.3272 0.2733]
7 lines without explanation
1import numpy as np
2
3def softmax(logits, temperature=1.0):
4    """Convert raw scores to probabilities."""
5    # Scale by temperature
6    scaled = logits / temperature
7
8    # Subtract max for numerical stability
9    shifted = scaled - np.max(scaled)
10
11    # Exponentiate: all values become positive
12    exp_vals = np.exp(shifted)
13
14    # Normalize: divide by sum so they sum to 1
15    probs = exp_vals / np.sum(exp_vals)
16    return probs
17
18# Network outputs (logits) for 3 classes
19logits = np.array([2.0, 1.0, 0.1])
20labels = ["cat", "dog", "bird"]
21
22probs = softmax(logits)
23print("Class probabilities:")
24for label, logit, prob in zip(labels, logits, probs):
25    print(f"  {label:5s}: logit={logit:5.1f} -> P={prob:.4f}")
26print(f"  Sum = {probs.sum():.4f}")
27
28# Temperature effect: controls "confidence"
29print("\nTemperature effect:")
30for T in [0.5, 1.0, 2.0, 5.0]:
31    p = softmax(logits, temperature=T)
32    print(f"  T={T:.1f}: {np.round(p, 4)}")

Cross-Entropy: Measuring Prediction Quality

Now we arrive at the most important equation in classification neural networks:cross-entropy loss. It measures how different the model's predicted probability distribution qq is from the true distribution pp:

H(p,q)=k=1Kp(k)logq(k)H(p, q) = -\sum_{k=1}^{K} p(k) \cdot \log q(k)

Since the true distribution pp is one-hot (only one class is correct), cross-entropy simplifies dramatically:

H(p,q)=logq(ytrue)H(p, q) = -\log q(y_{\text{true}})

This is just the negative log probability of the correct class. The properties are beautiful:

Predicted P(true class)Loss = -log(P)Interpretation
0.990.01Near-perfect prediction, almost zero loss
0.70.36Good prediction, moderate loss
0.50.69Coin flip — the model is unsure
0.12.30Bad prediction — high loss
0.014.61Terrible prediction — very high loss
0.0016.91Catastrophically wrong — extreme loss

Notice the asymmetric penalty: going from 0.99 to 0.7 adds only 0.35 to the loss, but going from 0.1 to 0.01 adds 2.31. Cross-entropy punishes confident wrong answers exponentially more than slightly uncertain correct answers. This creates a strong gradient signal to fix wrong predictions.

Information Theory Perspective

Cross-entropy comes from information theory. The entropy of a distribution pp is H(p)=p(k)logp(k)H(p) = -\sum p(k) \log p(k) \u2014 the minimum average number of bits needed to encode events from pp. The cross-entropy H(p,q)H(p, q) is the average number of bits needed when you use the encoding optimized for qq but the data actually comes from pp. The difference H(p,q)H(p)H(p, q) - H(p) is the KL divergence DKL(pq)D_{KL}(p \| q) \u2014 the \u201Cextra bits\u201D wasted by using the wrong distribution. Since H(p)H(p) is constant during training (the true labels don't change), minimizing cross-entropy is equivalent to minimizing KL divergence.

Explore cross-entropy interactively \u2014 adjust the predicted probabilities and see how the loss changes:

Loading cross-entropy visualization...

Let's implement cross-entropy and see how different predictions affect the loss:

Cross-Entropy Loss — Measuring Prediction Quality
🐍cross_entropy.py
1import numpy as np

NumPy provides np.log (natural logarithm) and np.sum for the cross-entropy computation.

EXECUTION STATE
📚 numpy = Key function: np.log() computes natural logarithm (ln). log(1)=0, log(0.5)=-0.693, log(0.1)=-2.303. Log of values near 0 goes to -∞.
3def cross_entropy(true_probs, pred_probs) — Measure prediction quality

Cross-entropy is THE loss function for classification in neural networks. It measures how different the predicted probability distribution is from the true distribution. Lower = better predictions. It comes from information theory: cross-entropy tells you how many bits you need to encode data from distribution p using the code optimized for distribution q.

EXECUTION STATE
⬇ input: true_probs (p) = The true distribution — usually one-hot: [1, 0, 0] meaning the true class is index 0. In information theory, this is the “source” distribution.
⬇ input: pred_probs (q) = The model’s predicted probabilities, e.g. [0.7, 0.2, 0.1]. These come from softmax. In information theory, this is the “model” distribution.
⬆ returns = A single non-negative number. 0 = perfect prediction (p = q). Larger = worse prediction. Maximum when predicted probability of the true class approaches 0.
4"""H(p, q) = -sum(p(x) * log(q(x)))"""

The cross-entropy formula. Since p is one-hot, this simplifies to -log(q[true_class]) — the negative log of the predicted probability for the correct class.

5eps = 1e-15 — Prevent log(0) = -infinity

A tiny constant added to predicted probabilities before taking the log. Without this, if the model predicts 0 probability for the true class, log(0) = -∞, which would crash the computation.

EXECUTION STATE
eps = 1e-15 = 0.000000000000001 — so small it doesn’t affect results, but prevents log(0). This is a standard numerical stability trick.
6return -np.sum(true_probs * np.log(pred_probs + eps))

The cross-entropy computation in one line: (1) add epsilon for safety, (2) take log of each predicted probability, (3) multiply by true probabilities (only the true class matters), (4) sum, (5) negate.

EXECUTION STATE
📚 np.log() = Natural logarithm (base e). np.log(1.0)=0.0, np.log(0.7)=-0.357, np.log(0.1)=-2.303. Smaller probabilities produce larger (more negative) log values.
pred_probs + eps = [0.7+1e-15, 0.2+1e-15, 0.1+1e-15] ≈ [0.7, 0.2, 0.1] — epsilon is negligible but prevents log(0).
np.log(pred_probs + eps) = [log(0.7), log(0.2), log(0.1)] = [-0.3567, -1.6094, -2.3026]
true_probs * np.log(...) = [1.0×(-0.3567), 0.0×(-1.6094), 0.0×(-2.3026)] = [-0.3567, 0.0, 0.0] — only the true class contributes!
-np.sum(...) = -(-0.3567 + 0 + 0) = 0.3567
⬆ return: 0.3567 = Cross-entropy loss = 0.3567. This is equivalent to -log(0.7). Lower = better.
8# True label: "cat" (one-hot encoded)

One-hot encoding: the true class gets probability 1.0, all others get 0.0. This is how we represent the ground truth for classification.

9true = np.array([1.0, 0.0, 0.0])

The true label as a probability vector. Since the true class is “cat” (index 0), it gets 1.0.

EXECUTION STATE
true = [1.0, 0.0, 0.0] — one-hot for “cat.” The true distribution is perfectly concentrated on one class.
10labels = ["cat", "dog", "bird"]

Class names for readability.

EXECUTION STATE
labels = ["cat", "dog", "bird"] — index mapping for the three classes
12# Scenario 1: Good prediction

The model is fairly confident about the correct class. Most probability mass is on “cat.”

13good_pred = np.array([0.7, 0.2, 0.1])

A reasonable prediction: 70% on cat (correct), 20% on dog, 10% on bird.

EXECUTION STATE
good_pred = [0.7, 0.2, 0.1] — correctly gives the most probability to “cat”
14loss_good = cross_entropy(true, good_pred)

Computes -log(0.7) since only the true class (cat, index 0) contributes.

EXECUTION STATE
loss_good = 0.3567 = H(p, q) = -1.0×log(0.7) = 0.3567. A moderate loss — correct but not highly confident.
15print(f"Good prediction {good_pred}:")

Header for the good prediction output.

EXECUTION STATE
output = Good prediction [0.7 0.2 0.1]:
16print(f" Loss = -log(0.7) = {loss_good:.4f}")

Shows the simplified formula: since p is one-hot, H(p,q) = -log(q[true_class]).

EXECUTION STATE
output = Loss = -log(0.7) = 0.3567
18# Scenario 2: Bad prediction

The model is wrong — it puts the most probability on “dog” instead of “cat.”

19bad_pred = np.array([0.1, 0.6, 0.3])

A bad prediction: only 10% on cat (correct), 60% on dog (wrong).

EXECUTION STATE
bad_pred = [0.1, 0.6, 0.3] — model incorrectly favors “dog” with 60%
20loss_bad = cross_entropy(true, bad_pred)

Computes -log(0.1) — a much higher loss because the model assigned low probability to the true class.

EXECUTION STATE
loss_bad = 2.3026 = H(p, q) = -log(0.1) = 2.3026. This is 6.5× worse than the good prediction! Cross-entropy heavily penalizes confident wrong answers.
21print(f"Bad prediction {bad_pred}:")

Header for bad prediction.

EXECUTION STATE
output = Bad prediction [0.1 0.6 0.3]:
22print(f" Loss = -log(0.1) = {loss_bad:.4f}")

The loss is much higher because -log(0.1) >> -log(0.7).

EXECUTION STATE
output = Loss = -log(0.1) = 2.3026
24# Scenario 3: Near-perfect prediction

The model is almost 100% confident and correct. This is what a well-trained model achieves.

25perfect_pred = np.array([0.99, 0.005, 0.005])

Nearly perfect: 99% on cat. The loss should be very close to zero.

EXECUTION STATE
perfect_pred = [0.99, 0.005, 0.005] — extremely confident correct prediction
26loss_perfect = cross_entropy(true, perfect_pred)

Computes -log(0.99), which is very close to 0.

EXECUTION STATE
loss_perfect = 0.0101 = H(p, q) = -log(0.99) = 0.0101. Nearly zero loss! The model’s prediction almost perfectly matches the true distribution.
27print(f"Perfect prediction {perfect_pred}:")

Header for perfect prediction.

EXECUTION STATE
output = Perfect prediction [0.99 0.005 0.005]:
28print(f" Loss = -log(0.99) = {loss_perfect:.4f}")

Near-zero loss.

EXECUTION STATE
output = Loss = -log(0.99) = 0.0101
30# Comparison

Comparing all three scenarios shows the key property of cross-entropy: it’s 0 for perfect predictions, small for good predictions, and very large for bad predictions — with an asymmetric, steep penalty as the predicted probability of the true class approaches zero.

31print("--- Summary ---")

Summary header.

EXECUTION STATE
output = --- Summary ---
32print(f"Perfect: {loss_perfect:.4f} (lowest)")

Perfect prediction has the lowest loss (0.0101).

EXECUTION STATE
output = Perfect: 0.0101 (lowest)
33print(f"Good: {loss_good:.4f}")

Good prediction has moderate loss (0.3567).

EXECUTION STATE
output = Good: 0.3567
34print(f"Bad: {loss_bad:.4f} (highest)")

Bad prediction has the highest loss (2.3026). Cross-entropy creates a strong gradient signal to fix wrong predictions.

EXECUTION STATE
output = Bad: 2.3026 (highest)
6 lines without explanation
1import numpy as np
2
3def cross_entropy(true_probs, pred_probs):
4    """H(p, q) = -sum(p(x) * log(q(x)))"""
5    eps = 1e-15  # prevent log(0)
6    return -np.sum(true_probs * np.log(pred_probs + eps))
7
8# True label: "cat" (one-hot encoded)
9true = np.array([1.0, 0.0, 0.0])
10labels = ["cat", "dog", "bird"]
11
12# Scenario 1: Good prediction
13good_pred = np.array([0.7, 0.2, 0.1])
14loss_good = cross_entropy(true, good_pred)
15print(f"Good prediction {good_pred}:")
16print(f"  Loss = -log(0.7) = {loss_good:.4f}")
17
18# Scenario 2: Bad prediction
19bad_pred = np.array([0.1, 0.6, 0.3])
20loss_bad = cross_entropy(true, bad_pred)
21print(f"\nBad prediction {bad_pred}:")
22print(f"  Loss = -log(0.1) = {loss_bad:.4f}")
23
24# Scenario 3: Near-perfect prediction
25perfect_pred = np.array([0.99, 0.005, 0.005])
26loss_perfect = cross_entropy(true, perfect_pred)
27print(f"\nPerfect prediction {perfect_pred}:")
28print(f"  Loss = -log(0.99) = {loss_perfect:.4f}")
29
30# Comparison
31print("\n--- Summary ---")
32print(f"Perfect:  {loss_perfect:.4f} (lowest)")
33print(f"Good:     {loss_good:.4f}")
34print(f"Bad:      {loss_bad:.4f} (highest)")

Maximum Likelihood: Why Cross-Entropy Works

Why is cross-entropy the loss function for classification? The answer comes from Maximum Likelihood Estimation (MLE) \u2014 the most principled way to fit a model to data.

The Likelihood Function

Given a dataset of NN examples {(x1,y1),,(xN,yN)}\{(x_1, y_1), \ldots, (x_N, y_N)\}, the likelihood is the probability of observing all the true labels given the model's predictions. Assuming independence:

L(θ)=i=1NPθ(yixi)\mathcal{L}(\theta) = \prod_{i=1}^{N} P_{\theta}(y_i \mid x_i)

Here Pθ(yixi)P_{\theta}(y_i \mid x_i) is the probability that the model (with parameters θ\theta) assigns to the correct class yiy_i for input xix_i. We want to find θ\theta that maximizes this likelihood \u2014 make the data as probable as possible under our model.

From Likelihood to Cross-Entropy

Products are numerically unstable and hard to optimize. Taking the log converts the product to a sum (since log(ab)=loga+logb\log(a \cdot b) = \log a + \log b):

logL(θ)=i=1NlogPθ(yixi)\log \mathcal{L}(\theta) = \sum_{i=1}^{N} \log P_{\theta}(y_i \mid x_i)

Maximizing this log-likelihood is equivalent to minimizing its negative:

1Ni=1NlogPθ(yixi)-\frac{1}{N} \sum_{i=1}^{N} \log P_{\theta}(y_i \mid x_i)

And this is exactly the cross-entropy loss! Each term logPθ(yixi)-\log P_{\theta}(y_i \mid x_i) is the negative log probability of the correct class \u2014 exactly what cross-entropy computes.

The Connection: Cross-entropy loss is not an arbitrary choice. It is the mathematically principled consequence of maximum likelihood estimation. When you minimize cross-entropy during training, you are maximizing the probability that the model assigns to the correct labels \u2014 you are finding the parameters that make the observed data most likely.

Why Not Use MSE for Classification?

You might wonder: why not use mean squared error (MSE) for classification? The answer is both theoretical and practical:

  1. MSE assumes Gaussian errors. MSE is the maximum likelihood estimator when errors follow a Gaussian distribution. But classification labels are categorical (one-hot), not Gaussian \u2014 so MSE is the wrong probabilistic model.
  2. Gradient saturation. With softmax + MSE, the gradient can vanish when the prediction is very wrong (the sigmoid-like saturation effect). With softmax + cross-entropy, the gradient is simply y^y\hat{y} - y (predicted minus true) \u2014 no saturation, always a clear learning signal.
  3. Sharper convergence. Cross-entropy creates steeper gradients for confident wrong predictions. If the model says 99% dog when the answer is cat, cross-entropy produces a massive gradient. MSE would produce a relatively gentle gradient for the same error.

Probability in PyTorch

PyTorch provides optimized, numerically stable implementations of softmax and cross-entropy that you should always use in practice (never implement them from scratch in production code). The key function is F.cross_entropy\texttt{F.cross\_entropy}, which combines softmax and negative log-likelihood in a single numerically stable operation.

Important: PyTorch's F.cross_entropy\texttt{F.cross\_entropy} takes raw logits, not probabilities. It applies softmax internally. If you pass probabilities that have already been through softmax, you'll get wrong results (double softmax).

Let's see the complete PyTorch pipeline: logits \u2192 softmax \u2192 cross-entropy \u2192 gradients:

Cross-Entropy in PyTorch — The Complete Pipeline
🐍pytorch_cross_entropy.py
1import torch

PyTorch provides tensor operations and automatic differentiation. We use it to compute cross-entropy loss and gradients.

EXECUTION STATE
📚 torch = PyTorch’s core library. Provides Tensor type (like NumPy arrays but with GPU support and autograd), and the automatic differentiation engine.
2import torch.nn.functional as F

PyTorch’s functional API contains loss functions and activation functions as pure functions (no learnable parameters).

EXECUTION STATE
📚 torch.nn.functional (F) = Stateless neural network functions. Key functions: F.softmax (logits → probabilities), F.cross_entropy (combined softmax + CE loss), F.relu, F.dropout.
4# Network outputs (raw logits, NOT probabilities)

Important distinction: neural networks output raw scores (logits), not probabilities. The logits can be any real number. PyTorch’s F.cross_entropy expects logits, not probabilities — it applies softmax internally.

5logits = torch.tensor([[2.0, 1.0, 0.1]])

Creates a 1×3 tensor of logits (batch_size=1, num_classes=3). The double brackets make this a 2D tensor as expected by F.cross_entropy.

EXECUTION STATE
📚 torch.tensor() = Creates a PyTorch tensor from Python data. Similar to np.array() but supports autograd. The input must match the expected shape: (batch_size, num_classes) for cross_entropy.
logits shape = (1, 3) — 1 sample, 3 classes. Logits [2.0, 1.0, 0.1] — same values we used in NumPy.
logits = tensor([[2.0, 1.0, 0.1]]) — these are NOT probabilities. They don’t sum to 1.
6target = torch.tensor([0]) — Class index for "cat"

The true class label as an integer index. PyTorch uses integer labels (not one-hot) for cross_entropy — it’s more memory-efficient.

EXECUTION STATE
📚 target format = PyTorch cross_entropy expects integer class indices, not one-hot vectors. target=0 means class 0 (cat). Shape: (batch_size,).
target = tensor([0]) = Class 0 = “cat.” This is equivalent to the one-hot vector [1, 0, 0] but more memory-efficient.
8# PyTorch combines softmax + cross-entropy in one step

F.cross_entropy does THREE things internally: (1) applies softmax to logits, (2) takes the log, (3) computes the negative log-likelihood. Combining these in one function is more numerically stable than doing them separately.

9loss = F.cross_entropy(logits, target)

Computes the cross-entropy loss in one optimized call. Internally: softmax([2.0, 1.0, 0.1]) = [0.6590, 0.2424, 0.0986], then -log(0.6590) = 0.4170.

EXECUTION STATE
📚 F.cross_entropy(input, target) = Combines LogSoftmax + NLLLoss. Input: raw logits (batch, classes). Target: integer class indices (batch,). Returns scalar loss averaged over the batch.
⬇ arg 1: logits = tensor([[2.0, 1.0, 0.1]]) — raw network output. F.cross_entropy applies softmax internally.
⬇ arg 2: target = tensor([0]) — the correct class index. F.cross_entropy extracts the logit at this index.
→ internal softmax = softmax([2.0, 1.0, 0.1]) = [0.6590, 0.2424, 0.0986]
→ loss = -log(0.6590) = 0.4170 — the negative log probability of the correct class
loss = 0.4170 = Note: this differs from our NumPy example (0.3567) because here we start from logits, not from pre-defined probabilities. The softmax of [2.0, 1.0, 0.1] gives P(cat)=0.659, not 0.7.
10print(f"Cross-entropy loss: {loss.item():.4f}")

Extracts the scalar value from the tensor for printing.

EXECUTION STATE
📚 .item() = Converts a single-element tensor to a Python float. Required because print() can’t directly format a tensor with :.4f.
output = Cross-entropy loss: 0.4170
12# Manual verification: softmax then negative log-likelihood

Let’s verify F.cross_entropy by doing the steps manually: first softmax, then -log.

13probs = F.softmax(logits, dim=-1)

Applies softmax along the last dimension to convert logits to probabilities.

EXECUTION STATE
📚 F.softmax(input, dim) = Applies the softmax function: softmax(x_i) = exp(x_i) / Σexp(x_j). Output sums to 1.0 along the specified dimension.
⬇ arg: dim=-1 = Apply softmax along the last dimension (classes). For shape (1,3): dim=-1 means each row becomes a probability distribution.
probs = tensor([[0.6590, 0.2424, 0.0986]]) — sum = 1.0 ✓
14print(f"Softmax probabilities: {probs.detach().numpy()}")

Converts the tensor to a NumPy array for readable printing.

EXECUTION STATE
📚 .detach() = Detaches the tensor from the computation graph. Required before .numpy() if the tensor has requires_grad=True or is part of a gradient computation.
📚 .numpy() = Converts a PyTorch tensor to a NumPy array. Only works on CPU tensors. Shares memory — modifying one modifies the other.
output = Softmax probabilities: [[0.659 0.2424 0.0986]]
16manual_loss = -torch.log(probs[0, target[0]])

Manually computes cross-entropy: take the predicted probability of the true class, then compute -log of it.

EXECUTION STATE
📚 torch.log() = Natural logarithm. torch.log(tensor(0.659)) = tensor(-0.4170).
probs[0, target[0]] = probs[0, 0] = 0.6590 — the probability assigned to class 0 (cat) in sample 0.
-torch.log(0.6590) = -(-0.4170) = 0.4170 — matches F.cross_entropy! ✓
17print(f"Manual -log(p[cat]): {manual_loss.item():.4f}")

Confirms the manual calculation matches F.cross_entropy.

EXECUTION STATE
output = Manual -log(p[cat]): 0.4170 — matches!
19# Gradient: how should logits change to reduce loss?

The gradient of cross-entropy loss with respect to logits tells us how to adjust the network’s output to reduce the loss. This is what drives learning in neural networks.

20logits_grad = torch.tensor([[2.0, 1.0, 0.1]], requires_grad=True)

Creates a new tensor with gradient tracking enabled. PyTorch will record all operations on this tensor to compute gradients via backpropagation.

EXECUTION STATE
📚 requires_grad=True = Tells PyTorch to track all operations on this tensor so gradients can be computed via .backward(). Without this, PyTorch doesn’t record the computation graph.
logits_grad = tensor([[2.0, 1.0, 0.1]], requires_grad=True) — same values, but now tracked for gradient computation.
21loss = F.cross_entropy(logits_grad, target)

Recomputes the loss with the gradient-enabled tensor. PyTorch builds a computation graph internally.

EXECUTION STATE
loss = tensor(0.4170, grad_fn=<NllLossBackward0>) — same loss value, but now has a grad_fn (computation graph attached).
22loss.backward()

Computes gradients by backpropagating through the computation graph. After this call, logits_grad.grad contains dloss/dlogits.

EXECUTION STATE
📚 .backward() = Triggers reverse-mode automatic differentiation. Computes gradients of the loss with respect to all tensors that have requires_grad=True. Gradients accumulate in the .grad attribute.
23print(f"Gradients: {logits_grad.grad.numpy()}")

The gradient of cross-entropy with respect to logits has a beautiful closed-form: gradient = softmax(logits) - one_hot(target). This is one of the most elegant results in all of neural network theory.

EXECUTION STATE
logits_grad.grad = [[-0.3410, 0.2424, 0.0986]] — the gradient of the loss with respect to each logit.
→ grad[cat] = -0.3410 = Negative gradient for the true class. Gradient descent will INCREASE this logit (push probability higher). grad = 0.6590 - 1.0 = -0.3410.
→ grad[dog] = 0.2424 = Positive gradient for wrong class. Gradient descent will DECREASE this logit (push probability lower). grad = 0.2424 - 0.0 = 0.2424.
→ grad[bird] = 0.0986 = Positive gradient for wrong class. grad = 0.0986 - 0.0 = 0.0986.
24print(f"Gradient = softmax - one_hot")

The cross-entropy gradient formula: ∇L/∇z = softmax(z) - y, where y is the one-hot target. This means the gradient is simply the difference between what the model predicted and what the answer should be. Elegant and efficient!

EXECUTION STATE
softmax - one_hot = [0.6590, 0.2424, 0.0986] - [1.0, 0.0, 0.0] = [-0.3410, 0.2424, 0.0986] ✔ matches autograd!
→ interpretation = The gradient pushes the model’s output toward the true distribution. For the correct class: increase logit. For wrong classes: decrease logits. The magnitude is proportional to the error.
5 lines without explanation
1import torch
2import torch.nn.functional as F
3
4# Network outputs (raw logits, NOT probabilities)
5logits = torch.tensor([[2.0, 1.0, 0.1]])
6target = torch.tensor([0])  # class index for "cat"
7
8# PyTorch combines softmax + cross-entropy in one step
9loss = F.cross_entropy(logits, target)
10print(f"Cross-entropy loss: {loss.item():.4f}")
11
12# Manual verification: softmax then negative log-likelihood
13probs = F.softmax(logits, dim=-1)
14print(f"Softmax probabilities: {probs.detach().numpy()}")
15
16manual_loss = -torch.log(probs[0, target[0]])
17print(f"Manual -log(p[cat]): {manual_loss.item():.4f}")
18
19# Gradient: how should logits change to reduce loss?
20logits_grad = torch.tensor([[2.0, 1.0, 0.1]], requires_grad=True)
21loss = F.cross_entropy(logits_grad, target)
22loss.backward()
23print(f"\nGradients: {logits_grad.grad.numpy()}")
24print(f"Gradient = softmax - one_hot")

The Beautiful Gradient

The gradient of cross-entropy loss with respect to the logits has a remarkably simple form: Lzi=y^iyi\frac{\partial L}{\partial z_i} = \hat{y}_i - y_i, where y^\hat{y} is the softmax output and yy is the one-hot target. In our example:

ClassSoftmax (predicted)One-hot (true)Gradient
cat0.65901.0-0.3410 (increase this logit)
dog0.24240.0+0.2424 (decrease this logit)
bird0.09860.0+0.0986 (decrease this logit)

The gradient is simply the prediction error: how far each predicted probability is from the true value. This elegance is no accident \u2014 it is a direct consequence of the mathematical relationship between softmax and cross-entropy being dual to each other.


Summary and What's Next

We've built a complete path from basic probability to the cross-entropy loss function. Here's the chain of ideas:

  1. Random variables and distributions formalize uncertainty \u2014 each outcome gets a probability, and probabilities must sum to 1.
  2. Expected value and variance summarize distributions with two numbers: where it's centered and how spread out it is.
  3. The Gaussian distribution appears everywhere in neural networks: weight initialization, batch normalization, and the central limit theorem.
  4. Bayes' theorem lets us reason about P(classdata)P(\text{class} \mid \text{data}) \u2014 exactly what classifiers compute \u2014 and warns us about base rate effects.
  5. Softmax converts raw network outputs (logits) into a valid probability distribution, with temperature controlling confidence.
  6. Cross-entropy measures how different two distributions are. For one-hot labels, it simplifies to logP(correct class)-\log P(\text{correct class}).
  7. Maximum likelihood estimation proves that cross-entropy is the mathematically principled loss for classification \u2014 not an arbitrary choice.
What's Next: In the next section, we cover the chain rule \u2014 the mathematical engine that powers backpropagation. The chain rule lets us compute how the loss changes with respect to every single weight in the network, enabling gradient descent to work on deep networks with millions of parameters.
Key ConceptFormulaRole in Neural Networks
Expected ValueE[X] = Σ x·P(x)Loss = average over training data
Gaussian PDFf(x) = (1/σ√(2π)) exp(-(x-μ)²/2σ²)Weight init, batch norm
Bayes’ TheoremP(A|B) = P(B|A)·P(A) / P(B)P(class | input)
Softmaxsoftmax(zᵢ) = exp(zᵢ) / Σexp(zⱼ)Logits → probabilities
Cross-EntropyH(p,q) = -Σ p·log(q)Classification loss
CE Gradient∇L/∇z = softmax(z) - yThe learning signal
Loading comments...