Learning Objectives
By the end of this section, you will:
- Deeply understand what variance measures: the spread or dispersion of a distribution around its mean
- Know why we square deviations (and not use absolute values)
- Master both the definitional and computational formulas
- Understand standard deviation and why it's in the same units as the data
- Apply key properties:
- Distinguish population vs sample variance (Bessel's correction)
- Connect variance to the bias-variance tradeoff in ML
- Recognize variance in loss functions, regularization, and uncertainty
- Avoid common pitfalls when working with variance
Historical Context
The Quest to Measure Uncertainty
While expected value tells us where a distribution is centered, scientists needed a way to quantify how spread out the data is. This became crucial for astronomy, where understanding measurement error was essential.
Carl Friedrich Gauss (1777-1855) developed least squares and worked extensively with error distributions. Karl Pearson (1857-1936) coined the term "standard deviation" in 1893, choosing it over earlier terms like "probable error."
Ronald Fisher (1890-1962) revolutionized the field by developing the analysis of variance (ANOVA) and proving why we divide by instead of for sample variance.
The Problem: Mean Isn't Enough
In the previous section, we learned that expected value tells us where the center of a distribution is. But consider these two scenarios:
Investment A
Returns: +8%, +12%, +10%, +10%, +10%
Mean: 10%
Investment B
Returns: -20%, +40%, +5%, +15%, +10%
Mean: 10%
Both investments have the same expected return of 10%. But would you treat them as equivalent? Clearly not! Investment B is much more risky—its returns swing wildly, while Investment A is stable.
The Core Insight: Expected value tells us where the distribution is centered. Variance tells us how spread out it is. Together, they give us a much richer picture of uncertainty.
What Does Variance Measure?
Variance = the average squared distance from the mean
It measures how "spread out" or "dispersed" the values of a random variable are around its expected value.
Think of variance as answering the question: "On average, how far are the values from the center?" (where "far" is measured in squared units).
Intuitive Picture
Low Variance
Values cluster tightly around the mean. The distribution is peaked and narrow.
Example: Heights of adult males in a country
High Variance
Values spread widely around the mean. The distribution is flat and wide.
Example: Annual incomes in a country
Interactive: Visualizing Spread
Use the slider below to see how variance affects the spread of a distribution. Notice how a larger variance makes the distribution wider and flatter.
Loading interactive demo...
Mathematical Definition
Definition: Variance
where is the expected value of
Breaking Down the Formula
| Symbol | Meaning | Intuition |
|---|---|---|
| X | The random variable | The quantity whose spread we want to measure |
| μ = E[X] | Expected value (mean) | The center of the distribution |
| (X - μ) | Deviation from mean | How far X is from the center |
| (X - μ)² | Squared deviation | Distance squared (always positive) |
| E[...] | Expected value of | Average over all possible outcomes |
So variance is literally: the expected value of the squared distance from the mean. It's a weighted average of "how far things are from the center," where farther deviations are penalized more heavily (because of squaring).
Why Do We Square the Deviations?
This is one of the most common questions students ask. Why not just average the deviations directly? Or use absolute values ?
Problem 1: Deviations Sum to Zero
If we try to compute , we get:
The positive and negative deviations always cancel out! This tells us nothing about spread.
Two Solutions
Option 1: Mean Absolute Deviation (MAD)
Take absolute values to prevent cancellation.
Option 2: Variance (squared deviations)
Square to make everything positive.
Why Squaring Wins
- Differentiability: is differentiable everywhere, but has a kink at 0. This matters hugely for optimization (gradient descent!).
- Mathematical tractability: For independent random variables, . This beautiful additivity property doesn't hold for MAD.
- Connection to Euclidean geometry: Squared distance is the foundation of Euclidean space. This connects variance to least squares, PCA, and many ML algorithms.
- Central Limit Theorem: The normal distribution emerges naturally from summing independent random variables, and it's parameterized by variance (not MAD).
Deep Insight: Variance "won" because of its mathematical elegance, not because it's the only valid measure of spread. In robust statistics, MAD is sometimes preferred because it's less sensitive to outliers.
Interactive: Why Squaring Works
Drag the data points and observe how positive and negative deviations cancel when not squared, but squaring captures the true spread.
Loading interactive demo...
The Computational Shortcut
There's an equivalent formula that's often easier to compute:
Computational Formula
"The mean of the squares minus the square of the mean"
Proof
Discrete vs Continuous Formulas
Discrete Random Variable
Sum over all possible values, weighted by probabilities.
Continuous Random Variable
Integrate over the real line, weighted by density.
Example: Variance of a Fair Die
For a fair 6-sided die, for .
First, the mean:
Then the variance:
Variance of Common Distributions
| Distribution | Parameters | Variance |
|---|---|---|
| Bernoulli | p | p(1 - p) |
| Binomial | n, p | np(1 - p) |
| Poisson | λ | λ |
| Geometric | p | (1 - p) / p² |
| Uniform(a, b) | a, b | (b - a)² / 12 |
| Exponential | λ | 1 / λ² |
| Normal | μ, σ² | σ² |
| Gamma | α, β | α / β² |
Standard Deviation: Back to Original Units
There's one annoying thing about variance: if is measured in dollars, then is in dollars squared. That's not intuitive!
Standard Deviation
Standard deviation is in the same units as the original data
Standard deviation (SD or σ) is simply the square root of variance. It brings us back to the original units, making it much more interpretable.
If X is in dollars:
- Var(X) is in dollars²
- SD(X) is in dollars
If X is in meters:
- Var(X) is in m²
- SD(X) is in meters
Rule of Thumb: Standard deviation tells you roughly how far a "typical" value is from the mean. If σ = 10 and μ = 50, most values are within 10-20 units of 50.
Interactive: Understanding Units
See how variance and standard deviation relate, and how SD maintains interpretable units.
Loading interactive demo...
The 68-95-99.7 Rule
For normal distributions (bell curves), there's a beautiful pattern relating standard deviation to probability:
The Empirical Rule (68-95-99.7)
This means: if heights are normally distributed with μ = 170cm and σ = 10cm, then about 68% of people are between 160-180cm, 95% are between 150-190cm, and nearly everyone (99.7%) is between 140-200cm.
Properties of Variance
Understanding these properties is essential for working with variance in practice:
1. Variance is Non-negative
Variance is zero only if X is constant (no randomness).
2. Variance of a Constant is Zero
Constants don't vary!
3. Adding a Constant Doesn't Change Variance
Shifting the distribution doesn't change its spread.
4. Scaling by a Constant
Multiplying X by a scales variance by a². Note the square!
5. Combined Linear Transformation
Only scaling affects variance, not shifting.
6. Sum of Independent Random Variables
Only if X and Y are independent! Variances add.
7. General Sum (Not Independent)
The covariance term captures how X and Y vary together.
Interactive: Variance Properties
Adjust the scaling factor and shift to see how the variance changes. Notice that has no effect!
Loading interactive demo...
Population vs Sample Variance
In practice, we usually don't know the true population parameters. We estimate them from a sample. But there's a subtle issue...
Population Variance
True variance (if we knew all values)
Sample Variance
Estimated variance from a sample
Why n-1? (Bessel's Correction)
The in the denominator is called Bessel's correction. It makes the sample variance an unbiased estimator of the population variance.
The intuition:
- We use the sample mean instead of the true mean
- The sample mean is calculated from the same data, so it's already "optimized" to be close to the data points
- This causes us to underestimate the true spread
- Dividing by instead of corrects for this bias
Real-World Applications
Finance: Risk Management
Portfolio Variance
In finance, variance is THE measure of risk. The famous Sharpe Ratio divides excess return by standard deviation:
Higher Sharpe = better risk-adjusted returns
Quality Control: Six Sigma
Process Capability
Six Sigma methodology aims to reduce variance in manufacturing. A "six sigma" process has defects only outside 6 standard deviations—that's about 3.4 defects per million!
Physics: Measurement Uncertainty
Error Propagation
When combining measurements, errors propagate. If , the uncertainty in Z is:
(assuming X and Y are independent)
Variance in Machine Learning
Variance appears everywhere in machine learning. Here are the key places:
1. Loss Functions
Mean Squared Error (MSE) is directly related to variance:
This is the sample variance of the residuals! We use squared error because it's differentiable (for gradient descent) and penalizes large errors heavily.
2. Weight Initialization
Xavier/Glorot and He initialization carefully control the variance of initial weights:
This prevents gradients from exploding or vanishing during training.
3. Batch Normalization
BatchNorm explicitly normalizes activations to have unit variance:
This stabilizes training by controlling the internal distribution of activations.
4. Regularization
L2 regularization (Ridge) reduces the variance of model predictions at the cost of introducing some bias. This is the bias-variance tradeoff in action!
The Bias-Variance Tradeoff
This is one of the most important concepts in machine learning. The expected prediction error can be decomposed as:
Bias-Variance Decomposition
Bias
How far the average prediction is from the true value. High bias = underfitting.
Variance
How much predictions change across different training sets. High variance = overfitting.
Irreducible
Inherent noise in the data. Can't be reduced by any model.
The Tradeoff: Simple models have high bias but low variance. Complex models have low bias but high variance. The goal is to find the sweet spot!
Interactive: Bias-Variance Tradeoff
Adjust model complexity to see how bias and variance change. Watch for the sweet spot where total error is minimized!
Loading interactive demo...
Common Pitfalls
only if X and Y are independent! In general, there's a covariance term.
, even if X and Y are independent. The formula is much more complex.
Be careful about whether you're working with variance (σ²) or standard deviation (σ). Normal distributions are parameterized by σ², not σ.
When estimating from data, use in the denominator, not .
Some distributions (like Cauchy) have undefined variance because the integral doesn't converge. Always check!
Coefficient of Variation
Sometimes we want to compare variability across distributions with different scales. The coefficient of variation (CV) normalizes standard deviation by the mean:
Example: If stock A has μ = $100, σ = $20 (CV = 20%) and stock B has μ = $10, σ = $5 (CV = 50%), then stock B is relatively more variable even though its standard deviation is smaller.
Python Implementation
1import numpy as np
2
3# Sample data
4data = np.array([2, 4, 4, 4, 5, 5, 7, 9])
5
6# --- Method 1: From definition ---
7def variance_definition(x):
8 """Compute variance from the definition."""
9 mu = np.mean(x)
10 return np.mean((x - mu)**2)
11
12# --- Method 2: Computational formula ---
13def variance_computational(x):
14 """Compute variance using E[X²] - E[X]²."""
15 return np.mean(x**2) - np.mean(x)**2
16
17# --- NumPy functions ---
18# Population variance (divide by n)
19pop_var = np.var(data)
20
21# Sample variance (divide by n-1, Bessel's correction)
22sample_var = np.var(data, ddof=1)
23
24# Standard deviation
25pop_std = np.std(data)
26sample_std = np.std(data, ddof=1)
27
28print(f"Population variance: {pop_var:.4f}")
29print(f"Sample variance: {sample_var:.4f}")
30print(f"Population std: {pop_std:.4f}")
31print(f"Sample std: {sample_std:.4f}")
32
33# --- Verify the formulas give same result ---
34print(f"\nDefinitional formula: {variance_definition(data):.4f}")
35print(f"Computational formula: {variance_computational(data):.4f}")
36
37# --- For distributions ---
38from scipy import stats
39
40# Normal distribution: Var = σ²
41normal = stats.norm(loc=0, scale=2) # μ=0, σ=2
42print(f"\nNormal(0,2) variance: {normal.var()}") # Should be 4
43
44# Poisson: Var = λ
45poisson = stats.poisson(mu=5) # λ=5
46print(f"Poisson(5) variance: {poisson.var()}") # Should be 5
47
48# Bernoulli: Var = p(1-p)
49bernoulli = stats.bernoulli(p=0.3)
50print(f"Bernoulli(0.3) variance: {bernoulli.var()}") # Should be 0.21Test Your Understanding
Take this quiz to check your understanding of variance and standard deviation.
Loading interactive demo...
Summary
Key Takeaways
- Variance measures spread—the average squared distance from the mean
- We square deviations because raw deviations sum to zero, and squaring is differentiable
- Two formulas: definitional and computational
- Standard deviation = √variance, same units as data
- Key property:
- Sample variance uses n-1 (Bessel's correction)
- Bias-variance tradeoff is fundamental to ML model selection
- Variance appears everywhere in ML: loss functions, initialization, normalization, regularization
Final Thought: If expected value tells you where to aim, variance tells you how much you might miss by. Both are essential for understanding and quantifying uncertainty—the heart of statistics and machine learning.