Chapter 3
20 min read
Section 2 of 6

Variance and Standard Deviation

Expectation and Moments

Learning Objectives

By the end of this section, you will:

  • Deeply understand what variance measures: the spread or dispersion of a distribution around its mean
  • Know why we square deviations (and not use absolute values)
  • Master both the definitional and computational formulas
  • Understand standard deviation and why it's in the same units as the data
  • Apply key properties: Var(aX+b)=a2Var(X)\text{Var}(aX + b) = a^2\text{Var}(X)
  • Distinguish population vs sample variance (Bessel's correction)
  • Connect variance to the bias-variance tradeoff in ML
  • Recognize variance in loss functions, regularization, and uncertainty
  • Avoid common pitfalls when working with variance

Historical Context

The Quest to Measure Uncertainty

While expected value tells us where a distribution is centered, scientists needed a way to quantify how spread out the data is. This became crucial for astronomy, where understanding measurement error was essential.

Carl Friedrich Gauss (1777-1855) developed least squares and worked extensively with error distributions. Karl Pearson (1857-1936) coined the term "standard deviation" in 1893, choosing it over earlier terms like "probable error."

Ronald Fisher (1890-1962) revolutionized the field by developing the analysis of variance (ANOVA) and proving why we divide by n1n-1 instead of nn for sample variance.

📐
1809
Gauss: Least Squares
📊
1893
Pearson: "Standard Deviation"
🔬
1918
Fisher: ANOVA

The Problem: Mean Isn't Enough

In the previous section, we learned that expected value tells us where the center of a distribution is. But consider these two scenarios:

Investment A

Returns: +8%, +12%, +10%, +10%, +10%

Mean: 10%

Investment B

Returns: -20%, +40%, +5%, +15%, +10%

Mean: 10%

Both investments have the same expected return of 10%. But would you treat them as equivalent? Clearly not! Investment B is much more risky—its returns swing wildly, while Investment A is stable.

The Core Insight: Expected value tells us where the distribution is centered. Variance tells us how spread out it is. Together, they give us a much richer picture of uncertainty.

What Does Variance Measure?

Variance = the average squared distance from the mean

It measures how "spread out" or "dispersed" the values of a random variable are around its expected value.

Think of variance as answering the question: "On average, how far are the values from the center?" (where "far" is measured in squared units).

Intuitive Picture

Low Variance

Values cluster tightly around the mean. The distribution is peaked and narrow.

Example: Heights of adult males in a country

High Variance

Values spread widely around the mean. The distribution is flat and wide.

Example: Annual incomes in a country


Interactive: Visualizing Spread

Use the slider below to see how variance affects the spread of a distribution. Notice how a larger variance makes the distribution wider and flatter.

Loading interactive demo...

Observation: As variance increases, the distribution becomes "flatter" and more spread out. The area under the curve always remains 1 (it's still a valid probability distribution), but the probability is spread over a wider range.

Mathematical Definition

Definition: Variance

Var(X)=E[(Xμ)2]\text{Var}(X) = E\left[(X - \mu)^2\right]

where μ=E[X]\mu = E[X] is the expected value of XX

Breaking Down the Formula

SymbolMeaningIntuition
XThe random variableThe quantity whose spread we want to measure
μ = E[X]Expected value (mean)The center of the distribution
(X - μ)Deviation from meanHow far X is from the center
(X - μ)²Squared deviationDistance squared (always positive)
E[...]Expected value ofAverage over all possible outcomes

So variance is literally: the expected value of the squared distance from the mean. It's a weighted average of "how far things are from the center," where farther deviations are penalized more heavily (because of squaring).

Notation: Variance is also written as σ2\sigma^2 (sigma squared) or Var(X)\text{Var}(X) or V(X)V(X). All mean the same thing.

Why Do We Square the Deviations?

This is one of the most common questions students ask. Why not just average the deviations (Xμ)(X - \mu) directly? Or use absolute values Xμ|X - \mu|?

Problem 1: Deviations Sum to Zero

If we try to compute E[Xμ]E[X - \mu], we get:

E[Xμ]=E[X]μ=μμ=0E[X - \mu] = E[X] - \mu = \mu - \mu = 0

The positive and negative deviations always cancel out! This tells us nothing about spread.

Two Solutions

Option 1: Mean Absolute Deviation (MAD)

E[Xμ]E[|X - \mu|]

Take absolute values to prevent cancellation.

Option 2: Variance (squared deviations)

E[(Xμ)2]E[(X - \mu)^2]

Square to make everything positive.

Why Squaring Wins

  1. Differentiability: x2x^2 is differentiable everywhere, but x|x| has a kink at 0. This matters hugely for optimization (gradient descent!).
  2. Mathematical tractability: For independent random variables, Var(X+Y)=Var(X)+Var(Y)\text{Var}(X + Y) = \text{Var}(X) + \text{Var}(Y). This beautiful additivity property doesn't hold for MAD.
  3. Connection to Euclidean geometry: Squared distance is the foundation of Euclidean space. This connects variance to least squares, PCA, and many ML algorithms.
  4. Central Limit Theorem: The normal distribution emerges naturally from summing independent random variables, and it's parameterized by variance (not MAD).
Deep Insight: Variance "won" because of its mathematical elegance, not because it's the only valid measure of spread. In robust statistics, MAD is sometimes preferred because it's less sensitive to outliers.

Interactive: Why Squaring Works

Drag the data points and observe how positive and negative deviations cancel when not squared, but squaring captures the true spread.

Loading interactive demo...


The Computational Shortcut

There's an equivalent formula that's often easier to compute:

Computational Formula

Var(X)=E[X2](E[X])2\text{Var}(X) = E[X^2] - (E[X])^2

"The mean of the squares minus the square of the mean"

Proof

Var(X)\text{Var}(X)=E[(Xμ)2]E[(X - \mu)^2](definition)
=E[X22μX+μ2]E[X^2 - 2\mu X + \mu^2](expand square)
=E[X2]2μE[X]+μ2E[X^2] - 2\mu E[X] + \mu^2(linearity)
=E[X2]2μμ+μ2E[X^2] - 2\mu \cdot \mu + \mu^2(since E[X]=μE[X] = \mu)
=E[X2]μ2E[X^2] - \mu^2(simplify)
=E[X2](E[X])2E[X^2] - (E[X])^2
Always True: E[X2](E[X])2E[X^2] \geq (E[X])^2. This follows from variance being non-negative. The equality holds only when XX is constant (no randomness).

Discrete vs Continuous Formulas

Discrete Random Variable

Var(X)=i(xiμ)2P(X=xi)\text{Var}(X) = \sum_i (x_i - \mu)^2 P(X = x_i)

Sum over all possible values, weighted by probabilities.

Continuous Random Variable

Var(X)=(xμ)2f(x)dx\text{Var}(X) = \int_{-\infty}^{\infty} (x - \mu)^2 f(x)\, dx

Integrate over the real line, weighted by density.

Example: Variance of a Fair Die

For a fair 6-sided die, P(X=k)=1/6P(X = k) = 1/6 for k=1,2,3,4,5,6k = 1, 2, 3, 4, 5, 6.

First, the mean: μ=1+2+3+4+5+66=3.5\mu = \frac{1+2+3+4+5+6}{6} = 3.5

Then the variance:

Var(X)=16[(13.5)2+(23.5)2++(63.5)2]\text{Var}(X) = \frac{1}{6}\left[(1-3.5)^2 + (2-3.5)^2 + \cdots + (6-3.5)^2\right]
=16[6.25+2.25+0.25+0.25+2.25+6.25]=17.562.917= \frac{1}{6}\left[6.25 + 2.25 + 0.25 + 0.25 + 2.25 + 6.25\right] = \frac{17.5}{6} \approx 2.917

Variance of Common Distributions

DistributionParametersVariance
Bernoullipp(1 - p)
Binomialn, pnp(1 - p)
Poissonλλ
Geometricp(1 - p) / p²
Uniform(a, b)a, b(b - a)² / 12
Exponentialλ1 / λ²
Normalμ, σ²σ²
Gammaα, βα / β²
Poisson Special Case: For Poisson(λ), both the mean AND variance equal λ. This is a unique property!
Bernoulli Maximum: The variance p(1p)p(1-p) is maximized when p=0.5p = 0.5, giving variance = 0.25. This is when uncertainty is highest (coin flip is most unpredictable at 50-50).

Standard Deviation: Back to Original Units

There's one annoying thing about variance: if XX is measured in dollars, then Var(X)\text{Var}(X) is in dollars squared. That's not intuitive!

Standard Deviation

σ=Var(X)=E[(Xμ)2]\sigma = \sqrt{\text{Var}(X)} = \sqrt{E[(X - \mu)^2]}

Standard deviation is in the same units as the original data

Standard deviation (SD or σ) is simply the square root of variance. It brings us back to the original units, making it much more interpretable.

If X is in dollars:

  • Var(X) is in dollars²
  • SD(X) is in dollars

If X is in meters:

  • Var(X) is in
  • SD(X) is in meters
Rule of Thumb: Standard deviation tells you roughly how far a "typical" value is from the mean. If σ = 10 and μ = 50, most values are within 10-20 units of 50.

Interactive: Understanding Units

See how variance and standard deviation relate, and how SD maintains interpretable units.

Loading interactive demo...


The 68-95-99.7 Rule

For normal distributions (bell curves), there's a beautiful pattern relating standard deviation to probability:

The Empirical Rule (68-95-99.7)

68%
within 1σ of mean
95%
within 2σ of mean
99.7%
within 3σ of mean

This means: if heights are normally distributed with μ = 170cm and σ = 10cm, then about 68% of people are between 160-180cm, 95% are between 150-190cm, and nearly everyone (99.7%) is between 140-200cm.

Beyond Normal: For ANY distribution (not just normal), Chebyshev's inequality guarantees that at least 11/k21 - 1/k^2 of values are within k standard deviations of the mean. For k=2: at least 75%. For k=3: at least 89%.

Properties of Variance

Understanding these properties is essential for working with variance in practice:

1. Variance is Non-negative

Var(X)0\text{Var}(X) \geq 0

Variance is zero only if X is constant (no randomness).

2. Variance of a Constant is Zero

Var(c)=0\text{Var}(c) = 0

Constants don't vary!

3. Adding a Constant Doesn't Change Variance

Var(X+c)=Var(X)\text{Var}(X + c) = \text{Var}(X)

Shifting the distribution doesn't change its spread.

4. Scaling by a Constant

Var(aX)=a2Var(X)\text{Var}(aX) = a^2 \text{Var}(X)

Multiplying X by a scales variance by a². Note the square!

5. Combined Linear Transformation

Var(aX+b)=a2Var(X)\text{Var}(aX + b) = a^2 \text{Var}(X)

Only scaling affects variance, not shifting.

6. Sum of Independent Random Variables

Var(X+Y)=Var(X)+Var(Y)\text{Var}(X + Y) = \text{Var}(X) + \text{Var}(Y)

Only if X and Y are independent! Variances add.

7. General Sum (Not Independent)

Var(X+Y)=Var(X)+Var(Y)+2Cov(X,Y)\text{Var}(X + Y) = \text{Var}(X) + \text{Var}(Y) + 2\text{Cov}(X, Y)

The covariance term captures how X and Y vary together.


Interactive: Variance Properties

Adjust the scaling factor aa and shift bb to see how the variance changes. Notice that bb has no effect!

Loading interactive demo...


Population vs Sample Variance

In practice, we usually don't know the true population parameters. We estimate them from a sample. But there's a subtle issue...

Population Variance

σ2=1Ni=1N(xiμ)2\sigma^2 = \frac{1}{N}\sum_{i=1}^{N}(x_i - \mu)^2

True variance (if we knew all values)

Sample Variance

s2=1n1i=1n(xixˉ)2s^2 = \frac{1}{n-1}\sum_{i=1}^{n}(x_i - \bar{x})^2

Estimated variance from a sample

Why n-1? (Bessel's Correction)

The n1n-1 in the denominator is called Bessel's correction. It makes the sample variance an unbiased estimator of the population variance.

The intuition:

  • We use the sample mean xˉ\bar{x} instead of the true mean μ\mu
  • The sample mean is calculated from the same data, so it's already "optimized" to be close to the data points
  • This causes us to underestimate the true spread
  • Dividing by n1n-1 instead of nn corrects for this bias
Degrees of Freedom: We have n data points but we "used up" one degree of freedom to estimate the mean. So we have n1n-1 degrees of freedom left for estimating variance.

Real-World Applications

Finance: Risk Management

Portfolio Variance

In finance, variance is THE measure of risk. The famous Sharpe Ratio divides excess return by standard deviation:

Sharpe=RpRfσp\text{Sharpe} = \frac{R_p - R_f}{\sigma_p}

Higher Sharpe = better risk-adjusted returns

Quality Control: Six Sigma

Process Capability

Six Sigma methodology aims to reduce variance in manufacturing. A "six sigma" process has defects only outside 6 standard deviations—that's about 3.4 defects per million!

Physics: Measurement Uncertainty

Error Propagation

When combining measurements, errors propagate. If Z=X+YZ = X + Y, the uncertainty in Z is:

σZ=σX2+σY2\sigma_Z = \sqrt{\sigma_X^2 + \sigma_Y^2}

(assuming X and Y are independent)


Variance in Machine Learning

Variance appears everywhere in machine learning. Here are the key places:

1. Loss Functions

Mean Squared Error (MSE) is directly related to variance:

MSE=1ni=1n(yiy^i)2\text{MSE} = \frac{1}{n}\sum_{i=1}^{n}(y_i - \hat{y}_i)^2

This is the sample variance of the residuals! We use squared error because it's differentiable (for gradient descent) and penalizes large errors heavily.

2. Weight Initialization

Xavier/Glorot and He initialization carefully control the variance of initial weights:

Var(W)=2nin+nout\text{Var}(W) = \frac{2}{n_{\text{in}} + n_{\text{out}}}

This prevents gradients from exploding or vanishing during training.

3. Batch Normalization

BatchNorm explicitly normalizes activations to have unit variance:

x^=xμσ2+ϵ\hat{x} = \frac{x - \mu}{\sqrt{\sigma^2 + \epsilon}}

This stabilizes training by controlling the internal distribution of activations.

4. Regularization

L2 regularization (Ridge) reduces the variance of model predictions at the cost of introducing some bias. This is the bias-variance tradeoff in action!


The Bias-Variance Tradeoff

This is one of the most important concepts in machine learning. The expected prediction error can be decomposed as:

Bias-Variance Decomposition

Error=Bias2+Variance+Irreducible Noise\text{Error} = \text{Bias}^2 + \text{Variance} + \text{Irreducible Noise}

Bias

How far the average prediction is from the true value. High bias = underfitting.

Variance

How much predictions change across different training sets. High variance = overfitting.

Irreducible

Inherent noise in the data. Can't be reduced by any model.

The Tradeoff: Simple models have high bias but low variance. Complex models have low bias but high variance. The goal is to find the sweet spot!


Interactive: Bias-Variance Tradeoff

Adjust model complexity to see how bias and variance change. Watch for the sweet spot where total error is minimized!

Loading interactive demo...


Common Pitfalls

Pitfall 1: Variance of Sum ≠ Sum of Variances (in general)

Var(X+Y)=Var(X)+Var(Y)\text{Var}(X + Y) = \text{Var}(X) + \text{Var}(Y) only if X and Y are independent! In general, there's a covariance term.

Pitfall 2: Variance of Product

Var(XY)Var(X)Var(Y)\text{Var}(XY) \neq \text{Var}(X) \cdot \text{Var}(Y), even if X and Y are independent. The formula is much more complex.

Pitfall 3: Confusing σ and σ²

Be careful about whether you're working with variance (σ²) or standard deviation (σ). Normal distributions are parameterized by σ², not σ.

Pitfall 4: Using Population Formula for Samples

When estimating from data, use n1n-1 in the denominator, not nn.

Pitfall 5: Variance Doesn't Always Exist

Some distributions (like Cauchy) have undefined variance because the integral doesn't converge. Always check!


Coefficient of Variation

Sometimes we want to compare variability across distributions with different scales. The coefficient of variation (CV) normalizes standard deviation by the mean:

CV=σμ×100%\text{CV} = \frac{\sigma}{\mu} \times 100\%

Example: If stock A has μ = $100, σ = $20 (CV = 20%) and stock B has μ = $10, σ = $5 (CV = 50%), then stock B is relatively more variable even though its standard deviation is smaller.

CV only makes sense when the mean is positive and represents a true zero point (ratio scale). It's meaningless for temperature in Celsius, for example.

Python Implementation

🐍python
1import numpy as np
2
3# Sample data
4data = np.array([2, 4, 4, 4, 5, 5, 7, 9])
5
6# --- Method 1: From definition ---
7def variance_definition(x):
8    """Compute variance from the definition."""
9    mu = np.mean(x)
10    return np.mean((x - mu)**2)
11
12# --- Method 2: Computational formula ---
13def variance_computational(x):
14    """Compute variance using E[X²] - E[X]²."""
15    return np.mean(x**2) - np.mean(x)**2
16
17# --- NumPy functions ---
18# Population variance (divide by n)
19pop_var = np.var(data)
20
21# Sample variance (divide by n-1, Bessel's correction)
22sample_var = np.var(data, ddof=1)
23
24# Standard deviation
25pop_std = np.std(data)
26sample_std = np.std(data, ddof=1)
27
28print(f"Population variance: {pop_var:.4f}")
29print(f"Sample variance: {sample_var:.4f}")
30print(f"Population std: {pop_std:.4f}")
31print(f"Sample std: {sample_std:.4f}")
32
33# --- Verify the formulas give same result ---
34print(f"\nDefinitional formula: {variance_definition(data):.4f}")
35print(f"Computational formula: {variance_computational(data):.4f}")
36
37# --- For distributions ---
38from scipy import stats
39
40# Normal distribution: Var = σ²
41normal = stats.norm(loc=0, scale=2)  # μ=0, σ=2
42print(f"\nNormal(0,2) variance: {normal.var()}")  # Should be 4
43
44# Poisson: Var = λ
45poisson = stats.poisson(mu=5)  # λ=5
46print(f"Poisson(5) variance: {poisson.var()}")  # Should be 5
47
48# Bernoulli: Var = p(1-p)
49bernoulli = stats.bernoulli(p=0.3)
50print(f"Bernoulli(0.3) variance: {bernoulli.var()}")  # Should be 0.21

Test Your Understanding

Take this quiz to check your understanding of variance and standard deviation.

Loading interactive demo...


Summary

Key Takeaways

  1. Variance measures spread—the average squared distance from the mean
  2. We square deviations because raw deviations sum to zero, and squaring is differentiable
  3. Two formulas: definitional E[(Xμ)2]E[(X-\mu)^2] and computational E[X2](E[X])2E[X^2] - (E[X])^2
  4. Standard deviation = √variance, same units as data
  5. Key property: Var(aX+b)=a2Var(X)\text{Var}(aX+b) = a^2\text{Var}(X)
  6. Sample variance uses n-1 (Bessel's correction)
  7. Bias-variance tradeoff is fundamental to ML model selection
  8. Variance appears everywhere in ML: loss functions, initialization, normalization, regularization
Final Thought: If expected value tells you where to aim, variance tells you how much you might miss by. Both are essential for understanding and quantifying uncertainty—the heart of statistics and machine learning.
Next Up: In the next section, we'll explore higher moments—skewness and kurtosis—which tell us about the shape of distributions beyond just center and spread.