Learning Objectives
Before You Start
You should understand what an estimator is from Section 01. Specifically, you should know that an estimator is a function of the sample data that produces an estimate of the unknown parameter .
By the end of this section, you will be able to:
Understand what it means for an estimator to systematically over- or under-estimate the truth
Understand how much an estimator "wiggles" across different samples
Master the beautiful formula: MSE = Bias² + Variance
The fundamental principle that governs all of statistics and machine learning
Connect bias-variance to underfitting, overfitting, and model selection
By the end of this section, you'll understand the two fundamental sources of error in any estimation procedure and how they combine into a single quality metric (MSE). This knowledge is essential for understanding why some estimators are preferred over others — even if they're biased!
The Big Picture: How Good Is Your Estimator?
The Central Question: Given multiple ways to estimate a parameter, how do we decide which estimator is "best"?
In Section 01, we learned that an estimator is a function that takes sample data and produces an estimate of the unknown true parameter . But here's the crucial insight:
The Estimator Is a Random Variable
Because depends on random sample data, the estimator itself is a random variable. If you collected a different sample, you'd get a different estimate!
This randomness raises a fundamental question: How do we evaluate the quality of an estimator that produces different values each time?
Remember our coffee shop example? The true average wait time is minutes (but we don't know this). We sample n customers and compute the sample mean .
Different samples give different estimates! How do we evaluate the quality of this estimator?
We need metrics that capture two distinct types of error:
Does the estimator consistently over- or under-estimate the truth? This is measured by BIAS.
How much does the estimator fluctuate from sample to sample? This is measured by VARIANCE.
The Fundamental Insight
Total Error = Systematic Error + Random Error. This is the essence of the bias-variance decomposition that we'll derive in this section.
What Is Bias?
Intuitive Understanding of Bias
Imagine you own a bathroom scale that displays weights. Unknown to you, the scale has a manufacturing defect that makes it always show 2 kg more than the actual weight.
| True Weight | Scale Reading | Error |
|---|---|---|
| 70 kg | 72 kg | +2 kg |
| 65 kg | 67 kg | +2 kg |
| 80 kg | 82 kg | +2 kg |
Key observation: No matter how many times you weigh yourself, the scale will always overestimate by 2 kg. More measurements won't help! This +2 kg is the bias.
Bias is the systematic, consistent error that doesn't go away with more data. It represents where your estimator is "aiming" relative to the truth.
Here's why this matters: A biased estimator gives you a false sense of confidence. You might collect more and more data, seeing your estimates converge — but they're converging to the wrong value!
Mathematical Definition
Formally, the bias of an estimator is the difference between its expected value and the true parameter:
An estimator is called unbiased if its bias is zero:
What Unbiased Really Means
An unbiased estimator hits the target on average. If you could repeat the experiment infinitely many times with different samples, the average of all your estimates would equal the true parameter.
This does NOT mean that any single estimate will be exactly correct!
| Bias Value | Interpretation | Example |
|---|---|---|
| Bias > 0 | Estimator overestimates on average | Scale adds 2 kg |
| Bias < 0 | Estimator underestimates on average | Survey that misses certain demographics |
| Bias = 0 | Estimator is unbiased — hits target on average | Sample mean for the population mean |
Bias Examples
What Is Variance?
Intuitive Understanding of Variance
Now let's consider a different scenario. Imagine two archers shooting at a target:
Arrows cluster tightly together. Even if they're slightly off-center, the shots are predictable and consistent.
Arrows are scattered all over the target. Each shot is unpredictable — you never know where the next arrow will land.
Variance measures the "spread" or "uncertainty" in your estimator. It tells you how much the estimate would change if you collected a different sample.
Unlike bias, variance can be reduced by collecting more data. With more observations, your estimate becomes more stable.
Equivalently, using the alternative formula:
Variance Examples
For the sample mean of n i.i.d. observations with variance :
| Sample Size (n) | Var() | Interpretation |
|---|---|---|
| n = 10 | σ²/10 | Moderate spread |
| n = 100 | σ²/100 | Much tighter |
| n = 1000 | σ²/1000 | Very precise |
| n → ∞ | → 0 | Perfect precision! |
Imagine each estimator as a shooter aiming at a target. The bullseye is the true parameter. Each shot represents an estimate from a different sample.
The Ideal Estimator
Shots cluster tightly around the bullseye
Consistent but Wrong
Shots cluster tightly but miss the center
Right on Average, but Scattered
Shots are spread out but centered on bullseye
The Worst Case
Shots are spread out AND miss the center
Sample Size Effects: Variance Shrinks, Bias Stays
A critical insight is that more data reduces variance but cannot fix bias. For the sample mean :
The variance decreases proportionally to , meaning the standard error (standard deviation of the estimator) decreases as . To halve your standard error, you need to quadruple your sample size!
For the sample mean estimator with population variance σ² = 4:
| n | Var(X̄) = σ²/n | Std Error = σ/√n | Reduction |
|---|---|---|---|
| 5 | 0.8000 | 0.8944 | 0% |
| 10 | 0.4000 | 0.6325 | 50% |
| 20 | 0.2000 | 0.4472 | 75% |
| 50 | 0.0800 | 0.2828 | 90% |
| 100 | 0.0400 | 0.2000 | 95% |
| 200 | 0.0200 | 0.1414 | 98% |
| 500 | 0.0080 | 0.0894 | 99% |
Notice: Bias stays at 0 regardless of n. Only variance decreases with more data!
Bias Is Immune to Sample Size
If your estimation method has bias, collecting more data will not eliminate it. A biased estimator converges to the wrong value as . You must fix the methodology itself to remove bias.
What Is Mean Squared Error (MSE)?
Now we need a single number that captures the overall quality of an estimator, combining both bias and variance. This is the Mean Squared Error (MSE).
MSE measures the average squared distance between our estimate and the truth — the "typical" error magnitude, squared.
The expected squared distance from the estimate to the true parameter value.
Why squared error? Several reasons:
- Penalizes large errors more: An error of 10 contributes 100, but an error of 2 only contributes 4
- Mathematical convenience: Differentiable and leads to elegant decomposition
- Positive definite: No cancellation between over- and under-estimates
The Beautiful Decomposition: MSE = Bias² + Variance
Here's one of the most important results in all of statistics:
✨ The Bias-Variance Decomposition
Understanding the Decomposition
This formula says that the total "badness" of an estimator (measured by MSE) comes from exactly two sources:
- Bias²: The squared systematic error — how far the average estimate is from truth
- Variance: The random fluctuation — how spread out the estimates are
Red dots = individual estimates from different samples. Green line = true value. Blue line = average of estimates.
MSE as Risk: A Deeper View
In statistical decision theory, we evaluate estimators using risk functions. The MSE is simply the risk when we use squared error as our loss function:
where is the squared error loss.
Generalizing to Vector Estimators
So far we've focused on estimating a single parameter . In practice, we often estimate vectors of parameters , such as all coefficients in a regression model.
Each component has its own bias. The estimator is unbiased if all components are unbiased.
The covariance matrix captures both individual variances (diagonal) and correlations between components (off-diagonal).
The matrix analog of MSE = Variance + Bias².
The trace gives a single number summarizing total estimation quality.
When Does Covariance Matter?
For a single prediction, only the marginal variances matter. But for simultaneous inference(e.g., confidence regions, hypothesis tests about multiple parameters), the full covariance structure is essential. Ignoring correlations can lead to overconfident or misleading conclusions.
Technical Assumptions & Connections
For bias, variance, and MSE to be well-defined, we need the relevant moments to exist:
- Bias: Requires (first moment exists)
- Variance & MSE: Requires (second moment exists)
For heavy-tailed distributions (e.g., Cauchy), the sample mean has infinite varianceand MSE is undefined! In such cases, other loss functions (like absolute error) are more appropriate.
In the next section on Consistency and Efficiency, we'll see that:
- Consistency: MSE → 0 as n → ∞
- Efficiency: Achieving lowest possible variance among unbiased estimators
There's a fundamental limit to how small the variance of an unbiased estimator can be:
where is the Fisher Information. We'll explore this in Chapter 12 (Fisher Information and CRLB).
Worked Example: Bernoulli Mean Estimation
Let's apply bias, variance, and MSE to a concrete problem: estimating the probability of success from coin flips.
You flip a coin n = 10 times and observe X = 3 heads. The true probability is unknown (say, p = 0.4). Compare two estimators:
The Bias-Variance Tradeoff
Here's a profound insight that underlies all of statistics and machine learning:
In most real-world scenarios, reducing bias increases variance, and reducing variance increases bias.
This is the famous bias-variance tradeoff. It explains why:
- High bias: Too restrictive, miss patterns
- Low variance: Stable across samples
- Result: Underfitting
- Low bias: Can fit any pattern
- High variance: Sensitive to sample
- Result: Overfitting
| Model | Bias | Variance | Risk |
|---|---|---|---|
| Constant prediction (always output mean) | High | Zero | Underfitting |
| Linear regression | Medium | Low | Depends on problem |
| Polynomial regression (degree 10) | Low | High | Overfitting risk |
| Deep neural network (no regularization) | Very Low | Very High | Severe overfitting |
The Sweet Spot
The goal is to find the optimal tradeoff — a model that's complex enough to capture real patterns (low bias) but not so complex that it fits noise (low variance). This minimizes the total MSE.
Yes! This is a crucial insight that many beginners miss.
An unbiased estimator with high variance can have higher MSE than a biased estimator with low variance. What matters is the total error, not just the bias.
Example: Ridge Regression
- Ordinary Least Squares (OLS): Unbiased but can have high variance
- Ridge Regression: Deliberately adds bias to dramatically reduce variance
- Result: Ridge often has lower MSE despite being biased!
ML Techniques and the Bias-Variance Tradeoff
Modern machine learning provides several powerful techniques to navigate the bias-variance tradeoff. Each technique works by shifting the balance between bias and variance in a controlled way.
| Technique | Effect on Bias | Effect on Variance | How It Works |
|---|---|---|---|
| Ridge Regression (L2) | ↑ Slightly increases | ↓ Significantly decreases | Shrinks coefficients toward zero; penalizes large weights with λ||β||² |
| LASSO (L1) | ↑ Increases (some features = 0) | ↓ Decreases significantly | Sparse solutions; sets irrelevant features exactly to zero with λ||β||₁ |
| Elastic Net | ↑ Moderate increase | ↓ Decreases | Combines L1 + L2; balances sparsity and grouping |
| Bagging / Random Forests | ≈ Unchanged | ↓↓ Dramatically decreases | Averages many high-variance models; Var(avg) = Var(single)/n |
| Boosting | ↓ Decreases | ↑ May increase (if overfit) | Sequentially corrects residual errors; additive model |
| Early Stopping | ↑ Slightly increases | ↓ Decreases | Stops training before overfitting; implicit regularization |
| Dropout (Neural Nets) | ↑ Slightly increases | ↓ Decreases significantly | Randomly drops neurons; approximates ensemble of networks |
Cross-validation is how we estimate the expected prediction error(the true risk) from finite data. It directly estimates the MSE we would see on unseen test data.
- k-Fold CV: Split data into k parts, train on k-1, test on 1, average the results
- Leave-One-Out (LOOCV): k = n; nearly unbiased but high variance
- Typical choice: k = 5 or 10 balances bias and variance of the error estimate itself
\widehat{ ext{Err}}_{ ext{CV}} = rac{1}{n} \sum_{i=1}^{n} L(y_i, \hat{f}^{-\kappa(i)}(x_i))
where is the model trained without fold containing observation i.
Cross-validation helps us select the optimal regularization strength (like λ in Ridge/LASSO) by finding the value that minimizes estimated test error — the point of optimal bias-variance balance.
Key Insight: Regularization techniques (Ridge, LASSO, Dropout, Early Stopping) all work by deliberately increasing bias to reduce variance by a larger amount, resulting in lower overall MSE. Ensemble methods like Bagging reduce variance through averaging. Cross-validation helps us find the sweet spot.
Real-World Applications
Python Implementation
Let's see these concepts in code. We'll compute bias, variance, and MSE for different estimators:
1import numpy as np
2import matplotlib.pyplot as plt
3
4def simulate_estimator_properties(true_mean=5.0, true_var=4.0, n_samples=30,
5 n_simulations=10000):
6 """
7 Simulate sampling to compute bias, variance, and MSE of the sample mean.
8 """
9 estimates = []
10
11 for _ in range(n_simulations):
12 # Draw a sample
13 sample = np.random.normal(true_mean, np.sqrt(true_var), n_samples)
14 # Compute sample mean (our estimator)
15 estimate = sample.mean()
16 estimates.append(estimate)
17
18 estimates = np.array(estimates)
19
20 # Compute properties
21 expected_value = estimates.mean() # E[θ̂]
22 bias = expected_value - true_mean # E[θ̂] - θ₀
23 variance = estimates.var() # Var(θ̂)
24 mse = np.mean((estimates - true_mean)**2) # E[(θ̂ - θ₀)²]
25
26 return {
27 'expected_value': expected_value,
28 'bias': bias,
29 'bias_squared': bias**2,
30 'variance': variance,
31 'mse': mse,
32 'mse_decomposition': bias**2 + variance, # Should equal MSE!
33 'estimates': estimates
34 }
35
36# Run simulation
37results = simulate_estimator_properties()
38
39print("═══════════════════════════════════════════")
40print(" SAMPLE MEAN ESTIMATOR PROPERTIES")
41print("═══════════════════════════════════════════")
42print(f" E[θ̂] (Expected Value): {results['expected_value']:.4f}")
43print(f" θ₀ (True Mean): 5.0000")
44print(f" Bias: {results['bias']:.4f}")
45print(f" Bias²: {results['bias_squared']:.6f}")
46print(f" Variance: {results['variance']:.4f}")
47print(f" MSE (direct): {results['mse']:.4f}")
48print(f" MSE (Bias² + Var): {results['mse_decomposition']:.4f}")
49print("═══════════════════════════════════════════")
50print(f" ✓ Verification: MSE ≈ Bias² + Var")
51print("═══════════════════════════════════════════")1# Comparing two estimators: Sample Mean vs Biased Estimator
2def compare_estimators(true_mean=5.0, true_var=4.0, n_samples=10,
3 n_simulations=10000):
4 """
5 Compare:
6 1. Sample mean (unbiased)
7 2. Shrinkage estimator: θ̂_shrink = 0.9 * X̄ + 0.5 (biased but lower variance)
8 """
9 unbiased_estimates = []
10 biased_estimates = []
11
12 for _ in range(n_simulations):
13 sample = np.random.normal(true_mean, np.sqrt(true_var), n_samples)
14
15 # Unbiased: sample mean
16 unbiased_est = sample.mean()
17
18 # Biased: shrinkage towards 5.0
19 # This adds bias but reduces variance
20 biased_est = 0.9 * sample.mean() + 0.5
21
22 unbiased_estimates.append(unbiased_est)
23 biased_estimates.append(biased_est)
24
25 unbiased = np.array(unbiased_estimates)
26 biased = np.array(biased_estimates)
27
28 # Compute metrics
29 def get_metrics(estimates, true_value):
30 exp_val = estimates.mean()
31 bias = exp_val - true_value
32 var = estimates.var()
33 mse = np.mean((estimates - true_value)**2)
34 return exp_val, bias, var, mse
35
36 u_exp, u_bias, u_var, u_mse = get_metrics(unbiased, true_mean)
37 b_exp, b_bias, b_var, b_mse = get_metrics(biased, true_mean)
38
39 print("
40" + "═"*60)
41 print(" ESTIMATOR COMPARISON: Unbiased vs Shrinkage")
42 print("═"*60)
43 print(f" {'Metric':<20} {'Unbiased':>15} {'Shrinkage':>15}")
44 print("-"*60)
45 print(f" {'E[θ̂]':<20} {u_exp:>15.4f} {b_exp:>15.4f}")
46 print(f" {'Bias':<20} {u_bias:>15.4f} {b_bias:>15.4f}")
47 print(f" {'Variance':<20} {u_var:>15.4f} {b_var:>15.4f}")
48 print(f" {'MSE':<20} {u_mse:>15.4f} {b_mse:>15.4f}")
49 print("═"*60)
50
51 winner = "Unbiased" if u_mse < b_mse else "Shrinkage"
52 print(f" Winner (Lower MSE): {winner}")
53 print("═"*60)
54
55compare_estimators()Key Takeaway from the Code
The simulation demonstrates that a biased estimator can have lower MSE than an unbiased one when the variance reduction outweighs the added bias. This is the principle behind regularization techniques like Ridge Regression and LASSO.
Key Insights
Bias measures whether you're hitting the right target on average. Variance measures how tightly your shots cluster together. You need both to be low for a good estimator.
MSE = Bias² + Variance is a mathematical identity, not an approximation. Every point of MSE comes from exactly one of these two sources.
A biased estimator with much lower variance can have lower MSE. This is why regularization (Ridge, LASSO) intentionally adds bias to dramatically reduce variance.
Collecting more samples shrinks variance (by √n), but a biased estimator remains biased no matter how much data you collect. Bias is a property of the estimation method itself.
Underfitting = high bias (model too simple). Overfitting = high variance (model too complex). The optimal model minimizes total error (MSE).
Summary
The systematic error of an estimator: . It measures where the estimator "aims" relative to the truth. Cannot be reduced by more data.
The spread of the estimator across different samples: . It measures how much the estimator "wiggles". Can be reduced by more data.
The total expected squared error: . Decomposes exactly as: MSE = Bias² + Variance.
Reducing bias often increases variance and vice versa. The goal is to minimize total MSE, which may involve accepting some bias to greatly reduce variance (regularization).
In the next section, we'll explore Consistency and Efficiency — two additional properties that tell us how estimators behave as sample size grows and how to identify the "best possible" estimator among unbiased ones.
| Symbol | Name | Meaning |
|---|---|---|
| True parameter | The unknown value we're trying to estimate | |
| Estimator/Estimate | A function of sample data that produces estimates | |
| Expected value | Average value of the estimator over all samples | |
| Bias | E[θ̂] - θ₀ — systematic error | |
| Variance | E[(θ̂ - E[θ̂])²] — spread of estimates | |
| Mean Squared Error | E[(θ̂ - θ₀)²] = Bias² + Variance |