Chapter 11
25 min read
Section 75 of 175

Bias, Variance, and MSE

Point Estimation

Learning Objectives

Before You Start

You should understand what an estimator is from Section 01. Specifically, you should know that an estimator θ^\hat{\theta} is a function of the sample data that produces an estimate of the unknown parameter θ0\theta_0.

By the end of this section, you will be able to:

🎯
Define Bias of an Estimator

Understand what it means for an estimator to systematically over- or under-estimate the truth

📊
Define Variance of an Estimator

Understand how much an estimator "wiggles" across different samples

📐
Calculate and Decompose MSE

Master the beautiful formula: MSE = Bias² + Variance

⚖️
Explain the Bias-Variance Tradeoff

The fundamental principle that governs all of statistics and machine learning

🚀
Apply These Concepts to ML

Connect bias-variance to underfitting, overfitting, and model selection

🏆What You'll Master

By the end of this section, you'll understand the two fundamental sources of error in any estimation procedure and how they combine into a single quality metric (MSE). This knowledge is essential for understanding why some estimators are preferred over others — even if they're biased!


The Big Picture: How Good Is Your Estimator?

The Central Question: Given multiple ways to estimate a parameter, how do we decide which estimator is "best"?

In Section 01, we learned that an estimator θ^\hat{\theta} is a function that takes sample data and produces an estimate of the unknown true parameter θ0\theta_0. But here's the crucial insight:

The Estimator Is a Random Variable

Because θ^\hat{\theta} depends on random sample data, the estimator itself is a random variable. If you collected a different sample, you'd get a different estimate!

This randomness raises a fundamental question: How do we evaluate the quality of an estimator that produces different values each time?

Back to the Coffee Shop

Remember our coffee shop example? The true average wait time is μ=2.847\mu = 2.847 minutes (but we don't know this). We sample n customers and compute the sample mean Xˉ\bar{X}.

Sample 1 (n=20)
Xˉ1=2.91\bar{X}_1 = 2.91 min
Sample 2 (n=20)
Xˉ2=2.73\bar{X}_2 = 2.73 min
Sample 3 (n=20)
Xˉ3=2.85\bar{X}_3 = 2.85 min

Different samples give different estimates! How do we evaluate the quality of this estimator?

We need metrics that capture two distinct types of error:

🎯Systematic Error

Does the estimator consistently over- or under-estimate the truth? This is measured by BIAS.

"Where is your aim pointed?"
🎲Random Error

How much does the estimator fluctuate from sample to sample? This is measured by VARIANCE.

"How shaky is your hand?"

The Fundamental Insight

Total Error = Systematic Error + Random Error. This is the essence of the bias-variance decomposition that we'll derive in this section.


What Is Bias?

Intuitive Understanding of Bias

Imagine you own a bathroom scale that displays weights. Unknown to you, the scale has a manufacturing defect that makes it always show 2 kg more than the actual weight.

⚖️The Miscalibrated Scale Analogy
True WeightScale ReadingError
70 kg72 kg+2 kg
65 kg67 kg+2 kg
80 kg82 kg+2 kg

Key observation: No matter how many times you weigh yourself, the scale will always overestimate by 2 kg. More measurements won't help! This +2 kg is the bias.

Bias is the systematic, consistent error that doesn't go away with more data. It represents where your estimator is "aiming" relative to the truth.

Here's why this matters: A biased estimator gives you a false sense of confidence. You might collect more and more data, seeing your estimates converge — but they're converging to the wrong value!

Mathematical Definition

Formally, the bias of an estimator θ^\hat{\theta} is the difference between its expected value and the true parameter:

Bias(θ^)=E[θ^]θ0\text{Bias}(\hat{\theta}) = E[\hat{\theta}] - \theta_0

E[θ^]E[\hat{\theta}]
Expected value of the estimator (where it aims on average)
θ0\theta_0
The true parameter value (the target)
Bias\text{Bias}
The systematic error (can be positive, negative, or zero)

An estimator is called unbiased if its bias is zero:

Unbiased: E[θ^]=θ0\text{Unbiased: } E[\hat{\theta}] = \theta_0

What Unbiased Really Means

An unbiased estimator hits the target on average. If you could repeat the experiment infinitely many times with different samples, the average of all your estimates would equal the true parameter.

This does NOT mean that any single estimate will be exactly correct!

Bias ValueInterpretationExample
Bias > 0Estimator overestimates on averageScale adds 2 kg
Bias < 0Estimator underestimates on averageSurvey that misses certain demographics
Bias = 0Estimator is unbiased — hits target on averageSample mean for the population mean

Bias Examples


What Is Variance?

Intuitive Understanding of Variance

Now let's consider a different scenario. Imagine two archers shooting at a target:

🏹Archer A: Steady Hand

Arrows cluster tightly together. Even if they're slightly off-center, the shots are predictable and consistent.

Low Variance
🏹Archer B: Shaky Hand

Arrows are scattered all over the target. Each shot is unpredictable — you never know where the next arrow will land.

High Variance
Variance measures the "spread" or "uncertainty" in your estimator. It tells you how much the estimate would change if you collected a different sample.

Unlike bias, variance can be reduced by collecting more data. With more observations, your estimate becomes more stable.

Var(θ^)=E[(θ^E[θ^])2]\text{Var}(\hat{\theta}) = E\left[(\hat{\theta} - E[\hat{\theta}])^2\right]

θ^E[θ^]\hat{\theta} - E[\hat{\theta}]
How far each estimate is from the average estimate
E[()2]E[(\cdots)^2]
Average squared deviation (expected squared distance)

Equivalently, using the alternative formula:

Var(θ^)=E[θ^2](E[θ^])2\text{Var}(\hat{\theta}) = E[\hat{\theta}^2] - (E[\hat{\theta}])^2

Variance Examples

📊Variance of the Sample Mean

For the sample mean Xˉ\bar{X} of n i.i.d. observations with variance σ2\sigma^2:

Var(Xˉ)=σ2n\text{Var}(\bar{X}) = \frac{\sigma^2}{n}

Key insight: The variance decreases as n increases! With more data, the sample mean becomes more precise.
Sample Size (n)Var(Xˉ\bar{X})Interpretation
n = 10σ²/10Moderate spread
n = 100σ²/100Much tighter
n = 1000σ²/1000Very precise
n → ∞→ 0Perfect precision!
🎯The Target Practice Analogy

Imagine each estimator as a shooter aiming at a target. The bullseye is the true parameter. Each shot represents an estimate from a different sample.

🏆Low Bias, Low Variance

The Ideal Estimator

Bias: Low
Var: Low

Shots cluster tightly around the bullseye

🎯High Bias, Low Variance

Consistent but Wrong

Bias: High
Var: Low

Shots cluster tightly but miss the center

🎲Low Bias, High Variance

Right on Average, but Scattered

Bias: Low
Var: High

Shots are spread out but centered on bullseye

😱High Bias, High Variance

The Worst Case

Bias: High
Var: High

Shots are spread out AND miss the center

Sample Size Effects: Variance Shrinks, Bias Stays

A critical insight is that more data reduces variance but cannot fix bias. For the sample mean Xˉ\bar{X}:

Var(Xˉ)=σ2nn0\text{Var}(\bar{X}) = \frac{\sigma^2}{n} \xrightarrow{n \to \infty} 0

The variance decreases proportionally to 1/n1/n, meaning the standard error (standard deviation of the estimator) decreases as 1/n1/\sqrt{n}. To halve your standard error, you need to quadruple your sample size!

📉Sample Size Effects on Variance

For the sample mean estimator with population variance σ² = 4:

n = 5n = 500
Bias
0
Fixed!
Var(X̄) = σ²/n
0.4000
Shrinks with n!
MSE
0.4000
= Bias² + Var
nVar(X̄) = σ²/nStd Error = σ/√nReduction
50.80000.89440%
100.40000.632550%
200.20000.447275%
500.08000.282890%
1000.04000.200095%
2000.02000.141498%
5000.00800.089499%

Notice: Bias stays at 0 regardless of n. Only variance decreases with more data!

Bias Is Immune to Sample Size

If your estimation method has bias, collecting more data will not eliminate it. A biased estimator converges to the wrong value as nn \to \infty. You must fix the methodology itself to remove bias.


What Is Mean Squared Error (MSE)?

Now we need a single number that captures the overall quality of an estimator, combining both bias and variance. This is the Mean Squared Error (MSE).

MSE measures the average squared distance between our estimate and the truth — the "typical" error magnitude, squared.

MSE(θ^)=E[(θ^θ0)2]\text{MSE}(\hat{\theta}) = E\left[(\hat{\theta} - \theta_0)^2\right]

The expected squared distance from the estimate to the true parameter value.

Why squared error? Several reasons:

  1. Penalizes large errors more: An error of 10 contributes 100, but an error of 2 only contributes 4
  2. Mathematical convenience: Differentiable and leads to elegant decomposition
  3. Positive definite: No cancellation between over- and under-estimates

The Beautiful Decomposition: MSE = Bias² + Variance

Here's one of the most important results in all of statistics:

The Bias-Variance Decomposition

MSE(θ^)=Bias(θ^)2+Var(θ^)\text{MSE}(\hat{\theta}) = \text{Bias}(\hat{\theta})^2 + \text{Var}(\hat{\theta})

Total Error = Systematic Error² + Random Error

Understanding the Decomposition

This formula says that the total "badness" of an estimator (measured by MSE) comes from exactly two sources:

  • Bias²: The squared systematic error — how far the average estimate is from truth
  • Variance: The random fluctuation — how spread out the estimates are

🎯Interactive Bias-Variance Explorer
UnderestimateUnbiasedOverestimate
Low (Precise)High (Spread)
True Value (θ₀)
E[θ̂]
Bias²
0.00
Variance
1.00
MSE = Bias² + Var
1.00

Red dots = individual estimates from different samples. Green line = true value. Blue line = average of estimates.

MSE as Risk: A Deeper View

📊MSE in Statistical Decision Theory

In statistical decision theory, we evaluate estimators using risk functions. The MSE is simply the risk when we use squared error as our loss function:

R(θ0,θ^)=Eθ0[L(θ0,θ^)]=Eθ0[(θ^θ0)2]=MSER(\theta_0, \hat{\theta}) = E_{\theta_0}\left[L(\theta_0, \hat{\theta})\right] = E_{\theta_0}\left[(\hat{\theta} - \theta_0)^2\right] = \text{MSE}

where L(θ0,θ^)=(θ^θ0)2L(\theta_0, \hat{\theta}) = (\hat{\theta} - \theta_0)^2 is the squared error loss.

Generalizing to Vector Estimators

So far we've focused on estimating a single parameter θR\theta \in \mathbb{R}. In practice, we often estimate vectors of parameters θRp\boldsymbol{\theta} \in \mathbb{R}^p, such as all coefficients in a regression model.

📐Vector Extensions of Bias, Variance, and MSE
Bias Vector:

Bias(θ^)=E[θ^]θ0Rp\text{Bias}(\hat{\boldsymbol{\theta}}) = E[\hat{\boldsymbol{\theta}}] - \boldsymbol{\theta}_0 \in \mathbb{R}^p

Each component has its own bias. The estimator is unbiased if all components are unbiased.

Covariance Matrix:

Σθ^=Cov(θ^)=E[(θ^E[θ^])(θ^E[θ^])]Rp×p\boldsymbol{\Sigma}_{\hat{\theta}} = \text{Cov}(\hat{\boldsymbol{\theta}}) = E\left[(\hat{\boldsymbol{\theta}} - E[\hat{\boldsymbol{\theta}}])(\hat{\boldsymbol{\theta}} - E[\hat{\boldsymbol{\theta}}])^\top\right] \in \mathbb{R}^{p \times p}

The covariance matrix captures both individual variances (diagonal) and correlations between components (off-diagonal).

MSE Matrix:

MSE(θ^)=E[(θ^θ0)(θ^θ0)]=Σθ^+BiasBias\text{MSE}(\hat{\boldsymbol{\theta}}) = E\left[(\hat{\boldsymbol{\theta}} - \boldsymbol{\theta}_0)(\hat{\boldsymbol{\theta}} - \boldsymbol{\theta}_0)^\top\right] = \boldsymbol{\Sigma}_{\hat{\theta}} + \text{Bias}\cdot\text{Bias}^\top

The matrix analog of MSE = Variance + Bias².

Scalar Summary (Total MSE):

Total MSE=tr(MSE)=j=1pMSE(θ^j)=Bias2+tr(Σθ^)\text{Total MSE} = \text{tr}(\text{MSE}) = \sum_{j=1}^{p} \text{MSE}(\hat{\theta}_j) = \|\text{Bias}\|^2 + \text{tr}(\boldsymbol{\Sigma}_{\hat{\theta}})

The trace gives a single number summarizing total estimation quality.

When Does Covariance Matter?

For a single prediction, only the marginal variances matter. But for simultaneous inference(e.g., confidence regions, hypothesis tests about multiple parameters), the full covariance structure is essential. Ignoring correlations can lead to overconfident or misleading conclusions.

⚠️Existence of Moments

For bias, variance, and MSE to be well-defined, we need the relevant moments to exist:

  • Bias: Requires E[θ^]<E[|\hat{\theta}|] < \infty (first moment exists)
  • Variance & MSE: Requires E[θ^2]<E[\hat{\theta}^2] < \infty (second moment exists)

For heavy-tailed distributions (e.g., Cauchy), the sample mean has infinite varianceand MSE is undefined! In such cases, other loss functions (like absolute error) are more appropriate.

🔗Connection to Section 03

In the next section on Consistency and Efficiency, we'll see that:

  • Consistency: MSE → 0 as n → ∞
  • Efficiency: Achieving lowest possible variance among unbiased estimators
📉Cramér-Rao Lower Bound

There's a fundamental limit to how small the variance of an unbiased estimator can be:

Var(θ^)1I(θ)\text{Var}(\hat{\theta}) \geq \frac{1}{I(\theta)}

where I(θ)I(\theta) is the Fisher Information. We'll explore this in Chapter 12 (Fisher Information and CRLB).

Worked Example: Bernoulli Mean Estimation

Let's apply bias, variance, and MSE to a concrete problem: estimating the probability of success pp from coin flips.

🪙Problem Setup

You flip a coin n = 10 times and observe X = 3 heads. The true probability is unknown (say, p = 0.4). Compare two estimators:

1. Sample Proportion (Unbiased)

p^1=Xn=310=0.30\hat{p}_1 = \frac{X}{n} = \frac{3}{10} = 0.30

2. Laplace-Smoothed (Biased)

p^2=X+1n+2=412=0.333\hat{p}_2 = \frac{X + 1}{n + 2} = \frac{4}{12} = 0.333


The Bias-Variance Tradeoff

Here's a profound insight that underlies all of statistics and machine learning:

In most real-world scenarios, reducing bias increases variance, and reducing variance increases bias.

This is the famous bias-variance tradeoff. It explains why:

🔧Simple Models
  • High bias: Too restrictive, miss patterns
  • Low variance: Stable across samples
  • Result: Underfitting
🔬Complex Models
  • Low bias: Can fit any pattern
  • High variance: Sensitive to sample
  • Result: Overfitting
🤖The ML Connection: Model Complexity
ModelBiasVarianceRisk
Constant prediction (always output mean)HighZeroUnderfitting
Linear regressionMediumLowDepends on problem
Polynomial regression (degree 10)LowHighOverfitting risk
Deep neural network (no regularization)Very LowVery HighSevere overfitting

The Sweet Spot

The goal is to find the optimal tradeoff — a model that's complex enough to capture real patterns (low bias) but not so complex that it fits noise (low variance). This minimizes the total MSE.

🤔Wait — Can a Biased Estimator Be BETTER Than an Unbiased One?

Yes! This is a crucial insight that many beginners miss.

An unbiased estimator with high variance can have higher MSE than a biased estimator with low variance. What matters is the total error, not just the bias.

Example: Ridge Regression

  • Ordinary Least Squares (OLS): Unbiased but can have high variance
  • Ridge Regression: Deliberately adds bias to dramatically reduce variance
  • Result: Ridge often has lower MSE despite being biased!

ML Techniques and the Bias-Variance Tradeoff

Modern machine learning provides several powerful techniques to navigate the bias-variance tradeoff. Each technique works by shifting the balance between bias and variance in a controlled way.

TechniqueEffect on BiasEffect on VarianceHow It Works
Ridge Regression (L2)↑ Slightly increases↓ Significantly decreasesShrinks coefficients toward zero; penalizes large weights with λ||β||²
LASSO (L1)↑ Increases (some features = 0)↓ Decreases significantlySparse solutions; sets irrelevant features exactly to zero with λ||β||₁
Elastic Net↑ Moderate increase↓ DecreasesCombines L1 + L2; balances sparsity and grouping
Bagging / Random Forests≈ Unchanged↓↓ Dramatically decreasesAverages many high-variance models; Var(avg) = Var(single)/n
Boosting↓ Decreases↑ May increase (if overfit)Sequentially corrects residual errors; additive model
Early Stopping↑ Slightly increases↓ DecreasesStops training before overfitting; implicit regularization
Dropout (Neural Nets)↑ Slightly increases↓ Decreases significantlyRandomly drops neurons; approximates ensemble of networks
🔄Cross-Validation: Estimating Expected Prediction Error

Cross-validation is how we estimate the expected prediction error(the true risk) from finite data. It directly estimates the MSE we would see on unseen test data.

  • k-Fold CV: Split data into k parts, train on k-1, test on 1, average the results
  • Leave-One-Out (LOOCV): k = n; nearly unbiased but high variance
  • Typical choice: k = 5 or 10 balances bias and variance of the error estimate itself

\widehat{ ext{Err}}_{ ext{CV}} = rac{1}{n} \sum_{i=1}^{n} L(y_i, \hat{f}^{-\kappa(i)}(x_i))

where hatfkappa(i)hat{f}^{-kappa(i)} is the model trained without fold containing observation i.

Cross-validation helps us select the optimal regularization strength (like λ in Ridge/LASSO) by finding the value that minimizes estimated test error — the point of optimal bias-variance balance.

Key Insight: Regularization techniques (Ridge, LASSO, Dropout, Early Stopping) all work by deliberately increasing bias to reduce variance by a larger amount, resulting in lower overall MSE. Ensemble methods like Bagging reduce variance through averaging. Cross-validation helps us find the sweet spot.

Real-World Applications


Python Implementation

Let's see these concepts in code. We'll compute bias, variance, and MSE for different estimators:

🐍python
1import numpy as np
2import matplotlib.pyplot as plt
3
4def simulate_estimator_properties(true_mean=5.0, true_var=4.0, n_samples=30,
5                                   n_simulations=10000):
6    """
7    Simulate sampling to compute bias, variance, and MSE of the sample mean.
8    """
9    estimates = []
10
11    for _ in range(n_simulations):
12        # Draw a sample
13        sample = np.random.normal(true_mean, np.sqrt(true_var), n_samples)
14        # Compute sample mean (our estimator)
15        estimate = sample.mean()
16        estimates.append(estimate)
17
18    estimates = np.array(estimates)
19
20    # Compute properties
21    expected_value = estimates.mean()  # E[θ̂]
22    bias = expected_value - true_mean   # E[θ̂] - θ₀
23    variance = estimates.var()          # Var(θ̂)
24    mse = np.mean((estimates - true_mean)**2)  # E[(θ̂ - θ₀)²]
25
26    return {
27        'expected_value': expected_value,
28        'bias': bias,
29        'bias_squared': bias**2,
30        'variance': variance,
31        'mse': mse,
32        'mse_decomposition': bias**2 + variance,  # Should equal MSE!
33        'estimates': estimates
34    }
35
36# Run simulation
37results = simulate_estimator_properties()
38
39print("═══════════════════════════════════════════")
40print("  SAMPLE MEAN ESTIMATOR PROPERTIES")
41print("═══════════════════════════════════════════")
42print(f"  E[θ̂] (Expected Value):    {results['expected_value']:.4f}")
43print(f"  θ₀ (True Mean):           5.0000")
44print(f"  Bias:                     {results['bias']:.4f}")
45print(f"  Bias²:                    {results['bias_squared']:.6f}")
46print(f"  Variance:                 {results['variance']:.4f}")
47print(f"  MSE (direct):             {results['mse']:.4f}")
48print(f"  MSE (Bias² + Var):        {results['mse_decomposition']:.4f}")
49print("═══════════════════════════════════════════")
50print(f"  ✓ Verification: MSE ≈ Bias² + Var")
51print("═══════════════════════════════════════════")
🐍python
1# Comparing two estimators: Sample Mean vs Biased Estimator
2def compare_estimators(true_mean=5.0, true_var=4.0, n_samples=10,
3                       n_simulations=10000):
4    """
5    Compare:
6    1. Sample mean (unbiased)
7    2. Shrinkage estimator: θ̂_shrink = 0.9 * X̄ + 0.5 (biased but lower variance)
8    """
9    unbiased_estimates = []
10    biased_estimates = []
11
12    for _ in range(n_simulations):
13        sample = np.random.normal(true_mean, np.sqrt(true_var), n_samples)
14
15        # Unbiased: sample mean
16        unbiased_est = sample.mean()
17
18        # Biased: shrinkage towards 5.0
19        # This adds bias but reduces variance
20        biased_est = 0.9 * sample.mean() + 0.5
21
22        unbiased_estimates.append(unbiased_est)
23        biased_estimates.append(biased_est)
24
25    unbiased = np.array(unbiased_estimates)
26    biased = np.array(biased_estimates)
27
28    # Compute metrics
29    def get_metrics(estimates, true_value):
30        exp_val = estimates.mean()
31        bias = exp_val - true_value
32        var = estimates.var()
33        mse = np.mean((estimates - true_value)**2)
34        return exp_val, bias, var, mse
35
36    u_exp, u_bias, u_var, u_mse = get_metrics(unbiased, true_mean)
37    b_exp, b_bias, b_var, b_mse = get_metrics(biased, true_mean)
38
39    print("
40" + "═"*60)
41    print("  ESTIMATOR COMPARISON: Unbiased vs Shrinkage")
42    print("═"*60)
43    print(f"  {'Metric':<20} {'Unbiased':>15} {'Shrinkage':>15}")
44    print("-"*60)
45    print(f"  {'E[θ̂]':<20} {u_exp:>15.4f} {b_exp:>15.4f}")
46    print(f"  {'Bias':<20} {u_bias:>15.4f} {b_bias:>15.4f}")
47    print(f"  {'Variance':<20} {u_var:>15.4f} {b_var:>15.4f}")
48    print(f"  {'MSE':<20} {u_mse:>15.4f} {b_mse:>15.4f}")
49    print("═"*60)
50
51    winner = "Unbiased" if u_mse < b_mse else "Shrinkage"
52    print(f"  Winner (Lower MSE): {winner}")
53    print("═"*60)
54
55compare_estimators()

Key Takeaway from the Code

The simulation demonstrates that a biased estimator can have lower MSE than an unbiased one when the variance reduction outweighs the added bias. This is the principle behind regularization techniques like Ridge Regression and LASSO.


Key Insights

1
Bias Is About Accuracy, Variance Is About Precision

Bias measures whether you're hitting the right target on average. Variance measures how tightly your shots cluster together. You need both to be low for a good estimator.

2
The MSE Decomposition Is Exact, Not Approximate

MSE = Bias² + Variance is a mathematical identity, not an approximation. Every point of MSE comes from exactly one of these two sources.

3
Unbiased Isn't Always Best

A biased estimator with much lower variance can have lower MSE. This is why regularization (Ridge, LASSO) intentionally adds bias to dramatically reduce variance.

4
More Data Reduces Variance, Not Bias

Collecting more samples shrinks variance (by √n), but a biased estimator remains biased no matter how much data you collect. Bias is a property of the estimation method itself.

5
The Tradeoff Explains Underfitting vs Overfitting

Underfitting = high bias (model too simple). Overfitting = high variance (model too complex). The optimal model minimizes total error (MSE).


Summary

📝Section Summary
Bias

The systematic error of an estimator: Bias(θ^)=E[θ^]θ0\text{Bias}(\hat{\theta}) = E[\hat{\theta}] - \theta_0. It measures where the estimator "aims" relative to the truth. Cannot be reduced by more data.

Variance

The spread of the estimator across different samples: Var(θ^)=E[(θ^E[θ^])2]\text{Var}(\hat{\theta}) = E[(\hat{\theta} - E[\hat{\theta}])^2]. It measures how much the estimator "wiggles". Can be reduced by more data.

Mean Squared Error (MSE)

The total expected squared error: MSE(θ^)=E[(θ^θ0)2]\text{MSE}(\hat{\theta}) = E[(\hat{\theta} - \theta_0)^2]. Decomposes exactly as: MSE = Bias² + Variance.

Bias-Variance Tradeoff

Reducing bias often increases variance and vice versa. The goal is to minimize total MSE, which may involve accepting some bias to greatly reduce variance (regularization).

🔮What's Next?

In the next section, we'll explore Consistency and Efficiency — two additional properties that tell us how estimators behave as sample size grows and how to identify the "best possible" estimator among unbiased ones.

📚Symbol Glossary
SymbolNameMeaning
θ0\theta_0True parameterThe unknown value we're trying to estimate
θ^\hat{\theta}Estimator/EstimateA function of sample data that produces estimates
E[θ^]E[\hat{\theta}]Expected valueAverage value of the estimator over all samples
Bias(θ^)\text{Bias}(\hat{\theta})BiasE[θ̂] - θ₀ — systematic error
Var(θ^)\text{Var}(\hat{\theta})VarianceE[(θ̂ - E[θ̂])²] — spread of estimates
MSE(θ^)\text{MSE}(\hat{\theta})Mean Squared ErrorE[(θ̂ - θ₀)²] = Bias² + Variance
Loading comments...