Boo-AI — Master Artificial Intelligence by Building from Scratch

Learning Objectives

Before You Start

You should understand what an estimator is from Section 01. Specifically, you should know that an estimator $\hat{\theta}$ is a function of the sample data that produces an estimate of the unknown parameter $\theta_0$ .

By the end of this section, you will be able to:

🎯

Define Bias of an Estimator

Understand what it means for an estimator to systematically over- or under-estimate the truth

📊

Define Variance of an Estimator

Understand how much an estimator "wiggles" across different samples

📐

Calculate and Decompose MSE

Master the beautiful formula: MSE = Bias² + Variance

⚖️

Explain the Bias-Variance Tradeoff

The fundamental principle that governs all of statistics and machine learning

🚀

Apply These Concepts to ML

Connect bias-variance to underfitting, overfitting, and model selection

🏆What You'll Master

By the end of this section, you'll understand the two fundamental sources of error in any estimation procedure and how they combine into a single quality metric (MSE). This knowledge is essential for understanding why some estimators are preferred over others — even if they're biased!

The Big Picture: How Good Is Your Estimator?

The Central Question: Given multiple ways to estimate a parameter, how do we decide which estimator is "best"?

In Section 01, we learned that an estimator $\hat{\theta}$ is a function that takes sample data and produces an estimate of the unknown true parameter $\theta_0$ . But here's the crucial insight:

The Estimator Is a Random Variable

Because $\hat{\theta}$ depends on random sample data, the estimator itself is a random variable. If you collected a different sample, you'd get a different estimate!

This randomness raises a fundamental question: How do we evaluate the quality of an estimator that produces different values each time?

☕Back to the Coffee Shop

Remember our coffee shop example? The true average wait time is $\mu = 2.847$ minutes (but we don't know this). We sample n customers and compute the sample mean $\bar{X}$ .

Sample 1 (n=20)

\bar{X}_1 = 2.91

min

Sample 2 (n=20)

\bar{X}_2 = 2.73

min

Sample 3 (n=20)

\bar{X}_3 = 2.85

min

Different samples give different estimates! How do we evaluate the quality of this estimator?

We need metrics that capture two distinct types of error:

🎯Systematic Error

Does the estimator consistently over- or under-estimate the truth? This is measured by BIAS.

"Where is your aim pointed?"

🎲Random Error

How much does the estimator fluctuate from sample to sample? This is measured by VARIANCE.

"How shaky is your hand?"

The Fundamental Insight

Total Error = Systematic Error + Random Error. This is the essence of the bias-variance decomposition that we'll derive in this section.

What Is Bias?

Intuitive Understanding of Bias

Imagine you own a bathroom scale that displays weights. Unknown to you, the scale has a manufacturing defect that makes it always show 2 kg more than the actual weight.

⚖️The Miscalibrated Scale Analogy

True Weight	Scale Reading	Error
70 kg	72 kg	+2 kg
65 kg	67 kg	+2 kg
80 kg	82 kg	+2 kg

Key observation: No matter how many times you weigh yourself, the scale will always overestimate by 2 kg. More measurements won't help! This +2 kg is the bias.

Bias is the systematic, consistent error that doesn't go away with more data. It represents where your estimator is "aiming" relative to the truth.

Here's why this matters: A biased estimator gives you a false sense of confidence. You might collect more and more data, seeing your estimates converge — but they're converging to the wrong value!

Mathematical Definition

Formally, the bias of an estimator $\hat{\theta}$ is the difference between its expected value and the true parameter:

$\text{Bias}(\hat{\theta}) = E[\hat{\theta}] - \theta_0$

E[\hat{\theta}]

Expected value of the estimator (where it aims on average)

\theta_0

The true parameter value (the target)

\text{Bias}

The systematic error (can be positive, negative, or zero)

An estimator is called unbiased if its bias is zero:

$\text{Unbiased: } E[\hat{\theta}] = \theta_0$

What Unbiased Really Means

An unbiased estimator hits the target on average. If you could repeat the experiment infinitely many times with different samples, the average of all your estimates would equal the true parameter.

This does NOT mean that any single estimate will be exactly correct!

Bias Value	Interpretation	Example
Bias > 0	Estimator overestimates on average	Scale adds 2 kg
Bias < 0	Estimator underestimates on average	Survey that misses certain demographics
Bias = 0	Estimator is unbiased — hits target on average	Sample mean for the population mean

Bias Examples

What Is Variance?

Intuitive Understanding of Variance

Now let's consider a different scenario. Imagine two archers shooting at a target:

🏹Archer A: Steady Hand

Arrows cluster tightly together. Even if they're slightly off-center, the shots are predictable and consistent.

Low Variance

🏹Archer B: Shaky Hand

Arrows are scattered all over the target. Each shot is unpredictable — you never know where the next arrow will land.

High Variance

Variance measures the "spread" or "uncertainty" in your estimator. It tells you how much the estimate would change if you collected a different sample.

Unlike bias, variance can be reduced by collecting more data. With more observations, your estimate becomes more stable.

$\text{Var}(\hat{\theta}) = E\left[(\hat{\theta} - E[\hat{\theta}])^2\right]$

\hat{\theta} - E[\hat{\theta}]

How far each estimate is from the average estimate

E[(\cdots)^2]

Average squared deviation (expected squared distance)

Equivalently, using the alternative formula:

$\text{Var}(\hat{\theta}) = E[\hat{\theta}^2] - (E[\hat{\theta}])^2$

Variance Examples

📊Variance of the Sample Mean

For the sample mean $\bar{X}$ of n i.i.d. observations with variance $\sigma^2$ :

$\text{Var}(\bar{X}) = \frac{\sigma^2}{n}$

Key insight: The variance decreases as n increases! With more data, the sample mean becomes more precise.

Sample Size (n)	Var( $\bar{X}$ )	Interpretation
n = 10	σ²/10	Moderate spread
n = 100	σ²/100	Much tighter
n = 1000	σ²/1000	Very precise
n → ∞	→ 0	Perfect precision!

🎯The Target Practice Analogy

Imagine each estimator as a shooter aiming at a target. The bullseye is the true parameter. Each shot represents an estimate from a different sample.

🏆Low Bias, Low Variance

The Ideal Estimator

Bias: Low

Var: Low

Shots cluster tightly around the bullseye

🎯High Bias, Low Variance

Consistent but Wrong

Bias: High

Var: Low

Shots cluster tightly but miss the center

🎲Low Bias, High Variance

Right on Average, but Scattered

Bias: Low

Var: High

Shots are spread out but centered on bullseye

😱High Bias, High Variance

The Worst Case

Bias: High

Var: High

Shots are spread out AND miss the center

Sample Size Effects: Variance Shrinks, Bias Stays

A critical insight is that more data reduces variance but cannot fix bias. For the sample mean $\bar{X}$ :

$\text{Var}(\bar{X}) = \frac{\sigma^2}{n} \xrightarrow{n \to \infty} 0$

The variance decreases proportionally to $1/n$ , meaning the standard error (standard deviation of the estimator) decreases as $1/\sqrt{n}$ . To halve your standard error, you need to quadruple your sample size!

📉Sample Size Effects on Variance

For the sample mean estimator with population variance σ² = 4:

Sample Size (n): 10

n = 5n = 500

Bias

Fixed!

Var(X̄) = σ²/n

0.4000

Shrinks with n!

MSE

0.4000

= Bias² + Var

n	Var(X̄) = σ²/n	Std Error = σ/√n	Reduction
5	0.8000	0.8944	0%
10	0.4000	0.6325	50%
20	0.2000	0.4472	75%
50	0.0800	0.2828	90%
100	0.0400	0.2000	95%
200	0.0200	0.1414	98%
500	0.0080	0.0894	99%

Notice: Bias stays at 0 regardless of n. Only variance decreases with more data!

Bias Is Immune to Sample Size

If your estimation method has bias, collecting more data will not eliminate it. A biased estimator converges to the wrong value as $n \to \infty$ . You must fix the methodology itself to remove bias.

What Is Mean Squared Error (MSE)?

Now we need a single number that captures the overall quality of an estimator, combining both bias and variance. This is the Mean Squared Error (MSE).

MSE measures the average squared distance between our estimate and the truth — the "typical" error magnitude, squared.

$\text{MSE}(\hat{\theta}) = E\left[(\hat{\theta} - \theta_0)^2\right]$

The expected squared distance from the estimate to the true parameter value.

Why squared error? Several reasons:

Penalizes large errors more: An error of 10 contributes 100, but an error of 2 only contributes 4
Mathematical convenience: Differentiable and leads to elegant decomposition
Positive definite: No cancellation between over- and under-estimates

The Beautiful Decomposition: MSE = Bias² + Variance

Here's one of the most important results in all of statistics:

✨ The Bias-Variance Decomposition

$\text{MSE}(\hat{\theta}) = \text{Bias}(\hat{\theta})^2 + \text{Var}(\hat{\theta})$

Total Error = Systematic Error² + Random Error

Understanding the Decomposition

This formula says that the total "badness" of an estimator (measured by MSE) comes from exactly two sources:

Bias²: The squared systematic error — how far the average estimate is from truth
Variance: The random fluctuation — how spread out the estimates are

🎯Interactive Bias-Variance Explorer

Bias: 0.0

UnderestimateUnbiasedOverestimate

Variance: 1.0

Low (Precise)High (Spread)

True Value (θ₀)

E[θ̂]

Bias²

0.00

Variance

1.00

MSE = Bias² + Var

1.00

Red dots = individual estimates from different samples. Green line = true value. Blue line = average of estimates.

MSE as Risk: A Deeper View

📊MSE in Statistical Decision Theory

In statistical decision theory, we evaluate estimators using risk functions. The MSE is simply the risk when we use squared error as our loss function:

$R(\theta_0, \hat{\theta}) = E_{\theta_0}\left[L(\theta_0, \hat{\theta})\right] = E_{\theta_0}\left[(\hat{\theta} - \theta_0)^2\right] = \text{MSE}$

where $L(\theta_0, \hat{\theta}) = (\hat{\theta} - \theta_0)^2$ is the squared error loss.

Generalizing to Vector Estimators

So far we've focused on estimating a single parameter $\theta \in \mathbb{R}$ . In practice, we often estimate vectors of parameters $\boldsymbol{\theta} \in \mathbb{R}^p$ , such as all coefficients in a regression model.

📐Vector Extensions of Bias, Variance, and MSE

Bias Vector:

$\text{Bias}(\hat{\boldsymbol{\theta}}) = E[\hat{\boldsymbol{\theta}}] - \boldsymbol{\theta}_0 \in \mathbb{R}^p$

Each component has its own bias. The estimator is unbiased if all components are unbiased.

Covariance Matrix:

$\boldsymbol{\Sigma}_{\hat{\theta}} = \text{Cov}(\hat{\boldsymbol{\theta}}) = E\left[(\hat{\boldsymbol{\theta}} - E[\hat{\boldsymbol{\theta}}])(\hat{\boldsymbol{\theta}} - E[\hat{\boldsymbol{\theta}}])^\top\right] \in \mathbb{R}^{p \times p}$

The covariance matrix captures both individual variances (diagonal) and correlations between components (off-diagonal).

MSE Matrix:

$\text{MSE}(\hat{\boldsymbol{\theta}}) = E\left[(\hat{\boldsymbol{\theta}} - \boldsymbol{\theta}_0)(\hat{\boldsymbol{\theta}} - \boldsymbol{\theta}_0)^\top\right] = \boldsymbol{\Sigma}_{\hat{\theta}} + \text{Bias}\cdot\text{Bias}^\top$

The matrix analog of MSE = Variance + Bias².

Scalar Summary (Total MSE):

$\text{Total MSE} = \text{tr}(\text{MSE}) = \sum_{j=1}^{p} \text{MSE}(\hat{\theta}_j) = \|\text{Bias}\|^2 + \text{tr}(\boldsymbol{\Sigma}_{\hat{\theta}})$

The trace gives a single number summarizing total estimation quality.

When Does Covariance Matter?

For a single prediction, only the marginal variances matter. But for simultaneous inference(e.g., confidence regions, hypothesis tests about multiple parameters), the full covariance structure is essential. Ignoring correlations can lead to overconfident or misleading conclusions.

Technical Assumptions & Connections

⚠️Existence of Moments

For bias, variance, and MSE to be well-defined, we need the relevant moments to exist:

Bias: Requires $E[|\hat{\theta}|] < \infty$ (first moment exists)
Variance & MSE: Requires $E[\hat{\theta}^2] < \infty$ (second moment exists)

For heavy-tailed distributions (e.g., Cauchy), the sample mean has infinite varianceand MSE is undefined! In such cases, other loss functions (like absolute error) are more appropriate.

🔗Connection to Section 03

In the next section on Consistency and Efficiency, we'll see that:

Consistency: MSE → 0 as n → ∞
Efficiency: Achieving lowest possible variance among unbiased estimators

📉Cramér-Rao Lower Bound

There's a fundamental limit to how small the variance of an unbiased estimator can be:

$\text{Var}(\hat{\theta}) \geq \frac{1}{I(\theta)}$

where $I(\theta)$ is the Fisher Information. We'll explore this in Chapter 12 (Fisher Information and CRLB).

Worked Example: Bernoulli Mean Estimation

Let's apply bias, variance, and MSE to a concrete problem: estimating the probability of success $p$ from coin flips.

🪙Problem Setup

You flip a coin n = 10 times and observe X = 3 heads. The true probability is unknown (say, p = 0.4). Compare two estimators:

1. Sample Proportion (Unbiased)

$\hat{p}_1 = \frac{X}{n} = \frac{3}{10} = 0.30$

2. Laplace-Smoothed (Biased)

$\hat{p}_2 = \frac{X + 1}{n + 2} = \frac{4}{12} = 0.333$

The Bias-Variance Tradeoff

Here's a profound insight that underlies all of statistics and machine learning:

In most real-world scenarios, reducing bias increases variance, and reducing variance increases bias.

This is the famous bias-variance tradeoff. It explains why:

🔧Simple Models

High bias: Too restrictive, miss patterns
Low variance: Stable across samples
Result: Underfitting

🔬Complex Models

Low bias: Can fit any pattern
High variance: Sensitive to sample
Result: Overfitting

🤖The ML Connection: Model Complexity

Model	Bias	Variance	Risk
Constant prediction (always output mean)	High	Zero	Underfitting
Linear regression	Medium	Low	Depends on problem
Polynomial regression (degree 10)	Low	High	Overfitting risk
Deep neural network (no regularization)	Very Low	Very High	Severe overfitting

The Sweet Spot

The goal is to find the optimal tradeoff — a model that's complex enough to capture real patterns (low bias) but not so complex that it fits noise (low variance). This minimizes the total MSE.

🤔Wait — Can a Biased Estimator Be BETTER Than an Unbiased One?

Yes! This is a crucial insight that many beginners miss.

An unbiased estimator with high variance can have higher MSE than a biased estimator with low variance. What matters is the total error, not just the bias.

Example: Ridge Regression

Ordinary Least Squares (OLS): Unbiased but can have high variance
Ridge Regression: Deliberately adds bias to dramatically reduce variance
Result: Ridge often has lower MSE despite being biased!

ML Techniques and the Bias-Variance Tradeoff

Modern machine learning provides several powerful techniques to navigate the bias-variance tradeoff. Each technique works by shifting the balance between bias and variance in a controlled way.

Technique	Effect on Bias	Effect on Variance	How It Works
Ridge Regression (L2)	↑ Slightly increases	↓ Significantly decreases	Shrinks coefficients toward zero; penalizes large weights with λ\|\|β\|\|²
LASSO (L1)	↑ Increases (some features = 0)	↓ Decreases significantly	Sparse solutions; sets irrelevant features exactly to zero with λ\|\|β\|\|₁
Elastic Net	↑ Moderate increase	↓ Decreases	Combines L1 + L2; balances sparsity and grouping
Bagging / Random Forests	≈ Unchanged	↓↓ Dramatically decreases	Averages many high-variance models; Var(avg) = Var(single)/n
Boosting	↓ Decreases	↑ May increase (if overfit)	Sequentially corrects residual errors; additive model
Early Stopping	↑ Slightly increases	↓ Decreases	Stops training before overfitting; implicit regularization
Dropout (Neural Nets)	↑ Slightly increases	↓ Decreases significantly	Randomly drops neurons; approximates ensemble of networks

🔄Cross-Validation: Estimating Expected Prediction Error

Cross-validation is how we estimate the expected prediction error(the true risk) from finite data. It directly estimates the MSE we would see on unseen test data.

k-Fold CV: Split data into k parts, train on k-1, test on 1, average the results
Leave-One-Out (LOOCV): k = n; nearly unbiased but high variance
Typical choice: k = 5 or 10 balances bias and variance of the error estimate itself

$\widehat{ ext{Err}}_{ ext{CV}} = rac{1}{n} \sum_{i=1}^{n} L(y_i, \hat{f}^{-\kappa(i)}(x_i))$

where $hat{f}^{-kappa(i)}$ is the model trained without fold containing observation i.

Cross-validation helps us select the optimal regularization strength (like λ in Ridge/LASSO) by finding the value that minimizes estimated test error — the point of optimal bias-variance balance.

Key Insight: Regularization techniques (Ridge, LASSO, Dropout, Early Stopping) all work by deliberately increasing bias to reduce variance by a larger amount, resulting in lower overall MSE. Ensemble methods like Bagging reduce variance through averaging. Cross-validation helps us find the sweet spot.

Real-World Applications

Python Implementation

Let's see these concepts in code. We'll compute bias, variance, and MSE for different estimators:

🐍python

1import numpy as np
2import matplotlib.pyplot as plt
3
4def simulate_estimator_properties(true_mean=5.0, true_var=4.0, n_samples=30,
5                                   n_simulations=10000):
6    """
7    Simulate sampling to compute bias, variance, and MSE of the sample mean.
8    """
9    estimates = []
10
11    for _ in range(n_simulations):
12        # Draw a sample
13        sample = np.random.normal(true_mean, np.sqrt(true_var), n_samples)
14        # Compute sample mean (our estimator)
15        estimate = sample.mean()
16        estimates.append(estimate)
17
18    estimates = np.array(estimates)
19
20    # Compute properties
21    expected_value = estimates.mean()  # E[θ̂]
22    bias = expected_value - true_mean   # E[θ̂] - θ₀
23    variance = estimates.var()          # Var(θ̂)
24    mse = np.mean((estimates - true_mean)**2)  # E[(θ̂ - θ₀)²]
25
26    return {
27        'expected_value': expected_value,
28        'bias': bias,
29        'bias_squared': bias**2,
30        'variance': variance,
31        'mse': mse,
32        'mse_decomposition': bias**2 + variance,  # Should equal MSE!
33        'estimates': estimates
34    }
35
36# Run simulation
37results = simulate_estimator_properties()
38
39print("═══════════════════════════════════════════")
40print("  SAMPLE MEAN ESTIMATOR PROPERTIES")
41print("═══════════════════════════════════════════")
42print(f"  E[θ̂] (Expected Value):    {results['expected_value']:.4f}")
43print(f"  θ₀ (True Mean):           5.0000")
44print(f"  Bias:                     {results['bias']:.4f}")
45print(f"  Bias²:                    {results['bias_squared']:.6f}")
46print(f"  Variance:                 {results['variance']:.4f}")
47print(f"  MSE (direct):             {results['mse']:.4f}")
48print(f"  MSE (Bias² + Var):        {results['mse_decomposition']:.4f}")
49print("═══════════════════════════════════════════")
50print(f"  ✓ Verification: MSE ≈ Bias² + Var")
51print("═══════════════════════════════════════════")

🐍python

1# Comparing two estimators: Sample Mean vs Biased Estimator
2def compare_estimators(true_mean=5.0, true_var=4.0, n_samples=10,
3                       n_simulations=10000):
4    """
5    Compare:
6    1. Sample mean (unbiased)
7    2. Shrinkage estimator: θ̂_shrink = 0.9 * X̄ + 0.5 (biased but lower variance)
8    """
9    unbiased_estimates = []
10    biased_estimates = []
11
12    for _ in range(n_simulations):
13        sample = np.random.normal(true_mean, np.sqrt(true_var), n_samples)
14
15        # Unbiased: sample mean
16        unbiased_est = sample.mean()
17
18        # Biased: shrinkage towards 5.0
19        # This adds bias but reduces variance
20        biased_est = 0.9 * sample.mean() + 0.5
21
22        unbiased_estimates.append(unbiased_est)
23        biased_estimates.append(biased_est)
24
25    unbiased = np.array(unbiased_estimates)
26    biased = np.array(biased_estimates)
27
28    # Compute metrics
29    def get_metrics(estimates, true_value):
30        exp_val = estimates.mean()
31        bias = exp_val - true_value
32        var = estimates.var()
33        mse = np.mean((estimates - true_value)**2)
34        return exp_val, bias, var, mse
35
36    u_exp, u_bias, u_var, u_mse = get_metrics(unbiased, true_mean)
37    b_exp, b_bias, b_var, b_mse = get_metrics(biased, true_mean)
38
39    print("
40" + "═"*60)
41    print("  ESTIMATOR COMPARISON: Unbiased vs Shrinkage")
42    print("═"*60)
43    print(f"  {'Metric':<20} {'Unbiased':>15} {'Shrinkage':>15}")
44    print("-"*60)
45    print(f"  {'E[θ̂]':<20} {u_exp:>15.4f} {b_exp:>15.4f}")
46    print(f"  {'Bias':<20} {u_bias:>15.4f} {b_bias:>15.4f}")
47    print(f"  {'Variance':<20} {u_var:>15.4f} {b_var:>15.4f}")
48    print(f"  {'MSE':<20} {u_mse:>15.4f} {b_mse:>15.4f}")
49    print("═"*60)
50
51    winner = "Unbiased" if u_mse < b_mse else "Shrinkage"
52    print(f"  Winner (Lower MSE): {winner}")
53    print("═"*60)
54
55compare_estimators()

Key Takeaway from the Code

The simulation demonstrates that a biased estimator can have lower MSE than an unbiased one when the variance reduction outweighs the added bias. This is the principle behind regularization techniques like Ridge Regression and LASSO.

Key Insights

Bias Is About Accuracy, Variance Is About Precision

Bias measures whether you're hitting the right target on average. Variance measures how tightly your shots cluster together. You need both to be low for a good estimator.

The MSE Decomposition Is Exact, Not Approximate

MSE = Bias² + Variance is a mathematical identity, not an approximation. Every point of MSE comes from exactly one of these two sources.

Unbiased Isn't Always Best

A biased estimator with much lower variance can have lower MSE. This is why regularization (Ridge, LASSO) intentionally adds bias to dramatically reduce variance.

More Data Reduces Variance, Not Bias

Collecting more samples shrinks variance (by √n), but a biased estimator remains biased no matter how much data you collect. Bias is a property of the estimation method itself.

The Tradeoff Explains Underfitting vs Overfitting

Underfitting = high bias (model too simple). Overfitting = high variance (model too complex). The optimal model minimizes total error (MSE).

Summary

📝Section Summary

Bias

The systematic error of an estimator: $\text{Bias}(\hat{\theta}) = E[\hat{\theta}] - \theta_0$ . It measures where the estimator "aims" relative to the truth. Cannot be reduced by more data.

Variance

The spread of the estimator across different samples: $\text{Var}(\hat{\theta}) = E[(\hat{\theta} - E[\hat{\theta}])^2]$ . It measures how much the estimator "wiggles". Can be reduced by more data.

Mean Squared Error (MSE)

The total expected squared error: $\text{MSE}(\hat{\theta}) = E[(\hat{\theta} - \theta_0)^2]$ . Decomposes exactly as: MSE = Bias² + Variance.

Bias-Variance Tradeoff

Reducing bias often increases variance and vice versa. The goal is to minimize total MSE, which may involve accepting some bias to greatly reduce variance (regularization).

🔮What's Next?

In the next section, we'll explore Consistency and Efficiency — two additional properties that tell us how estimators behave as sample size grows and how to identify the "best possible" estimator among unbiased ones.

📚Symbol Glossary

Symbol	Name	Meaning
$\theta_0$	True parameter	The unknown value we're trying to estimate
$\hat{\theta}$	Estimator/Estimate	A function of sample data that produces estimates
$E[\hat{\theta}]$	Expected value	Average value of the estimator over all samples
$\text{Bias}(\hat{\theta})$	Bias	E[θ̂] - θ₀ — systematic error
$\text{Var}(\hat{\theta})$	Variance	E[(θ̂ - E[θ̂])²] — spread of estimates
$\text{MSE}(\hat{\theta})$	Mean Squared Error	E[(θ̂ - θ₀)²] = Bias² + Variance