Chapter 11
30 min read
Section 73 of 175

Decision Theory and Prediction

Point Estimation

Learning Objectives

Before You Start

This section provides the conceptual foundation for all of point estimation. You should be comfortable with expected values, probability distributions, and basic optimization concepts.

By the end of this section, you will be able to:

๐ŸŽฏ
Understand Decision Theory

The framework for making optimal choices under uncertainty

๐Ÿ“Š
Master Loss Functions

How to quantify the cost of making wrong decisions

โš–๏ธ
Compare Risk Functions

Frequentist risk, Bayes risk, and minimax approaches

๐Ÿ”ฎ
Distinguish Estimation from Prediction

Why predicting new values requires different thinking

๐Ÿ”—
Connect to Point Estimation

See how MSE, bias, and variance arise from decision theory


The Big Picture: Why Decision Theory?

Statistics is about making decisions under uncertainty. Decision theory provides the mathematical framework for choosing the "best" action when we don't know the true state of the world.

Before we dive into estimators, bias, and variance, we need to answer a fundamental question: What does it mean for an estimator to be "good"?

Different people might have different answers:

  • "An estimator that's right on average" (unbiasedness)
  • "An estimator that's usually close to the truth" (low variance)
  • "An estimator that minimizes my expected loss" (optimal decision)

Decision theory gives us a unified framework to think about all these properties. It tells us that the "best" estimator depends on:

  1. What we lose when we're wrong (the loss function)
  2. How we average that loss (the risk function)
  3. What we know beforehand (prior information)

The Core Insight

Every estimator property you will study โ€” bias, variance, MSE, consistency, efficiency, sufficiency โ€” is a decision-theoretic concept in disguise.

  • "Bias & variance describe how the risk decomposes."
  • "Consistency describes how the risk behaves as data grows."
  • "Efficiency compares risk against theoretical lower bounds."
  • "Sufficiency and completeness identify when risk cannot be improved."

Decision theory is not an optional interpretation โ€” it is the mathematical spine of statistical inference.

๐Ÿค–The Estimation Machine Analogy

Think of an estimator as a machine:

You feed it raw data โ†’ it outputs a guess about an unknown truth.

Just like a physical machine can be evaluated for accuracy and precision, an estimator can be evaluated using several fundamental criteria:

๐ŸŽฏBias โ€” Systematic Error

Question: Is the machine centered on the truth or consistently off-target?

  • If it always guesses too high โ†’ positive bias
  • If it always guesses too low โ†’ negative bias

Interpretation: Bias measures systematic error.

๐ŸŽฒVariance โ€” Random Scatter

Question: How much do the machine's outputs fluctuate from run to run?

  • Tight clustering โ†’ low variance
  • Wildly different answers โ†’ high variance

Interpretation: Variance measures random instability.

๐Ÿ“MSE โ€” Total Error

Question: Overall, how wrong is the machine on average?

MSE=Bias2+Variance\text{MSE} = \text{Bias}^2 + \text{Variance}

Interpretation: MSE balances systematic error + random error into one score.

๐Ÿ“ˆConsistency โ€” Learning with More Data

Question: As we feed the machine more and more data, does it eventually lock onto the true value?

  • If yes โ†’ consistent estimator
  • If no โ†’ inconsistent estimator

Interpretation: Consistency is a long-run guarantee, not a finite-sample promise.

๐Ÿ†Efficiency โ€” Best Possible Precision

Question: Among all unbiased machines, does this one have the tightest grouping?

  • If it achieves the smallest possible variance, it is efficient

Interpretation: Efficiency means no other unbiased estimator is more precise.

๐Ÿง Sufficiency โ€” No Wasted Information

Question: Does the machine extract all useful information from the data โ€” or does it throw some away?

  • If nothing is lost โ†’ sufficient
  • If relevant information is discarded โ†’ insufficient

Interpretation: Sufficiency is about perfect information compression.

โœ…Bottom Line

A near-perfect estimation machine would be:

  • Unbiased โ†’ centered on the truth
  • Low variance โ†’ stable across samples
  • Low MSE โ†’ small total error
  • Consistent โ†’ converges with more data
  • Efficient โ†’ best possible precision
  • Sufficient โ†’ wastes no information

This is exactly what optimal statistical estimation aims to achieve.

Machine Learning Perspective

The concepts from classical estimation theory map directly onto modern machine learning. Understanding this connection helps you see that ML is applied decision theory.

Classical Estimation โ†” Machine Learning Intuition
Statistical ConceptEstimator-Machine MeaningMachine Learning Interpretation
BiasIs the machine systematically off-target?Underfitting โ€” model too simple, misses true structure
VarianceHow much do outputs fluctuate across samples?Overfitting โ€” model too sensitive to noise
MSE / RiskTotal error combining bias & varianceGeneralization error on unseen test data
ConsistencyDoes the machine improve with more data?Model converges as dataset grows
EfficiencyAmong unbiased machines, is this the tightest?Best possible accuracy for given data + model class
SufficiencyIs any useful information being thrown away?Feature bottleneck / information loss
๐Ÿ”ฅThe Core ML Insight

Training a neural network is nothing but tuning an estimation machine to minimize expected decision-theoretic risk under a chosen loss.

โ–ธLoss function โ†’ training objective
โ–ธRisk (expected loss) โ†’ true generalization error
โ–ธEmpirical risk โ†’ training loss
โ–ธRegularization โ†’ biasโ€“variance control
โ–ธFeature engineering / representation learning โ†’ sufficiency
๐ŸŽฏBiasโ€“Variance in ML Language
๐Ÿ“‰High Bias
= Underfitting
  • โ€ข Model is too rigid
  • โ€ข Misses patterns in data
  • โ€ข Low training error improvement
  • โ€ข High error on both train & test
๐Ÿ“ˆHigh Variance
= Overfitting
  • โ€ข Model is too flexible
  • โ€ข Fits noise in training data
  • โ€ข Huge trainโ€“test gap
  • โ€ข Low train error, high test error
โœจOptimal Model
= Balanced
  • โ€ข Right model complexity
  • โ€ข Balanced bias + variance
  • โ€ข Minimum test error
  • โ€ข = Minimum MSE / Risk
The goal of model selection: Find the sweet spot where Bias2+Variance\text{Bias}^2 + \text{Variance} is minimized.

Why This Matters for ML Engineers

Every hyperparameter you tune, every architecture choice you make, every regularization technique you apply โ€” you are navigating the bias-variance tradeoff. Decision theory gives you the mathematical foundation to understand why these techniques work.

๐Ÿ”—Unifying Principle

Every estimator is a machine that turns data into decisions.

Every statistical property โ€” bias, variance, MSE, consistency, efficiency, sufficiency โ€” is just a different way of scoring how well that machine behaves under uncertainty.


What Is Decision Theory?

Intuitive Understanding

Imagine you're a doctor diagnosing a patient. You observe symptoms (data) but don't know the true disease (parameter). You must choose a treatment (action). Different actions have different consequences depending on the true disease.

The same logic applies to estimation:

  • Data = your observations X1,X2,โ€ฆ,XnX_1, X_2, \ldots, X_n
  • Unknown state = true parameter ฮธ\theta
  • Action = your estimate ฮธ^\hat{\theta}
  • Loss = how "wrong" your estimate is

Types of Statistical Problems

The information we extract from data takes different forms depending on our goals. Decision theory provides a unified framework for all of them:

๐Ÿ“Estimation

Goal: Produce "best guesses" of unknown parameters

Action space: All possible parameter values

Examples: Fraction defective ฮธ\theta, population mean ฮผ\mu, regression coefficients ฮฒ\beta
โš–๏ธTesting

Goal: Decide if data supports a hypothesis or not

Action space: {Accept H0H_0, Reject H0H_0}

Examples: Drug effectiveness vs placebo, A/B test significance, quality control pass/fail
๐Ÿ†Ranking

Goal: Order items from best to worst

Action space: All k!k! possible orderings of kk items

Examples: Consumer reports ranking brands, search result ordering, tournament seeding
๐Ÿ”ฎPrediction

Goal: Forecast future observations given covariates

Action space: Predicted values Y^=ฮผ(z)\hat{Y} = \mu(\mathbf{z})

Examples: Patient response given (age, sex, dose), house price given features, demand forecasting

The Common Thread

In all cases, the analysis doesn't stop at specifying an estimate, test, ranking, or prediction. We must also evaluate how well our procedure performs. This requires criteria of performance โ€” which is exactly what decision theory provides.

Why Decision Theory Matters

Given so many possible procedures (sample mean vs median, different test statistics, various models), how do we choose? Decision theory provides the framework to answer this systematically.

๐Ÿ”ฌA Priori Performance

When: Before looking at data

Purpose: Study design, sample size determination

Question: "How well can the best procedure do?"

Example: Determining how many patients we need to detect a treatment effect with 80% power
๐Ÿ“ŠA Posteriori Performance

When: After data is collected

Purpose: Assess reliability of our estimate

Question: "How reliable is this particular estimate?"

Example: Confidence intervals, standard errors, posterior credible intervals
๐ŸŽฏThe Four Purposes of Decision Theory

The decision theoretic framework helps us:

1
Clarify Objectives

What exactly are we trying to achieve? Estimation, testing, ranking, or prediction?

2
Identify Possible Actions

What decisions can we make? What is the action space?

3
Assess Risk, Accuracy & Reliability

How do we measure "how well" a procedure performs? What are the relevant metrics?

4
Guide Procedure Selection

Given objectives and performance criteria, which procedure should we use?

The Fundamental Question: In estimation we care how far off we are; in testing, what mistakes we've made; in ranking, which orderings are wrong. Decision theory gives us the mathematical language to express and minimize these errors.

Formal Framework

A statistical decision problem consists of three ingredients:

1
State Space ฮ˜\Theta

The set of possible true parameter values. Examples: ฮ˜=(0,1)\Theta = (0, 1) for probabilities,ฮ˜=R\Theta = \mathbb{R} for means, ฮ˜=R+\Theta = \mathbb{R}^+ for variances.

2
Action Space A\mathcal{A}

The set of possible decisions. For point estimation, A=ฮ˜\mathcal{A} = \Theta (we choose an estimate from the same space as the parameter).

3
Loss Function L(ฮธ,a)L(\theta, a)

A function L:ฮ˜ร—Aโ†’R+L: \Theta \times \mathcal{A} \to \mathbb{R}^+ measuring the cost of taking action aa when the true state is ฮธ\theta. Lower loss = better.

Decision Procedures

A decision procedure (or decision rule) is a functionฮด:Xโ†’A\delta: \mathcal{X} \to \mathcal{A} that maps any possible data outcome to an action. When we observe X=x\mathbf{X} = \mathbf{x}, we take actionฮด(x)\delta(\mathbf{x}).

ฮด:Xโ†’Awhereฮด(x)=actionย takenย whenย X=x\delta: \mathcal{X} \to \mathcal{A} \quad \text{where} \quad \delta(\mathbf{x}) = \text{action taken when } \mathbf{X} = \mathbf{x}

๐Ÿ“Examples of Decision Rules
Estimation: Two Competing Rules

For estimating the population mean ฮผ\mu from data X1,โ€ฆ,XnX_1, \ldots, X_n:

โ–ธฮด1(x)=xห‰=1nโˆ‘i=1nxi\delta_1(\mathbf{x}) = \bar{x} = \frac{1}{n}\sum_{i=1}^n x_i (sample mean)
โ–ธฮด2(x)=x~\delta_2(\mathbf{x}) = \tilde{x} (sample median)

Which is better? That depends on the loss function and the true distribution!

Testing: Two-Sample Problem

Testing H0:ฮผX=ฮผYH_0: \mu_X = \mu_Y vs H1:ฮผXโ‰ ฮผYH_1: \mu_X \neq \mu_Y with data from two groups:

ฮด(x,y)={0ifย โˆฃxห‰โˆ’yห‰โˆฃฯƒ^<c1ifย โˆฃxห‰โˆ’yห‰โˆฃฯƒ^โ‰ฅc\delta(\mathbf{x}, \mathbf{y}) = \begin{cases} 0 & \text{if } \frac{|\bar{x} - \bar{y}|}{\hat{\sigma}} < c \\ 1 & \text{if } \frac{|\bar{x} - \bar{y}|}{\hat{\sigma}} \geq c \end{cases}

The critical value cc controls the tradeoff between Type I and Type II errors.

Prediction: Linear Regression

Given training data {(zi,yi)}i=1n\{(\mathbf{z}_i, y_i)\}_{i=1}^n, predict YY for new z\mathbf{z}:

โ–ธฮด(z)=ฮฒ^0+ฮฒ^1z1+โ‹ฏ+ฮฒ^pzp\delta(\mathbf{z}) = \hat{\beta}_0 + \hat{\beta}_1 z_1 + \cdots + \hat{\beta}_p z_p

The decision rule is an entire function! The action space is infinite-dimensional.

Key Insight

The notation ฮด(x)\delta(\mathbf{x}) emphasizes that our decision is a function of the data. We don't just pick a number โ€” we specify a rule that tells us what to do for any possible dataset we might observe.

The Decision Recipe

๐Ÿ“‹One-Glance Decision Recipe
1
Specify ฮ˜ and A

What are the possible states? What decisions can you take?

2
Pick a Loss Function L(ฮธ,a)L(\theta, a)

Squared error: large errors matter disproportionately
Absolute error: need robustness to outliers
Asymmetric: over/under-estimating have different costs
0-1 loss: classification or hypothesis testing

3
Compute Risk R(ฮธ,ฮด)R(\theta, \delta)

Average the loss over the sampling distribution: Eฮธ[L(ฮธ,ฮด(X))]\mathbb{E}_\theta[L(\theta, \delta(X))]

4
Choose Your Decision Rule

Bayes: have prior? Minimize Bayes risk
Minimax: protect against worst case
Frequentist: evaluate pointwise risk R(ฮธ,ฮด)R(\theta, \delta)


Loss Functions

The loss function is the heart of decision theory. It quantifies: "How bad is it to choose action a when the truth is ฮธ?"

Common Loss Functions

๐Ÿ“Š

Interactive: Compare Loss Functions

0
2
4
6
8
10
ฮธ\theta (true) aa (estimate) error
Squared Error
4.00
(ฮธโˆ’a)2(\theta - a)^2
Absolute Error
2.00
โˆฃฮธโˆ’aโˆฃ|\theta - a|
Huber Loss
1.50
robust
0-1 Loss
1
exact match
Observation: Squared error penalizes large errors much more than absolute error. Try setting error = 2 vs error = 4: squared loss quadruples while absolute loss only doubles!
Loss FunctionFormula L(ฮธ,a)L(\theta, a)PropertiesUse When...
Squared Error(ฮธโˆ’a)2(\theta - a)^2Differentiable, penalizes large errors heavilyErrors of similar magnitude, computational convenience
Absolute Errorโˆฃฮธโˆ’aโˆฃ|\theta - a|Robust to outliers, non-differentiable at 0Large errors shouldn&apos;t dominate
0-1 Loss0 if a=ฮธa = \theta, 1 otherwiseUsed for classification/testingOnly exact correctness matters
Asymmetricc1(ฮธโˆ’a)++c2(aโˆ’ฮธ)+c_1(\theta-a)^+ + c_2(a-\theta)^+Different costs for over/under-estimationConsequences differ by direction of error
Huber Loss12(ฮธโˆ’a)2\frac{1}{2}(\theta-a)^2 if โˆฃฮธโˆ’aโˆฃโ‰คฮด|\theta-a| \le \deltaCombines benefits of squared and absoluteRobustness with differentiability

Advanced Loss Functions

Beyond the basic loss functions, several specialized losses arise in practice:

๐Ÿ“ŠLoss Functions for Vector Parameters

When estimating a dd-dimensional parameterฮฝ=(ฮฝ1,โ€ฆ,ฮฝd)\boldsymbol{\nu} = (\nu_1, \ldots, \nu_d) with estimatea=(a1,โ€ฆ,ad)\mathbf{a} = (a_1, \ldots, a_d):

Squared Euclidean Distance

L(ฮฝ,a)=1dโˆ‘j=1d(ajโˆ’ฮฝj)2=1dโˆฅaโˆ’ฮฝโˆฅ22L(\boldsymbol{\nu}, \mathbf{a}) = \frac{1}{d}\sum_{j=1}^d (a_j - \nu_j)^2 = \frac{1}{d}\|\mathbf{a} - \boldsymbol{\nu}\|_2^2

Most common choice; decomposes into sum of univariate losses

Absolute Distance (L1)

L(ฮฝ,a)=1dโˆ‘j=1dโˆฃajโˆ’ฮฝjโˆฃ=1dโˆฅaโˆ’ฮฝโˆฅ1L(\boldsymbol{\nu}, \mathbf{a}) = \frac{1}{d}\sum_{j=1}^d |a_j - \nu_j| = \frac{1}{d}\|\mathbf{a} - \boldsymbol{\nu}\|_1

Robust to outliers; leads to sparse solutions in regularization

Supremum Distance (Lโˆž)

L(ฮฝ,a)=maxโกj=1,โ€ฆ,dโˆฃajโˆ’ฮฝjโˆฃ=โˆฅaโˆ’ฮฝโˆฅโˆžL(\boldsymbol{\nu}, \mathbf{a}) = \max_{j=1,\ldots,d} |a_j - \nu_j| = \|\mathbf{a} - \boldsymbol{\nu}\|_\infty

Worst-case error across all components; minimax flavor

๐Ÿ”ฎPrediction Loss (Integrated Squared Error)

For prediction problems where the true function is ฮผ(z)\mu(\mathbf{z}) and our predictor is a(z)a(\mathbf{z}):

L(P,a)=โˆซ(ฮผ(z)โˆ’a(z))2โ€‰dQ(z)L(P, a) = \int (\mu(\mathbf{z}) - a(\mathbf{z}))^2 \, dQ(\mathbf{z})

If QQ is the empirical distribution of the training covariates:

L(P,a)=1nโˆ‘j=1n(ฮผ(zj)โˆ’a(zj))2L(P, a) = \frac{1}{n}\sum_{j=1}^n (\mu(\mathbf{z}_j) - a(\mathbf{z}_j))^2

This is the mean squared prediction error โ€” exactly what we minimize in regression!

Worked Example: The Newsvendor Problem

The newsvendor problem is a classic example of asymmetric loss in action. A vendor must decide how many newspapers to stock before knowing the day's demand.

๐Ÿ“ฐNewsvendor: Asymmetric Loss in Action

Setup:

  • Each newspaper costs $1 to buy and sells for $2
  • Understock cost (lost sale): cuc_u = $1 profit missed
  • Overstock cost (unsold paper): coc_o = $1 purchase price lost

The Optimal Quantile Formula:

ฮฑโˆ—=cucu+co=11+1=0.5\alpha^* = \frac{c_u}{c_u + c_o} = \frac{1}{1 + 1} = 0.5

Order the 50th percentile (median) of demand when costs are equal.

Now suppose stockouts are worse:

  • Understock cost: cuc_u = $5 (angry customer, lost reputation)
  • Overstock cost: coc_o = $1 (just the paper cost)

ฮฑโˆ—=cucu+co=55+1=0.833\alpha^* = \frac{c_u}{c_u + c_o} = \frac{5}{5 + 1} = 0.833

Order the 83rd percentile of demand โ€” stock more to avoid stockouts!

The General Principle

Under asymmetric loss L(ฮธ,a)=cu(ฮธโˆ’a)++co(aโˆ’ฮธ)+L(\theta, a) = c_u(\theta-a)^+ + c_o(a-\theta)^+, the optimal estimate is the ฮฑ\alpha-quantile where ฮฑ=cu/(cu+co)\alpha = c_u/(c_u + c_o).

Loss Functions in ML Practice

The loss functions from decision theory appear throughout machine learning under different names:

Decision Theory LossML Training LossEval MetricUse Case
Squared Error (ฮธโˆ’a)2(\theta-a)^2MSE LossRMSE, R2R^2Regression with normal errors
Absolute Error โˆฃฮธโˆ’aโˆฃ|\theta-a|L1 / MAE LossMAE, MedAERobust regression, sparse solutions
0-1 LossClassification ErrorAccuracy, Error RateHard classification decisions
Log Loss โˆ’logโกP(ฮธโˆฃa)-\log P(\theta|a)Cross-Entropy / NLLLog Loss, PerplexityProbabilistic classification
Huber LossSmooth L1 LossHuber metricObject detection, robust regression
Asymmetric LossQuantile Loss / PinballQuantile coverageDemand forecasting, prediction intervals

Theory โ†” Practice Connection

When you minimize cross-entropy loss in a neural network, you're finding the MLE. When you minimize MSE, you're minimizing expected squared error loss. The frameworks connect!

Choosing a Loss Function

Key Insight

The choice of loss function determines the optimal estimator! Under squared error loss, the optimal estimator is the posterior mean. Under absolute error loss, it's the posterior median.


Risk Functions

The loss L(ฮธ, ฮด(X)) is random because it depends on the random data X. We need a way to summarize the "typical" or "expected" loss. This is the risk function.

1Why Do We Even Need a Risk Function?

Let's start from the most basic problem:

You choose an estimator ฮด(X)\delta(X).
You plug in your observed data XX.
You get one number ฮด(X)\delta(X).

But here's the problem:

  • โœ… That number is produced by random data.
  • โœ… If you repeated the experiment, you would get a different dataset.
  • โœ… That means your estimator would output a different value every time.

So now ask this:

"How do I judge whether my estimator is good, if every repetition gives a different error?"

You cannot judge an estimator by a single outcome.
You must judge it by its long-run behavior under randomness.

That is exactly what the risk function is:

โœ… The risk function is the average long-run penalty your decision rule will pay if the true parameter is ฮธ\theta.

It converts:

  • Random loss โ†’ deterministic performance curve
  • One noisy outcome โ†’ a stable performance guarantee

Without risk, you cannot compare estimators scientifically.

3What Does Risk Tell Us?

The risk function answers this fundamental question:

"If the true world were ฮธ\theta, how painful would it be to use this estimator forever?"

So risk tells you:

What You WantWhat Risk Tells You
AccuracyAverage closeness to truth
ReliabilityHow stable performance is
RobustnessSensitivity to randomness
SafetyExpected damage
OptimalityWhether another rule does better

Risk turns intuition into a measurable object.

โœ…Final One-Sentence Truth
The risk function is the bridge between mathematical uncertainty and real-world consequence.

Frequentist Risk

The frequentist risk (or simply "risk") averages the loss over the sampling distribution of the data:

R(ฮธ,ฮด)=Eฮธ[L(ฮธ,ฮด(X))]=โˆซL(ฮธ,ฮด(x))f(xโˆฃฮธ)โ€‰dxR(\theta, \delta) = \mathbb{E}_\theta[L(\theta, \delta(X))] = \int L(\theta, \delta(x)) f(x|\theta)\, dx

Interpretation: "If ฮธ is the true parameter, what's my expected loss if I use decision rule ฮด repeatedly?"

For squared error loss, the risk has a special name:

R(ฮธ,ฮด)=Eฮธ[(ฮธโˆ’ฮด(X))2]=MSEฮธ(ฮด)R(\theta, \delta) = \mathbb{E}_\theta[(\theta - \delta(X))^2] = \text{MSE}_\theta(\delta)

๐ŸŽฏFrom Risk to MSE to Bias-Variance

The MSE depends on two fundamental quantities:

Bias: Bias(ฮฝ^)=E[ฮฝ^]โˆ’ฮฝ\text{Bias}(\hat{\nu}) = \mathbb{E}[\hat{\nu}] - \nu(systematic deviation)
Variance: Var(ฮฝ^)=E[(ฮฝ^โˆ’E[ฮฝ^])2]\text{Var}(\hat{\nu}) = \mathbb{E}[(\hat{\nu} - \mathbb{E}[\hat{\nu}])^2](fluctuation)
The Famous Decomposition: MSE(ฮฝ^)=Bias2(ฮฝ^)+Var(ฮฝ^)\text{MSE}(\hat{\nu}) = \text{Bias}^2(\hat{\nu}) + \text{Var}(\hat{\nu})= Total error = Systematic + Random

Preview: This decomposition is the central result. We'll explore the bias-variance tradeoff in Section 2.

๐ŸŽฒ

Interactive: Risk Comparison (Normal Data)

Compare the risk (expected squared error loss) of the sample mean vs sample median for estimating the mean of a Normal distribution.

Assumptions:XiโˆผN(ฮธ,ฯƒ2=1)X_i \sim N(\theta, \sigma^2=1) i.i.d. | ฯƒ2\sigma^2 known | Squared error loss | Monte Carlo (1000 runs)
Sample Mean Risk
Infinity
Theory: 1/n = 0.1000
Sample Median Risk
0.1400
Theory: โ‰ˆ (ฯ€/2)/n = 0.1571
Relative Efficiency
Infinity%
Mean is 0.00x better

Decision Theory Insight

Under squared error loss and Normal data, the sample mean has lower risk than the median. Decision theory helps us make this comparison rigorous.

Bayes Risk and Minimax Risk: Two Ways to Choose an Optimal Decision Rule

The frequentist risk

R(ฮธ,ฮด)=Eฮธ[L(ฮธ,ฮด(X))]R(\theta, \delta) = \mathbb{E}_\theta[L(\theta, \delta(X))]

is a function of the unknown true parameter ฮธ\theta. Since ฮธ\theta is not known in practice, the risk alone does not immediately tell us how to select one estimator over another. This leads to a fundamental question:

How should we choose an optimal decision rule when performance depends on an unknown truth?

Decision theory provides two principled answers to this question: the Bayes approach and the minimax approach.

โ—†Bayes Risk (Average-Case Optimality)

In the Bayesian framework, the parameter ฮธ\theta is treated as a random variable with prior distribution ฯ€(ฮธ)\pi(\theta). The performance of a decision rule is measured by its Bayes risk:

r(ฮด)=Eฮธโˆผฯ€[R(ฮธ,ฮด)]=โˆซR(ฮธ,ฮด)โ€‰ฯ€(ฮธ)โ€‰dฮธr(\delta) = \mathbb{E}_{\theta \sim \pi}[R(\theta, \delta)] = \int R(\theta, \delta) \, \pi(\theta) \, d\theta

A Bayes decision rule is defined as:

ฮดB=argโกminโกฮดr(ฮด)\delta_B = \arg\min_\delta r(\delta)

Interpretation
Bayes risk is the expected long-run loss, averaged over our prior beliefs about which world is likely to be true.

Thus, Bayes optimality is an average-case optimality, weighted by subjective or empirical beliefs.

โ—†Minimax Risk (Worst-Case Optimality)

In the minimax framework, no prior distribution is assumed. Instead, Nature is treated as an adversary who may choose the worst possible value of ฮธ\theta. The relevant performance measure is the maximal risk:

supโกฮธโˆˆฮ˜R(ฮธ,ฮด)\sup_{\theta \in \Theta} R(\theta, \delta)

A minimax decision rule is defined as:

ฮดM=argโกminโกฮดsupโกฮธโˆˆฮ˜R(ฮธ,ฮด)\delta_M = \arg\min_\delta \sup_{\theta \in \Theta} R(\theta, \delta)

Interpretation
Minimax risk measures how bad things could get in the worst possible world. The minimax rule is the one with the smallest guaranteed maximum damage.

This is a worst-case optimality principle.

โ—†One Unified Decision-Theoretic View

Both Bayes and minimax arise from a single abstract principle:

ฮดโˆ—=argโกminโกฮดR(ฮด)\delta^* = \arg\min_\delta \mathcal{R}(\delta)

where the functional R(ฮด)\mathcal{R}(\delta) is either:

  • an expectation over ฮธ\theta (Bayes), or
  • a supremum over ฮธ\theta (minimax).

Thus, Bayes and minimax are not competing theories โ€” they are two different ways of aggregating the same underlying risk function.

Bayes

minimize average risk under a belief

Minimax

minimize worst possible risk under uncertainty

Bayes risk teaches us how to act intelligently when we believe something about the world.

Minimax risk teaches us how to act safely when we don't trust the world at all.

Formal Comparison:

๐ŸŽฒBayes Approach
๐Ÿ›ก๏ธMinimax Approach

Put a prior distribution ฯ€(ฮธ) on the parameter and minimize the Bayes risk:

Minimize the worst-case risk over all ฮธ:

r(ฯ€,ฮด)=Eฯ€[R(ฮธ,ฮด)]=โˆซR(ฮธ,ฮด)ฯ€(ฮธ)โ€‰dฮธr(\pi, \delta) = \mathbb{E}_\pi[R(\theta, \delta)] = \int R(\theta, \delta) \pi(\theta)\, d\theta

Average risk over your prior beliefs about ฮธ.

ฮดโˆ—=argโกminโกฮดmaxโกฮธR(ฮธ,ฮด)\delta^* = \arg\min_\delta \max_\theta R(\theta, \delta)

Prepare for the adversarial scenario where nature picks the worst ฮธ.

Formal Definition
Formal Definition

In the Bayesian framework, ฮธ\theta is random. The Bayes risk is:

r(ฮด)=E[R(ฮธ,ฮด)]=E[l(ฮธ,ฮด(X))]r(\delta) = E[R(\theta, \delta)] = E[l(\theta, \delta(X))]

(expectation form with loss function)

We prefer ฮด\delta to ฮดโ€ฒ\delta' iff:

supโกฮธR(ฮธ,ฮด)<supโกฮธR(ฮธ,ฮดโ€ฒ)\sup_\theta R(\theta, \delta) < \sup_\theta R(\theta, \delta')

For discrete ฮธ\theta:

r(ฮด)=โˆ‘ฮธR(ฮธ,ฮด)ฯ€(ฮธ)r(\delta) = \sum_\theta R(\theta, \delta) \pi(\theta)

For continuous ฮธ\theta:

r(ฮด)=โˆซR(ฮธ,ฮด)ฯ€(ฮธ)โ€‰dฮธr(\delta) = \int R(\theta, \delta) \pi(\theta)\, d\theta

A procedure ฮดโˆ—\delta^* is minimax if:

supโกฮธR(ฮธ,ฮดโˆ—)=infโกฮดsupโกฮธR(ฮธ,ฮด)\sup_\theta R(\theta, \delta^*) = \inf_\delta \sup_\theta R(\theta, \delta)

That is, ฮดโˆ—\delta^* minimizes the maximum risk.

๐Ÿ›ข๏ธWorked Example: Oil Drilling Decision

Suppose an expert believes the probability of finding oil is ฮธ\theta, which can take two values: ฮธ1\theta_1 (low yield) or ฮธ2\theta_2 (high yield). The expert assigns prior probabilities:

ฯ€(ฮธ1)=0.2,ฯ€(ฮธ2)=0.8\pi(\theta_1) = 0.2, \quad \pi(\theta_2) = 0.8

The Bayes risk of any procedure ฮด\delta is:

r(ฮด)=0.2โ‹…R(ฮธ1,ฮด)+0.8โ‹…R(ฮธ2,ฮด)r(\delta) = 0.2 \cdot R(\theta_1, \delta) + 0.8 \cdot R(\theta_2, \delta)

Consider 9 possible decision procedures. Their risks are:

Procedure ii123456789
Bayes risk r(ฮดi)r(\delta_i)9.67.488.384.922.83.77.024.95.8
maxโก{R(ฮธ1,ฮดi),R(ฮธ2,ฮดi)}\max\{R(\theta_1, \delta_i), R(\theta_2, \delta_i)\}127.69.65.4106.58.48.56
๐ŸŽฒ Bayes Rule

ฮด5\delta_5 has minimum Bayes risk r(ฮด5)=2.8r(\delta_5) = 2.8. This is the unique Bayes rule for this prior.

๐Ÿ›ก๏ธ Minimax Rule

ฮด4\delta_4 has minimum max-risk = 5.4. This is the minimax rule.

Key Observation

The Bayes rule (ฮด5\delta_5) and minimax rule (ฮด4\delta_4) are different! The Bayes rule optimizes average performance under the prior, while minimax protects against the worst case.

๐ŸŽฎGame Theory Interpretation of Minimax

The minimax criterion comes from two-person zero-sum game theory (von Neumann):

๐ŸŒ
Player I: Nature

Picks ฮธโˆˆฮ˜\theta \in \Theta (possibly adversarially)

๐Ÿ‘จโ€๐Ÿ”ฌ
Player II: Statistician

Picks ฮดโˆˆD\delta \in \mathcal{D} (decision procedure)

The statistician "pays" Nature the risk R(ฮธ,ฮด)R(\theta, \delta). The maximum risk of ฮดโˆ—\delta^* is the upper pure value of the game.

Minimax is Very Conservative

This criterion aims to give maximum protection against the worst that can happenโ€”Nature choosing a ฮธ\theta that makes risk as large as possible.

The principle is compelling if you believe the parameter is being chosen by a malevolent opponentwho knows your decision procedure. However, most statisticians find minimax too conservativeas a general ruleโ€”though it can lead to very reasonable procedures in adversarial or safety-critical settings.

๐ŸŽฒRandomized Procedures Can Lower Maximum Risk

A key insight from game theory: randomizing between procedures can reduce maximum risk!

Example: In the oil drilling problem, suppose we flip a fair coin and useฮด4\delta_4 if heads, ฮด6\delta_6 if tails. The expected risk of this randomized procedure is:

12R(ฮธ,ฮด4)+12R(ฮธ,ฮด6)={4.75ifย ฮธ=ฮธ14.20ifย ฮธ=ฮธ2\frac{1}{2}R(\theta, \delta_4) + \frac{1}{2}R(\theta, \delta_6) = \begin{cases} 4.75 & \text{if } \theta = \theta_1 \\ 4.20 & \text{if } \theta = \theta_2 \end{cases}

The maximum risk is now 4.75, which is lower than the minimax value of 5.4 achieved by ฮด4\delta_4 alone!

Practical Implication

When computing minimax procedures, we should consider randomized rules (mixed strategies), not just deterministic ones. The minimax theorem guarantees that under suitable conditions, there exists a (possibly randomized) minimax procedure.

๐ŸŽฏ

Interactive: Bayes Optimal Decision

The Bayes estimator under squared error loss is the posterior mean. Watch how it balances prior belief and observed data.

Assumptions:Normal-Normal conjugate | Prior: ฮธโˆผN(ฮผ0,ฯ„2)\theta \sim N(\mu_0, \tau^2) | Likelihood: XโˆฃฮธโˆผN(ฮธ,ฯƒ2=1)X|\theta \sim N(\theta, \sigma^2=1) | Single obs
-4
-2
0
2
4
Prior ฮผ0\mu_0 Data XX Posterior (Bayes)
Bayes Estimate
1.500
Posterior Variance
0.500
Try: Increase prior variance โ†’ Bayes estimate moves toward the data. Decrease prior variance โ†’ Prior belief dominates.
โš ๏ธThe Fundamental Problem: No Uniformly Best Rule

We say procedure ฮด\delta improves ฮดโ€ฒ\delta' if:

R(ฮธ,ฮด)โ‰คR(ฮธ,ฮดโ€ฒ)forย allย ฮธ,ย withย strictย inequalityย forย someย ฮธR(\theta, \delta) \leq R(\theta, \delta') \quad \text{for all } \theta, \text{ with strict inequality for some } \theta

Key insight: There is typically no single rule that improves all others!

Example: Estimating ฮธโˆˆR\theta \in \mathbb{R} when XโˆผN(ฮธ,ฯƒ02)X \sim N(\theta, \sigma_0^2)

  • Consider the "absurd rule" ฮดโˆ—(X)=0\delta^*(X) = 0 (ignore data entirely)
  • Its MSE is MSE(ฮธ^)=ฮธ2\text{MSE}(\hat{\theta}) = \theta^2
  • At ฮธ=0\theta = 0: This rule cannot be improved because E0[ฮด2(X)]=0E_0[\delta^2(X)] = 0 only if ฮด(X)=0\delta(X) = 0

Even terrible rules can be unbeatable at some ฮธ values!

โœ…Admissibility: Ruling Out Bad Procedures
โŒ Inadmissible

A rule ฮด\delta is inadmissible if there exists another rule ฮดโ€ฒ\delta' that improves it.

Why use ฮด when ฮด' is never worse and sometimes better?

โœ“ Admissible

A rule ฮด\delta is admissible if no rule improves it (i.e., it's not inadmissible).

Admissible rules are "Pareto optimal" in risk space.

Practical Implication

We should restrict attention to admissible procedures โ€” inadmissible ones are dominated and should never be used. But among admissible procedures, we still need Bayes or minimax criteria to choose!


Frequentist vs Bayesian Approach

Understanding the philosophical and practical differences between frequentist and Bayesian approaches is fundamental to mastering statistical inference. Let's explore these two paradigms side by side.

1

The Core Intuition (One-Line Difference)

โœ… Frequentist:

"The parameter is a fixed but unknown truth. Only the data is random."

โœ… Bayesian:

"The parameter itself is uncertain. I represent my uncertainty with a probability distribution."

That single difference changes everything.

2

How Each One Sees "Truth"

๐ŸŽฏ Frequentist View of Truth
  • There is one true value of the parameter.
  • Example:
    The true mean height of adult men in the US is one fixed number.
  • You just don't know it.
  • Your estimator is judged by:
    • What happens over imaginary repeated experiments

You never say:

"The probability that ฮผ = 172.3 is 0.7" โŒ
Because ฮผ is not random to a frequentist.
๐Ÿง  Bayesian View of Truth
  • The parameter is unknown AND treated as random.
  • You express your uncertainty as a distribution.
  • Example:
    Before data: "I believe ฮผ is around 170โ€“175 with high probability."
    After data: "Now I believe ฮผ is tightly around 173."

You do say:

"There is a 95% probability that ฮผ lies between 172 and 174." โœ…

That sentence is illegal in frequentist statistics, but natural in Bayesian inference.

3

How Each One Treats Uncertainty

QuestionFrequentistBayesian
What is random?The dataThe data + parameters
What is fixed?The parameter ฮธ\thetaNothing is fixed
What does probability mean?Long-run frequencyDegree of belief
What is confidence?Coverage over repeated samplesDirect probability of truth

Critical Distinction: Coverage โ‰  Probability of ฮธ

Confidence Intervals (Frequentist):

Guarantee long-run coverage: "95% of intervals constructed this way will contain ฮธ\theta." The interval is random; ฮธ\theta is fixed.

Credible Intervals (Bayesian):

Give probability of ฮธ\theta conditional on data: "P(ฮธโˆˆ[a,b]โˆฃX)=0.95P(\theta \in [a,b] | X) = 0.95." The interval is fixed (given data); ฮธ\theta is random.

โš ๏ธ Interpreting CIs as "probability ฮธ is in the interval" is a common but incorrect interpretation. Only Bayesian credible intervals allow this interpretationโ€”at the cost of requiring a prior.

4

How Decisions Are Made

โœ… Frequentist Decision Logic
  1. Assume ฮธ\theta is fixed.
  2. Assume repeated sampling.
  3. Choose procedure with:
    • Low MSE
    • Correct confidence interval coverage
    • Controlled Type-I error

Key idea:

"If I repeated this experiment 1 million times, my method would behave correctly."
โœ… Bayesian Decision Logic
  1. Start with a prior belief about ฮธ\theta.
  2. Collect data.
  3. Update belief using Bayes' theorem.
  4. Choose action that minimizes posterior expected loss.

Key idea:

"Given what I know right now, what is the best decision?"
๐Ÿ“ Decision-Theoretic Summary: ฮดโˆ—=argโกminโกฮดRisk\delta^* = \arg\min_\delta \text{Risk}
ParadigmRisk DefinitionWhat It MinimizesTypical Criteria
FrequentistR(ฮธ,ฮด)=Eฮธ[L(ฮธ,ฮด(X))]R(\theta, \delta) = E_\theta[L(\theta, \delta(X))]Pointwise risk for each ฮธ\thetaMSE, Type-I/II error, CI coverage
Bayesr(ฯ€,ฮด)=โˆซR(ฮธ,ฮด)ฯ€(ฮธ)dฮธr(\pi, \delta) = \int R(\theta, \delta) \pi(\theta) d\thetaAverage risk under prior ฯ€Posterior expected loss
MinimaxmaxโกฮธR(ฮธ,ฮด)\max_\theta R(\theta, \delta)Worst-case risk over all ฮธRobust to adversarial ฮธ

All three paradigms are valid decision-theoretic frameworksโ€”they just optimize different objectives.

5

Real-World Examples

๐Ÿญ Example 1: Factory Defect Rate (Worked with Numbers)

You want to estimate defect probability ฮธ\theta. Data: n = 100 samples, x = 3 defects observed.

โœ… Frequentist Analysis
  • Point estimate: ฮธ^=3/100=0.03\hat{\theta} = 3/100 = 0.03
  • 95% CI (Wald):

    ฮธ^ยฑ1.96ฮธ^(1โˆ’ฮธ^)n=0.03ยฑ0.033\hat{\theta} \pm 1.96\sqrt{\frac{\hat{\theta}(1-\hat{\theta})}{n}} = 0.03 \pm 0.033

  • Result: (0.000, 0.063) or use exact Clopper-Pearson: (0.006, 0.085)

Legal interpretation: "If I repeated this sampling procedure infinitely, 95% of such intervals would contain the true ฮธ\theta."

โŒ Cannot say: "There's a 95% probability ฮธ\theta is in this interval."

โœ… Bayesian Analysis
  • Prior: ฮธโˆผBeta(2,50)\theta \sim \text{Beta}(2, 50) (prior mean โ‰ˆ 0.038, encodes "around 4%")
  • Likelihood: XโˆฃฮธโˆผBinomial(100,ฮธ)X | \theta \sim \text{Binomial}(100, \theta)
  • Posterior:

    ฮธโˆฃXโˆผBeta(2+3,50+97)=Beta(5,147)\theta | X \sim \text{Beta}(2+3, 50+97) = \text{Beta}(5, 147)

  • 95% Credible Interval: (0.011, 0.063)

Legal interpretation: "Given the data and my prior, there is a 95% probability that ฮธ\theta lies in (0.011, 0.063)."

โœ… Can make direct probability statements about ฮธ\theta.

๐Ÿ“Š Interval Comparison
Interval Type95% IntervalWhat It Means
Frequentist CI (Clopper-Pearson)(0.006, 0.085)95% of such intervals cover true ฮธ in repeated sampling
Bayesian Credible (Beta(2,50) prior)(0.011, 0.063)P(ฮธ โˆˆ interval | data) = 0.95

Note: The intervals differ because the Bayesian incorporates prior information (pulling toward ~4%), while the frequentist uses only the data.

๐ŸŽ›๏ธ Prior Sensitivity: How Priors Shift Posteriors

Same data (n=100, x=3), different priors โ†’ different posteriors:

PriorPrior BeliefPosterior95% Credible Interval
Beta(1,1)\text{Beta}(1, 1)Flat/uninformativeBeta(4,98)\text{Beta}(4, 98)(0.011, 0.074)
Beta(2,50)\text{Beta}(2, 50)Weakly informative (~4%)Beta(5,147)\text{Beta}(5, 147)(0.011, 0.063)
Beta(10,200)\text{Beta}(10, 200)Strong prior (~5%)Beta(13,297)\text{Beta}(13, 297)(0.024, 0.068)

โš ๏ธ Key insight: Strong priors dominate small samples. With n=100, the strong prior pulls the interval away from the MLE (0.03). This is a feature when prior knowledge is reliable, but a bug when the prior is misspecified.

๐Ÿ’‰ Example 2: Medical Drug Trial (Life-or-Death Decisions)
โœ… Frequentist Doctor
  • Hypothesis test:
    • H0H_0: No effect
    • H1H_1: Drug helps
  • If p < 0.05 โ†’ approve

This controls:

"How often would I wrongly approve useless drugs if I repeated trials forever?"
โœ… Bayesian Doctor
  • Already knows:
    • Similar drugs
    • Biological constraints
  • Uses prior.
  • After trial:
    "There is an 87% probability this drug reduces mortality by at least 10%."

This answers:

"What should I do today, given all information?"

โœ… That's decision-theoretic optimality.

๐Ÿค– Example 3: Machine Learning Model
โœ… Frequentist Training
  • Fit model.
  • Report:
    • Test accuracy
    • Confidence intervals via bootstrapping
    • Hypothesis tests on coefficients

Used in:

  • Classical statistics
  • Regulatory environments
  • Scientific publishing
โœ… Bayesian Training
  • Model weights have distributions.
  • Predictions have credible intervals.
  • Uncertainty-aware outputs:
    "There is a 92% probability that this patient has disease."

Used in:

  • Medical AI
  • Robotics
  • Reinforcement learning
  • Active learning
  • Safety-critical AI
6

When Should You Use Which? (Practical Rulebook)

โœ… Use Frequentist When:

โœ“ You want:

  • Hypothesis testing
  • p-values
  • Long-run guarantees
  • Regulatory approval
  • Scientific reproducibility

โœ“ You believe:

  • "Truth is fixed"
  • "I don't want to specify a prior"
  • "Only data should speak"

๐Ÿ“Œ Examples:

  • FDA drug trials
  • Manufacturing quality control
  • Academic hypothesis testing
โœ… Use Bayesian When:

โœ“ You want:

  • Direct probability statements about parameters
  • Optimal decisions under uncertainty
  • Uncertainty-aware AI
  • Small-data problems
  • Sequential learning

โœ“ You believe:

  • "Prior knowledge matters"
  • "Uncertainty itself should be modeled"

๐Ÿ“Œ Examples:

  • Medical diagnosis
  • Autonomous systems
  • Financial risk modeling
  • Reinforcement learning
  • LLM uncertainty estimation
7

The Deep Unification Insight (Advanced)

Here is the truth most PhD students miss:

โœ… Frequentist methods optimize worst-case or pointwise risk.
โœ… Bayesian methods optimize average risk under a prior.

They are both solving the same decision-theoretic problem, just with different ways of handling uncertainty.

In fact:

  • Ridge regression = Bayesian MAP under Gaussian prior
  • LASSO = Bayesian MAP under Laplace prior
  • Dropout = approximate Bayesian inference
  • Ensemble methods = frequentist uncertainty approximation

Final One-Sentence Summary

Frequentists trust repeated experiments.
Bayesians trust probability as a language of belief.
Engineers and AI systems increasingly rely on Bayesian reasoning when decisions matter in real time under uncertainty.

8

Honest Trade-Offs

โš ๏ธ Frequentist Challenges
  • Small samples: Asymptotic guarantees may not hold; coverage can be poor.
  • No prior info: Cannot easily incorporate domain knowledge.
  • No direct probability on ฮธ: CIs answer "what would happen in repeated samples?" not "where is ฮธ?"
  • Multiple comparisons: Requires careful correction (Bonferroni, FDR).
โš ๏ธ Bayesian Challenges
  • Prior sensitivity: Results depend on prior choice, especially with small n.
  • Computation: Exact posteriors often intractable; requires MCMC/VI.
  • Subjectivity criticism: Two analysts with different priors get different answers.
  • Improper priors: Must verify posterior is proper (integrates to 1).
9

When Priors Are Hard to Specify

Not sure what prior to use? Several approaches can help:

๐Ÿ“Š Objective/Reference Priors

Jeffreys prior, reference priorsโ€”designed to be "non-informative" or "minimally informative." Let the data dominate.

๐Ÿ”„ Empirical Bayes

Estimate hyperparameters from the data itself. Common in hierarchical models. "Let the data inform the prior."

โš ๏ธ Improper Priors

Some "priors" don't integrate to 1 (e.g., ฯ€(ฮธ)โˆ1\pi(\theta) \propto 1). Must check posterior propriety!

Prior Sensitivity Analysis

Always run your analysis with multiple priors (informative, weakly informative, diffuse). If conclusions change dramatically, your inference is prior-dependentโ€”get more data or be explicit about prior assumptions.

10

Computation in Practice

Frequentist Computation
  • Exact methods: t-tests, F-tests, exact binomial CIs
  • Asymptotics: z-tests, Wald intervals, likelihood ratio tests
  • Resampling: Bootstrap for CIs, permutation tests

Usually fast; well-supported in standard software.

Bayesian Computation
  • Conjugate priors: Closed-form posteriors (Beta-Binomial, Normal-Normal)
  • MCMC: Stan, PyMC, JAGSโ€”sample from posterior
  • Variational Inference: Fast approximations (mean-field, ADVI)
  • Deep ensembles: Approximate uncertainty in neural networks

Can be slow; requires convergence diagnostics.

11

Calibration: Checking Your Methods

Frequentist Calibration

Simulation studies: Generate data from known ฮธ, check if your 95% CI actually covers ฮธ in ~95% of simulations. Test under misspecification to assess robustness.

Bayesian Calibration

Posterior predictive checks: Simulate data from the posterior and compare to observed data. If the posterior can't reproduce key features of your data, the model (or prior) is misspecified.

Both paradigms need validation

Neither frequentist nor Bayesian methods are "automatic." Both require checking assumptions: model correctness, prior reasonableness, convergence (for MCMC), and coverage/calibration.


Paradigm Comparison: When to Use What

ParadigmOptimizesProsConsUse When...
FrequentistR(ฮธ,ฮด)R(\theta, \delta) for each ฮธ\thetaNo prior needed; objective; well-understood theoryNo single "best" ฮด\delta if RR varies with ฮธ\theta; can't combine infoRegulatory settings; need ฮธ\theta-specific guarantees
Bayesr(ฯ€,ฮด)=Eฯ€[R(ฮธ,ฮด)]r(\pi, \delta) = \mathbb{E}_\pi[R(\theta,\delta)]Coherent decisions; incorporates prior; single optimal ฮด\deltaRequires prior; sensitive to prior choice; computationHave prior info; want probabilistic statements
MinimaxmaxโกฮธR(ฮธ,ฮด)\max_\theta R(\theta, \delta)Robust to worst case; no prior neededConservative; may be too pessimistic; hard to computeAdversarial settings; safety-critical applications

Practical Guidance

In practice, most ML uses frequentist evaluation (test set metrics) butBayes-like reasoning (regularization = prior, ensembles = posterior averaging).Minimax appears in robust optimization and adversarial training.


Prediction

Prediction is not about learning a number โ€” it is about learning how randomness will unfold in the future.

In many real-world problems, we observe a vector of covariates ZโˆˆRd\mathbf{Z} \in \mathbb{R}^d and wish to predict an unseen response YโˆˆRY \in \mathbb{R}. This prediction task arises throughout science and engineering:

Education: Predict first-year GPA from entrance exam scores
Finance: Predict portfolio value from market history
Meteorology: Predict rainfall from weather patterns
Energy: Predict demand from temperature forecasts

We assume the joint distribution of (Z,Y)(\mathbf{Z}, Y) is known (or estimated from data). Our goal is to find a predictor g(Z)g(\mathbf{Z}) that is as close as possible to the true future outcome YY.

โ—†Prediction as a Decision Problem

Prediction fits exactly into the decision-theoretic framework:

  • State of nature: The joint distribution of (Z,Y)(\mathbf{Z}, Y)
  • Action: A function gg that maps Zโ†ฆg(Z)\mathbf{Z} \mapsto g(\mathbf{Z})
  • Loss function: A penalty measuring prediction error
  • Risk: Expected prediction error
โ—†Mean Squared Prediction Error (MSPE)

A natural measure of prediction quality is the squared error: (g(Z)โˆ’Y)2(g(\mathbf{Z}) - Y)^2. Since the future outcome YY is random, we measure performance using the mean squared prediction error:

ฮ”2(Y,g(Z))=E[(g(Z)โˆ’Y)2]\Delta^2(Y, g(\mathbf{Z})) = \mathbb{E}\left[(g(\mathbf{Z}) - Y)^2\right]

This is the prediction analogue of MSE in estimation.

โ—†Fundamental Optimality Result (Key Theorem)

Among all possible predictors, the function that minimizes MSPE is:

gโˆ—(Z)=E[YโˆฃZ]g^*(\mathbf{Z}) = \mathbb{E}[Y \mid \mathbf{Z}]

Interpretation: The optimal predictor under squared error loss is the conditional mean of Y given Z.

This theorem is the mathematical foundation of:

  • Linear regression
  • Neural network regression
  • Gaussian processes
  • Deep learning with squared loss
โ—†Connection to Bayesian Decision Theory

Under squared loss, prediction is identical to Bayesian decision-making:

gโˆ—(Z)=argโกminโกgE[(Yโˆ’g(Z))2โˆฃZ]g^*(\mathbf{Z}) = \arg\min_g \mathbb{E}\left[(Y - g(\mathbf{Z}))^2 \mid \mathbf{Z}\right]

Thus:

  • Prediction = posterior Bayes decision
  • MSPE = posterior expected risk

This equivalence explains why deep learning with MSE loss is implicitly Bayesian.

โ—†Classes of Predictors (Hypothesis Spaces)

We may search over:

โœ“ Nonparametric class:

GNP={allย measurableย functionsย g(Z)}\mathcal{G}_{NP} = \{\text{all measurable functions } g(\mathbf{Z})\}

โœ“ Linear class:

GL={g(Z)=a+โˆ‘j=1dbjZj}\mathcal{G}_L = \left\{ g(\mathbf{Z}) = a + \sum_{j=1}^{d} b_j Z_j \right\}

Restricting to GL\mathcal{G}_L leads to linear regression.

Restricting to neural networks leads to deep learning.

Machine learning = empirical MSPE minimization over a restricted predictor class.


โ—†Prediction vs Estimation (The Deep Conceptual Difference)
EstimationPrediction
GoalLearn a fixed unknown parameter ฮธ\thetaPredict a future random outcome XnewX_{\text{new}}
RandomnessOnly the data is randomThe future outcome is random even if ฮธ\theta is known
TargetA constantA random variable
Error as nโ†’โˆžn \to \inftyYes (for good estimators)No โ€” irreducible noise remains

Estimation uncertainty can vanish. Prediction uncertainty never vanishes.

This irreducible error is called Bayes error in machine learning.

๐Ÿ“Estimation

Goal: Learn a fixed unknown quantity ฮธ

"What is the true population mean?"

ฮธ is fixed; only our uncertainty about it changes with data.

๐Ÿ”ฎPrediction

Goal: Predict a future random observation XnewX_{new}

"What will the next customer spend?"

XnewX_{new} is random even if we knew ฮธ\theta perfectly!

Prediction vs Estimation

๐Ÿ”ฎ

Interactive: Estimation vs Prediction

See the key difference: estimating the population mean vs predicting a new observation.

Assumptions:XiโˆผN(ฮผ,ฯƒ2)X_i \sim N(\mu, \sigma^2) i.i.d. | ฯƒ2\sigma^2 unknown, estimated | Plug-in intervals (not Bayesian)
Data:4.25.14.85.54.9
๐Ÿ“Estimation
"What is the true population mean ฮผ?"
ฮผ^\hat{\mu} = 4.900
95% CI: [4.484, 5.316]
Width: 0.832
๐Ÿ”ฎPrediction
"Where will the next observation fall?"
X~new\tilde{X}_{new} = 4.900
95% PI: [3.882, 5.918]
Width: 2.037
๐Ÿ’กKey Insight

The prediction interval is always wider than the confidence interval! Why? Prediction uncertainty = estimation uncertainty + inherent variability of new observation.

Var(Y^new)=Var(Xห‰)+ฯƒ2=ฯƒ2n+ฯƒ2=ฯƒ2(1+1n)\text{Var}(\hat{Y}_{new}) = \text{Var}(\bar{X}) + \sigma^2 = \frac{\sigma^2}{n} + \sigma^2 = \sigma^2\left(1 + \frac{1}{n}\right)

The key insight: prediction uncertainty has two sources:

  1. Estimation uncertainty: We don't know ฮธ\theta exactly
  2. Inherent randomness: Even if we knew ฮธ\theta, XnewX_{new} is random

Var(Y~new)=Var(ฮธ^)โŸestimation+ฯƒ2โŸinherent=ฯƒ2n+ฯƒ2=ฯƒ2(1+1n)\text{Var}(\tilde{Y}_{new}) = \underbrace{\text{Var}(\hat{\theta})}_{\text{estimation}} + \underbrace{\sigma^2}_{\text{inherent}} = \frac{\sigma^2}{n} + \sigma^2 = \sigma^2\left(1 + \frac{1}{n}\right)

Common Confusion

A confidence interval for ฮผ\mu and a prediction intervalfor XnewX_{new} look similar but mean different things:

  • 95% CI: "We're 95% confident the TRUE MEAN lies in this interval"
  • 95% PI: "We're 95% confident the NEXT OBSERVATION lies in this interval"

The prediction interval is always wider because it accounts for individual variability.

Predictive Distributions

In many applications, our goal is not merely to produce a single numerical prediction, but to characterize the full uncertainty of a future outcome. The object that encodes this uncertainty is the predictive distribution.

While a point predictor answers:

"What value do I guess will occur?"

the predictive distribution answers the more fundamental question:

โœ“ "What range of outcomes could occur, and with what probabilities?"

โ—†Frequentist Predictive Distribution (Plug-In)

In the frequentist framework, the model is:

f(xโˆฃฮธ)f(x \mid \theta)

and the unknown parameter ฮธ\theta is first estimated by ฮธ^\hat{\theta}. The plug-in predictive distribution is then:

f^(xnew)=f(xnewโˆฃฮธ^)\hat{f}(x_{\text{new}}) = f(x_{\text{new}} \mid \hat{\theta})

Interpretation: The plug-in predictive treats the estimated parameter as if it were the true parameter and ignores uncertainty in ฮธ^\hat{\theta}.

Thus, it accounts only for observation noise, but not parameter uncertainty. This makes plug-in predictions:

  • Sharp
  • Optimistic
  • Potentially overconfident in small samples
โ—†Bayesian Posterior Predictive Distribution

In the Bayesian framework, ฮธ\theta is a random variable with posterior distributionฯ€(ฮธโˆฃX1,โ€ฆ,Xn)\pi(\theta \mid X_1, \ldots, X_n). The posterior predictive distribution is:

f(xnewโˆฃX1,โ€ฆ,Xn)=โˆซf(xnewโˆฃฮธ)โ€‰ฯ€(ฮธโˆฃX1,โ€ฆ,Xn)โ€‰dฮธf(x_{\text{new}} \mid X_1, \ldots, X_n) = \int f(x_{\text{new}} \mid \theta) \, \pi(\theta \mid X_1, \ldots, X_n) \, d\theta

Interpretation: Bayesian prediction averages over all plausible parameter values, weighted by how strongly the data support each value.

This properly accounts for:

  • โœ“ Observation noise
  • โœ“ Parameter uncertainty
  • โœ“ Prior uncertainty (when data are limited)

As a result, Bayesian predictive distributions are typically wider and better calibrated.

โ—†Why Bayesian Predictive Distributions Are More Honest

The plug-in approach implicitly assumes:

ฮธ^โ‰ˆฮธwithย certainty\hat{\theta} \approx \theta \quad \text{with certainty}

The Bayesian approach explicitly acknowledges:

ฮธย isย stillย uncertainย afterย observingย data\theta \text{ is still uncertain after observing data}

Hence:

Bayesianย predictiveย variance=noiseย variance+parameterย uncertaintyย variance\text{Bayesian predictive variance} = \text{noise variance} + \text{parameter uncertainty variance}

This is why Bayesian predictions are safer for:

โ€ข Risk management
โ€ข Medicine
โ€ข Safety-critical systems
โ€ข Financial decision-making
โ—†Predictive Distributions as Decision-Theoretic Objects

The predictive distribution is not just a probabilistic summary โ€” it is the complete input required to make optimal future decisions under uncertainty. Any rational decision rule for actions involving the future (insurance pricing, thresholding, alarm systems, portfolio allocation) must be based on a predictive distribution, not a point estimate.


Common Pitfalls

โš ๏ธDon't Make These Mistakes!
1.
Confusing CI with PI

A 95% confidence interval for ฮผ tells you where the parameter likely is. A 95% prediction interval tells you where the next observation will fall. PIs are always wider!

2.
Using Squared Error Loss with Outliers

Squared error heavily penalizes large errors. If your data has outliers or heavy tails, absolute error or Huber loss may be more appropriate. Match your loss to your problem!

3.
Plug-in Prediction Underestimates Uncertainty

Using f(xnewโˆฃฮธ^)f(x_{new} | \hat{\theta}) treats your estimate as if it were the true ฮธ\theta. This ignores estimation uncertainty and makes your prediction intervals too narrow. Use the posterior predictive instead!

4.
Mixing Up Prior and Data Variance

In Bayesian estimation, the prior variance ฯ„2\tau^2 represents uncertainty about ฮธ\theta before data. The data variance ฯƒ2\sigma^2 is the noise in observations. These are different quantities โ€” don't confuse them!

5.
Ignoring Asymmetry in Real Costs

Defaulting to squared error when overestimating and underestimating have different costs. Always ask: "What's the real-world consequence of each type of error?"


Connection to Point Estimation

Now we can see how decision theory provides the foundation for everything in this chapter:

ConceptDecision Theory ViewWhat It Tells Us
Estimator ฮธ^\hat{\theta}Decision rule ฮด(X)\delta(X)Maps data to estimates
MSERisk under squared error lossE[(ฮธโˆ’ฮธ^)2]\mathbb{E}[(\theta - \hat{\theta})^2]
BiasSystematic error in the decisionE[ฮธ^]โˆ’ฮธ\mathbb{E}[\hat{\theta}] - \theta
VarianceVariability of the decisionVar(ฮธ^)\text{Var}(\hat{\theta})
Unbiased EstimatorDecision that&apos;s correct on averageE[ฮธ^]=ฮธ\mathbb{E}[\hat{\theta}] = \theta
UMVUEBest unbiased decisionMinimum variance among unbiased
Bayes EstimatorOptimal decision given priorMinimizes Bayes risk
MLEAsymptotically optimal decisionMinimizes KL divergence

The Big Picture

All the properties we study in point estimation are about finding optimal decisionsunder different loss functions and different notions of optimality.

  • MSE = risk under squared error loss
  • Unbiasedness = zero systematic error
  • Efficiency = achieving the minimum possible risk
  • Sufficiency = using all relevant information for the decision

Confidence Bounds as Decision Theory

Decision theory provides a powerful lens for understanding confidence bounds and intervals โ€” an important hybrid of testing and estimation.

๐Ÿ“ŠMotivating Example: Accounts Receivable Audit

An accounting firm examines accounts receivable for a company based on a random sample. They want an upper bound on the total amount owed ฮฝ\nu.

If XX represents the amount owed in the sample, they seek ฮฝห‰(X)\bar{\nu}(X) such that:

P[ฮฝห‰(X)โ‰ฅฮฝ]โ‰ฅ1โˆ’ฮฑP[\bar{\nu}(X) \geq \nu] \geq 1 - \alpha

This ฮฝห‰(X)\bar{\nu}(X) is called a (1-ฮฑ) upper confidence bound on ฮฝ\nu.

The Decision-Theoretic Formulation

How does this fit into decision theory? We can view the upper confidence bound as a decision procedure with action space A=R\mathcal{A} = \mathbb{R} and a specific loss function.

โŒNaive Loss (Has Problems)

L(P,a)={0ifย aโ‰ฅฮฝ(P)1ifย a<ฮฝ(P)L(P, a) = \begin{cases} 0 & \text{if } a \geq \nu(P) \\ 1 & \text{if } a < \nu(P) \end{cases}

Problem: Taking ฮฝห‰โ‰กโˆž\bar{\nu} \equiv \infty achieves risk = 0! A bound that says "at most infinity" is useless.

โœ…Better Loss (Balances Goals)

L(P,a)={aโˆ’ฮฝ(P)ifย aโ‰ฅฮฝ(P)cifย a<ฮฝ(P)L(P, a) = \begin{cases} a - \nu(P) & \text{if } a \geq \nu(P) \\ c & \text{if } a < \nu(P) \end{cases}

Why better: Penalizes overestimation (loose bounds) while heavily penalizing undercoverage.

The Key Insight

Though upper bounding is the primary goal, it's also important to get close to the truth. Knowing "at most โˆž dollars" is technically correct but useless. The decision-theoretic framework naturally accommodates both goals by choosing an appropriate loss function.

๐ŸŽฏThe Practical Approach

Rather than using Lagrangian optimization, practitioners typically:

  1. Fix the coverage probability: Require P[ฮฝห‰(X)โ‰ฅฮฝ]โ‰ฅ1โˆ’ฮฑP[\bar{\nu}(X) \geq \nu] \geq 1 - \alphafor all PP (e.g., ฮฑ = 0.05)
  2. Then minimize the "excess": Among all procedures satisfying (1), minimizeR(P,ฮฝห‰)=E[(ฮฝห‰(X)โˆ’ฮฝ(P))+]R(P, \bar{\nu}) = \mathbb{E}[(\bar{\nu}(X) - \nu(P))_+]

where x+=xโ‹…1(xโ‰ฅ0)x_+ = x \cdot \mathbf{1}(x \geq 0) is the positive part.

Extension to Confidence Intervals

The same decision-theoretic logic extends to confidence intervals. A confidence interval [ฮฝโ€พ(X),ฮฝห‰(X)][\underline{\nu}(X), \bar{\nu}(X)] for ฮฝ\nusatisfies:

P[ฮฝโ€พ(X)โ‰คฮฝ(P)โ‰คฮฝห‰(X)]โ‰ฅ1โˆ’ฮฑforย allย PโˆˆPP[\underline{\nu}(X) \leq \nu(P) \leq \bar{\nu}(X)] \geq 1 - \alpha \quad \text{for all } P \in \mathcal{P}

๐Ÿ“Visualizing the Tradeoffs
๐Ÿ“
Too Wide
High coverage, but uninformative
[โˆ’โˆž, +โˆž] has 100% coverage!
โœจ
Just Right
Correct coverage, minimal width
Optimal decision-theoretic balance
โš ๏ธ
Too Narrow
Precise, but wrong too often
Below nominal coverage

Goal: Minimize interval width while maintaining โ‰ฅ (1-ฮฑ) coverage

ConceptDecision Theory FormulationWhat We Optimize
Upper BoundAction a=ฮฝห‰(X)a = \bar{\nu}(X)Minimize E[ฮฝห‰(X)โˆ’ฮฝ]\mathbb{E}[\bar{\nu}(X) - \nu] s.t. coverage โ‰ฅ 1-ฮฑ
Lower BoundAction a=ฮฝโ€พ(X)a = \underline{\nu}(X)Maximize E[ฮฝโ€พ(X)]\mathbb{E}[\underline{\nu}(X)] s.t. coverage โ‰ฅ 1-ฮฑ
Two-Sided CIAction a=[ฮฝโ€พ,ฮฝห‰]a = [\underline{\nu}, \bar{\nu}]Minimize E[ฮฝห‰โˆ’ฮฝโ€พ]\mathbb{E}[\bar{\nu} - \underline{\nu}] s.t. coverage โ‰ฅ 1-ฮฑ

Why This Matters

Understanding confidence intervals as decision procedures explains why we construct them the way we do: we're finding the narrowest intervals that still achieve the required coverage. This is a constrained optimization problem โ€” pure decision theory!


Symbol Glossary

SymbolNameMeaning
ฮธ\thetaParameterThe unknown true value we want to estimate
ฮ˜\ThetaParameter SpaceSet of all possible ฮธ\theta values
aaActionA decision or estimate we choose
A\mathcal{A}Action SpaceSet of all possible actions
L(ฮธ,a)L(\theta, a)Loss FunctionCost of choosing action aa when truth is ฮธ\theta
ฮด(X)\delta(X)Decision RuleFunction mapping data XX to an action
R(ฮธ,ฮด)R(\theta, \delta)Risk FunctionExpected loss: Eฮธ[L(ฮธ,ฮด(X))]\mathbb{E}_\theta[L(\theta, \delta(X))]
r(ฯ€,ฮด)r(\pi, \delta)Bayes RiskExpected risk under prior: Eฯ€[R(ฮธ,ฮด)]\mathbb{E}_\pi[R(\theta, \delta)]
ฯ€(ฮธ)\pi(\theta)Prior DistributionBelief about ฮธ\theta before seeing data
XnewX_{new}Future ObservationA new random value to be predicted

Python Implementation

Here's a complete implementation of key decision theory concepts:

๐Ÿpython
1import numpy as np
2from scipy import stats
3from typing import Callable, Tuple
4
5# =============================================================================
6# LOSS FUNCTIONS
7# =============================================================================
8
9def squared_error_loss(theta: float, estimate: float) -> float:
10    """Squared error loss: L(ฮธ, a) = (ฮธ - a)ยฒ"""
11    return (theta - estimate) ** 2
12
13def absolute_error_loss(theta: float, estimate: float) -> float:
14    """Absolute error loss: L(ฮธ, a) = |ฮธ - a|"""
15    return np.abs(theta - estimate)
16
17def asymmetric_loss(theta: float, estimate: float,
18                    c_under: float = 1.0, c_over: float = 2.0) -> float:
19    """
20    Asymmetric loss: different costs for over/under-estimation
21    L(ฮธ, a) = c_under * max(ฮธ - a, 0) + c_over * max(a - ฮธ, 0)
22    """
23    if estimate < theta:
24        return c_under * (theta - estimate)
25    else:
26        return c_over * (estimate - theta)
27
28def huber_loss(theta: float, estimate: float, delta: float = 1.0) -> float:
29    """Huber loss: quadratic for small errors, linear for large"""
30    error = np.abs(theta - estimate)
31    if error <= delta:
32        return 0.5 * error ** 2
33    else:
34        return delta * error - 0.5 * delta ** 2
35
36# =============================================================================
37# RISK FUNCTIONS
38# =============================================================================
39
40def frequentist_risk(true_theta: float,
41                     estimator: Callable[[np.ndarray], float],
42                     sample_size: int,
43                     loss_fn: Callable[[float, float], float],
44                     n_simulations: int = 10000) -> float:
45    """
46    Compute frequentist risk via simulation.
47
48    R(ฮธ, ฮด) = E_ฮธ[L(ฮธ, ฮด(X))]
49    """
50    losses = []
51    for _ in range(n_simulations):
52        # Generate data from N(theta, 1)
53        data = np.random.normal(true_theta, 1, sample_size)
54        estimate = estimator(data)
55        losses.append(loss_fn(true_theta, estimate))
56    return np.mean(losses)
57
58def bayes_risk(estimator: Callable[[np.ndarray], float],
59               prior_mean: float,
60               prior_std: float,
61               sample_size: int,
62               loss_fn: Callable[[float, float], float],
63               n_simulations: int = 10000) -> float:
64    """
65    Compute Bayes risk via simulation.
66
67    r(ฯ€, ฮด) = E_ฯ€[R(ฮธ, ฮด)]
68    """
69    total_loss = 0
70    for _ in range(n_simulations):
71        # Sample ฮธ from prior
72        theta = np.random.normal(prior_mean, prior_std)
73        # Generate data from N(theta, 1)
74        data = np.random.normal(theta, 1, sample_size)
75        estimate = estimator(data)
76        total_loss += loss_fn(theta, estimate)
77    return total_loss / n_simulations
78
79# =============================================================================
80# ESTIMATORS
81# =============================================================================
82
83def sample_mean(data: np.ndarray) -> float:
84    """Sample mean estimator"""
85    return np.mean(data)
86
87def sample_median(data: np.ndarray) -> float:
88    """Sample median estimator"""
89    return np.median(data)
90
91def bayes_estimator_normal(data: np.ndarray,
92                           prior_mean: float,
93                           prior_var: float,
94                           data_var: float = 1.0) -> float:
95    """
96    Bayes estimator for Normal mean (conjugate prior).
97    Posterior mean minimizes Bayes risk under squared error loss.
98    """
99    n = len(data)
100    posterior_precision = 1/prior_var + n/data_var
101    posterior_mean = (prior_mean/prior_var + np.sum(data)/data_var) / posterior_precision
102    return posterior_mean
103
104def shrinkage_estimator(data: np.ndarray, shrinkage: float = 0.5) -> float:
105    """James-Stein style shrinkage toward zero"""
106    return shrinkage * np.mean(data)
107
108# =============================================================================
109# PREDICTION
110# =============================================================================
111
112def prediction_interval(data: np.ndarray,
113                        confidence: float = 0.95) -> Tuple[float, float]:
114    """
115    Compute prediction interval for new observation.
116    Assumes Normal data with unknown mean and variance.
117    """
118    n = len(data)
119    mean = np.mean(data)
120    s = np.std(data, ddof=1)  # Sample std
121
122    # t-distribution with n-1 degrees of freedom
123    t_crit = stats.t.ppf((1 + confidence) / 2, df=n-1)
124
125    # Prediction SE: sqrt(sยฒ * (1 + 1/n))
126    pred_se = s * np.sqrt(1 + 1/n)
127
128    lower = mean - t_crit * pred_se
129    upper = mean + t_crit * pred_se
130
131    return lower, upper
132
133def posterior_predictive_normal(data: np.ndarray,
134                                prior_mean: float,
135                                prior_var: float,
136                                data_var: float = 1.0) -> Tuple[float, float]:
137    """
138    Compute posterior predictive distribution parameters.
139    Returns (mean, variance) of X_new | X_1, ..., X_n
140    """
141    n = len(data)
142
143    # Posterior parameters
144    posterior_precision = 1/prior_var + n/data_var
145    posterior_var = 1 / posterior_precision
146    posterior_mean = posterior_var * (prior_mean/prior_var + np.sum(data)/data_var)
147
148    # Predictive distribution
149    predictive_mean = posterior_mean
150    predictive_var = data_var + posterior_var
151
152    return predictive_mean, predictive_var
153
154# =============================================================================
155# EXAMPLE USAGE
156# =============================================================================
157
158if __name__ == "__main__":
159    np.random.seed(42)
160
161    # Compare estimator risks
162    print("=" * 60)
163    print("RISK COMPARISON: Mean vs Median for Normal Data")
164    print("=" * 60)
165
166    true_theta = 5.0
167    for n in [5, 10, 25, 50]:
168        risk_mean = frequentist_risk(true_theta, sample_mean, n, squared_error_loss)
169        risk_median = frequentist_risk(true_theta, sample_median, n, squared_error_loss)
170        print(f"n={n:2d}: Mean Risk = {risk_mean:.4f}, Median Risk = {risk_median:.4f}")
171        print(f"       Mean is {risk_median/risk_mean:.2f}x more efficient")
172
173    print()
174    print("=" * 60)
175    print("BAYES RISK COMPARISON")
176    print("=" * 60)
177
178    # Compare MLE vs Bayes estimator
179    n = 10
180    for prior_std in [0.5, 1.0, 2.0, 5.0]:
181        mle_bayes_risk = bayes_risk(sample_mean, 0, prior_std, n, squared_error_loss)
182        bayes_est = lambda x: bayes_estimator_normal(x, 0, prior_std**2)
183        bayes_bayes_risk = bayes_risk(bayes_est, 0, prior_std, n, squared_error_loss)
184        print(f"Prior ฯƒ={prior_std}: MLE Bayes Risk = {mle_bayes_risk:.4f}, "
185              f"Bayes Est. Risk = {bayes_bayes_risk:.4f}")
186
187    print()
188    print("=" * 60)
189    print("PREDICTION vs ESTIMATION")
190    print("=" * 60)
191
192    data = np.array([4.2, 5.1, 4.8, 5.5, 4.9, 5.2, 4.7, 5.3])
193
194    # Estimation
195    est_mean = np.mean(data)
196    est_se = np.std(data, ddof=1) / np.sqrt(len(data))
197    ci_lower = est_mean - 1.96 * est_se
198    ci_upper = est_mean + 1.96 * est_se
199
200    # Prediction
201    pi_lower, pi_upper = prediction_interval(data)
202
203    print(f"Data: {data}")
204    print(f"
205Estimation (95% CI for ฮผ): [{ci_lower:.3f}, {ci_upper:.3f}]")
206    print(f"CI Width: {ci_upper - ci_lower:.3f}")
207    print(f"
208Prediction (95% PI for X_new): [{pi_lower:.3f}, {pi_upper:.3f}]")
209    print(f"PI Width: {pi_upper - pi_lower:.3f}")
210    print(f"
211PI is {(pi_upper - pi_lower)/(ci_upper - ci_lower):.2f}x wider than CI")

Key Insights

1
Decision Theory is the Foundation

All estimation concepts (bias, variance, MSE, efficiency) arise from decision theory. The "best" estimator depends on your loss function and how you average risk.

2
Loss Function Determines Optimal Estimator

Squared error โ†’ posterior mean. Absolute error โ†’ posterior median. Asymmetric loss โ†’ quantiles. Choose your loss based on the real-world consequences.

3
Prediction โ‰  Estimation

Predicting a future observation has MORE uncertainty than estimating a parameter. Prediction intervals are always wider than confidence intervals.

4
MSE = Risk Under Squared Error Loss

The Mean Squared Error is the frequentist risk when using squared error loss. The bias-variance decomposition follows directly from this.

5
Bayes and Minimax Connect

Bayes estimators under least favorable priors are often minimax. The sample mean is both Bayes (flat prior) and minimax for Normal mean estimation.


Try It Yourself

Solidify your understanding by experimenting with the interactive demos. Here's a structured exploration:

๐ŸงชHands-On Checklist
Explore Loss Function Sensitivity

In the Loss Function Demo, set error = 1, then error = 3. How much does squared error increase vs absolute error? (Hint: squared goes 1โ†’9, absolute goes 1โ†’3)

Verify Risk Decreases with Sample Size

In the Risk Demo, slide n from 10 to 50. Confirm risk drops roughly as 1/n for the sample mean. Does the median's risk also drop at the same rate?

Compare CI Width to PI Width

In the Prediction Demo, add more data points. Watch the PI shrink slower than the CI. Why? (The ฯƒ2\sigma^2 term doesn't go away even with infinite data!)

Test Bayes Shrinkage

In the Bayes Demo, set prior variance very small (0.1) vs very large (5.0). Where does the posterior land? Verify: small ฯ„2\tau^2 โ†’ prior dominates; large ฯ„2\tau^2 โ†’ data dominates.

Design Your Own Loss

Think of a real problem where underestimating is 3x worse than overestimating. What quantile should you use? (Answer: ฮฑ = 3/(3+1) = 0.75 = 75th percentile)


Summary

In this section, we've built the decision-theoretic foundation for point estimation:

  1. Decision theory framework: States (ฮธ), actions (a), and loss L(ฮธ, a)
  2. Loss functions: Squared error, absolute error, asymmetric, and 0-1 loss each lead to different optimal estimators
  3. Risk functions: Frequentist risk R(ฮธ, ฮด), Bayes risk r(ฯ€, ฮด), and minimax risk provide different ways to evaluate estimators
  4. Prediction vs estimation: Prediction has extra uncertainty from the inherent randomness of future observations
  5. Connection to point estimation: MSE, bias, variance are all decision-theoretic concepts
๐Ÿš€What's Next?

Now that you understand why we care about different estimator properties, we'll dive deep into the specific concepts:

  • Section 1: Estimators and their properties โ€” the parametric framework
  • Section 2: Bias, Variance, and MSE โ€” the famous decomposition
  • Section 3: Consistency and Efficiency โ€” large-sample behavior
  • Section 4: Sufficiency โ€” using all the information in your data
  • Section 5: Completeness and Ancillarity โ€” finding optimal estimators
Loading comments...