Decision Theory and Prediction | Chapter 11 - Point Estimation | Probability & Statistics for AI/ML

Learning Objectives

Before You Start

This section provides the conceptual foundation for all of point estimation. You should be comfortable with expected values, probability distributions, and basic optimization concepts.

By the end of this section, you will be able to:

🎯

Understand Decision Theory

The framework for making optimal choices under uncertainty

📊

Master Loss Functions

How to quantify the cost of making wrong decisions

⚖️

Compare Risk Functions

Frequentist risk, Bayes risk, and minimax approaches

🔮

Distinguish Estimation from Prediction

Why predicting new values requires different thinking

🔗

Connect to Point Estimation

See how MSE, bias, and variance arise from decision theory

The Big Picture: Why Decision Theory?

Statistics is about making decisions under uncertainty. Decision theory provides the mathematical framework for choosing the "best" action when we don't know the true state of the world.

Before we dive into estimators, bias, and variance, we need to answer a fundamental question: What does it mean for an estimator to be "good"?

Different people might have different answers:

"An estimator that's right on average" (unbiasedness)
"An estimator that's usually close to the truth" (low variance)
"An estimator that minimizes my expected loss" (optimal decision)

Decision theory gives us a unified framework to think about all these properties. It tells us that the "best" estimator depends on:

What we lose when we're wrong (the loss function)
How we average that loss (the risk function)
What we know beforehand (prior information)

The Core Insight

Every estimator property you will study — bias, variance, MSE, consistency, efficiency, sufficiency — is a decision-theoretic concept in disguise.

"Bias & variance describe how the risk decomposes."
"Consistency describes how the risk behaves as data grows."
"Efficiency compares risk against theoretical lower bounds."
"Sufficiency and completeness identify when risk cannot be improved."

Decision theory is not an optional interpretation — it is the mathematical spine of statistical inference.

🤖The Estimation Machine Analogy

Think of an estimator as a machine:

You feed it raw data → it outputs a guess about an unknown truth.

Just like a physical machine can be evaluated for accuracy and precision, an estimator can be evaluated using several fundamental criteria:

🎯Bias — Systematic Error

Question: Is the machine centered on the truth or consistently off-target?

If it always guesses too high → positive bias
If it always guesses too low → negative bias

Interpretation: Bias measures systematic error.

🎲Variance — Random Scatter

Question: How much do the machine's outputs fluctuate from run to run?

Tight clustering → low variance
Wildly different answers → high variance

Interpretation: Variance measures random instability.

📏MSE — Total Error

Question: Overall, how wrong is the machine on average?

\text{MSE} = \text{Bias}^2 + \text{Variance}

Interpretation: MSE balances systematic error + random error into one score.

📈Consistency — Learning with More Data

Question: As we feed the machine more and more data, does it eventually lock onto the true value?

If yes → consistent estimator
If no → inconsistent estimator

Interpretation: Consistency is a long-run guarantee, not a finite-sample promise.

🏆Efficiency — Best Possible Precision

Question: Among all unbiased machines, does this one have the tightest grouping?

If it achieves the smallest possible variance, it is efficient

Interpretation: Efficiency means no other unbiased estimator is more precise.

🧠Sufficiency — No Wasted Information

Question: Does the machine extract all useful information from the data — or does it throw some away?

If nothing is lost → sufficient
If relevant information is discarded → insufficient

Interpretation: Sufficiency is about perfect information compression.

✅Bottom Line

A near-perfect estimation machine would be:

Unbiased → centered on the truth
Low variance → stable across samples
Low MSE → small total error
Consistent → converges with more data
Efficient → best possible precision
Sufficient → wastes no information

This is exactly what optimal statistical estimation aims to achieve.

Machine Learning Perspective

The concepts from classical estimation theory map directly onto modern machine learning. Understanding this connection helps you see that ML is applied decision theory.

Classical Estimation ↔ Machine Learning Intuition

Statistical Concept	Estimator-Machine Meaning	Machine Learning Interpretation
Bias	Is the machine systematically off-target?	Underfitting — model too simple, misses true structure
Variance	How much do outputs fluctuate across samples?	Overfitting — model too sensitive to noise
MSE / Risk	Total error combining bias & variance	Generalization error on unseen test data
Consistency	Does the machine improve with more data?	Model converges as dataset grows
Efficiency	Among unbiased machines, is this the tightest?	Best possible accuracy for given data + model class
Sufficiency	Is any useful information being thrown away?	Feature bottleneck / information loss

🔥The Core ML Insight

Training a neural network is nothing but tuning an estimation machine to minimize expected decision-theoretic risk under a chosen loss.

▸Loss function → training objective

▸Risk (expected loss) → true generalization error

▸Empirical risk → training loss

▸Regularization → bias–variance control

▸Feature engineering / representation learning → sufficiency

🎯Bias–Variance in ML Language

📉High Bias

= Underfitting

• Model is too rigid
• Misses patterns in data
• Low training error improvement
• High error on both train & test

📈High Variance

= Overfitting

• Model is too flexible
• Fits noise in training data
• Huge train–test gap
• Low train error, high test error

✨Optimal Model

= Balanced

• Right model complexity
• Balanced bias + variance
• Minimum test error
• = Minimum MSE / Risk

The goal of model selection: Find the sweet spot where

\text{Bias}^2 + \text{Variance}

is minimized.

Why This Matters for ML Engineers

Every hyperparameter you tune, every architecture choice you make, every regularization technique you apply — you are navigating the bias-variance tradeoff. Decision theory gives you the mathematical foundation to understand why these techniques work.

🔗Unifying Principle

Every estimator is a machine that turns data into decisions.

Every statistical property — bias, variance, MSE, consistency, efficiency, sufficiency — is just a different way of scoring how well that machine behaves under uncertainty.

What Is Decision Theory?

Intuitive Understanding

Imagine you're a doctor diagnosing a patient. You observe symptoms (data) but don't know the true disease (parameter). You must choose a treatment (action). Different actions have different consequences depending on the true disease.

The same logic applies to estimation:

Data = your observations $X_1, X_2, \ldots, X_n$
Unknown state = true parameter $\theta$
Action = your estimate $\hat{\theta}$
Loss = how "wrong" your estimate is

Types of Statistical Problems

The information we extract from data takes different forms depending on our goals. Decision theory provides a unified framework for all of them:

📐Estimation

Goal: Produce "best guesses" of unknown parameters

Action space: All possible parameter values

Examples: Fraction defective

\theta

, population mean

\mu

, regression coefficients

\beta

⚖️Testing

Goal: Decide if data supports a hypothesis or not

Action space: {Accept $H_0$ , Reject $H_0$ }

Examples: Drug effectiveness vs placebo, A/B test significance, quality control pass/fail

🏆Ranking

Goal: Order items from best to worst

Action space: All $k!$ possible orderings of $k$ items

Examples: Consumer reports ranking brands, search result ordering, tournament seeding

🔮Prediction

Goal: Forecast future observations given covariates

Action space: Predicted values $\hat{Y} = \mu(\mathbf{z})$

Examples: Patient response given (age, sex, dose), house price given features, demand forecasting

The Common Thread

In all cases, the analysis doesn't stop at specifying an estimate, test, ranking, or prediction. We must also evaluate how well our procedure performs. This requires criteria of performance — which is exactly what decision theory provides.

Why Decision Theory Matters

Given so many possible procedures (sample mean vs median, different test statistics, various models), how do we choose? Decision theory provides the framework to answer this systematically.

🔬A Priori Performance

When: Before looking at data

Purpose: Study design, sample size determination

Question: "How well can the best procedure do?"

Example: Determining how many patients we need to detect a treatment effect with 80% power

📊A Posteriori Performance

When: After data is collected

Purpose: Assess reliability of our estimate

Question: "How reliable is this particular estimate?"

Example: Confidence intervals, standard errors, posterior credible intervals

🎯The Four Purposes of Decision Theory

The decision theoretic framework helps us:

Clarify Objectives

What exactly are we trying to achieve? Estimation, testing, ranking, or prediction?

Identify Possible Actions

What decisions can we make? What is the action space?

Assess Risk, Accuracy & Reliability

How do we measure "how well" a procedure performs? What are the relevant metrics?

Guide Procedure Selection

Given objectives and performance criteria, which procedure should we use?

The Fundamental Question: In estimation we care how far off we are; in testing, what mistakes we've made; in ranking, which orderings are wrong. Decision theory gives us the mathematical language to express and minimize these errors.

Formal Framework

A statistical decision problem consists of three ingredients:

State Space

\Theta

The set of possible true parameter values. Examples: $\Theta = (0, 1)$ for probabilities, $\Theta = \mathbb{R}$ for means, $\Theta = \mathbb{R}^+$ for variances.

Action Space

\mathcal{A}

The set of possible decisions. For point estimation, $\mathcal{A} = \Theta$ (we choose an estimate from the same space as the parameter).

Loss Function

L(\theta, a)

A function $L: \Theta \times \mathcal{A} \to \mathbb{R}^+$ measuring the cost of taking action $a$ when the true state is $\theta$ . Lower loss = better.

Decision Procedures

A decision procedure (or decision rule) is a function $\delta: \mathcal{X} \to \mathcal{A}$ that maps any possible data outcome to an action. When we observe $\mathbf{X} = \mathbf{x}$ , we take action $\delta(\mathbf{x})$ .

$\delta: \mathcal{X} \to \mathcal{A} \quad \text{where} \quad \delta(\mathbf{x}) = \text{action taken when } \mathbf{X} = \mathbf{x}$

📝Examples of Decision Rules

Estimation: Two Competing Rules

For estimating the population mean $\mu$ from data $X_1, \ldots, X_n$ :

▸

\delta_1(\mathbf{x}) = \bar{x} = \frac{1}{n}\sum_{i=1}^n x_i

(sample mean)

▸

\delta_2(\mathbf{x}) = \tilde{x}

(sample median)

Which is better? That depends on the loss function and the true distribution!

Testing: Two-Sample Problem

Testing $H_0: \mu_X = \mu_Y$ vs $H_1: \mu_X \neq \mu_Y$ with data from two groups:

$\delta(\mathbf{x}, \mathbf{y}) = \begin{cases} 0 & \text{if } \frac{|\bar{x} - \bar{y}|}{\hat{\sigma}} < c \\ 1 & \text{if } \frac{|\bar{x} - \bar{y}|}{\hat{\sigma}} \geq c \end{cases}$

The critical value $c$ controls the tradeoff between Type I and Type II errors.

Prediction: Linear Regression

Given training data $\{(\mathbf{z}_i, y_i)\}_{i=1}^n$ , predict $Y$ for new $\mathbf{z}$ :

▸

\delta(\mathbf{z}) = \hat{\beta}_0 + \hat{\beta}_1 z_1 + \cdots + \hat{\beta}_p z_p

The decision rule is an entire function! The action space is infinite-dimensional.

Key Insight

The notation $\delta(\mathbf{x})$ emphasizes that our decision is a function of the data. We don't just pick a number — we specify a rule that tells us what to do for any possible dataset we might observe.

The Decision Recipe

📋One-Glance Decision Recipe

Specify Θ and A

What are the possible states? What decisions can you take?

Pick a Loss Function

L(\theta, a)

Squared error: large errors matter disproportionately
Absolute error: need robustness to outliers
Asymmetric: over/under-estimating have different costs
0-1 loss: classification or hypothesis testing

Compute Risk

R(\theta, \delta)

Average the loss over the sampling distribution: $\mathbb{E}_\theta[L(\theta, \delta(X))]$

Choose Your Decision Rule

Bayes: have prior? Minimize Bayes risk
Minimax: protect against worst case
Frequentist: evaluate pointwise risk $R(\theta, \delta)$

Loss Functions

The loss function is the heart of decision theory. It quantifies: "How bad is it to choose action a when the truth is θ?"

Common Loss Functions

📊

Interactive: Compare Loss Functions

True Parameter (

\theta

): 5

Your Estimate (

a

): 7

\theta

(true)

a

(estimate) error

Squared Error

4.00

(\theta - a)^2

Absolute Error

2.00

|\theta - a|

Huber Loss

1.50

robust

0-1 Loss

exact match

Observation: Squared error penalizes large errors much more than absolute error. Try setting error = 2 vs error = 4: squared loss quadruples while absolute loss only doubles!

Loss Function	Formula $L(\theta, a)$	Properties	Use When...
Squared Error	$(\theta - a)^2$	Differentiable, penalizes large errors heavily	Errors of similar magnitude, computational convenience
Absolute Error	$\|\theta - a\|$	Robust to outliers, non-differentiable at 0	Large errors shouldn't dominate
0-1 Loss	0 if $a = \theta$ , 1 otherwise	Used for classification/testing	Only exact correctness matters
Asymmetric	$c_1(\theta-a)^+ + c_2(a-\theta)^+$	Different costs for over/under-estimation	Consequences differ by direction of error
Huber Loss	$\frac{1}{2}(\theta-a)^2$ if $\|\theta-a\| \le \delta$	Combines benefits of squared and absolute	Robustness with differentiability

Advanced Loss Functions

Beyond the basic loss functions, several specialized losses arise in practice:

📊Loss Functions for Vector Parameters

When estimating a $d$ -dimensional parameter $\boldsymbol{\nu} = (\nu_1, \ldots, \nu_d)$ with estimate $\mathbf{a} = (a_1, \ldots, a_d)$ :

Squared Euclidean Distance

$L(\boldsymbol{\nu}, \mathbf{a}) = \frac{1}{d}\sum_{j=1}^d (a_j - \nu_j)^2 = \frac{1}{d}\|\mathbf{a} - \boldsymbol{\nu}\|_2^2$

Most common choice; decomposes into sum of univariate losses

Absolute Distance (L1)

$L(\boldsymbol{\nu}, \mathbf{a}) = \frac{1}{d}\sum_{j=1}^d |a_j - \nu_j| = \frac{1}{d}\|\mathbf{a} - \boldsymbol{\nu}\|_1$

Robust to outliers; leads to sparse solutions in regularization

Supremum Distance (L∞)

$L(\boldsymbol{\nu}, \mathbf{a}) = \max_{j=1,\ldots,d} |a_j - \nu_j| = \|\mathbf{a} - \boldsymbol{\nu}\|_\infty$

Worst-case error across all components; minimax flavor

🔮Prediction Loss (Integrated Squared Error)

For prediction problems where the true function is $\mu(\mathbf{z})$ and our predictor is $a(\mathbf{z})$ :

$L(P, a) = \int (\mu(\mathbf{z}) - a(\mathbf{z}))^2 \, dQ(\mathbf{z})$

If $Q$ is the empirical distribution of the training covariates:

$L(P, a) = \frac{1}{n}\sum_{j=1}^n (\mu(\mathbf{z}_j) - a(\mathbf{z}_j))^2$

This is the mean squared prediction error — exactly what we minimize in regression!

Worked Example: The Newsvendor Problem

The newsvendor problem is a classic example of asymmetric loss in action. A vendor must decide how many newspapers to stock before knowing the day's demand.

📰Newsvendor: Asymmetric Loss in Action

Setup:

Each newspaper costs $1 to buy and sells for $2
Understock cost (lost sale): $c_u$ = $1 profit missed
Overstock cost (unsold paper): $c_o$ = $1 purchase price lost

The Optimal Quantile Formula:

$\alpha^* = \frac{c_u}{c_u + c_o} = \frac{1}{1 + 1} = 0.5$

Order the 50th percentile (median) of demand when costs are equal.

Now suppose stockouts are worse:

Understock cost: $c_u$ = $5 (angry customer, lost reputation)
Overstock cost: $c_o$ = $1 (just the paper cost)

$\alpha^* = \frac{c_u}{c_u + c_o} = \frac{5}{5 + 1} = 0.833$

Order the 83rd percentile of demand — stock more to avoid stockouts!

The General Principle

Under asymmetric loss $L(\theta, a) = c_u(\theta-a)^+ + c_o(a-\theta)^+$ , the optimal estimate is the $\alpha$ -quantile where $\alpha = c_u/(c_u + c_o)$ .

Loss Functions in ML Practice

The loss functions from decision theory appear throughout machine learning under different names:

Decision Theory Loss	ML Training Loss	Eval Metric	Use Case
Squared Error $(\theta-a)^2$	MSE Loss	RMSE, $R^2$	Regression with normal errors
Absolute Error $\|\theta-a\|$	L1 / MAE Loss	MAE, MedAE	Robust regression, sparse solutions
0-1 Loss	Classification Error	Accuracy, Error Rate	Hard classification decisions
Log Loss $-\log P(\theta\|a)$	Cross-Entropy / NLL	Log Loss, Perplexity	Probabilistic classification
Huber Loss	Smooth L1 Loss	Huber metric	Object detection, robust regression
Asymmetric Loss	Quantile Loss / Pinball	Quantile coverage	Demand forecasting, prediction intervals

Theory ↔ Practice Connection

When you minimize cross-entropy loss in a neural network, you're finding the MLE. When you minimize MSE, you're minimizing expected squared error loss. The frameworks connect!

Choosing a Loss Function

Key Insight

The choice of loss function determines the optimal estimator! Under squared error loss, the optimal estimator is the posterior mean. Under absolute error loss, it's the posterior median.

Risk Functions

The loss L(θ, δ(X)) is random because it depends on the random data X. We need a way to summarize the "typical" or "expected" loss. This is the risk function.

1Why Do We Even Need a Risk Function?

Let's start from the most basic problem:

You choose an estimator $\delta(X)$ .
You plug in your observed data $X$ .
You get one number $\delta(X)$ .

But here's the problem:

✅ That number is produced by random data.
✅ If you repeated the experiment, you would get a different dataset.
✅ That means your estimator would output a different value every time.

So now ask this:

"How do I judge whether my estimator is good, if every repetition gives a different error?"

You cannot judge an estimator by a single outcome.
You must judge it by its long-run behavior under randomness.

That is exactly what the risk function is:

✅ The risk function is the average long-run penalty your decision rule will pay if the true parameter is $\theta$ .

It converts:

Random loss → deterministic performance curve
One noisy outcome → a stable performance guarantee

Without risk, you cannot compare estimators scientifically.

3What Does Risk Tell Us?

The risk function answers this fundamental question:

"If the true world were $\theta$ , how painful would it be to use this estimator forever?"

So risk tells you:

What You Want	What Risk Tells You
Accuracy	Average closeness to truth
Reliability	How stable performance is
Robustness	Sensitivity to randomness
Safety	Expected damage
Optimality	Whether another rule does better

Risk turns intuition into a measurable object.

✅Final One-Sentence Truth

The risk function is the bridge between mathematical uncertainty and real-world consequence.

Frequentist Risk

The frequentist risk (or simply "risk") averages the loss over the sampling distribution of the data:

$R(\theta, \delta) = \mathbb{E}_\theta[L(\theta, \delta(X))] = \int L(\theta, \delta(x)) f(x|\theta)\, dx$

Interpretation: "If θ is the true parameter, what's my expected loss if I use decision rule δ repeatedly?"

For squared error loss, the risk has a special name:

$R(\theta, \delta) = \mathbb{E}_\theta[(\theta - \delta(X))^2] = \text{MSE}_\theta(\delta)$

🎯From Risk to MSE to Bias-Variance

The MSE depends on two fundamental quantities:

Bias:

\text{Bias}(\hat{\nu}) = \mathbb{E}[\hat{\nu}] - \nu

(systematic deviation)

Variance:

\text{Var}(\hat{\nu}) = \mathbb{E}[(\hat{\nu} - \mathbb{E}[\hat{\nu}])^2]

(fluctuation)

The Famous Decomposition:

\text{MSE}(\hat{\nu}) = \text{Bias}^2(\hat{\nu}) + \text{Var}(\hat{\nu})

= Total error = Systematic + Random

Preview: This decomposition is the central result. We'll explore the bias-variance tradeoff in Section 2.

🎲

Interactive: Risk Comparison (Normal Data)

Compare the risk (expected squared error loss) of the sample mean vs sample median for estimating the mean of a Normal distribution.

Assumptions:

X_i \sim N(\theta, \sigma^2=1)

i.i.d. |

\sigma^2

known | Squared error loss | Monte Carlo (1000 runs)

Sample Size (n): 10

Sample Mean Risk

Infinity

Theory: 1/n = 0.1000

Sample Median Risk

0.1400

Theory: ≈ (π/2)/n = 0.1571

Relative Efficiency

Infinity%

Mean is 0.00x better

Decision Theory Insight

Under squared error loss and Normal data, the sample mean has lower risk than the median. Decision theory helps us make this comparison rigorous.

Bayes Risk and Minimax Risk: Two Ways to Choose an Optimal Decision Rule

The frequentist risk

$R(\theta, \delta) = \mathbb{E}_\theta[L(\theta, \delta(X))]$

is a function of the unknown true parameter $\theta$ . Since $\theta$ is not known in practice, the risk alone does not immediately tell us how to select one estimator over another. This leads to a fundamental question:

How should we choose an optimal decision rule when performance depends on an unknown truth?

Decision theory provides two principled answers to this question: the Bayes approach and the minimax approach.

◆Bayes Risk (Average-Case Optimality)

In the Bayesian framework, the parameter $\theta$ is treated as a random variable with prior distribution $\pi(\theta)$ . The performance of a decision rule is measured by its Bayes risk:

$r(\delta) = \mathbb{E}_{\theta \sim \pi}[R(\theta, \delta)] = \int R(\theta, \delta) \, \pi(\theta) \, d\theta$

A Bayes decision rule is defined as:

$\delta_B = \arg\min_\delta r(\delta)$

Interpretation

Bayes risk is the expected long-run loss, averaged over our prior beliefs about which world is likely to be true.

Thus, Bayes optimality is an average-case optimality, weighted by subjective or empirical beliefs.

◆Minimax Risk (Worst-Case Optimality)

In the minimax framework, no prior distribution is assumed. Instead, Nature is treated as an adversary who may choose the worst possible value of $\theta$ . The relevant performance measure is the maximal risk:

$\sup_{\theta \in \Theta} R(\theta, \delta)$

A minimax decision rule is defined as:

$\delta_M = \arg\min_\delta \sup_{\theta \in \Theta} R(\theta, \delta)$

Interpretation

Minimax risk measures how bad things could get in the worst possible world. The minimax rule is the one with the smallest guaranteed maximum damage.

This is a worst-case optimality principle.

◆One Unified Decision-Theoretic View

Both Bayes and minimax arise from a single abstract principle:

$\delta^* = \arg\min_\delta \mathcal{R}(\delta)$

where the functional $\mathcal{R}(\delta)$ is either:

an expectation over $\theta$ (Bayes), or
a supremum over $\theta$ (minimax).

Thus, Bayes and minimax are not competing theories — they are two different ways of aggregating the same underlying risk function.

Bayes

minimize average risk under a belief

Minimax

minimize worst possible risk under uncertainty

Bayes risk teaches us how to act intelligently when we believe something about the world.

Minimax risk teaches us how to act safely when we don't trust the world at all.

Formal Comparison:

🎲Bayes Approach	🛡️Minimax Approach
Put a prior distribution π(θ) on the parameter and minimize the Bayes risk:	Minimize the worst-case risk over all θ:
$r(\pi, \delta) = \mathbb{E}_\pi[R(\theta, \delta)] = \int R(\theta, \delta) \pi(\theta)\, d\theta$ Average risk over your prior beliefs about θ.	$\delta^* = \arg\min_\delta \max_\theta R(\theta, \delta)$ Prepare for the adversarial scenario where nature picks the worst θ.
Formal Definition	Formal Definition
In the Bayesian framework, $\theta$ is random. The Bayes risk is: $r(\delta) = E[R(\theta, \delta)] = E[l(\theta, \delta(X))]$ (expectation form with loss function)	We prefer $\delta$ to $\delta'$ iff: $\sup_\theta R(\theta, \delta) < \sup_\theta R(\theta, \delta')$
For discrete $\theta$ : $r(\delta) = \sum_\theta R(\theta, \delta) \pi(\theta)$ For continuous $\theta$ : $r(\delta) = \int R(\theta, \delta) \pi(\theta)\, d\theta$	A procedure $\delta^$ is minimax* if: $\sup_\theta R(\theta, \delta^) = \inf_\delta \sup_\theta R(\theta, \delta)$ That is, $\delta^$ minimizes the maximum risk.

🛢️Worked Example: Oil Drilling Decision

Suppose an expert believes the probability of finding oil is $\theta$ , which can take two values: $\theta_1$ (low yield) or $\theta_2$ (high yield). The expert assigns prior probabilities:

$\pi(\theta_1) = 0.2, \quad \pi(\theta_2) = 0.8$

The Bayes risk of any procedure $\delta$ is:

$r(\delta) = 0.2 \cdot R(\theta_1, \delta) + 0.8 \cdot R(\theta_2, \delta)$

Consider 9 possible decision procedures. Their risks are:

Procedure $i$	1	2	3	4	5	6	7	8	9
Bayes risk $r(\delta_i)$	9.6	7.48	8.38	4.92	2.8	3.7	7.02	4.9	5.8
$\max\{R(\theta_1, \delta_i), R(\theta_2, \delta_i)\}$	12	7.6	9.6	5.4	10	6.5	8.4	8.5	6

🎲 Bayes Rule

$\delta_5$ has minimum Bayes risk $r(\delta_5) = 2.8$ . This is the unique Bayes rule for this prior.

🛡️ Minimax Rule

$\delta_4$ has minimum max-risk = 5.4. This is the minimax rule.

Key Observation

The Bayes rule ( $\delta_5$ ) and minimax rule ( $\delta_4$ ) are different! The Bayes rule optimizes average performance under the prior, while minimax protects against the worst case.

🎮Game Theory Interpretation of Minimax

The minimax criterion comes from two-person zero-sum game theory (von Neumann):

🌍

Player I: Nature

Picks $\theta \in \Theta$ (possibly adversarially)

👨‍🔬

Player II: Statistician

Picks $\delta \in \mathcal{D}$ (decision procedure)

The statistician "pays" Nature the risk $R(\theta, \delta)$ . The maximum risk of $\delta^*$ is the upper pure value of the game.

Minimax is Very Conservative

This criterion aims to give maximum protection against the worst that can happen—Nature choosing a $\theta$ that makes risk as large as possible.

The principle is compelling if you believe the parameter is being chosen by a malevolent opponentwho knows your decision procedure. However, most statisticians find minimax too conservativeas a general rule—though it can lead to very reasonable procedures in adversarial or safety-critical settings.

🎲Randomized Procedures Can Lower Maximum Risk

A key insight from game theory: randomizing between procedures can reduce maximum risk!

Example: In the oil drilling problem, suppose we flip a fair coin and use $\delta_4$ if heads, $\delta_6$ if tails. The expected risk of this randomized procedure is:

$\frac{1}{2}R(\theta, \delta_4) + \frac{1}{2}R(\theta, \delta_6) = \begin{cases} 4.75 & \text{if } \theta = \theta_1 \\ 4.20 & \text{if } \theta = \theta_2 \end{cases}$

The maximum risk is now 4.75, which is lower than the minimax value of 5.4 achieved by $\delta_4$ alone!

Practical Implication

When computing minimax procedures, we should consider randomized rules (mixed strategies), not just deterministic ones. The minimax theorem guarantees that under suitable conditions, there exists a (possibly randomized) minimax procedure.

🎯

Interactive: Bayes Optimal Decision

The Bayes estimator under squared error loss is the posterior mean. Watch how it balances prior belief and observed data.

Assumptions:Normal-Normal conjugate | Prior:

\theta \sim N(\mu_0, \tau^2)

| Likelihood:

X|\theta \sim N(\theta, \sigma^2=1)

| Single obs

Prior Mean: 0

Prior Variance: 1

Observation X: 3

-4

-2

Prior

\mu_0

Data

X

Posterior (Bayes)

Bayes Estimate

1.500

Posterior Variance

0.500

Try: Increase prior variance → Bayes estimate moves toward the data. Decrease prior variance → Prior belief dominates.

⚠️The Fundamental Problem: No Uniformly Best Rule

We say procedure $\delta$ improves $\delta'$ if:

$R(\theta, \delta) \leq R(\theta, \delta') \quad \text{for all } \theta, \text{ with strict inequality for some } \theta$

Key insight: There is typically no single rule that improves all others!

Example: Estimating $\theta \in \mathbb{R}$ when $X \sim N(\theta, \sigma_0^2)$

Consider the "absurd rule" $\delta^*(X) = 0$ (ignore data entirely)
Its MSE is $\text{MSE}(\hat{\theta}) = \theta^2$
At $\theta = 0$ : This rule cannot be improved because $E_0[\delta^2(X)] = 0$ only if $\delta(X) = 0$

Even terrible rules can be unbeatable at some θ values!

✅Admissibility: Ruling Out Bad Procedures

❌ Inadmissible

A rule $\delta$ is inadmissible if there exists another rule $\delta'$ that improves it.

Why use δ when δ' is never worse and sometimes better?

✓ Admissible

A rule $\delta$ is admissible if no rule improves it (i.e., it's not inadmissible).

Admissible rules are "Pareto optimal" in risk space.

Practical Implication

We should restrict attention to admissible procedures — inadmissible ones are dominated and should never be used. But among admissible procedures, we still need Bayes or minimax criteria to choose!

Frequentist vs Bayesian Approach

Understanding the philosophical and practical differences between frequentist and Bayesian approaches is fundamental to mastering statistical inference. Let's explore these two paradigms side by side.

The Core Intuition (One-Line Difference)

✅ Frequentist:

"The parameter is a fixed but unknown truth. Only the data is random."

✅ Bayesian:

"The parameter itself is uncertain. I represent my uncertainty with a probability distribution."

That single difference changes everything.

How Each One Sees "Truth"

🎯 Frequentist View of Truth

There is one true value of the parameter.
Example:
The true mean height of adult men in the US is one fixed number.
You just don't know it.
Your estimator is judged by:
- What happens over imaginary repeated experiments

You never say:

"The probability that μ = 172.3 is 0.7" ❌
Because μ is not random to a frequentist.

🧠 Bayesian View of Truth

The parameter is unknown AND treated as random.
You express your uncertainty as a distribution.
Example:
Before data: "I believe μ is around 170–175 with high probability."
After data: "Now I believe μ is tightly around 173."

You do say:

"There is a 95% probability that μ lies between 172 and 174." ✅

That sentence is illegal in frequentist statistics, but natural in Bayesian inference.

How Each One Treats Uncertainty

Question	Frequentist	Bayesian
What is random?	The data	The data + parameters
What is fixed?	The parameter $\theta$	Nothing is fixed
What does probability mean?	Long-run frequency	Degree of belief
What is confidence?	Coverage over repeated samples	Direct probability of truth

Critical Distinction: Coverage ≠ Probability of θ

Confidence Intervals (Frequentist):

Guarantee long-run coverage: "95% of intervals constructed this way will contain $\theta$ ." The interval is random; $\theta$ is fixed.

Credible Intervals (Bayesian):

Give probability of $\theta$ conditional on data: " $P(\theta \in [a,b] | X) = 0.95$ ." The interval is fixed (given data); $\theta$ is random.

⚠️ Interpreting CIs as "probability θ is in the interval" is a common but incorrect interpretation. Only Bayesian credible intervals allow this interpretation—at the cost of requiring a prior.

How Decisions Are Made

✅ Frequentist Decision Logic

Assume $\theta$ is fixed.
Assume repeated sampling.
Choose procedure with:
- Low MSE
- Correct confidence interval coverage
- Controlled Type-I error

Key idea:

"If I repeated this experiment 1 million times, my method would behave correctly."

✅ Bayesian Decision Logic

Start with a prior belief about $\theta$ .
Collect data.
Update belief using Bayes' theorem.
Choose action that minimizes posterior expected loss.

Key idea:

"Given what I know right now, what is the best decision?"

📐 Decision-Theoretic Summary:

\delta^* = \arg\min_\delta \text{Risk}

Paradigm	Risk Definition	What It Minimizes	Typical Criteria
Frequentist	$R(\theta, \delta) = E_\theta[L(\theta, \delta(X))]$	Pointwise risk for each $\theta$	MSE, Type-I/II error, CI coverage
Bayes	$r(\pi, \delta) = \int R(\theta, \delta) \pi(\theta) d\theta$	Average risk under prior π	Posterior expected loss
Minimax	$\max_\theta R(\theta, \delta)$	Worst-case risk over all θ	Robust to adversarial θ

All three paradigms are valid decision-theoretic frameworks—they just optimize different objectives.

Real-World Examples

🏭 Example 1: Factory Defect Rate (Worked with Numbers)

You want to estimate defect probability $\theta$ . Data: n = 100 samples, x = 3 defects observed.

✅ Frequentist Analysis

Point estimate: $\hat{\theta} = 3/100 = 0.03$
95% CI (Wald):
$\hat{\theta} \pm 1.96\sqrt{\frac{\hat{\theta}(1-\hat{\theta})}{n}} = 0.03 \pm 0.033$
Result: (0.000, 0.063) or use exact Clopper-Pearson: (0.006, 0.085)

Legal interpretation: "If I repeated this sampling procedure infinitely, 95% of such intervals would contain the true $\theta$ ."

❌ Cannot say: "There's a 95% probability $\theta$ is in this interval."

✅ Bayesian Analysis

Prior: $\theta \sim \text{Beta}(2, 50)$ (prior mean ≈ 0.038, encodes "around 4%")
Likelihood: $X | \theta \sim \text{Binomial}(100, \theta)$
Posterior:
$\theta | X \sim \text{Beta}(2+3, 50+97) = \text{Beta}(5, 147)$
95% Credible Interval: (0.011, 0.063)

Legal interpretation: "Given the data and my prior, there is a 95% probability that $\theta$ lies in (0.011, 0.063)."

✅ Can make direct probability statements about $\theta$ .

📊 Interval Comparison

Interval Type	95% Interval	What It Means
Frequentist CI (Clopper-Pearson)	(0.006, 0.085)	95% of such intervals cover true θ in repeated sampling
Bayesian Credible (Beta(2,50) prior)	(0.011, 0.063)	P(θ ∈ interval \| data) = 0.95

Note: The intervals differ because the Bayesian incorporates prior information (pulling toward ~4%), while the frequentist uses only the data.

🎛️ Prior Sensitivity: How Priors Shift Posteriors

Same data (n=100, x=3), different priors → different posteriors:

Prior	Prior Belief	Posterior	95% Credible Interval
$\text{Beta}(1, 1)$	Flat/uninformative	$\text{Beta}(4, 98)$	(0.011, 0.074)
$\text{Beta}(2, 50)$	Weakly informative (~4%)	$\text{Beta}(5, 147)$	(0.011, 0.063)
$\text{Beta}(10, 200)$	Strong prior (~5%)	$\text{Beta}(13, 297)$	(0.024, 0.068)

⚠️ Key insight: Strong priors dominate small samples. With n=100, the strong prior pulls the interval away from the MLE (0.03). This is a feature when prior knowledge is reliable, but a bug when the prior is misspecified.

💉 Example 2: Medical Drug Trial (Life-or-Death Decisions)

✅ Frequentist Doctor

Hypothesis test:
- $H_0$ : No effect
- $H_1$ : Drug helps
If p < 0.05 → approve

This controls:

"How often would I wrongly approve useless drugs if I repeated trials forever?"

✅ Bayesian Doctor

Already knows:
- Similar drugs
- Biological constraints
Uses prior.
After trial:
"There is an 87% probability this drug reduces mortality by at least 10%."

This answers:

"What should I do today, given all information?"

✅ That's decision-theoretic optimality.

🤖 Example 3: Machine Learning Model

✅ Frequentist Training

Fit model.
Report:
- Test accuracy
- Confidence intervals via bootstrapping
- Hypothesis tests on coefficients

Used in:

Classical statistics
Regulatory environments
Scientific publishing

✅ Bayesian Training

Model weights have distributions.
Predictions have credible intervals.
Uncertainty-aware outputs:
"There is a 92% probability that this patient has disease."

Used in:

Medical AI
Robotics
Reinforcement learning
Active learning
Safety-critical AI

When Should You Use Which? (Practical Rulebook)

✅ Use Frequentist When:

✓ You want:

Hypothesis testing
p-values
Long-run guarantees
Regulatory approval
Scientific reproducibility

✓ You believe:

"Truth is fixed"
"I don't want to specify a prior"
"Only data should speak"

📌 Examples:

FDA drug trials
Manufacturing quality control
Academic hypothesis testing

✅ Use Bayesian When:

✓ You want:

Direct probability statements about parameters
Optimal decisions under uncertainty
Uncertainty-aware AI
Small-data problems
Sequential learning

✓ You believe:

"Prior knowledge matters"
"Uncertainty itself should be modeled"

📌 Examples:

Medical diagnosis
Autonomous systems
Financial risk modeling
Reinforcement learning
LLM uncertainty estimation

The Deep Unification Insight (Advanced)

Here is the truth most PhD students miss:

✅ Frequentist methods optimize worst-case or pointwise risk.
✅ Bayesian methods optimize average risk under a prior.

They are both solving the same decision-theoretic problem, just with different ways of handling uncertainty.

In fact:

Ridge regression = Bayesian MAP under Gaussian prior
LASSO = Bayesian MAP under Laplace prior
Dropout = approximate Bayesian inference
Ensemble methods = frequentist uncertainty approximation

Final One-Sentence Summary

Frequentists trust repeated experiments.
Bayesians trust probability as a language of belief.
Engineers and AI systems increasingly rely on Bayesian reasoning when decisions matter in real time under uncertainty.

Honest Trade-Offs

⚠️ Frequentist Challenges

Small samples: Asymptotic guarantees may not hold; coverage can be poor.
No prior info: Cannot easily incorporate domain knowledge.
No direct probability on θ: CIs answer "what would happen in repeated samples?" not "where is θ?"
Multiple comparisons: Requires careful correction (Bonferroni, FDR).

⚠️ Bayesian Challenges

Prior sensitivity: Results depend on prior choice, especially with small n.
Computation: Exact posteriors often intractable; requires MCMC/VI.
Subjectivity criticism: Two analysts with different priors get different answers.
Improper priors: Must verify posterior is proper (integrates to 1).

When Priors Are Hard to Specify

Not sure what prior to use? Several approaches can help:

📊 Objective/Reference Priors

Jeffreys prior, reference priors—designed to be "non-informative" or "minimally informative." Let the data dominate.

🔄 Empirical Bayes

Estimate hyperparameters from the data itself. Common in hierarchical models. "Let the data inform the prior."

⚠️ Improper Priors

Some "priors" don't integrate to 1 (e.g., $\pi(\theta) \propto 1$ ). Must check posterior propriety!

Prior Sensitivity Analysis

Always run your analysis with multiple priors (informative, weakly informative, diffuse). If conclusions change dramatically, your inference is prior-dependent—get more data or be explicit about prior assumptions.

Computation in Practice

Frequentist Computation

Exact methods: t-tests, F-tests, exact binomial CIs
Asymptotics: z-tests, Wald intervals, likelihood ratio tests
Resampling: Bootstrap for CIs, permutation tests

Usually fast; well-supported in standard software.

Bayesian Computation

Conjugate priors: Closed-form posteriors (Beta-Binomial, Normal-Normal)
MCMC: Stan, PyMC, JAGS—sample from posterior
Variational Inference: Fast approximations (mean-field, ADVI)
Deep ensembles: Approximate uncertainty in neural networks

Can be slow; requires convergence diagnostics.

Calibration: Checking Your Methods

Frequentist Calibration

Simulation studies: Generate data from known θ, check if your 95% CI actually covers θ in ~95% of simulations. Test under misspecification to assess robustness.

Bayesian Calibration

Posterior predictive checks: Simulate data from the posterior and compare to observed data. If the posterior can't reproduce key features of your data, the model (or prior) is misspecified.

Both paradigms need validation

Neither frequentist nor Bayesian methods are "automatic." Both require checking assumptions: model correctness, prior reasonableness, convergence (for MCMC), and coverage/calibration.

Paradigm Comparison: When to Use What

Paradigm	Optimizes	Pros	Cons	Use When...
Frequentist	$R(\theta, \delta)$ for each $\theta$	No prior needed; objective; well-understood theory	No single "best" $\delta$ if $R$ varies with $\theta$ ; can't combine info	Regulatory settings; need $\theta$ -specific guarantees
Bayes	$r(\pi, \delta) = \mathbb{E}_\pi[R(\theta,\delta)]$	Coherent decisions; incorporates prior; single optimal $\delta$	Requires prior; sensitive to prior choice; computation	Have prior info; want probabilistic statements
Minimax	$\max_\theta R(\theta, \delta)$	Robust to worst case; no prior needed	Conservative; may be too pessimistic; hard to compute	Adversarial settings; safety-critical applications

Practical Guidance

In practice, most ML uses frequentist evaluation (test set metrics) butBayes-like reasoning (regularization = prior, ensembles = posterior averaging).Minimax appears in robust optimization and adversarial training.

Prediction

Prediction is not about learning a number — it is about learning how randomness will unfold in the future.

In many real-world problems, we observe a vector of covariates $\mathbf{Z} \in \mathbb{R}^d$ and wish to predict an unseen response $Y \in \mathbb{R}$ . This prediction task arises throughout science and engineering:

Education: Predict first-year GPA from entrance exam scores

Finance: Predict portfolio value from market history

Meteorology: Predict rainfall from weather patterns

Energy: Predict demand from temperature forecasts

We assume the joint distribution of $(\mathbf{Z}, Y)$ is known (or estimated from data). Our goal is to find a predictor $g(\mathbf{Z})$ that is as close as possible to the true future outcome $Y$ .

◆Prediction as a Decision Problem

Prediction fits exactly into the decision-theoretic framework:

State of nature: The joint distribution of $(\mathbf{Z}, Y)$
Action: A function $g$ that maps $\mathbf{Z} \mapsto g(\mathbf{Z})$
Loss function: A penalty measuring prediction error
Risk: Expected prediction error

◆Mean Squared Prediction Error (MSPE)

A natural measure of prediction quality is the squared error: $(g(\mathbf{Z}) - Y)^2$ . Since the future outcome $Y$ is random, we measure performance using the mean squared prediction error:

$\Delta^2(Y, g(\mathbf{Z})) = \mathbb{E}\left[(g(\mathbf{Z}) - Y)^2\right]$

This is the prediction analogue of MSE in estimation.

◆Fundamental Optimality Result (Key Theorem)

Among all possible predictors, the function that minimizes MSPE is:

$g^*(\mathbf{Z}) = \mathbb{E}[Y \mid \mathbf{Z}]$

Interpretation: The optimal predictor under squared error loss is the conditional mean of Y given Z.

This theorem is the mathematical foundation of:

Linear regression
Neural network regression
Gaussian processes
Deep learning with squared loss

◆Connection to Bayesian Decision Theory

Under squared loss, prediction is identical to Bayesian decision-making:

$g^*(\mathbf{Z}) = \arg\min_g \mathbb{E}\left[(Y - g(\mathbf{Z}))^2 \mid \mathbf{Z}\right]$

Thus:

Prediction = posterior Bayes decision
MSPE = posterior expected risk

This equivalence explains why deep learning with MSE loss is implicitly Bayesian.

◆Classes of Predictors (Hypothesis Spaces)

We may search over:

✓ Nonparametric class:

$\mathcal{G}_{NP} = \{\text{all measurable functions } g(\mathbf{Z})\}$

✓ Linear class:

$\mathcal{G}_L = \left\{ g(\mathbf{Z}) = a + \sum_{j=1}^{d} b_j Z_j \right\}$

Restricting to $\mathcal{G}_L$ leads to linear regression.

Restricting to neural networks leads to deep learning.

Machine learning = empirical MSPE minimization over a restricted predictor class.

◆Prediction vs Estimation (The Deep Conceptual Difference)

	Estimation	Prediction
Goal	Learn a fixed unknown parameter $\theta$	Predict a future random outcome $X_{\text{new}}$
Randomness	Only the data is random	The future outcome is random even if $\theta$ is known
Target	A constant	A random variable
Error as $n \to \infty$	Yes (for good estimators)	No — irreducible noise remains

Estimation uncertainty can vanish. Prediction uncertainty never vanishes.

This irreducible error is called Bayes error in machine learning.

📍Estimation

Goal: Learn a fixed unknown quantity θ

"What is the true population mean?"

θ is fixed; only our uncertainty about it changes with data.

🔮Prediction

Goal: Predict a future random observation $X_{new}$

"What will the next customer spend?"

$X_{new}$ is random even if we knew $\theta$ perfectly!

Prediction vs Estimation

🔮

Interactive: Estimation vs Prediction

See the key difference: estimating the population mean vs predicting a new observation.

Assumptions:

X_i \sim N(\mu, \sigma^2)

i.i.d. |

\sigma^2

unknown, estimated | Plug-in intervals (not Bayesian)

Data:4.25.14.85.54.9

📍Estimation

"What is the true population mean μ?"

\hat{\mu}

= 4.900

95% CI: [4.484, 5.316]

Width: 0.832

🔮Prediction

"Where will the next observation fall?"

\tilde{X}_{new}

= 4.900

95% PI: [3.882, 5.918]

Width: 2.037

💡Key Insight

The prediction interval is always wider than the confidence interval! Why? Prediction uncertainty = estimation uncertainty + inherent variability of new observation.

\text{Var}(\hat{Y}_{new}) = \text{Var}(\bar{X}) + \sigma^2 = \frac{\sigma^2}{n} + \sigma^2 = \sigma^2\left(1 + \frac{1}{n}\right)

The key insight: prediction uncertainty has two sources:

Estimation uncertainty: We don't know $\theta$ exactly
Inherent randomness: Even if we knew $\theta$ , $X_{new}$ is random

$\text{Var}(\tilde{Y}_{new}) = \underbrace{\text{Var}(\hat{\theta})}_{\text{estimation}} + \underbrace{\sigma^2}_{\text{inherent}} = \frac{\sigma^2}{n} + \sigma^2 = \sigma^2\left(1 + \frac{1}{n}\right)$

Common Confusion

A confidence interval for $\mu$ and a prediction intervalfor $X_{new}$ look similar but mean different things:

95% CI: "We're 95% confident the TRUE MEAN lies in this interval"
95% PI: "We're 95% confident the NEXT OBSERVATION lies in this interval"

The prediction interval is always wider because it accounts for individual variability.

Predictive Distributions

In many applications, our goal is not merely to produce a single numerical prediction, but to characterize the full uncertainty of a future outcome. The object that encodes this uncertainty is the predictive distribution.

While a point predictor answers:

"What value do I guess will occur?"

the predictive distribution answers the more fundamental question:

✓ "What range of outcomes could occur, and with what probabilities?"

◆Frequentist Predictive Distribution (Plug-In)

In the frequentist framework, the model is:

$f(x \mid \theta)$

and the unknown parameter $\theta$ is first estimated by $\hat{\theta}$ . The plug-in predictive distribution is then:

$\hat{f}(x_{\text{new}}) = f(x_{\text{new}} \mid \hat{\theta})$

Interpretation: The plug-in predictive treats the estimated parameter as if it were the true parameter and ignores uncertainty in $\hat{\theta}$ .

Thus, it accounts only for observation noise, but not parameter uncertainty. This makes plug-in predictions:

Sharp
Optimistic
Potentially overconfident in small samples

◆Bayesian Posterior Predictive Distribution

In the Bayesian framework, $\theta$ is a random variable with posterior distribution $\pi(\theta \mid X_1, \ldots, X_n)$ . The posterior predictive distribution is:

$f(x_{\text{new}} \mid X_1, \ldots, X_n) = \int f(x_{\text{new}} \mid \theta) \, \pi(\theta \mid X_1, \ldots, X_n) \, d\theta$

Interpretation: Bayesian prediction averages over all plausible parameter values, weighted by how strongly the data support each value.

This properly accounts for:

✓ Observation noise
✓ Parameter uncertainty
✓ Prior uncertainty (when data are limited)

As a result, Bayesian predictive distributions are typically wider and better calibrated.

◆Why Bayesian Predictive Distributions Are More Honest

The plug-in approach implicitly assumes:

$\hat{\theta} \approx \theta \quad \text{with certainty}$

The Bayesian approach explicitly acknowledges:

$\theta \text{ is still uncertain after observing data}$

Hence:

$\text{Bayesian predictive variance} = \text{noise variance} + \text{parameter uncertainty variance}$

This is why Bayesian predictions are safer for:

• Risk management

• Medicine

• Safety-critical systems

• Financial decision-making

◆Predictive Distributions as Decision-Theoretic Objects

The predictive distribution is not just a probabilistic summary — it is the complete input required to make optimal future decisions under uncertainty. Any rational decision rule for actions involving the future (insurance pricing, thresholding, alarm systems, portfolio allocation) must be based on a predictive distribution, not a point estimate.

Common Pitfalls

⚠️Don't Make These Mistakes!

Confusing CI with PI

A 95% confidence interval for μ tells you where the parameter likely is. A 95% prediction interval tells you where the next observation will fall. PIs are always wider!

Using Squared Error Loss with Outliers

Squared error heavily penalizes large errors. If your data has outliers or heavy tails, absolute error or Huber loss may be more appropriate. Match your loss to your problem!

Plug-in Prediction Underestimates Uncertainty

Using $f(x_{new} | \hat{\theta})$ treats your estimate as if it were the true $\theta$ . This ignores estimation uncertainty and makes your prediction intervals too narrow. Use the posterior predictive instead!

Mixing Up Prior and Data Variance

In Bayesian estimation, the prior variance $\tau^2$ represents uncertainty about $\theta$ before data. The data variance $\sigma^2$ is the noise in observations. These are different quantities — don't confuse them!

Ignoring Asymmetry in Real Costs

Defaulting to squared error when overestimating and underestimating have different costs. Always ask: "What's the real-world consequence of each type of error?"

Connection to Point Estimation

Now we can see how decision theory provides the foundation for everything in this chapter:

Concept	Decision Theory View	What It Tells Us
Estimator $\hat{\theta}$	Decision rule $\delta(X)$	Maps data to estimates
MSE	Risk under squared error loss	$\mathbb{E}[(\theta - \hat{\theta})^2]$
Bias	Systematic error in the decision	$\mathbb{E}[\hat{\theta}] - \theta$
Variance	Variability of the decision	$\text{Var}(\hat{\theta})$
Unbiased Estimator	Decision that's correct on average	$\mathbb{E}[\hat{\theta}] = \theta$
UMVUE	Best unbiased decision	Minimum variance among unbiased
Bayes Estimator	Optimal decision given prior	Minimizes Bayes risk
MLE	Asymptotically optimal decision	Minimizes KL divergence

The Big Picture

All the properties we study in point estimation are about finding optimal decisionsunder different loss functions and different notions of optimality.

MSE = risk under squared error loss
Unbiasedness = zero systematic error
Efficiency = achieving the minimum possible risk
Sufficiency = using all relevant information for the decision

Confidence Bounds as Decision Theory

Decision theory provides a powerful lens for understanding confidence bounds and intervals — an important hybrid of testing and estimation.

📊Motivating Example: Accounts Receivable Audit

An accounting firm examines accounts receivable for a company based on a random sample. They want an upper bound on the total amount owed $\nu$ .

If $X$ represents the amount owed in the sample, they seek $\bar{\nu}(X)$ such that:

$P[\bar{\nu}(X) \geq \nu] \geq 1 - \alpha$

This $\bar{\nu}(X)$ is called a (1-α) upper confidence bound on $\nu$ .

The Decision-Theoretic Formulation

How does this fit into decision theory? We can view the upper confidence bound as a decision procedure with action space $\mathcal{A} = \mathbb{R}$ and a specific loss function.

❌Naive Loss (Has Problems)

$L(P, a) = \begin{cases} 0 & \text{if } a \geq \nu(P) \\ 1 & \text{if } a < \nu(P) \end{cases}$

Problem: Taking $\bar{\nu} \equiv \infty$ achieves risk = 0! A bound that says "at most infinity" is useless.

✅Better Loss (Balances Goals)

$L(P, a) = \begin{cases} a - \nu(P) & \text{if } a \geq \nu(P) \\ c & \text{if } a < \nu(P) \end{cases}$

Why better: Penalizes overestimation (loose bounds) while heavily penalizing undercoverage.

The Key Insight

Though upper bounding is the primary goal, it's also important to get close to the truth. Knowing "at most ∞ dollars" is technically correct but useless. The decision-theoretic framework naturally accommodates both goals by choosing an appropriate loss function.

🎯The Practical Approach

Rather than using Lagrangian optimization, practitioners typically:

Fix the coverage probability: Require $P[\bar{\nu}(X) \geq \nu] \geq 1 - \alpha$ for all $P$ (e.g., α = 0.05)
Then minimize the "excess": Among all procedures satisfying (1), minimize $R(P, \bar{\nu}) = \mathbb{E}[(\bar{\nu}(X) - \nu(P))_+]$

where $x_+ = x \cdot \mathbf{1}(x \geq 0)$ is the positive part.

Extension to Confidence Intervals

The same decision-theoretic logic extends to confidence intervals. A confidence interval $[\underline{\nu}(X), \bar{\nu}(X)]$ for $\nu$ satisfies:

$P[\underline{\nu}(X) \leq \nu(P) \leq \bar{\nu}(X)] \geq 1 - \alpha \quad \text{for all } P \in \mathcal{P}$

📐Visualizing the Tradeoffs

📏

Too Wide

High coverage, but uninformative

[−∞, +∞] has 100% coverage!

✨

Just Right

Correct coverage, minimal width

Optimal decision-theoretic balance

⚠️

Too Narrow

Precise, but wrong too often

Below nominal coverage

Goal: Minimize interval width while maintaining ≥ (1-α) coverage

Concept	Decision Theory Formulation	What We Optimize
Upper Bound	Action $a = \bar{\nu}(X)$	Minimize $\mathbb{E}[\bar{\nu}(X) - \nu]$ s.t. coverage ≥ 1-α
Lower Bound	Action $a = \underline{\nu}(X)$	Maximize $\mathbb{E}[\underline{\nu}(X)]$ s.t. coverage ≥ 1-α
Two-Sided CI	Action $a = [\underline{\nu}, \bar{\nu}]$	Minimize $\mathbb{E}[\bar{\nu} - \underline{\nu}]$ s.t. coverage ≥ 1-α

Why This Matters

Understanding confidence intervals as decision procedures explains why we construct them the way we do: we're finding the narrowest intervals that still achieve the required coverage. This is a constrained optimization problem — pure decision theory!

Symbol Glossary

Symbol	Name	Meaning
$\theta$	Parameter	The unknown true value we want to estimate
$\Theta$	Parameter Space	Set of all possible $\theta$ values
$a$	Action	A decision or estimate we choose
$\mathcal{A}$	Action Space	Set of all possible actions
$L(\theta, a)$	Loss Function	Cost of choosing action $a$ when truth is $\theta$
$\delta(X)$	Decision Rule	Function mapping data $X$ to an action
$R(\theta, \delta)$	Risk Function	Expected loss: $\mathbb{E}_\theta[L(\theta, \delta(X))]$
$r(\pi, \delta)$	Bayes Risk	Expected risk under prior: $\mathbb{E}_\pi[R(\theta, \delta)]$
$\pi(\theta)$	Prior Distribution	Belief about $\theta$ before seeing data
$X_{new}$	Future Observation	A new random value to be predicted

Python Implementation

Here's a complete implementation of key decision theory concepts:

🐍python

1import numpy as np
2from scipy import stats
3from typing import Callable, Tuple
4
5# =============================================================================
6# LOSS FUNCTIONS
7# =============================================================================
8
9def squared_error_loss(theta: float, estimate: float) -> float:
10    """Squared error loss: L(θ, a) = (θ - a)²"""
11    return (theta - estimate) ** 2
12
13def absolute_error_loss(theta: float, estimate: float) -> float:
14    """Absolute error loss: L(θ, a) = |θ - a|"""
15    return np.abs(theta - estimate)
16
17def asymmetric_loss(theta: float, estimate: float,
18                    c_under: float = 1.0, c_over: float = 2.0) -> float:
19    """
20    Asymmetric loss: different costs for over/under-estimation
21    L(θ, a) = c_under * max(θ - a, 0) + c_over * max(a - θ, 0)
22    """
23    if estimate < theta:
24        return c_under * (theta - estimate)
25    else:
26        return c_over * (estimate - theta)
27
28def huber_loss(theta: float, estimate: float, delta: float = 1.0) -> float:
29    """Huber loss: quadratic for small errors, linear for large"""
30    error = np.abs(theta - estimate)
31    if error <= delta:
32        return 0.5 * error ** 2
33    else:
34        return delta * error - 0.5 * delta ** 2
35
36# =============================================================================
37# RISK FUNCTIONS
38# =============================================================================
39
40def frequentist_risk(true_theta: float,
41                     estimator: Callable[[np.ndarray], float],
42                     sample_size: int,
43                     loss_fn: Callable[[float, float], float],
44                     n_simulations: int = 10000) -> float:
45    """
46    Compute frequentist risk via simulation.
47
48    R(θ, δ) = E_θ[L(θ, δ(X))]
49    """
50    losses = []
51    for _ in range(n_simulations):
52        # Generate data from N(theta, 1)
53        data = np.random.normal(true_theta, 1, sample_size)
54        estimate = estimator(data)
55        losses.append(loss_fn(true_theta, estimate))
56    return np.mean(losses)
57
58def bayes_risk(estimator: Callable[[np.ndarray], float],
59               prior_mean: float,
60               prior_std: float,
61               sample_size: int,
62               loss_fn: Callable[[float, float], float],
63               n_simulations: int = 10000) -> float:
64    """
65    Compute Bayes risk via simulation.
66
67    r(π, δ) = E_π[R(θ, δ)]
68    """
69    total_loss = 0
70    for _ in range(n_simulations):
71        # Sample θ from prior
72        theta = np.random.normal(prior_mean, prior_std)
73        # Generate data from N(theta, 1)
74        data = np.random.normal(theta, 1, sample_size)
75        estimate = estimator(data)
76        total_loss += loss_fn(theta, estimate)
77    return total_loss / n_simulations
78
79# =============================================================================
80# ESTIMATORS
81# =============================================================================
82
83def sample_mean(data: np.ndarray) -> float:
84    """Sample mean estimator"""
85    return np.mean(data)
86
87def sample_median(data: np.ndarray) -> float:
88    """Sample median estimator"""
89    return np.median(data)
90
91def bayes_estimator_normal(data: np.ndarray,
92                           prior_mean: float,
93                           prior_var: float,
94                           data_var: float = 1.0) -> float:
95    """
96    Bayes estimator for Normal mean (conjugate prior).
97    Posterior mean minimizes Bayes risk under squared error loss.
98    """
99    n = len(data)
100    posterior_precision = 1/prior_var + n/data_var
101    posterior_mean = (prior_mean/prior_var + np.sum(data)/data_var) / posterior_precision
102    return posterior_mean
103
104def shrinkage_estimator(data: np.ndarray, shrinkage: float = 0.5) -> float:
105    """James-Stein style shrinkage toward zero"""
106    return shrinkage * np.mean(data)
107
108# =============================================================================
109# PREDICTION
110# =============================================================================
111
112def prediction_interval(data: np.ndarray,
113                        confidence: float = 0.95) -> Tuple[float, float]:
114    """
115    Compute prediction interval for new observation.
116    Assumes Normal data with unknown mean and variance.
117    """
118    n = len(data)
119    mean = np.mean(data)
120    s = np.std(data, ddof=1)  # Sample std
121
122    # t-distribution with n-1 degrees of freedom
123    t_crit = stats.t.ppf((1 + confidence) / 2, df=n-1)
124
125    # Prediction SE: sqrt(s² * (1 + 1/n))
126    pred_se = s * np.sqrt(1 + 1/n)
127
128    lower = mean - t_crit * pred_se
129    upper = mean + t_crit * pred_se
130
131    return lower, upper
132
133def posterior_predictive_normal(data: np.ndarray,
134                                prior_mean: float,
135                                prior_var: float,
136                                data_var: float = 1.0) -> Tuple[float, float]:
137    """
138    Compute posterior predictive distribution parameters.
139    Returns (mean, variance) of X_new | X_1, ..., X_n
140    """
141    n = len(data)
142
143    # Posterior parameters
144    posterior_precision = 1/prior_var + n/data_var
145    posterior_var = 1 / posterior_precision
146    posterior_mean = posterior_var * (prior_mean/prior_var + np.sum(data)/data_var)
147
148    # Predictive distribution
149    predictive_mean = posterior_mean
150    predictive_var = data_var + posterior_var
151
152    return predictive_mean, predictive_var
153
154# =============================================================================
155# EXAMPLE USAGE
156# =============================================================================
157
158if __name__ == "__main__":
159    np.random.seed(42)
160
161    # Compare estimator risks
162    print("=" * 60)
163    print("RISK COMPARISON: Mean vs Median for Normal Data")
164    print("=" * 60)
165
166    true_theta = 5.0
167    for n in [5, 10, 25, 50]:
168        risk_mean = frequentist_risk(true_theta, sample_mean, n, squared_error_loss)
169        risk_median = frequentist_risk(true_theta, sample_median, n, squared_error_loss)
170        print(f"n={n:2d}: Mean Risk = {risk_mean:.4f}, Median Risk = {risk_median:.4f}")
171        print(f"       Mean is {risk_median/risk_mean:.2f}x more efficient")
172
173    print()
174    print("=" * 60)
175    print("BAYES RISK COMPARISON")
176    print("=" * 60)
177
178    # Compare MLE vs Bayes estimator
179    n = 10
180    for prior_std in [0.5, 1.0, 2.0, 5.0]:
181        mle_bayes_risk = bayes_risk(sample_mean, 0, prior_std, n, squared_error_loss)
182        bayes_est = lambda x: bayes_estimator_normal(x, 0, prior_std**2)
183        bayes_bayes_risk = bayes_risk(bayes_est, 0, prior_std, n, squared_error_loss)
184        print(f"Prior σ={prior_std}: MLE Bayes Risk = {mle_bayes_risk:.4f}, "
185              f"Bayes Est. Risk = {bayes_bayes_risk:.4f}")
186
187    print()
188    print("=" * 60)
189    print("PREDICTION vs ESTIMATION")
190    print("=" * 60)
191
192    data = np.array([4.2, 5.1, 4.8, 5.5, 4.9, 5.2, 4.7, 5.3])
193
194    # Estimation
195    est_mean = np.mean(data)
196    est_se = np.std(data, ddof=1) / np.sqrt(len(data))
197    ci_lower = est_mean - 1.96 * est_se
198    ci_upper = est_mean + 1.96 * est_se
199
200    # Prediction
201    pi_lower, pi_upper = prediction_interval(data)
202
203    print(f"Data: {data}")
204    print(f"
205Estimation (95% CI for μ): [{ci_lower:.3f}, {ci_upper:.3f}]")
206    print(f"CI Width: {ci_upper - ci_lower:.3f}")
207    print(f"
208Prediction (95% PI for X_new): [{pi_lower:.3f}, {pi_upper:.3f}]")
209    print(f"PI Width: {pi_upper - pi_lower:.3f}")
210    print(f"
211PI is {(pi_upper - pi_lower)/(ci_upper - ci_lower):.2f}x wider than CI")

Key Insights

Decision Theory is the Foundation

All estimation concepts (bias, variance, MSE, efficiency) arise from decision theory. The "best" estimator depends on your loss function and how you average risk.

Loss Function Determines Optimal Estimator

Squared error → posterior mean. Absolute error → posterior median. Asymmetric loss → quantiles. Choose your loss based on the real-world consequences.

Prediction ≠ Estimation

Predicting a future observation has MORE uncertainty than estimating a parameter. Prediction intervals are always wider than confidence intervals.

MSE = Risk Under Squared Error Loss

The Mean Squared Error is the frequentist risk when using squared error loss. The bias-variance decomposition follows directly from this.

Bayes and Minimax Connect

Bayes estimators under least favorable priors are often minimax. The sample mean is both Bayes (flat prior) and minimax for Normal mean estimation.

Try It Yourself

Solidify your understanding by experimenting with the interactive demos. Here's a structured exploration:

🧪Hands-On Checklist

Explore Loss Function Sensitivity

In the Loss Function Demo, set error = 1, then error = 3. How much does squared error increase vs absolute error? (Hint: squared goes 1→9, absolute goes 1→3)

Verify Risk Decreases with Sample Size

In the Risk Demo, slide n from 10 to 50. Confirm risk drops roughly as 1/n for the sample mean. Does the median's risk also drop at the same rate?

Compare CI Width to PI Width

In the Prediction Demo, add more data points. Watch the PI shrink slower than the CI. Why? (The $\sigma^2$ term doesn't go away even with infinite data!)

Test Bayes Shrinkage

In the Bayes Demo, set prior variance very small (0.1) vs very large (5.0). Where does the posterior land? Verify: small $\tau^2$ → prior dominates; large $\tau^2$ → data dominates.

Design Your Own Loss

Think of a real problem where underestimating is 3x worse than overestimating. What quantile should you use? (Answer: α = 3/(3+1) = 0.75 = 75th percentile)

Summary

In this section, we've built the decision-theoretic foundation for point estimation:

Decision theory framework: States (θ), actions (a), and loss L(θ, a)
Loss functions: Squared error, absolute error, asymmetric, and 0-1 loss each lead to different optimal estimators
Risk functions: Frequentist risk R(θ, δ), Bayes risk r(π, δ), and minimax risk provide different ways to evaluate estimators
Prediction vs estimation: Prediction has extra uncertainty from the inherent randomness of future observations
Connection to point estimation: MSE, bias, variance are all decision-theoretic concepts

🚀What's Next?

Now that you understand why we care about different estimator properties, we'll dive deep into the specific concepts:

Section 1: Estimators and their properties — the parametric framework
Section 2: Bias, Variance, and MSE — the famous decomposition
Section 3: Consistency and Efficiency — large-sample behavior
Section 4: Sufficiency — using all the information in your data
Section 5: Completeness and Ancillarity — finding optimal estimators

Learning Objectives

Before You Start

The Big Picture: Why Decision Theory?

The Core Insight

Machine Learning Perspective

🛠️How ML Techniques Control Bias & Variance

Why This Matters for ML Engineers

What Is Decision Theory?

Intuitive Understanding

🏥Real-World Example: Medical Diagnosis

Types of Statistical Problems

🔮Deep Dive: The Prediction Problem

The Common Thread

Why Decision Theory Matters

Formal Framework

🎯Concrete Action Spaces by Problem Type

Decision Procedures

Key Insight

The Decision Recipe

Loss Functions

Common Loss Functions

Interactive: Compare Loss Functions

Advanced Loss Functions

🎯Other Specialized Loss Functions

Worked Example: The Newsvendor Problem

The General Principle

Loss Functions in ML Practice

Theory ↔ Practice Connection

Choosing a Loss Function

🎯When to Use Each Loss Function

Key Insight

Risk Functions

📘The Meaning and Role of the Risk Function

Frequentist Risk

🤔Why Average the Loss? (The Deep Reason)

Interactive: Risk Comparison (Normal Data)

Decision Theory Insight

Bayes Risk and Minimax Risk: Two Ways to Choose an Optimal Decision Rule

Key Observation

Minimax is Very Conservative

Practical Implication

Interactive: Bayes Optimal Decision

Practical Implication

📊The Risk Set: Visualizing Bayes vs Minimax

🔗Connection: When Bayes = Minimax

📜Key Theorems: Bayes and Admissibility

🎯Two Approaches to Selecting Good Procedures

Frequentist vs Bayesian Approach

The Core Intuition (One-Line Difference)

How Each One Sees "Truth"

How Each One Treats Uncertainty

Critical Distinction: Coverage ≠ Probability of θ

How Decisions Are Made

Real-World Examples

When Should You Use Which? (Practical Rulebook)

The Deep Unification Insight (Advanced)

Final One-Sentence Summary

Honest Trade-Offs

When Priors Are Hard to Specify

Prior Sensitivity Analysis

Computation in Practice

Calibration: Checking Your Methods

Both paradigms need validation

Paradigm Comparison: When to Use What

Practical Guidance

Prediction

Prediction vs Estimation

Interactive: Estimation vs Prediction

Common Confusion

Predictive Distributions

📊Example: Normal Prediction

Common Pitfalls

Connection to Point Estimation

The Big Picture

Confidence Bounds as Decision Theory

The Decision-Theoretic Formulation

The Key Insight

Extension to Confidence Intervals

Why This Matters

Symbol Glossary