Learning Objectives
Before You Start
This section provides the conceptual foundation for all of point estimation. You should be comfortable with expected values, probability distributions, and basic optimization concepts.
By the end of this section, you will be able to:
The framework for making optimal choices under uncertainty
How to quantify the cost of making wrong decisions
Frequentist risk, Bayes risk, and minimax approaches
Why predicting new values requires different thinking
See how MSE, bias, and variance arise from decision theory
The Big Picture: Why Decision Theory?
Statistics is about making decisions under uncertainty. Decision theory provides the mathematical framework for choosing the "best" action when we don't know the true state of the world.
Before we dive into estimators, bias, and variance, we need to answer a fundamental question: What does it mean for an estimator to be "good"?
Different people might have different answers:
- "An estimator that's right on average" (unbiasedness)
- "An estimator that's usually close to the truth" (low variance)
- "An estimator that minimizes my expected loss" (optimal decision)
Decision theory gives us a unified framework to think about all these properties. It tells us that the "best" estimator depends on:
- What we lose when we're wrong (the loss function)
- How we average that loss (the risk function)
- What we know beforehand (prior information)
The Core Insight
Every estimator property you will study โ bias, variance, MSE, consistency, efficiency, sufficiency โ is a decision-theoretic concept in disguise.
- "Bias & variance describe how the risk decomposes."
- "Consistency describes how the risk behaves as data grows."
- "Efficiency compares risk against theoretical lower bounds."
- "Sufficiency and completeness identify when risk cannot be improved."
Decision theory is not an optional interpretation โ it is the mathematical spine of statistical inference.
Think of an estimator as a machine:
You feed it raw data โ it outputs a guess about an unknown truth.
Just like a physical machine can be evaluated for accuracy and precision, an estimator can be evaluated using several fundamental criteria:
Question: Is the machine centered on the truth or consistently off-target?
- If it always guesses too high โ positive bias
- If it always guesses too low โ negative bias
Interpretation: Bias measures systematic error.
Question: How much do the machine's outputs fluctuate from run to run?
- Tight clustering โ low variance
- Wildly different answers โ high variance
Interpretation: Variance measures random instability.
Question: Overall, how wrong is the machine on average?
Interpretation: MSE balances systematic error + random error into one score.
Question: As we feed the machine more and more data, does it eventually lock onto the true value?
- If yes โ consistent estimator
- If no โ inconsistent estimator
Interpretation: Consistency is a long-run guarantee, not a finite-sample promise.
Question: Among all unbiased machines, does this one have the tightest grouping?
- If it achieves the smallest possible variance, it is efficient
Interpretation: Efficiency means no other unbiased estimator is more precise.
Question: Does the machine extract all useful information from the data โ or does it throw some away?
- If nothing is lost โ sufficient
- If relevant information is discarded โ insufficient
Interpretation: Sufficiency is about perfect information compression.
A near-perfect estimation machine would be:
- Unbiased โ centered on the truth
- Low variance โ stable across samples
- Low MSE โ small total error
- Consistent โ converges with more data
- Efficient โ best possible precision
- Sufficient โ wastes no information
This is exactly what optimal statistical estimation aims to achieve.
Machine Learning Perspective
The concepts from classical estimation theory map directly onto modern machine learning. Understanding this connection helps you see that ML is applied decision theory.
| Statistical Concept | Estimator-Machine Meaning | Machine Learning Interpretation |
|---|---|---|
| Bias | Is the machine systematically off-target? | Underfitting โ model too simple, misses true structure |
| Variance | How much do outputs fluctuate across samples? | Overfitting โ model too sensitive to noise |
| MSE / Risk | Total error combining bias & variance | Generalization error on unseen test data |
| Consistency | Does the machine improve with more data? | Model converges as dataset grows |
| Efficiency | Among unbiased machines, is this the tightest? | Best possible accuracy for given data + model class |
| Sufficiency | Is any useful information being thrown away? | Feature bottleneck / information loss |
Training a neural network is nothing but tuning an estimation machine to minimize expected decision-theoretic risk under a chosen loss.
- โข Model is too rigid
- โข Misses patterns in data
- โข Low training error improvement
- โข High error on both train & test
- โข Model is too flexible
- โข Fits noise in training data
- โข Huge trainโtest gap
- โข Low train error, high test error
- โข Right model complexity
- โข Balanced bias + variance
- โข Minimum test error
- โข = Minimum MSE / Risk
Why This Matters for ML Engineers
Every hyperparameter you tune, every architecture choice you make, every regularization technique you apply โ you are navigating the bias-variance tradeoff. Decision theory gives you the mathematical foundation to understand why these techniques work.
Every estimator is a machine that turns data into decisions.
Every statistical property โ bias, variance, MSE, consistency, efficiency, sufficiency โ is just a different way of scoring how well that machine behaves under uncertainty.
What Is Decision Theory?
Intuitive Understanding
Imagine you're a doctor diagnosing a patient. You observe symptoms (data) but don't know the true disease (parameter). You must choose a treatment (action). Different actions have different consequences depending on the true disease.
The same logic applies to estimation:
- Data = your observations
- Unknown state = true parameter
- Action = your estimate
- Loss = how "wrong" your estimate is
Types of Statistical Problems
The information we extract from data takes different forms depending on our goals. Decision theory provides a unified framework for all of them:
Goal: Produce "best guesses" of unknown parameters
Action space: All possible parameter values
Goal: Decide if data supports a hypothesis or not
Action space: {Accept , Reject }
Goal: Order items from best to worst
Action space: All possible orderings of items
Goal: Forecast future observations given covariates
Action space: Predicted values
The Common Thread
In all cases, the analysis doesn't stop at specifying an estimate, test, ranking, or prediction. We must also evaluate how well our procedure performs. This requires criteria of performance โ which is exactly what decision theory provides.
Why Decision Theory Matters
Given so many possible procedures (sample mean vs median, different test statistics, various models), how do we choose? Decision theory provides the framework to answer this systematically.
When: Before looking at data
Purpose: Study design, sample size determination
Question: "How well can the best procedure do?"
When: After data is collected
Purpose: Assess reliability of our estimate
Question: "How reliable is this particular estimate?"
The decision theoretic framework helps us:
What exactly are we trying to achieve? Estimation, testing, ranking, or prediction?
What decisions can we make? What is the action space?
How do we measure "how well" a procedure performs? What are the relevant metrics?
Given objectives and performance criteria, which procedure should we use?
The Fundamental Question: In estimation we care how far off we are; in testing, what mistakes we've made; in ranking, which orderings are wrong. Decision theory gives us the mathematical language to express and minimize these errors.
Formal Framework
A statistical decision problem consists of three ingredients:
The set of possible true parameter values. Examples: for probabilities, for means, for variances.
The set of possible decisions. For point estimation, (we choose an estimate from the same space as the parameter).
A function measuring the cost of taking action when the true state is . Lower loss = better.
Decision Procedures
A decision procedure (or decision rule) is a function that maps any possible data outcome to an action. When we observe , we take action.
For estimating the population mean from data :
Which is better? That depends on the loss function and the true distribution!
Testing vs with data from two groups:
The critical value controls the tradeoff between Type I and Type II errors.
Given training data , predict for new :
The decision rule is an entire function! The action space is infinite-dimensional.
Key Insight
The notation emphasizes that our decision is a function of the data. We don't just pick a number โ we specify a rule that tells us what to do for any possible dataset we might observe.
The Decision Recipe
What are the possible states? What decisions can you take?
Squared error: large errors matter disproportionately
Absolute error: need robustness to outliers
Asymmetric: over/under-estimating have different costs
0-1 loss: classification or hypothesis testing
Average the loss over the sampling distribution:
Bayes: have prior? Minimize Bayes risk
Minimax: protect against worst case
Frequentist: evaluate pointwise risk
Loss Functions
The loss function is the heart of decision theory. It quantifies: "How bad is it to choose action a when the truth is ฮธ?"
Common Loss Functions
Interactive: Compare Loss Functions
| Loss Function | Formula | Properties | Use When... |
|---|---|---|---|
| Squared Error | Differentiable, penalizes large errors heavily | Errors of similar magnitude, computational convenience | |
| Absolute Error | Robust to outliers, non-differentiable at 0 | Large errors shouldn't dominate | |
| 0-1 Loss | 0 if , 1 otherwise | Used for classification/testing | Only exact correctness matters |
| Asymmetric | Different costs for over/under-estimation | Consequences differ by direction of error | |
| Huber Loss | if | Combines benefits of squared and absolute | Robustness with differentiability |
Advanced Loss Functions
Beyond the basic loss functions, several specialized losses arise in practice:
When estimating a -dimensional parameter with estimate:
Most common choice; decomposes into sum of univariate losses
Robust to outliers; leads to sparse solutions in regularization
Worst-case error across all components; minimax flavor
For prediction problems where the true function is and our predictor is :
If is the empirical distribution of the training covariates:
This is the mean squared prediction error โ exactly what we minimize in regression!
Worked Example: The Newsvendor Problem
The newsvendor problem is a classic example of asymmetric loss in action. A vendor must decide how many newspapers to stock before knowing the day's demand.
Setup:
- Each newspaper costs $1 to buy and sells for $2
- Understock cost (lost sale): = $1 profit missed
- Overstock cost (unsold paper): = $1 purchase price lost
The Optimal Quantile Formula:
Order the 50th percentile (median) of demand when costs are equal.
Now suppose stockouts are worse:
- Understock cost: = $5 (angry customer, lost reputation)
- Overstock cost: = $1 (just the paper cost)
Order the 83rd percentile of demand โ stock more to avoid stockouts!
The General Principle
Under asymmetric loss , the optimal estimate is the -quantile where .
Loss Functions in ML Practice
The loss functions from decision theory appear throughout machine learning under different names:
| Decision Theory Loss | ML Training Loss | Eval Metric | Use Case |
|---|---|---|---|
| Squared Error | MSE Loss | RMSE, | Regression with normal errors |
| Absolute Error | L1 / MAE Loss | MAE, MedAE | Robust regression, sparse solutions |
| 0-1 Loss | Classification Error | Accuracy, Error Rate | Hard classification decisions |
| Log Loss | Cross-Entropy / NLL | Log Loss, Perplexity | Probabilistic classification |
| Huber Loss | Smooth L1 Loss | Huber metric | Object detection, robust regression |
| Asymmetric Loss | Quantile Loss / Pinball | Quantile coverage | Demand forecasting, prediction intervals |
Theory โ Practice Connection
When you minimize cross-entropy loss in a neural network, you're finding the MLE. When you minimize MSE, you're minimizing expected squared error loss. The frameworks connect!
Choosing a Loss Function
Key Insight
The choice of loss function determines the optimal estimator! Under squared error loss, the optimal estimator is the posterior mean. Under absolute error loss, it's the posterior median.
Risk Functions
The loss L(ฮธ, ฮด(X)) is random because it depends on the random data X. We need a way to summarize the "typical" or "expected" loss. This is the risk function.
Let's start from the most basic problem:
You choose an estimator .
You plug in your observed data .
You get one number .
But here's the problem:
- โ That number is produced by random data.
- โ If you repeated the experiment, you would get a different dataset.
- โ That means your estimator would output a different value every time.
So now ask this:
"How do I judge whether my estimator is good, if every repetition gives a different error?"
You cannot judge an estimator by a single outcome.
You must judge it by its long-run behavior under randomness.
That is exactly what the risk function is:
โ The risk function is the average long-run penalty your decision rule will pay if the true parameter is .
It converts:
- Random loss โ deterministic performance curve
- One noisy outcome โ a stable performance guarantee
Without risk, you cannot compare estimators scientifically.
The risk function answers this fundamental question:
"If the true world were , how painful would it be to use this estimator forever?"
So risk tells you:
| What You Want | What Risk Tells You |
|---|---|
| Accuracy | Average closeness to truth |
| Reliability | How stable performance is |
| Robustness | Sensitivity to randomness |
| Safety | Expected damage |
| Optimality | Whether another rule does better |
Risk turns intuition into a measurable object.
The risk function is the bridge between mathematical uncertainty and real-world consequence.
Frequentist Risk
The frequentist risk (or simply "risk") averages the loss over the sampling distribution of the data:
Interpretation: "If ฮธ is the true parameter, what's my expected loss if I use decision rule ฮด repeatedly?"
For squared error loss, the risk has a special name:
The MSE depends on two fundamental quantities:
Preview: This decomposition is the central result. We'll explore the bias-variance tradeoff in Section 2.
Interactive: Risk Comparison (Normal Data)
Compare the risk (expected squared error loss) of the sample mean vs sample median for estimating the mean of a Normal distribution.
Decision Theory Insight
Under squared error loss and Normal data, the sample mean has lower risk than the median. Decision theory helps us make this comparison rigorous.
Bayes Risk and Minimax Risk: Two Ways to Choose an Optimal Decision Rule
The frequentist risk
is a function of the unknown true parameter . Since is not known in practice, the risk alone does not immediately tell us how to select one estimator over another. This leads to a fundamental question:
How should we choose an optimal decision rule when performance depends on an unknown truth?
Decision theory provides two principled answers to this question: the Bayes approach and the minimax approach.
In the Bayesian framework, the parameter is treated as a random variable with prior distribution . The performance of a decision rule is measured by its Bayes risk:
A Bayes decision rule is defined as:
Bayes risk is the expected long-run loss, averaged over our prior beliefs about which world is likely to be true.
Thus, Bayes optimality is an average-case optimality, weighted by subjective or empirical beliefs.
In the minimax framework, no prior distribution is assumed. Instead, Nature is treated as an adversary who may choose the worst possible value of . The relevant performance measure is the maximal risk:
A minimax decision rule is defined as:
Minimax risk measures how bad things could get in the worst possible world. The minimax rule is the one with the smallest guaranteed maximum damage.
This is a worst-case optimality principle.
Both Bayes and minimax arise from a single abstract principle:
where the functional is either:
- an expectation over (Bayes), or
- a supremum over (minimax).
Thus, Bayes and minimax are not competing theories โ they are two different ways of aggregating the same underlying risk function.
minimize average risk under a belief
minimize worst possible risk under uncertainty
Bayes risk teaches us how to act intelligently when we believe something about the world.
Minimax risk teaches us how to act safely when we don't trust the world at all.
Formal Comparison:
๐ฒBayes Approach | ๐ก๏ธMinimax Approach |
|---|---|
Put a prior distribution ฯ(ฮธ) on the parameter and minimize the Bayes risk: | Minimize the worst-case risk over all ฮธ: |
Average risk over your prior beliefs about ฮธ. | Prepare for the adversarial scenario where nature picks the worst ฮธ. |
Formal Definition | Formal Definition |
In the Bayesian framework, is random. The Bayes risk is: (expectation form with loss function) | We prefer to iff: |
For discrete : For continuous : | A procedure is minimax if: That is, minimizes the maximum risk. |
Suppose an expert believes the probability of finding oil is , which can take two values: (low yield) or (high yield). The expert assigns prior probabilities:
The Bayes risk of any procedure is:
Consider 9 possible decision procedures. Their risks are:
| Procedure | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 |
|---|---|---|---|---|---|---|---|---|---|
| Bayes risk | 9.6 | 7.48 | 8.38 | 4.92 | 2.8 | 3.7 | 7.02 | 4.9 | 5.8 |
| 12 | 7.6 | 9.6 | 5.4 | 10 | 6.5 | 8.4 | 8.5 | 6 |
has minimum Bayes risk . This is the unique Bayes rule for this prior.
has minimum max-risk = 5.4. This is the minimax rule.
Key Observation
The Bayes rule () and minimax rule () are different! The Bayes rule optimizes average performance under the prior, while minimax protects against the worst case.
The minimax criterion comes from two-person zero-sum game theory (von Neumann):
Picks (possibly adversarially)
Picks (decision procedure)
The statistician "pays" Nature the risk . The maximum risk of is the upper pure value of the game.
Minimax is Very Conservative
This criterion aims to give maximum protection against the worst that can happenโNature choosing a that makes risk as large as possible.
The principle is compelling if you believe the parameter is being chosen by a malevolent opponentwho knows your decision procedure. However, most statisticians find minimax too conservativeas a general ruleโthough it can lead to very reasonable procedures in adversarial or safety-critical settings.
A key insight from game theory: randomizing between procedures can reduce maximum risk!
Example: In the oil drilling problem, suppose we flip a fair coin and use if heads, if tails. The expected risk of this randomized procedure is:
The maximum risk is now 4.75, which is lower than the minimax value of 5.4 achieved by alone!
Practical Implication
When computing minimax procedures, we should consider randomized rules (mixed strategies), not just deterministic ones. The minimax theorem guarantees that under suitable conditions, there exists a (possibly randomized) minimax procedure.
Interactive: Bayes Optimal Decision
The Bayes estimator under squared error loss is the posterior mean. Watch how it balances prior belief and observed data.
We say procedure improves if:
Key insight: There is typically no single rule that improves all others!
Example: Estimating when
- Consider the "absurd rule" (ignore data entirely)
- Its MSE is
- At : This rule cannot be improved because only if
Even terrible rules can be unbeatable at some ฮธ values!
A rule is inadmissible if there exists another rule that improves it.
Why use ฮด when ฮด' is never worse and sometimes better?
A rule is admissible if no rule improves it (i.e., it's not inadmissible).
Admissible rules are "Pareto optimal" in risk space.
Practical Implication
We should restrict attention to admissible procedures โ inadmissible ones are dominated and should never be used. But among admissible procedures, we still need Bayes or minimax criteria to choose!
Frequentist vs Bayesian Approach
Understanding the philosophical and practical differences between frequentist and Bayesian approaches is fundamental to mastering statistical inference. Let's explore these two paradigms side by side.
The Core Intuition (One-Line Difference)
"The parameter is a fixed but unknown truth. Only the data is random."
"The parameter itself is uncertain. I represent my uncertainty with a probability distribution."
That single difference changes everything.
How Each One Sees "Truth"
- There is one true value of the parameter.
- Example:The true mean height of adult men in the US is one fixed number.
- You just don't know it.
- Your estimator is judged by:
- What happens over imaginary repeated experiments
You never say:
"The probability that ฮผ = 172.3 is 0.7" โ
Because ฮผ is not random to a frequentist.
- The parameter is unknown AND treated as random.
- You express your uncertainty as a distribution.
- Example:Before data: "I believe ฮผ is around 170โ175 with high probability."
After data: "Now I believe ฮผ is tightly around 173."
You do say:
"There is a 95% probability that ฮผ lies between 172 and 174." โ
That sentence is illegal in frequentist statistics, but natural in Bayesian inference.
How Each One Treats Uncertainty
| Question | Frequentist | Bayesian |
|---|---|---|
| What is random? | The data | The data + parameters |
| What is fixed? | The parameter | Nothing is fixed |
| What does probability mean? | Long-run frequency | Degree of belief |
| What is confidence? | Coverage over repeated samples | Direct probability of truth |
Critical Distinction: Coverage โ Probability of ฮธ
Confidence Intervals (Frequentist):
Guarantee long-run coverage: "95% of intervals constructed this way will contain ." The interval is random; is fixed.
Credible Intervals (Bayesian):
Give probability of conditional on data: "." The interval is fixed (given data); is random.
โ ๏ธ Interpreting CIs as "probability ฮธ is in the interval" is a common but incorrect interpretation. Only Bayesian credible intervals allow this interpretationโat the cost of requiring a prior.
How Decisions Are Made
- Assume is fixed.
- Assume repeated sampling.
- Choose procedure with:
- Low MSE
- Correct confidence interval coverage
- Controlled Type-I error
Key idea:
"If I repeated this experiment 1 million times, my method would behave correctly."
- Start with a prior belief about .
- Collect data.
- Update belief using Bayes' theorem.
- Choose action that minimizes posterior expected loss.
Key idea:
"Given what I know right now, what is the best decision?"
| Paradigm | Risk Definition | What It Minimizes | Typical Criteria |
|---|---|---|---|
| Frequentist | Pointwise risk for each | MSE, Type-I/II error, CI coverage | |
| Bayes | Average risk under prior ฯ | Posterior expected loss | |
| Minimax | Worst-case risk over all ฮธ | Robust to adversarial ฮธ |
All three paradigms are valid decision-theoretic frameworksโthey just optimize different objectives.
Real-World Examples
You want to estimate defect probability . Data: n = 100 samples, x = 3 defects observed.
- Point estimate:
- 95% CI (Wald):
- Result: (0.000, 0.063) or use exact Clopper-Pearson: (0.006, 0.085)
Legal interpretation: "If I repeated this sampling procedure infinitely, 95% of such intervals would contain the true ."
โ Cannot say: "There's a 95% probability is in this interval."
- Prior: (prior mean โ 0.038, encodes "around 4%")
- Likelihood:
- Posterior:
- 95% Credible Interval: (0.011, 0.063)
Legal interpretation: "Given the data and my prior, there is a 95% probability that lies in (0.011, 0.063)."
โ Can make direct probability statements about .
| Interval Type | 95% Interval | What It Means |
|---|---|---|
| Frequentist CI (Clopper-Pearson) | (0.006, 0.085) | 95% of such intervals cover true ฮธ in repeated sampling |
| Bayesian Credible (Beta(2,50) prior) | (0.011, 0.063) | P(ฮธ โ interval | data) = 0.95 |
Note: The intervals differ because the Bayesian incorporates prior information (pulling toward ~4%), while the frequentist uses only the data.
Same data (n=100, x=3), different priors โ different posteriors:
| Prior | Prior Belief | Posterior | 95% Credible Interval |
|---|---|---|---|
| Flat/uninformative | (0.011, 0.074) | ||
| Weakly informative (~4%) | (0.011, 0.063) | ||
| Strong prior (~5%) | (0.024, 0.068) |
โ ๏ธ Key insight: Strong priors dominate small samples. With n=100, the strong prior pulls the interval away from the MLE (0.03). This is a feature when prior knowledge is reliable, but a bug when the prior is misspecified.
- Hypothesis test:
- : No effect
- : Drug helps
- If p < 0.05 โ approve
This controls:
"How often would I wrongly approve useless drugs if I repeated trials forever?"
- Already knows:
- Similar drugs
- Biological constraints
- Uses prior.
- After trial: "There is an 87% probability this drug reduces mortality by at least 10%."
This answers:
"What should I do today, given all information?"
โ That's decision-theoretic optimality.
- Fit model.
- Report:
- Test accuracy
- Confidence intervals via bootstrapping
- Hypothesis tests on coefficients
Used in:
- Classical statistics
- Regulatory environments
- Scientific publishing
- Model weights have distributions.
- Predictions have credible intervals.
- Uncertainty-aware outputs: "There is a 92% probability that this patient has disease."
Used in:
- Medical AI
- Robotics
- Reinforcement learning
- Active learning
- Safety-critical AI
When Should You Use Which? (Practical Rulebook)
โ You want:
- Hypothesis testing
- p-values
- Long-run guarantees
- Regulatory approval
- Scientific reproducibility
โ You believe:
- "Truth is fixed"
- "I don't want to specify a prior"
- "Only data should speak"
๐ Examples:
- FDA drug trials
- Manufacturing quality control
- Academic hypothesis testing
โ You want:
- Direct probability statements about parameters
- Optimal decisions under uncertainty
- Uncertainty-aware AI
- Small-data problems
- Sequential learning
โ You believe:
- "Prior knowledge matters"
- "Uncertainty itself should be modeled"
๐ Examples:
- Medical diagnosis
- Autonomous systems
- Financial risk modeling
- Reinforcement learning
- LLM uncertainty estimation
The Deep Unification Insight (Advanced)
Here is the truth most PhD students miss:
โ Frequentist methods optimize worst-case or pointwise risk.
โ Bayesian methods optimize average risk under a prior.
They are both solving the same decision-theoretic problem, just with different ways of handling uncertainty.
In fact:
- Ridge regression = Bayesian MAP under Gaussian prior
- LASSO = Bayesian MAP under Laplace prior
- Dropout = approximate Bayesian inference
- Ensemble methods = frequentist uncertainty approximation
Final One-Sentence Summary
Frequentists trust repeated experiments.
Bayesians trust probability as a language of belief.
Engineers and AI systems increasingly rely on Bayesian reasoning when decisions matter in real time under uncertainty.
Honest Trade-Offs
- Small samples: Asymptotic guarantees may not hold; coverage can be poor.
- No prior info: Cannot easily incorporate domain knowledge.
- No direct probability on ฮธ: CIs answer "what would happen in repeated samples?" not "where is ฮธ?"
- Multiple comparisons: Requires careful correction (Bonferroni, FDR).
- Prior sensitivity: Results depend on prior choice, especially with small n.
- Computation: Exact posteriors often intractable; requires MCMC/VI.
- Subjectivity criticism: Two analysts with different priors get different answers.
- Improper priors: Must verify posterior is proper (integrates to 1).
When Priors Are Hard to Specify
Not sure what prior to use? Several approaches can help:
Jeffreys prior, reference priorsโdesigned to be "non-informative" or "minimally informative." Let the data dominate.
Estimate hyperparameters from the data itself. Common in hierarchical models. "Let the data inform the prior."
Some "priors" don't integrate to 1 (e.g., ). Must check posterior propriety!
Prior Sensitivity Analysis
Always run your analysis with multiple priors (informative, weakly informative, diffuse). If conclusions change dramatically, your inference is prior-dependentโget more data or be explicit about prior assumptions.
Computation in Practice
- Exact methods: t-tests, F-tests, exact binomial CIs
- Asymptotics: z-tests, Wald intervals, likelihood ratio tests
- Resampling: Bootstrap for CIs, permutation tests
Usually fast; well-supported in standard software.
- Conjugate priors: Closed-form posteriors (Beta-Binomial, Normal-Normal)
- MCMC: Stan, PyMC, JAGSโsample from posterior
- Variational Inference: Fast approximations (mean-field, ADVI)
- Deep ensembles: Approximate uncertainty in neural networks
Can be slow; requires convergence diagnostics.
Calibration: Checking Your Methods
Simulation studies: Generate data from known ฮธ, check if your 95% CI actually covers ฮธ in ~95% of simulations. Test under misspecification to assess robustness.
Posterior predictive checks: Simulate data from the posterior and compare to observed data. If the posterior can't reproduce key features of your data, the model (or prior) is misspecified.
Both paradigms need validation
Neither frequentist nor Bayesian methods are "automatic." Both require checking assumptions: model correctness, prior reasonableness, convergence (for MCMC), and coverage/calibration.
Paradigm Comparison: When to Use What
| Paradigm | Optimizes | Pros | Cons | Use When... |
|---|---|---|---|---|
| Frequentist | for each | No prior needed; objective; well-understood theory | No single "best" if varies with ; can't combine info | Regulatory settings; need -specific guarantees |
| Bayes | Coherent decisions; incorporates prior; single optimal | Requires prior; sensitive to prior choice; computation | Have prior info; want probabilistic statements | |
| Minimax | Robust to worst case; no prior needed | Conservative; may be too pessimistic; hard to compute | Adversarial settings; safety-critical applications |
Practical Guidance
In practice, most ML uses frequentist evaluation (test set metrics) butBayes-like reasoning (regularization = prior, ensembles = posterior averaging).Minimax appears in robust optimization and adversarial training.
Prediction
Prediction is not about learning a number โ it is about learning how randomness will unfold in the future.
In many real-world problems, we observe a vector of covariates and wish to predict an unseen response . This prediction task arises throughout science and engineering:
We assume the joint distribution of is known (or estimated from data). Our goal is to find a predictor that is as close as possible to the true future outcome .
Prediction fits exactly into the decision-theoretic framework:
- State of nature: The joint distribution of
- Action: A function that maps
- Loss function: A penalty measuring prediction error
- Risk: Expected prediction error
A natural measure of prediction quality is the squared error: . Since the future outcome is random, we measure performance using the mean squared prediction error:
This is the prediction analogue of MSE in estimation.
Among all possible predictors, the function that minimizes MSPE is:
Interpretation: The optimal predictor under squared error loss is the conditional mean of Y given Z.
This theorem is the mathematical foundation of:
- Linear regression
- Neural network regression
- Gaussian processes
- Deep learning with squared loss
Under squared loss, prediction is identical to Bayesian decision-making:
Thus:
- Prediction = posterior Bayes decision
- MSPE = posterior expected risk
This equivalence explains why deep learning with MSE loss is implicitly Bayesian.
We may search over:
Restricting to leads to linear regression.
Restricting to neural networks leads to deep learning.
Machine learning = empirical MSPE minimization over a restricted predictor class.
| Estimation | Prediction | |
|---|---|---|
| Goal | Learn a fixed unknown parameter | Predict a future random outcome |
| Randomness | Only the data is random | The future outcome is random even if is known |
| Target | A constant | A random variable |
| Error as | Yes (for good estimators) | No โ irreducible noise remains |
Estimation uncertainty can vanish. Prediction uncertainty never vanishes.
This irreducible error is called Bayes error in machine learning.
Goal: Learn a fixed unknown quantity ฮธ
"What is the true population mean?"
ฮธ is fixed; only our uncertainty about it changes with data.
Goal: Predict a future random observation
"What will the next customer spend?"
is random even if we knew perfectly!
Prediction vs Estimation
Interactive: Estimation vs Prediction
See the key difference: estimating the population mean vs predicting a new observation.
The prediction interval is always wider than the confidence interval! Why? Prediction uncertainty = estimation uncertainty + inherent variability of new observation.
The key insight: prediction uncertainty has two sources:
- Estimation uncertainty: We don't know exactly
- Inherent randomness: Even if we knew , is random
Common Confusion
A confidence interval for and a prediction intervalfor look similar but mean different things:
- 95% CI: "We're 95% confident the TRUE MEAN lies in this interval"
- 95% PI: "We're 95% confident the NEXT OBSERVATION lies in this interval"
The prediction interval is always wider because it accounts for individual variability.
Predictive Distributions
In many applications, our goal is not merely to produce a single numerical prediction, but to characterize the full uncertainty of a future outcome. The object that encodes this uncertainty is the predictive distribution.
While a point predictor answers:
"What value do I guess will occur?"
the predictive distribution answers the more fundamental question:
โ "What range of outcomes could occur, and with what probabilities?"
In the frequentist framework, the model is:
and the unknown parameter is first estimated by . The plug-in predictive distribution is then:
Interpretation: The plug-in predictive treats the estimated parameter as if it were the true parameter and ignores uncertainty in .
Thus, it accounts only for observation noise, but not parameter uncertainty. This makes plug-in predictions:
- Sharp
- Optimistic
- Potentially overconfident in small samples
In the Bayesian framework, is a random variable with posterior distribution. The posterior predictive distribution is:
Interpretation: Bayesian prediction averages over all plausible parameter values, weighted by how strongly the data support each value.
This properly accounts for:
- โ Observation noise
- โ Parameter uncertainty
- โ Prior uncertainty (when data are limited)
As a result, Bayesian predictive distributions are typically wider and better calibrated.
The plug-in approach implicitly assumes:
The Bayesian approach explicitly acknowledges:
Hence:
This is why Bayesian predictions are safer for:
The predictive distribution is not just a probabilistic summary โ it is the complete input required to make optimal future decisions under uncertainty. Any rational decision rule for actions involving the future (insurance pricing, thresholding, alarm systems, portfolio allocation) must be based on a predictive distribution, not a point estimate.
Common Pitfalls
A 95% confidence interval for ฮผ tells you where the parameter likely is. A 95% prediction interval tells you where the next observation will fall. PIs are always wider!
Squared error heavily penalizes large errors. If your data has outliers or heavy tails, absolute error or Huber loss may be more appropriate. Match your loss to your problem!
Using treats your estimate as if it were the true . This ignores estimation uncertainty and makes your prediction intervals too narrow. Use the posterior predictive instead!
In Bayesian estimation, the prior variance represents uncertainty about before data. The data variance is the noise in observations. These are different quantities โ don't confuse them!
Defaulting to squared error when overestimating and underestimating have different costs. Always ask: "What's the real-world consequence of each type of error?"
Connection to Point Estimation
Now we can see how decision theory provides the foundation for everything in this chapter:
| Concept | Decision Theory View | What It Tells Us |
|---|---|---|
| Estimator | Decision rule | Maps data to estimates |
| MSE | Risk under squared error loss | |
| Bias | Systematic error in the decision | |
| Variance | Variability of the decision | |
| Unbiased Estimator | Decision that's correct on average | |
| UMVUE | Best unbiased decision | Minimum variance among unbiased |
| Bayes Estimator | Optimal decision given prior | Minimizes Bayes risk |
| MLE | Asymptotically optimal decision | Minimizes KL divergence |
The Big Picture
All the properties we study in point estimation are about finding optimal decisionsunder different loss functions and different notions of optimality.
- MSE = risk under squared error loss
- Unbiasedness = zero systematic error
- Efficiency = achieving the minimum possible risk
- Sufficiency = using all relevant information for the decision
Confidence Bounds as Decision Theory
Decision theory provides a powerful lens for understanding confidence bounds and intervals โ an important hybrid of testing and estimation.
An accounting firm examines accounts receivable for a company based on a random sample. They want an upper bound on the total amount owed .
If represents the amount owed in the sample, they seek such that:
This is called a (1-ฮฑ) upper confidence bound on .
The Decision-Theoretic Formulation
How does this fit into decision theory? We can view the upper confidence bound as a decision procedure with action space and a specific loss function.
Problem: Taking achieves risk = 0! A bound that says "at most infinity" is useless.
Why better: Penalizes overestimation (loose bounds) while heavily penalizing undercoverage.
The Key Insight
Though upper bounding is the primary goal, it's also important to get close to the truth. Knowing "at most โ dollars" is technically correct but useless. The decision-theoretic framework naturally accommodates both goals by choosing an appropriate loss function.
Rather than using Lagrangian optimization, practitioners typically:
- Fix the coverage probability: Require for all (e.g., ฮฑ = 0.05)
- Then minimize the "excess": Among all procedures satisfying (1), minimize
where is the positive part.
Extension to Confidence Intervals
The same decision-theoretic logic extends to confidence intervals. A confidence interval for satisfies:
Goal: Minimize interval width while maintaining โฅ (1-ฮฑ) coverage
| Concept | Decision Theory Formulation | What We Optimize |
|---|---|---|
| Upper Bound | Action | Minimize s.t. coverage โฅ 1-ฮฑ |
| Lower Bound | Action | Maximize s.t. coverage โฅ 1-ฮฑ |
| Two-Sided CI | Action | Minimize s.t. coverage โฅ 1-ฮฑ |
Why This Matters
Understanding confidence intervals as decision procedures explains why we construct them the way we do: we're finding the narrowest intervals that still achieve the required coverage. This is a constrained optimization problem โ pure decision theory!
Symbol Glossary
| Symbol | Name | Meaning |
|---|---|---|
| Parameter | The unknown true value we want to estimate | |
| Parameter Space | Set of all possible values | |
| Action | A decision or estimate we choose | |
| Action Space | Set of all possible actions | |
| Loss Function | Cost of choosing action when truth is | |
| Decision Rule | Function mapping data to an action | |
| Risk Function | Expected loss: | |
| Bayes Risk | Expected risk under prior: | |
| Prior Distribution | Belief about before seeing data | |
| Future Observation | A new random value to be predicted |
Python Implementation
Here's a complete implementation of key decision theory concepts:
1import numpy as np
2from scipy import stats
3from typing import Callable, Tuple
4
5# =============================================================================
6# LOSS FUNCTIONS
7# =============================================================================
8
9def squared_error_loss(theta: float, estimate: float) -> float:
10 """Squared error loss: L(ฮธ, a) = (ฮธ - a)ยฒ"""
11 return (theta - estimate) ** 2
12
13def absolute_error_loss(theta: float, estimate: float) -> float:
14 """Absolute error loss: L(ฮธ, a) = |ฮธ - a|"""
15 return np.abs(theta - estimate)
16
17def asymmetric_loss(theta: float, estimate: float,
18 c_under: float = 1.0, c_over: float = 2.0) -> float:
19 """
20 Asymmetric loss: different costs for over/under-estimation
21 L(ฮธ, a) = c_under * max(ฮธ - a, 0) + c_over * max(a - ฮธ, 0)
22 """
23 if estimate < theta:
24 return c_under * (theta - estimate)
25 else:
26 return c_over * (estimate - theta)
27
28def huber_loss(theta: float, estimate: float, delta: float = 1.0) -> float:
29 """Huber loss: quadratic for small errors, linear for large"""
30 error = np.abs(theta - estimate)
31 if error <= delta:
32 return 0.5 * error ** 2
33 else:
34 return delta * error - 0.5 * delta ** 2
35
36# =============================================================================
37# RISK FUNCTIONS
38# =============================================================================
39
40def frequentist_risk(true_theta: float,
41 estimator: Callable[[np.ndarray], float],
42 sample_size: int,
43 loss_fn: Callable[[float, float], float],
44 n_simulations: int = 10000) -> float:
45 """
46 Compute frequentist risk via simulation.
47
48 R(ฮธ, ฮด) = E_ฮธ[L(ฮธ, ฮด(X))]
49 """
50 losses = []
51 for _ in range(n_simulations):
52 # Generate data from N(theta, 1)
53 data = np.random.normal(true_theta, 1, sample_size)
54 estimate = estimator(data)
55 losses.append(loss_fn(true_theta, estimate))
56 return np.mean(losses)
57
58def bayes_risk(estimator: Callable[[np.ndarray], float],
59 prior_mean: float,
60 prior_std: float,
61 sample_size: int,
62 loss_fn: Callable[[float, float], float],
63 n_simulations: int = 10000) -> float:
64 """
65 Compute Bayes risk via simulation.
66
67 r(ฯ, ฮด) = E_ฯ[R(ฮธ, ฮด)]
68 """
69 total_loss = 0
70 for _ in range(n_simulations):
71 # Sample ฮธ from prior
72 theta = np.random.normal(prior_mean, prior_std)
73 # Generate data from N(theta, 1)
74 data = np.random.normal(theta, 1, sample_size)
75 estimate = estimator(data)
76 total_loss += loss_fn(theta, estimate)
77 return total_loss / n_simulations
78
79# =============================================================================
80# ESTIMATORS
81# =============================================================================
82
83def sample_mean(data: np.ndarray) -> float:
84 """Sample mean estimator"""
85 return np.mean(data)
86
87def sample_median(data: np.ndarray) -> float:
88 """Sample median estimator"""
89 return np.median(data)
90
91def bayes_estimator_normal(data: np.ndarray,
92 prior_mean: float,
93 prior_var: float,
94 data_var: float = 1.0) -> float:
95 """
96 Bayes estimator for Normal mean (conjugate prior).
97 Posterior mean minimizes Bayes risk under squared error loss.
98 """
99 n = len(data)
100 posterior_precision = 1/prior_var + n/data_var
101 posterior_mean = (prior_mean/prior_var + np.sum(data)/data_var) / posterior_precision
102 return posterior_mean
103
104def shrinkage_estimator(data: np.ndarray, shrinkage: float = 0.5) -> float:
105 """James-Stein style shrinkage toward zero"""
106 return shrinkage * np.mean(data)
107
108# =============================================================================
109# PREDICTION
110# =============================================================================
111
112def prediction_interval(data: np.ndarray,
113 confidence: float = 0.95) -> Tuple[float, float]:
114 """
115 Compute prediction interval for new observation.
116 Assumes Normal data with unknown mean and variance.
117 """
118 n = len(data)
119 mean = np.mean(data)
120 s = np.std(data, ddof=1) # Sample std
121
122 # t-distribution with n-1 degrees of freedom
123 t_crit = stats.t.ppf((1 + confidence) / 2, df=n-1)
124
125 # Prediction SE: sqrt(sยฒ * (1 + 1/n))
126 pred_se = s * np.sqrt(1 + 1/n)
127
128 lower = mean - t_crit * pred_se
129 upper = mean + t_crit * pred_se
130
131 return lower, upper
132
133def posterior_predictive_normal(data: np.ndarray,
134 prior_mean: float,
135 prior_var: float,
136 data_var: float = 1.0) -> Tuple[float, float]:
137 """
138 Compute posterior predictive distribution parameters.
139 Returns (mean, variance) of X_new | X_1, ..., X_n
140 """
141 n = len(data)
142
143 # Posterior parameters
144 posterior_precision = 1/prior_var + n/data_var
145 posterior_var = 1 / posterior_precision
146 posterior_mean = posterior_var * (prior_mean/prior_var + np.sum(data)/data_var)
147
148 # Predictive distribution
149 predictive_mean = posterior_mean
150 predictive_var = data_var + posterior_var
151
152 return predictive_mean, predictive_var
153
154# =============================================================================
155# EXAMPLE USAGE
156# =============================================================================
157
158if __name__ == "__main__":
159 np.random.seed(42)
160
161 # Compare estimator risks
162 print("=" * 60)
163 print("RISK COMPARISON: Mean vs Median for Normal Data")
164 print("=" * 60)
165
166 true_theta = 5.0
167 for n in [5, 10, 25, 50]:
168 risk_mean = frequentist_risk(true_theta, sample_mean, n, squared_error_loss)
169 risk_median = frequentist_risk(true_theta, sample_median, n, squared_error_loss)
170 print(f"n={n:2d}: Mean Risk = {risk_mean:.4f}, Median Risk = {risk_median:.4f}")
171 print(f" Mean is {risk_median/risk_mean:.2f}x more efficient")
172
173 print()
174 print("=" * 60)
175 print("BAYES RISK COMPARISON")
176 print("=" * 60)
177
178 # Compare MLE vs Bayes estimator
179 n = 10
180 for prior_std in [0.5, 1.0, 2.0, 5.0]:
181 mle_bayes_risk = bayes_risk(sample_mean, 0, prior_std, n, squared_error_loss)
182 bayes_est = lambda x: bayes_estimator_normal(x, 0, prior_std**2)
183 bayes_bayes_risk = bayes_risk(bayes_est, 0, prior_std, n, squared_error_loss)
184 print(f"Prior ฯ={prior_std}: MLE Bayes Risk = {mle_bayes_risk:.4f}, "
185 f"Bayes Est. Risk = {bayes_bayes_risk:.4f}")
186
187 print()
188 print("=" * 60)
189 print("PREDICTION vs ESTIMATION")
190 print("=" * 60)
191
192 data = np.array([4.2, 5.1, 4.8, 5.5, 4.9, 5.2, 4.7, 5.3])
193
194 # Estimation
195 est_mean = np.mean(data)
196 est_se = np.std(data, ddof=1) / np.sqrt(len(data))
197 ci_lower = est_mean - 1.96 * est_se
198 ci_upper = est_mean + 1.96 * est_se
199
200 # Prediction
201 pi_lower, pi_upper = prediction_interval(data)
202
203 print(f"Data: {data}")
204 print(f"
205Estimation (95% CI for ฮผ): [{ci_lower:.3f}, {ci_upper:.3f}]")
206 print(f"CI Width: {ci_upper - ci_lower:.3f}")
207 print(f"
208Prediction (95% PI for X_new): [{pi_lower:.3f}, {pi_upper:.3f}]")
209 print(f"PI Width: {pi_upper - pi_lower:.3f}")
210 print(f"
211PI is {(pi_upper - pi_lower)/(ci_upper - ci_lower):.2f}x wider than CI")Key Insights
All estimation concepts (bias, variance, MSE, efficiency) arise from decision theory. The "best" estimator depends on your loss function and how you average risk.
Squared error โ posterior mean. Absolute error โ posterior median. Asymmetric loss โ quantiles. Choose your loss based on the real-world consequences.
Predicting a future observation has MORE uncertainty than estimating a parameter. Prediction intervals are always wider than confidence intervals.
The Mean Squared Error is the frequentist risk when using squared error loss. The bias-variance decomposition follows directly from this.
Bayes estimators under least favorable priors are often minimax. The sample mean is both Bayes (flat prior) and minimax for Normal mean estimation.
Try It Yourself
Solidify your understanding by experimenting with the interactive demos. Here's a structured exploration:
In the Loss Function Demo, set error = 1, then error = 3. How much does squared error increase vs absolute error? (Hint: squared goes 1โ9, absolute goes 1โ3)
In the Risk Demo, slide n from 10 to 50. Confirm risk drops roughly as 1/n for the sample mean. Does the median's risk also drop at the same rate?
In the Prediction Demo, add more data points. Watch the PI shrink slower than the CI. Why? (The term doesn't go away even with infinite data!)
In the Bayes Demo, set prior variance very small (0.1) vs very large (5.0). Where does the posterior land? Verify: small โ prior dominates; large โ data dominates.
Think of a real problem where underestimating is 3x worse than overestimating. What quantile should you use? (Answer: ฮฑ = 3/(3+1) = 0.75 = 75th percentile)
Summary
In this section, we've built the decision-theoretic foundation for point estimation:
- Decision theory framework: States (ฮธ), actions (a), and loss L(ฮธ, a)
- Loss functions: Squared error, absolute error, asymmetric, and 0-1 loss each lead to different optimal estimators
- Risk functions: Frequentist risk R(ฮธ, ฮด), Bayes risk r(ฯ, ฮด), and minimax risk provide different ways to evaluate estimators
- Prediction vs estimation: Prediction has extra uncertainty from the inherent randomness of future observations
- Connection to point estimation: MSE, bias, variance are all decision-theoretic concepts
Now that you understand why we care about different estimator properties, we'll dive deep into the specific concepts:
- Section 1: Estimators and their properties โ the parametric framework
- Section 2: Bias, Variance, and MSE โ the famous decomposition
- Section 3: Consistency and Efficiency โ large-sample behavior
- Section 4: Sufficiency โ using all the information in your data
- Section 5: Completeness and Ancillarity โ finding optimal estimators