Learning Objectives
Before You Start
You should be comfortable with probability distributions, random variables, and basic calculus (derivatives and optimization). Familiarity with expected values will also help.
By the end of this section, you will be able to:
What it means to work with a family of distributions indexed by parameters
A function that maps observations to parameter estimates — data in, estimate out
Why they measure "discrepancy" between candidate parameters and the truth
By minimizing the empirical contrast — finding the "best fit" parameter
Using gradient conditions and general -functions
To real estimation problems including Maximum Likelihood Estimation (MLE)
By the end of this section, you'll understand the complete estimation pipeline and be ready to implement an MLE estimator from scratch. You'll see that all estimators — from simple sample means to neural network training — follow the same fundamental pattern.
The Big Picture: Why Estimation Matters
Statistics is fundamentally about learning from data. We observe data, but we want to know about the underlying process that generated it.
A coffee shop manager wants to know the true average wait time for customers. Measuring every single customer forever is impossible, so they decide to sample customers and estimate the average.
Sample Size vs Estimate Accuracy
Watch how your estimate improves as you collect more data. The true mean is 2.847 — can you get close?
With only n = 10 samples, there's still considerable uncertainty. Increase the sample size to watch the estimate converge toward the true value.
The Detective Analogy
Think of estimation like being a detective. You arrive at a crime scene and find clues (your data). You can't rewind time to see exactly what happened (the true parameters), but you can use the clues to reconstruct what most likely occurred.
- Clues at crime scene → Evidence
- Reconstruct what happened → Theory
- Better methods → Closer to truth
- Can never be 100% certain → Uncertainty
- Sample observations →
- Apply estimator →
- Good estimator →
- Quantify uncertainty → Confidence intervals
The better your detective methods (estimators), the closer your reconstruction gets to the truth. And just like in detective work, some methods are provably better than others.
A Concrete Example: The Coffee Shop Mystery
Imagine you're studying customer wait times at a coffee shop. You collect data: 2.3 min, 1.8 min, 4.1 min, 3.2 min...
The central question: What is the "true" average wait time for ALL customers (past, present, and future)?
Here's the key insight: that "true average" exists somewhere out there as a fixed number — maybe it's exactly 2.847 minutes. We'll never know it precisely, but we can get closer and closer with more data and smarter methods.
The entire estimation process: from hidden truth, through observable data, to our best guess
- Your sample: [2.3, 1.8, 4.1, 3.2, ...]
- Sample mean: 2.85 min
- Sample size: n = 100
- True population mean: = ???
- True variance: = ???
- The actual data-generating process
The One Formula to Remember
An estimator is just a function of your data. That's it. You put data in, you get a parameter estimate out. The magic is in choosing which function T(·) to use — that's what this chapter teaches you.
Why Should You Care? (The ML Connection)
Every time you train a machine learning model, you're doing estimation:
| ML Task | What You're Estimating |
|---|---|
| Linear Regression | Slope and intercept parameters |
| Neural Network Training | Millions of weight parameters |
| GPT/LLM Training | Next-token probability distribution |
| Bayesian Inference | Posterior distribution over parameters |
When you run model.fit(X_train, y_train) in scikit-learn, which of the "Three Questions of Estimation" is scikit-learn answering for you?
Click to reveal answer
The Universal Pattern
All estimation follows the same pattern: Data → Estimator → Estimate. Whether you're computing a sample mean or training GPT-4, you're applying a function to data to get parameter estimates.
The Three Questions of Estimation
Every estimation problem comes down to three questions:
- What should we estimate? — Defining the parameter(s) of interest
- How should we estimate it? — Choosing an estimator (this chapter!)
- How good is our estimate? — Quantifying uncertainty (next chapters)
Real-World Estimation: From Problems to Mathematical Models
Estimation isn't just abstract mathematics — it solves concrete problems every day. Click on each example below to see how real-world challenges map directly to the estimation framework we're learning.
The Universal Pattern Across All Examples
Notice how every example follows the same structure: (1) True parameter θ₀ exists but is unknown, (2) Data X is a sample from the population, (3) Contrast function ρ(X, θ) measures how poorly θ fits the data, (4) Estimate minimizes the contrast. This is the universal language of estimation.
The Estimation Problem (Formal)
Given observed data , construct a function that gives us a "good guess" for the unknown parameter .
What Makes a Guess "Good"?
This is the million-dollar question! We want estimators that are: (1) unbiased — correct on average, (2) consistent — improve with more data, and (3) efficient — have minimal variance. We'll formalize these properties throughout this chapter.
The Parametric Framework
Before we can estimate anything, we need to set up our mathematical framework precisely. Let's decode the notation:
The Setup: Observation Space and Probability Families
Let's break this down symbol by symbol:
| Symbol | Name | What It Means |
|---|---|---|
| Observation vector | The data we actually observe (e.g., n wait times) | |
| Sample space | All possible values could take | |
| Probability distribution | The unknown law governing how is generated | |
| Probability family | The set of all candidate distributions we consider |
Think of the Sample Space as the Data Universe
If you're measuring wait times (positive numbers), then — all possible n-tuples of positive real numbers.
The Parametric Assumption
In the parametric case, we assume the true distribution belongs to a specific family indexed by parameters:
| Symbol | Name | What It Means |
|---|---|---|
| Parameter vector | The unknown quantities we want to estimate | |
| Parameter space | All possible values could take | |
| Parametric distribution | The distribution when parameter equals |
Example: Normal Distribution
For normally distributed data: , and (mean can be any real number, variance must be positive).
The key insight: Once we specify , we completely determine the probability distribution. The estimation problem becomes: Which generated our data?
What Is an Estimator and Estimate?
An estimator is a machine (function) that turns data into guesses. An estimate is the actual guess.
The estimator predicts the true parameter of the population using sample data from that population.
Formally, an estimator is a function of the observation vector that produces an estimate of the unknown parameter .
This notation emphasizes three crucial points:
- is a function — It takes data as input and produces a parameter estimate as output
- depends only on — We can only use observable data, not the true (unknown)
- lives in — The estimate should be a valid parameter value
The Hat Notation
The "hat" symbol always denotes an estimate or estimator. When you see , think: "this is our data-based guess for ."
Estimator vs Estimate
A subtle but important distinction:
| Term | Symbol | What It Is |
|---|---|---|
| Estimator | The function/rule itself (random, before seeing data) | |
| Estimate | The specific value obtained after observing (fixed number) |
The estimator is a random variable because is random. The estimate is a specific number computed from realized data .
Contrast Functions: Measuring Discrepancy
How do we find a "good" estimator? The key idea is to define a contrast function that measures how "far" any candidate parameter is from the truth.
Definition of Contrast Function
A contrast function (Greek letter "rho") is a function that takes:
- An observation from the sample space
- A candidate parameter from the parameter space
And produces a real number measuring the "discrepancy" between and the truth.
Intuition for the Contrast Function
Think of as answering: "How incompatible is this candidate parameter with the observed data ?"
- Small means is compatible with data
- Large means is incompatible with data
Wait — How Can We Compare Data and Parameters?
You might be confused: is data (numbers we observed), and is a parameter (an abstract quantity describing a distribution). These are completely different types of objects! How can a function take both and produce a meaningful "discrepancy"?
The Key Insight
The contrast function does NOT directly compare and . Instead, it asks:
"If were the true parameter, how surprising or unlikely would this observed data be?"
Here's how it works:
- The parameter defines a probability model — it tells us what the data should look like if were true
- The data is what we actually observed — the real numbers we collected
- The function measures the "fit" — how well does what we saw match what we'd expect if were true?
Concrete Examples
Let's make this crystal clear with examples. Click each to explore:
The Bridge Between Data and Parameters
The contrast function acts as a bridge between the world of data () and the world of parameters ():
- generates expectations — "what data should look like"
- is reality — "what data actually looks like"
- measures the gap — "how well does expectation match reality?"
The Key Requirement
For to be a useful contrast function, we need a special property. Define the population discrepancy:
This measures the average discrepancy when the true parameter is and we're evaluating at .
The Fundamental Requirement
For to be a valid contrast function, we require:
In plain English: When averaged over the true distribution, the discrepancy is smallest exactly at the true parameter.
This requirement ensures that if we knew the truth (), the contrast function would correctly identify it as the best choice.
The Discrepancy Function
Let's understand the discrepancy function more deeply:
💡 Intuitive Meaning of the Discrepancy Function
The discrepancy function measures how bad a guessed model is on average, when the world is actually governed by the true parameter .
It evaluates the loss for every possible observation, weighted by how frequently that observation occurs in reality.
In machine learning, this is the ideal objective we want to minimize — and training loss is simply its finite-data approximation.
For discrete distributions, the integral becomes a sum:
measures how bad our guessed model is on average, when reality is actually generated by the true parameter .
It is the expected pain of believing , when the world is actually governed by .
Imagine God knows the true parameter .
God repeatedly simulates infinite datasets from the true distribution .
For each simulated data point , God evaluates how wrong your guess is using .
The discrepancy is the average of that mistake over all possible worlds.
This explains why:
- It uses the true distribution
- It integrates over all possible observations
- It is a population-level measure, not a sample-level one
= true physical parameter of a system
= your estimated model
= sensor reading
= prediction error
Then = expected prediction error over infinite future measurements.
Because discrepancy is not about how good the model thinks it is — it is about how reality judges the model.
The true distribution tells us which observations actually occur in the real world. We care about performance on real data, not hypothetical data from our model.
In machine learning, we never know . So we replace this population discrepancy with a sample average:
🔗 This Is Your PyTorch/TensorFlow Loss Function!
Every deep learning training objective is a special case of the contrast function :
Cross-Entropy Loss — Used in language models and classifiers
When GPT predicts the next token, it minimizes this over millions of text samples.
Mean Squared Error — Used in regression and reconstruction
Neural networks learn by minimizing prediction error over training data.
Contrastive Loss — Used in representation learning
CLIP learns image-text alignment by contrasting positive pairs against negatives.
KL Divergence + Reconstruction — Used in generative models
VAEs and diffusion models learn to generate by minimizing this variational bound.
The Profound Connection
✓Population discrepancy = True expected loss (what we ideally want)
✓Empirical risk = Training loss (what we actually compute)
✓Generalization = How close empirical risk is to population discrepancy
In PyTorch, this is literally:
1# Population discrepancy (ideal, but impossible):
2# D(θ₀, θ) = E_{x~P_θ₀}[ρ(x, θ)]
3
4# Empirical risk (what we actually compute):
5loss = 0
6for x, y in training_data:
7 loss += criterion(model(x), y) # ρ(x, θ)
8loss = loss / len(training_data) # (1/n) Σ ρ(xᵢ, θ)
9
10# Minimize!
11loss.backward()
12optimizer.step()🎯 Classical estimation theory → Modern deep learning are the same mathematical framework!
Interactive Geometric Visualization
Let's visualize this "bowl" concept interactively! Move the sliders to see how the discrepancy function behaves as you explore different parameter values.
The Discrepancy "Bowl" — 2D Visualization
The discrepancy function D(θ₀, θ) measures how "wrong" a candidate θ is when θ₀ is the true parameter. It forms a bowl shape with the minimum at θ = θ₀.
Key Insight
As θ moves away from θ₀, the discrepancy grows. The "bowl" shape ensures there's a unique minimum at θ = θ₀. Move the slider to find it!
The Discrepancy "Bowl" — 3D Visualization
With two parameters (θ₁, θ₂), the discrepancy function forms a 3D bowl. The minimum sits at the true parameter values (θ₀₁, θ₀₂). Drag to rotate, scroll to zoom.
Try This
- • Click on the surface to see coordinates at that point
- • Move θ₁ and θ₂ sliders to explore the surface
- • Watch how D grows as you move away from θ₀
- • Rotate the view by dragging — see the bowl shape from different angles
- • When θ = θ₀, you're at the bottom of the bowl (D = 0)
What You're Seeing
The 2D curve shows the discrepancy for a single parameter (e.g., estimating a mean). The 3D surface shows what happens with two parameters — the bowl becomes a paraboloid, and the minimum sits at the true parameter values.
Notice how the sample contrast (dashed orange line in 2D) is "noisy" compared to the true discrepancy (solid green). With more data, the sample contrast converges to the true discrepancy — this is the law of large numbers at work!
Minimum Contrast Estimates
Since we can't compute (we don't know ), we need a clever workaround. Here's the key insight:
Since is an unbiased estimate of , we can minimize instead of !
The Minimum Contrast Estimator
Define the minimum contrast estimate as:
In words: is the parameter value that minimizes the contrast function for the observed data .
| Component | Meaning |
|---|---|
| "argument that minimizes" — returns the , not the minimum value | |
| Search over all valid parameter values | |
| Contrast function for observed |
Why This Works
The fundamental idea is beautifully simple:
- If we could compute and minimize it, we'd find (the truth)
- We can't compute directly ( is unknown)
- But is an "unbiased estimate" of in a weak sense
- So minimizing should give us something close to
The Sample Version
With observations , we often use the average contrast:
Minimizing this average is the basis for many estimation methods.
Estimating Equations
Now we introduce a powerful technique: instead of directly minimizing the contrast, we use calculus to find where the derivative equals zero.
The Gradient Condition
Suppose is Euclidean (), is an interior point, and is smooth. Then at the minimum:
where denotes the gradient (vector of partial derivatives):
Why the Gradient?
At a minimum of a smooth function, the surface is "flat" — all partial derivatives are zero. The gradient collects all these derivatives into one vector equation.
The Estimating Equation
Since we can't evaluate directly, we use the sample version. The estimating equation is:
In words: Find such that the gradient of the contrast function (evaluated at ) equals zero.
This Is a System of Equations
If , then this gives us equations in unknowns:
General Estimating Equations
There's an even more general framework that doesn't require starting from a contrast function. We directly specify estimating functions.
The -Function Approach
"More generally, suppose we are given a function ..."
Pure Symbolic Meaning
is a function that takes two inputs:
and outputs a d-dimensional vector in :
So Ψ is like:
Intuition
Instead of using gradients (), this says:
Let's use any vector-valued function whose expectation is 0 at the true .
Think of as a system of equations the true θ must satisfy.
Examples:
- •moment conditions
- •score equations (log-likelihood derivative)
- •sample equations from GMM
- •regression normal equations
What is Ψ (Psi)? — The "Error Detector"
Ψ is an "error detector" — a function that measures how "wrong" a candidate parameter θ is, given the observed data X.
The Key Idea: When θ equals the true parameter θ₀, the function Ψ(X, θ) should average to zero:
Think of it like a thermostat:
- •If the room temperature (θ) matches the target (θ₀), the error signal is zero → Ψ = 0
- •If the room is too cold, the error is negative → Ψ < 0 (heat more!)
- •If the room is too hot, the error is positive → Ψ > 0 (cool down!)
Formally, is a vector-valued function that takes data and a parameter, and outputs an "error vector":
where — each component checks a different aspect of the parameter.
Two Parameters (d = 2)
Data (X)
5.2
Parameters (θ)
μ = 3.0
σ² = 2.0
Use case: Normal distribution — estimating mean (μ) and variance (σ²)
Three Parameters (d = 3)
Data (X)
y = 8
x₁ = 2
x₂ = 3
Parameters (θ)
β₀ = 1
β₁ = 2
β₂ = 1
All zeros → θ is the MLE!
Use case: Linear regression — y = β₀ + β₁x₁ + β₂x₂
Notice: The output dimension matches the number of parameters (d). Each ψⱼ tells you how "off" the j-th parameter is. When all outputs are zero, you've found the estimate!
| Symbol | Meaning | Intuition |
|---|---|---|
| Vector of estimating functions | The "error detector" that tells us how wrong θ is | |
| The -th component function | Checks if the j-th parameter component is correct | |
| Dimension of parameter space | Number of parameters to estimate (d equations for d unknowns) | |
| Data space | All possible values the data X can take |
📊 Concrete Example: Estimating the Mean
Suppose we want to estimate the mean μ of a distribution. A natural Ψ-function is:
Ψ(X, μ) = X - μ
| If... | Then Ψ = | Meaning |
|---|---|---|
| X = 5, μ = 5 | 0 | ✅ μ is correct! |
| X = 7, μ = 5 | +2 | ⬆️ μ is too low |
| X = 3, μ = 5 | -2 | ⬇️ μ is too high |
Key property: E[Ψ(X, μ₀)] = E[X - μ₀] = μ₀ - μ₀ = 0 when μ = μ₀ (the true mean).
Population Condition
Define the population version:
We require that has as its unique solution.
This means: when the true parameter is , the expected value of is zero only when .
The Estimating Equation Estimate
We define as the solution to:
This is equations in unknowns — one equation for each component of .
Connection to Contrast Functions
When , the estimating equation approach reduces to the minimum contrast approach. But the -function framework is more general!
Intuitive Examples
Let's make these abstract concepts concrete with familiar examples.
Example 1: Method of Moments
Setup: Estimate the mean from data .
Estimating function:
Estimating equation:
Solution:
The sample mean! This is the simplest estimating equation estimate.
Example 2: Maximum Likelihood (Preview)
Setup: Observations with density .
Contrast function (negative log-likelihood):
Estimating equation (score equation):
This is the famous score equation — the foundation of Maximum Likelihood Estimation (covered in detail in Chapter 12).
Example 3: Least Squares
Setup: Linear regression .
Contrast function (sum of squared errors):
Estimating equation (normal equations):
Solution:
Complete Symbol Glossary
Here's a comprehensive reference for all the symbols in this section:
| Symbol | Name | Meaning |
|---|---|---|
| Observation | The data we observe (random vector) | |
| Realized data | Specific observed values (lowercase = fixed) | |
| Sample space | All possible values could take | |
| Distribution | Probability law governing | |
| Distribution family | Set of candidate distributions | |
| Parameter | Unknown quantity we want to estimate | |
| True parameter | The actual parameter value (unknown) | |
| Parameter space | All valid parameter values | |
| Parametric distribution | Distribution when parameter is | |
| Estimator/estimate | Our guess for based on data | |
| Contrast function | Measures incompatibility of with | |
| Discrepancy | Expected contrast under true | |
| Expectation | Expected value under distribution | |
| Gradient | Vector of partial derivatives w.r.t. | |
| Estimating function | Vector-valued function defining equations | |
| Population moment | Expected value of under |
Python Implementation
Implementing a General Estimating Equation Solver
1import numpy as np
2from scipy.optimize import fsolve, minimize
3from typing import Callable
4
5def solve_estimating_equation(
6 psi: Callable[[np.ndarray, np.ndarray], np.ndarray],
7 X: np.ndarray,
8 theta_init: np.ndarray
9) -> np.ndarray:
10 """
11 Solve the estimating equation Psi(X, theta) = 0.
12
13 Parameters:
14 -----------
15 psi : Callable
16 Estimating function Psi(X, theta) -> R^d
17 X : np.ndarray
18 Observed data
19 theta_init : np.ndarray
20 Initial guess for theta
21
22 Returns:
23 --------
24 np.ndarray : Solution theta_hat
25 """
26 # Define the equation to solve: sum over observations
27 def equation(theta: np.ndarray) -> np.ndarray:
28 return np.sum([psi(x_i, theta) for x_i in X], axis=0)
29
30 # Solve using scipy's fsolve
31 theta_hat = fsolve(equation, theta_init)
32 return theta_hat
33
34
35# Example 1: Method of Moments for Normal(mu, sigma^2)
36def psi_normal(x: float, theta: np.ndarray) -> np.ndarray:
37 """
38 Estimating function for Normal(mu, sigma^2).
39 theta = (mu, sigma^2)
40 """
41 mu, sigma2 = theta
42 return np.array([
43 x - mu, # For mu: E[X - mu] = 0
44 (x - mu)**2 - sigma2 # For sigma^2: E[(X-mu)^2 - sigma^2] = 0
45 ])
46
47
48# Generate data from Normal(5, 4)
49np.random.seed(42)
50X = np.random.normal(loc=5, scale=2, size=100)
51
52# Solve
53theta_hat = solve_estimating_equation(psi_normal, X, theta_init=np.array([0.0, 1.0]))
54print(f"True theta = (5, 4)")
55print(f"Estimated theta_hat = ({theta_hat[0]:.4f}, {theta_hat[1]:.4f})")Minimum Contrast Estimation
1import numpy as np
2from scipy.optimize import minimize
3from typing import Callable
4
5def minimum_contrast_estimate(
6 rho: Callable[[np.ndarray, np.ndarray], float],
7 X: np.ndarray,
8 theta_init: np.ndarray,
9 bounds: tuple = None
10) -> np.ndarray:
11 """
12 Find theta that minimizes the contrast function rho(X, theta).
13
14 Parameters:
15 -----------
16 rho : Callable
17 Contrast function rho(X, theta) -> R
18 X : np.ndarray
19 Observed data
20 theta_init : np.ndarray
21 Initial guess
22 bounds : tuple, optional
23 Parameter bounds
24
25 Returns:
26 --------
27 np.ndarray : Minimum contrast estimate theta_hat
28 """
29 def objective(theta: np.ndarray) -> float:
30 return rho(X, theta)
31
32 result = minimize(objective, theta_init, bounds=bounds, method='L-BFGS-B')
33 return result.x
34
35
36# Example: Least Squares Regression
37def least_squares_contrast(X: np.ndarray, beta: np.ndarray) -> float:
38 """
39 Contrast function: sum of squared residuals.
40 X has columns [1, x1, x2, ..., y] (last column is response)
41 """
42 design = X[:, :-1] # Features
43 y = X[:, -1] # Response
44 y_pred = design @ beta
45 return np.sum((y - y_pred)**2)
46
47
48# Generate linear regression data: y = 2 + 3*x + noise
49np.random.seed(42)
50n = 100
51x = np.random.uniform(0, 10, n)
52y = 2 + 3*x + np.random.normal(0, 1, n)
53X = np.column_stack([np.ones(n), x, y]) # [1, x, y]
54
55# Estimate
56beta_hat = minimum_contrast_estimate(
57 least_squares_contrast, X,
58 theta_init=np.array([0.0, 0.0])
59)
60print(f"True beta = (2, 3)")
61print(f"Estimated beta_hat = ({beta_hat[0]:.4f}, {beta_hat[1]:.4f})")Visualizing the Discrepancy Function
1import numpy as np
2import matplotlib.pyplot as plt
3from scipy import stats
4
5# Setup: Estimate mean mu of Normal(mu, 1)
6true_mu = 3.0
7n = 50
8
9# Generate data
10np.random.seed(42)
11X = np.random.normal(loc=true_mu, scale=1, size=n)
12
13# Define contrast function (negative log-likelihood per observation)
14def contrast(x_i, mu):
15 """rho(x, mu) = (x - mu)^2 / 2 (proportional to negative log-likelihood)"""
16 return (x_i - mu)**2 / 2
17
18# Compute sample contrast for range of mu values
19mu_range = np.linspace(0, 6, 200)
20sample_contrast = [np.mean([contrast(x_i, mu) for x_i in X])
21 for mu in mu_range]
22
23# The true discrepancy D(mu_0, mu) = E[(X - mu)^2/2] = ((mu - mu_0)^2 + 1)/2
24true_discrepancy = ((mu_range - true_mu)**2 + 1) / 2
25
26# Plot
27plt.figure(figsize=(10, 6))
28plt.plot(mu_range, sample_contrast, 'b-', linewidth=2,
29 label=r'Sample contrast $\bar{\rho}(X, \mu)$')
30plt.plot(mu_range, true_discrepancy, 'r--', linewidth=2,
31 label=r'True discrepancy $D(\mu_0, \mu)$')
32plt.axvline(true_mu, color='g', linestyle=':', linewidth=2,
33 label=f'True $\mu_0$ = {true_mu}')
34plt.axvline(np.mean(X), color='purple', linestyle=':', linewidth=2,
35 label=f'Estimate $\hat{{\mu}}$ = {np.mean(X):.3f}')
36
37plt.xlabel(r'$\mu$', fontsize=14)
38plt.ylabel('Contrast / Discrepancy', fontsize=14)
39plt.title('Sample Contrast vs Population Discrepancy', fontsize=14)
40plt.legend(fontsize=12)
41plt.grid(True, alpha=0.3)
42plt.tight_layout()
43plt.savefig('discrepancy_visualization.png', dpi=150)
44plt.show()
45
46# The minimum of sample contrast gives our estimate
47mu_hat = mu_range[np.argmin(sample_contrast)]
48print(f"True mu = {true_mu}")
49print(f"Sample mean = {np.mean(X):.4f}")
50print(f"Minimizer of sample contrast = {mu_hat:.4f}")Key Insights
The Estimation Recipe
How to Construct an Estimator
- Choose a contrast function that measures how "incompatible" is with the data
- Verify the key property: is minimized at
- Option A: Directly minimize over
- Option B: Solve the estimating equation
Why This Framework Is Powerful
- Unifying: Maximum Likelihood, Least Squares, Method of Moments are all special cases
- Flexible: Can handle complex models by choosing appropriate contrast functions
- Analyzable: Properties of (bias, variance, consistency) follow from properties of
- Computationally tractable: Estimating equations often easier to solve than direct optimization
Caution: Uniqueness
Not all contrast functions have unique minima! When has multiple local minima, different starting points may give different estimates. Always check that your estimator is well-defined.
Summary
This section introduced the foundational framework for estimation theory. Here are the key takeaways:
Core Concepts
| Concept | Key Formula | Intuition |
|---|---|---|
| Parametric model | Data comes from a distribution indexed by | |
| Estimator | Recipe for guessing from data | |
| Contrast function | Measures incompatibility of with | |
| Discrepancy | Average contrast under true | |
| Min contrast estimate | that best fits the data | |
| Estimating equation | Find where gradient vanishes |
The Big Ideas
- Estimation is about finding the parameter that generated our data — we use observable quantities to infer unobservable truths
- Contrast functions formalize "goodness of fit" — smaller contrast means better compatibility with data
- The key requirement is identification — must be uniquely minimized at
- Estimating equations are often easier to solve — differentiation turns optimization into root-finding
Coming Next: In the next section, we'll explore the key properties of estimators: bias, variance, and mean squared error (MSE). These concepts help us evaluate how "good" an estimator is and choose between competing estimation strategies.