Chapter 11
25 min read
Section 74 of 175

Estimators and Their Properties

Point Estimation

Learning Objectives

Before You Start

You should be comfortable with probability distributions, random variables, and basic calculus (derivatives and optimization). Familiarity with expected values will also help.

By the end of this section, you will be able to:

🎯
Understand the Parametric Framework

What it means to work with a family of distributions indexed by parameters

🔧
Define an Estimator

A function that maps observations to parameter estimates — data in, estimate out

📏
Explain Contrast Functions

Why they measure "discrepancy" between candidate parameters and the truth

📉
Derive Minimum Contrast Estimates

By minimizing the empirical contrast — finding the "best fit" parameter

✏️
Set Up Estimating Equations

Using gradient conditions and general Ψ\Psi-functions

🚀
Apply These Concepts

To real estimation problems including Maximum Likelihood Estimation (MLE)

🏆What You'll Build

By the end of this section, you'll understand the complete estimation pipeline and be ready to implement an MLE estimator from scratch. You'll see that all estimators — from simple sample means to neural network training — follow the same fundamental pattern.


The Big Picture: Why Estimation Matters

Statistics is fundamentally about learning from data. We observe data, but we want to know about the underlying process that generated it.
The Problem: Coffee Shop Wait Times

A coffee shop manager wants to know the true average wait time for customers. Measuring every single customer forever is impossible, so they decide to sample customers and estimate the average.

Unknown Truth
True mean μ = 2.847 minutes(hidden from us)
What We Do
Sample n customers and compute θ̂ = sample mean
The Question
How close is θ̂ to μ? Does more data help?
🎮Try It Yourself: See Estimation in Action

Sample Size vs Estimate Accuracy

Watch how your estimate improves as you collect more data. The true mean is 2.847 — can you get close?

1.32.12.83.64.3True μ = 2.847θ̂ = 2.595
10
52.5k5k7.5k10k
Sample Mean
2.595
True Mean
2.847
Error
0.252
Error decreases as n increases (Law of Large Numbers)
5
10
50
100
500
1000
5000
10000
Key Insight

With only n = 10 samples, there's still considerable uncertainty. Increase the sample size to watch the estimate converge toward the true value.

The Detective Analogy

Think of estimation like being a detective. You arrive at a crime scene and find clues (your data). You can't rewind time to see exactly what happened (the true parameters), but you can use the clues to reconstruct what most likely occurred.

🔍Detective Work
  • Clues at crime scene → Evidence
  • Reconstruct what happened → Theory
  • Better methods → Closer to truth
  • Can never be 100% certain → Uncertainty
📊Statistical Estimation
  • Sample observations → X1,X2,,XnX_1, X_2, \ldots, X_n
  • Apply estimator → θ^(X)\hat{\theta}(X)
  • Good estimator → θ^θ0\hat{\theta} \approx \theta_0
  • Quantify uncertainty → Confidence intervals

The better your detective methods (estimators), the closer your reconstruction gets to the truth. And just like in detective work, some methods are provably better than others.

A Concrete Example: The Coffee Shop Mystery

Imagine you're studying customer wait times at a coffee shop. You collect data: 2.3 min, 1.8 min, 4.1 min, 3.2 min...

The central question: What is the "true" average wait time for ALL customers (past, present, and future)?

Here's the key insight: that "true average" exists somewhere out there as a fixed number — maybe it's exactly 2.847 minutes. We'll never know it precisely, but we can get closer and closer with more data and smarter methods.

🧠The Mental Model: Estimation Pipeline
Reality (θ0\theta_0)
Hidden truth
Generates Data
Sampling
We Observe (X)
Your data
Estimator T(·)
Your method
Estimate (θ^\hat{\theta})
Your guess
θ0\theta_0
Goal!

The entire estimation process: from hidden truth, through observable data, to our best guess

🎯
The Two Worlds of Estimation
What We Have (Observable)
  • Your sample: [2.3, 1.8, 4.1, 3.2, ...]
  • Sample mean: 2.85 min
  • Sample size: n = 100
What We Want (Hidden)
  • True population mean: μ\mu = ???
  • True variance: σ2\sigma^2 = ???
  • The actual data-generating process
🔗Estimation builds the bridge from observable to hidden

The One Formula to Remember

θ^=T(X1,X2,,Xn)\hat{\theta} = T(X_1, X_2, \ldots, X_n)

An estimator is just a function of your data. That's it. You put data in, you get a parameter estimate out. The magic is in choosing which function T(·) to use — that's what this chapter teaches you.

Why Should You Care? (The ML Connection)

Every time you train a machine learning model, you're doing estimation:

ML TaskWhat You're Estimating
Linear RegressionSlope and intercept parameters
Neural Network TrainingMillions of weight parameters
GPT/LLM TrainingNext-token probability distribution
Bayesian InferencePosterior distribution over parameters
🤔Quick Check

When you run model.fit(X_train, y_train) in scikit-learn, which of the "Three Questions of Estimation" is scikit-learn answering for you?

Click to reveal answer
Question 2: How should we estimate it? — scikit-learn has already chosen the estimator for you (e.g., Ordinary Least Squares for LinearRegression). You provide the data, and it applies the pre-defined estimation algorithm.

The Universal Pattern

All estimation follows the same pattern: Data → Estimator → Estimate. Whether you're computing a sample mean or training GPT-4, you're applying a function to data to get parameter estimates.

The Three Questions of Estimation

Every estimation problem comes down to three questions:

  1. What should we estimate? — Defining the parameter(s) of interest
  2. How should we estimate it? — Choosing an estimator (this chapter!)
  3. How good is our estimate? — Quantifying uncertainty (next chapters)

Real-World Estimation: From Problems to Mathematical Models

Estimation isn't just abstract mathematics — it solves concrete problems every day. Click on each example below to see how real-world challenges map directly to the estimation framework we're learning.

The Universal Pattern Across All Examples

Notice how every example follows the same structure: (1) True parameter θ₀ exists but is unknown, (2) Data X is a sample from the population, (3) Contrast function ρ(X, θ) measures how poorly θ fits the data, (4) Estimate θ^\hat{\theta} minimizes the contrast. This is the universal language of estimation.

The Estimation Problem (Formal)

Given observed data X=(X1,X2,,Xn)X = (X_1, X_2, \ldots, X_n), construct a function θ^(X)\hat{\theta}(X) that gives us a "good guess" for the unknown parameter θ\theta.

What Makes a Guess "Good"?

This is the million-dollar question! We want estimators that are: (1) unbiased — correct on average, (2) consistent — improve with more data, and (3) efficient — have minimal variance. We'll formalize these properties throughout this chapter.


The Parametric Framework

Before we can estimate anything, we need to set up our mathematical framework precisely. Let's decode the notation:

The Setup: Observation Space and Probability Families

XX,XPPX \in \mathcal{X}, \quad X \sim P \in \mathcal{P}

Let's break this down symbol by symbol:

SymbolNameWhat It Means
XXObservation vectorThe data we actually observe (e.g., n wait times)
X\mathcal{X}Sample spaceAll possible values XX could take
PPProbability distributionThe unknown law governing how XX is generated
P\mathcal{P}Probability familyThe set of all candidate distributions we consider

Think of the Sample Space as the Data Universe

If you're measuring wait times (positive numbers), then X=R+n\mathcal{X} = \mathbb{R}^n_{+} — all possible n-tuples of positive real numbers.

The Parametric Assumption

In the parametric case, we assume the true distribution PP belongs to a specific family indexed by parameters:

P={Pθ:θΘ}\mathcal{P} = \{P_{\boldsymbol{\theta}} : \boldsymbol{\theta} \in \Theta\}

SymbolNameWhat It Means
θ\thetaParameter vectorThe unknown quantities we want to estimate
Θ\ThetaParameter spaceAll possible values θ\theta could take
PθP_{\theta}Parametric distributionThe distribution when parameter equals θ\theta

Example: Normal Distribution

For normally distributed data: θ=(μ,σ2)\boldsymbol{\theta} = (\mu, \sigma^2), and Θ=R×R+\Theta = \mathbb{R} \times \mathbb{R}^{+} (mean can be any real number, variance must be positive).

The key insight: Once we specify θ\theta, we completely determine the probability distribution. The estimation problem becomes: Which θ\theta generated our data?


What Is an Estimator and Estimate?

An estimator θ^(X)\hat{\theta}(X) is a machine (function) that turns data XX into guesses. An estimate θ^\hat{\theta} is the actual guess.

The estimator predicts the true parameter θ\theta of the population using sample data XX from that population.

Formally, an estimator θ^(X)\hat{\boldsymbol{\theta}}(X) is a function of the observation vector XX that produces an estimate of the unknown parameter θ\theta.

θ^:XΘ\hat{\boldsymbol{\theta}} : \mathcal{X} \to \Theta

This notation emphasizes three crucial points:

  1. θ^\hat{\theta} is a function — It takes data as input and produces a parameter estimate as output
  2. θ^\hat{\theta} depends only on XX — We can only use observable data, not the true (unknown) θ\theta
  3. θ^\hat{\theta} lives in Θ\Theta — The estimate should be a valid parameter value

The Hat Notation

The "hat" symbol always denotes an estimate or estimator. When you see θ^\hat{\theta}, think: "this is our data-based guess for θ\theta."

Estimator vs Estimate

A subtle but important distinction:

TermSymbolWhat It Is
Estimatorθ^(X)\hat{\theta}(X)The function/rule itself (random, before seeing data)
Estimateθ^(x)\hat{\theta}(x)The specific value obtained after observing xx (fixed number)

The estimator θ^(X)\hat{\theta}(X) is a random variable because XX is random. The estimate θ^(x)\hat{\theta}(x) is a specific number computed from realized data xx.


Contrast Functions: Measuring Discrepancy

How do we find a "good" estimator? The key idea is to define a contrast function that measures how "far" any candidate parameter is from the truth.

Definition of Contrast Function

ρ:X×ΘR\rho : \mathcal{X} \times \Theta \to \mathbb{R}

A contrast function ρ\rho (Greek letter "rho") is a function that takes:

  • An observation XX from the sample space X\mathcal{X}
  • A candidate parameter θ\theta from the parameter space Θ\Theta

And produces a real number measuring the "discrepancy" between θ\theta and the truth.

Intuition for the Contrast Function

Think of ρ(X,θ)\rho(X, \theta) as answering: "How incompatible is this candidate parameter θ\theta with the observed data XX?"

  • Small ρ(X,θ)\rho(X, \theta) means θ\theta is compatible with data XX
  • Large ρ(X,θ)\rho(X, \theta) means θ\theta is incompatible with data XX

Wait — How Can We Compare Data and Parameters?

You might be confused: XX is data (numbers we observed), and θ\theta is a parameter (an abstract quantity describing a distribution). These are completely different types of objects! How can a function take both and produce a meaningful "discrepancy"?

The Key Insight

The contrast function ρ(X,θ)\rho(X, \theta) does NOT directly compare XX and θ\theta. Instead, it asks:

"If θ\theta were the true parameter, how surprising or unlikely would this observed data XX be?"

Here's how it works:

  1. The parameter θ\theta defines a probability model — it tells us what the data should look like if θ\theta were true
  2. The data XX is what we actually observed — the real numbers we collected
  3. The function ρ\rho measures the "fit" — how well does what we saw match what we'd expect if θ\theta were true?

Concrete Examples

Let's make this crystal clear with examples. Click each to explore:

The Bridge Between Data and Parameters

The contrast function ρ\rho acts as a bridge between the world of data (XX) and the world of parameters (θ\theta):

  • θ\theta generates expectations — "what data should look like"
  • XX is reality — "what data actually looks like"
  • ρ(X,θ)\rho(X, \theta) measures the gap — "how well does expectation match reality?"

The Key Requirement

For ρ\rho to be a useful contrast function, we need a special property. Define the population discrepancy:

D(θ0,θ)Eθ0[ρ(X,θ)]D(\theta_0, \theta) \equiv E_{\theta_0}[\rho(X, \theta)]

This measures the average discrepancy when the true parameter is θ0\theta_0 and we're evaluating at θ\theta.

The Fundamental Requirement

For ρ\rho to be a valid contrast function, we require:

D(θ0,θ) is uniquely minimized for θ=θ0D(\theta_0, \theta) \text{ is uniquely minimized for } \theta = \theta_0

In plain English: When averaged over the true distribution, the discrepancy is smallest exactly at the true parameter.

This requirement ensures that if we knew the truth (θ0\theta_0), the contrast function would correctly identify it as the best choice.


The Discrepancy Function D(θ0,θ)D(\theta_0, \theta)

Let's understand the discrepancy function more deeply:

D(θ0,θ)Eθ0[ρ(X,θ)]=Xρ(x,θ)f(x;θ0)dxD(\theta_0, \theta) \equiv E_{\theta_0}[\rho(X, \theta)] = \int_{\mathcal{X}} \rho(x, \theta) \, f(x; \theta_0) \, dx
Xdx\int_{\mathcal{X}} \cdots \, dx
Sample from all possible observations in X\mathcal{X}
ρ(x,θ)\rho(x, \theta)
Contrast at candidate parameter θ\theta
f(x;θ0)f(x; \theta_0)
Population distribution with true parameter θ0\theta_0
In words: We weight the contrast ρ(x,θ)\rho(x, \theta) by how likely each observation xx is under the true distribution, then sum/integrate over all possibilities.

💡 Intuitive Meaning of the Discrepancy Function

The discrepancy function measures how bad a guessed model θ\theta is on average, when the world is actually governed by the true parameter θ0\theta_0.

It evaluates the loss for every possible observation, weighted by how frequently that observation occurs in reality.

In machine learning, this is the ideal objective we want to minimize — and training loss is simply its finite-data approximation.

For discrete distributions, the integral becomes a sum:

D(θ0,θ)=xXρ(x,θ)Pθ0(X=x)D(\theta_0, \theta) = \sum_{x \in \mathcal{X}} \rho(x, \theta) \, P_{\theta_0}(X = x)

Intuition

D(θ0,θ)D(\theta_0, \theta) measures how bad our guessed model θ\theta is on average, when reality is actually generated by the true parameter θ0\theta_0.

It is the expected pain of believing θ\theta, when the world is actually governed by θ0\theta_0.

🧠 God's Eye Interpretation

Imagine God knows the true parameter θ0\theta_0.

God repeatedly simulates infinite datasets from the true distribution f(x;θ0)f(x; \theta_0).

For each simulated data point xx, God evaluates how wrong your guess θ\theta is using ρ(x,θ)\rho(x, \theta).

The discrepancy D(θ0,θ)D(\theta_0, \theta) is the average of that mistake over all possible worlds.

This explains why:

  • It uses the true distribution
  • It integrates over all possible observations
  • It is a population-level measure, not a sample-level one
⚙️ Engineering Analogy

θ0\theta_0 = true physical parameter of a system

θ\theta = your estimated model

xx = sensor reading

ρ(x,θ)\rho(x, \theta) = prediction error

Then D(θ0,θ)D(\theta_0, \theta) = expected prediction error over infinite future measurements.

"If I keep using model θ\theta forever, how badly will I perform on average?"
Why do we integrate using the true distribution f(x;θ0)f(x; \theta_0)?

Because discrepancy is not about how good the model thinks it is — it is about how reality judges the model.

The true distribution tells us which observations actually occur in the real world. We care about performance on real data, not hypothetical data from our model.

🤖 Machine Learning View

In machine learning, we never know θ0\theta_0. So we replace this population discrepancy with a sample average:

D(θ0,θ)1ni=1nρ(xi,θ)D(\theta_0, \theta) \approx \frac{1}{n} \sum_{i=1}^{n} \rho(x_i, \theta)

🔗 This Is Your PyTorch/TensorFlow Loss Function!

Every deep learning training objective is a special case of the contrast function ρ(x,θ)\rho(x, \theta):

GPT / BERT / Classification

Cross-Entropy Loss — Used in language models and classifiers

ρ(x,θ)=c=1Cyclogy^c(θ)\rho(x, \theta) = -\sum_{c=1}^{C} y_c \log \hat{y}_c(\theta)

When GPT predicts the next token, it minimizes this over millions of text samples.

Regression / Autoencoders

Mean Squared Error — Used in regression and reconstruction

ρ(x,θ)=yfθ(x)2\rho(x, \theta) = \|y - f_\theta(x)\|^2

Neural networks learn by minimizing prediction error over training data.

CLIP / SimCLR / Embeddings

Contrastive Loss — Used in representation learning

ρ(x,θ)=logexp(sim(zi,zj)/τ)kexp(sim(zi,zk)/τ)\rho(x, \theta) = -\log \frac{\exp(\text{sim}(z_i, z_j)/\tau)}{\sum_k \exp(\text{sim}(z_i, z_k)/\tau)}

CLIP learns image-text alignment by contrasting positive pairs against negatives.

VAE / Diffusion Models

KL Divergence + Reconstruction — Used in generative models

ρ(x,θ)=Eq[logpθ(xz)]+DKL(qϕ(zx)p(z))\rho(x, \theta) = -\mathbb{E}_{q}[\log p_\theta(x|z)] + D_{KL}(q_\phi(z|x) \| p(z))

VAEs and diffusion models learn to generate by minimizing this variational bound.

The Profound Connection

Population discrepancy D(θ0,θ)D(\theta_0, \theta) = True expected loss (what we ideally want)

Empirical risk 1niρ(xi,θ)\frac{1}{n}\sum_i \rho(x_i, \theta) = Training loss (what we actually compute)

Generalization = How close empirical risk is to population discrepancy

In PyTorch, this is literally:

🐍training_loop.py
1# Population discrepancy (ideal, but impossible):
2# D(θ₀, θ) = E_{x~P_θ₀}[ρ(x, θ)]
3
4# Empirical risk (what we actually compute):
5loss = 0
6for x, y in training_data:
7    loss += criterion(model(x), y)  # ρ(x, θ)
8loss = loss / len(training_data)    # (1/n) Σ ρ(xᵢ, θ)
9
10# Minimize!
11loss.backward()
12optimizer.step()

🎯 Classical estimation theory → Modern deep learning are the same mathematical framework!

Interactive Geometric Visualization

Let's visualize this "bowl" concept interactively! Move the sliders to see how the discrepancy function behaves as you explore different parameter values.

The Discrepancy "Bowl" — 2D Visualization

The discrepancy function D(θ₀, θ) measures how "wrong" a candidate θ is when θ₀ is the true parameter. It forms a bowl shape with the minimum at θ = θ₀.

05101520-20246θ (candidate parameter)D(θ₀, θ)θ₀ (minimum)D = 4.00
True Discrepancy D(θ₀, θ)
Current θ
Minimum at θ₀

Key Insight

As θ moves away from θ₀, the discrepancy grows. The "bowl" shape ensures there's a unique minimum at θ = θ₀. Move the slider to find it!

The Discrepancy "Bowl" — 3D Visualization

With two parameters (θ₁, θ₂), the discrepancy function forms a 3D bowl. The minimum sits at the true parameter values (θ₀₁, θ₀₂). Drag to rotate, scroll to zoom.

D(θ₀, θ) = 2.00
Bowl Surface D(θ₀, θ)
Minimum at θ₀
Current θ
Clicked Point

Try This

  • Click on the surface to see coordinates at that point
  • • Move θ₁ and θ₂ sliders to explore the surface
  • • Watch how D grows as you move away from θ₀
  • • Rotate the view by dragging — see the bowl shape from different angles
  • • When θ = θ₀, you're at the bottom of the bowl (D = 0)

What You're Seeing

The 2D curve shows the discrepancy for a single parameter (e.g., estimating a mean). The 3D surface shows what happens with two parameters — the bowl becomes a paraboloid, and the minimum sits at the true parameter values.

Notice how the sample contrast (dashed orange line in 2D) is "noisy" compared to the true discrepancy (solid green). With more data, the sample contrast converges to the true discrepancy — this is the law of large numbers at work!


Minimum Contrast Estimates

Since we can't compute D(θ0,θ)D(\theta_0, \theta) (we don't know θ0\theta_0), we need a clever workaround. Here's the key insight:

Since ρ(X,θ)\rho(X, \theta) is an unbiased estimate of D(θ0,θ)D(\theta_0, \theta), we can minimize ρ\rho instead of DD!

The Minimum Contrast Estimator

Define the minimum contrast estimate as:

θ^(X)=argminθΘρ(X,θ)\hat{\theta}(X) = \arg\min_{\theta \in \Theta} \rho(X, \theta)

In words: θ^(X)\hat{\theta}(X) is the parameter value that minimizes the contrast function for the observed data XX.

ComponentMeaning
argmin\arg\min"argument that minimizes" — returns the θ\theta, not the minimum value
θΘ\theta \in \ThetaSearch over all valid parameter values
ρ(X,θ)\rho(X, \theta)Contrast function for observed XX

Why This Works

The fundamental idea is beautifully simple:

  1. If we could compute D(θ0,θ)D(\theta_0, \theta) and minimize it, we'd find θ0\theta_0 (the truth)
  2. We can't compute DD directly (θ0\theta_0 is unknown)
  3. But ρ(X,θ)\rho(X, \theta) is an "unbiased estimate" of DD in a weak sense
  4. So minimizing ρ(X,θ)\rho(X, \theta) should give us something close to θ0\theta_0

The Sample Version

With nn observations X1,,XnX_1, \ldots, X_n, we often use the average contrast:

ρˉ(X,θ)=1ni=1nρ(Xi,θ)\bar{\rho}(X, \theta) = \frac{1}{n}\sum_{i=1}^{n}\rho(X_i, \theta)

Minimizing this average is the basis for many estimation methods.


Estimating Equations

Now we introduce a powerful technique: instead of directly minimizing the contrast, we use calculus to find where the derivative equals zero.

The Gradient Condition

Suppose Θ\Theta is Euclidean (ΘRd\Theta \subset \mathbb{R}^d), θ0\theta_0 is an interior point, and DD is smooth. Then at the minimum:

θD(θ0,θ)θ=θ0=0\nabla_{\theta} D(\theta_0, \theta) \Big|_{\theta=\theta_0} = 0

where θ\nabla_{\theta} denotes the gradient (vector of partial derivatives):

θ=(θ1,,θd)\nabla_{\theta} = \left(\frac{\partial}{\partial \theta_1}, \ldots, \frac{\partial}{\partial \theta_d}\right)

Why the Gradient?

At a minimum of a smooth function, the surface is "flat" — all partial derivatives are zero. The gradient collects all these derivatives into one vector equation.

The Estimating Equation

Since we can't evaluate DD directly, we use the sample version. The estimating equation is:

θρ(X,θ^)=0\nabla_{\theta}\rho(X, \hat{\theta}) = 0

In words: Find θ^\hat{\theta} such that the gradient of the contrast function (evaluated at θ^\hat{\theta}) equals zero.

This Is a System of Equations

If θRd\theta \in \mathbb{R}^d, then this gives us dd equations in dd unknowns:

ρ(X,θ)θjθ=θ^=0for j=1,,d\frac{\partial \rho(X, \theta)}{\partial \theta_j}\Big|_{\theta=\hat{\theta}} = 0 \quad \text{for } j = 1, \ldots, d


General Estimating Equations

There's an even more general framework that doesn't require starting from a contrast function. We directly specify estimating functions.

The Ψ\Psi-Function Approach

"More generally, suppose we are given a function Ψ:X×RdRd\Psi : \mathcal{X} \times \mathbb{R}^d \to \mathbb{R}^d..."

Pure Symbolic Meaning

Ψ\Psi is a function that takes two inputs:

1XXX \in \mathcal{X}  (data)
2θRd\theta \in \mathbb{R}^d  (parameter vector)

and outputs a d-dimensional vector in Rd\mathbb{R}^d:

Ψ(X,θ)=(ψ1(X,θ)ψd(X,θ))\Psi(X, \theta) = \begin{pmatrix} \psi_1(X, \theta) \\ \vdots \\ \psi_d(X, \theta) \end{pmatrix}

So Ψ is like:

ψ₁(X, θ) = some equation
ψ₂(X, θ) = another equation
...
ψ_d(X, θ) = last equation

Intuition

Instead of using gradients (ρ\nabla\rho), this says:

Let's use any vector-valued function Ψ(X,θ)\Psi(X, \theta) whose expectation is 0 at the true θ0\theta_0.

Think of Ψ\Psi as a system of equations the true θ must satisfy.

Examples:

  • moment conditions
  • score equations (log-likelihood derivative)
  • sample equations from GMM
  • regression normal equations
🎯

What is Ψ (Psi)? — The "Error Detector"

Ψ is an "error detector" — a function that measures how "wrong" a candidate parameter θ is, given the observed data X.

The Key Idea: When θ equals the true parameter θ₀, the function Ψ(X, θ) should average to zero:

E[Ψ(X, θ₀)] = 0  ←  "No error signal at the truth"

Think of it like a thermostat:

  • If the room temperature (θ) matches the target (θ₀), the error signal is zero → Ψ = 0
  • If the room is too cold, the error is negative → Ψ < 0 (heat more!)
  • If the room is too hot, the error is positive → Ψ > 0 (cool down!)

Formally, Ψ\Psi is a vector-valued function that takes data and a parameter, and outputs an "error vector":

Ψ:X×RdRd\Psi : \mathcal{X} \times \mathbb{R}^d \to \mathbb{R}^d

where Ψ(ψ1,,ψd)T\Psi \equiv (\psi_1, \ldots, \psi_d)^T — each component ψj\psi_j checks a different aspect of the parameter.

2

Two Parameters (d = 2)

Input

Data (X)

5.2

Parameters (θ)

μ = 3.0

σ² = 2.0

Ψ function
Output(2D error vector)
Ψ(X, θ) =
2.200.42

Use case: Normal distribution — estimating mean (μ) and variance (σ²)

3

Three Parameters (d = 3)

Input

Data (X)

y = 8

x₁ = 2

x₂ = 3

Parameters (θ)

β₀ = 1

β₁ = 2

β₂ = 1

Ψ function
Output(3D error vector)
Ψ(X, θ) =
0.000.000.00

All zeros → θ is the MLE!

Use case: Linear regression — y = β₀ + β₁x₁ + β₂x₂

Notice: The output dimension matches the number of parameters (d). Each ψⱼ tells you how "off" the j-th parameter is. When all outputs are zero, you've found the estimate!

SymbolMeaningIntuition
Ψ\PsiVector of estimating functionsThe "error detector" that tells us how wrong θ is
ψj\psi_jThe jj-th component functionChecks if the j-th parameter component is correct
ddDimension of parameter spaceNumber of parameters to estimate (d equations for d unknowns)
X\mathcal{X}Data spaceAll possible values the data X can take

📊 Concrete Example: Estimating the Mean

Suppose we want to estimate the mean μ of a distribution. A natural Ψ-function is:

Ψ(X, μ) = X - μ

If...Then Ψ =Meaning
X = 5, μ = 50✅ μ is correct!
X = 7, μ = 5+2⬆️ μ is too low
X = 3, μ = 5-2⬇️ μ is too high

Key property: E[Ψ(X, μ₀)] = E[X - μ₀] = μ₀ - μ₀ = 0 when μ = μ₀ (the true mean).

Population Condition

Define the population version:

V(θ0,θ)=Eθ0[Ψ(X,θ)]V(\theta_0, \theta) = E_{\theta_0}[\Psi(X, \theta)]

We require that V(θ0,θ)=0V(\theta_0, \theta) = 0 has θ0\theta_0 as its unique solution.

This means: when the true parameter is θ0\theta_0, the expected value of Ψ(X,θ)\Psi(X, \theta) is zero only when θ=θ0\theta = \theta_0.

The Estimating Equation Estimate

We define θ^\hat{\theta} as the solution to:

Ψ(X,θ^)=0\Psi(X, \hat{\theta}) = 0

This is dd equations in dd unknowns — one equation for each component of θ\theta.

Connection to Contrast Functions

When Ψ(X,θ)=θρ(X,θ)\Psi(X, \theta) = \nabla_{\theta} \rho(X, \theta), the estimating equation approach reduces to the minimum contrast approach. But the Ψ\Psi-function framework is more general!


Intuitive Examples

Let's make these abstract concepts concrete with familiar examples.

Example 1: Method of Moments

Setup: Estimate the mean μ\mu from data X1,,XnX_1, \ldots, X_n.

Estimating function: ψ(Xi,μ)=Xiμ\psi(X_i, \mu) = X_i - \mu

Estimating equation: i=1n(Xiμ^)=0\sum_{i=1}^n (X_i - \hat{\mu}) = 0

Solution: μ^=Xˉ=1ni=1nXi\hat{\mu} = \bar{X} = \frac{1}{n}\sum_{i=1}^n X_i

The sample mean! This is the simplest estimating equation estimate.

Example 2: Maximum Likelihood (Preview)

Setup: Observations with density f(x;θ)f(x; \theta).

Contrast function (negative log-likelihood):

ρ(X,θ)=i=1nlogf(Xi;θ)\rho(X, \theta) = -\sum_{i=1}^n \log f(X_i; \theta)

Estimating equation (score equation):

i=1nlogf(Xi;θ)θθ=θ^=0\sum_{i=1}^n \frac{\partial \log f(X_i; \theta)}{\partial \theta}\Big|_{\theta=\hat{\theta}} = 0

This is the famous score equation — the foundation of Maximum Likelihood Estimation (covered in detail in Chapter 12).

Example 3: Least Squares

Setup: Linear regression Y=Xβ+εY = X\beta + \varepsilon.

Contrast function (sum of squared errors):

ρ(Y,β)=i=1n(YiXiTβ)2\rho(Y, \beta) = \sum_{i=1}^n (Y_i - X_i^T\beta)^2

Estimating equation (normal equations):

XT(YXβ^)=0X^T(Y - X\hat{\beta}) = 0

Solution: β^=(XTX)1XTY\hat{\beta} = (X^TX)^{-1}X^TY


Complete Symbol Glossary

Here's a comprehensive reference for all the symbols in this section:

SymbolNameMeaning
XXObservationThe data we observe (random vector)
xxRealized dataSpecific observed values (lowercase = fixed)
X\mathcal{X}Sample spaceAll possible values XX could take
PPDistributionProbability law governing XX
P\mathcal{P}Distribution familySet of candidate distributions
θ\thetaParameterUnknown quantity we want to estimate
θ0\theta_0True parameterThe actual parameter value (unknown)
Θ\ThetaParameter spaceAll valid parameter values
PθP_{\theta}Parametric distributionDistribution when parameter is θ\theta
θ^\hat{\theta}Estimator/estimateOur guess for θ\theta based on data
ρ\rhoContrast functionMeasures incompatibility of θ\theta with XX
D(θ0,θ)D(\theta_0, \theta)DiscrepancyExpected contrast under true θ0\theta_0
Eθ0E_{\theta_0}ExpectationExpected value under distribution Pθ0P_{\theta_0}
θ\nabla_{\theta}GradientVector of partial derivatives w.r.t. θ\theta
Ψ\PsiEstimating functionVector-valued function defining equations
V(θ0,θ)V(\theta_0, \theta)Population momentExpected value of Ψ\Psi under θ0\theta_0

Python Implementation

Implementing a General Estimating Equation Solver

🐍estimating_equations.py
1import numpy as np
2from scipy.optimize import fsolve, minimize
3from typing import Callable
4
5def solve_estimating_equation(
6    psi: Callable[[np.ndarray, np.ndarray], np.ndarray],
7    X: np.ndarray,
8    theta_init: np.ndarray
9) -> np.ndarray:
10    """
11    Solve the estimating equation Psi(X, theta) = 0.
12
13    Parameters:
14    -----------
15    psi : Callable
16        Estimating function Psi(X, theta) -> R^d
17    X : np.ndarray
18        Observed data
19    theta_init : np.ndarray
20        Initial guess for theta
21
22    Returns:
23    --------
24    np.ndarray : Solution theta_hat
25    """
26    # Define the equation to solve: sum over observations
27    def equation(theta: np.ndarray) -> np.ndarray:
28        return np.sum([psi(x_i, theta) for x_i in X], axis=0)
29
30    # Solve using scipy's fsolve
31    theta_hat = fsolve(equation, theta_init)
32    return theta_hat
33
34
35# Example 1: Method of Moments for Normal(mu, sigma^2)
36def psi_normal(x: float, theta: np.ndarray) -> np.ndarray:
37    """
38    Estimating function for Normal(mu, sigma^2).
39    theta = (mu, sigma^2)
40    """
41    mu, sigma2 = theta
42    return np.array([
43        x - mu,                    # For mu: E[X - mu] = 0
44        (x - mu)**2 - sigma2       # For sigma^2: E[(X-mu)^2 - sigma^2] = 0
45    ])
46
47
48# Generate data from Normal(5, 4)
49np.random.seed(42)
50X = np.random.normal(loc=5, scale=2, size=100)
51
52# Solve
53theta_hat = solve_estimating_equation(psi_normal, X, theta_init=np.array([0.0, 1.0]))
54print(f"True theta = (5, 4)")
55print(f"Estimated theta_hat = ({theta_hat[0]:.4f}, {theta_hat[1]:.4f})")

Minimum Contrast Estimation

🐍minimum_contrast.py
1import numpy as np
2from scipy.optimize import minimize
3from typing import Callable
4
5def minimum_contrast_estimate(
6    rho: Callable[[np.ndarray, np.ndarray], float],
7    X: np.ndarray,
8    theta_init: np.ndarray,
9    bounds: tuple = None
10) -> np.ndarray:
11    """
12    Find theta that minimizes the contrast function rho(X, theta).
13
14    Parameters:
15    -----------
16    rho : Callable
17        Contrast function rho(X, theta) -> R
18    X : np.ndarray
19        Observed data
20    theta_init : np.ndarray
21        Initial guess
22    bounds : tuple, optional
23        Parameter bounds
24
25    Returns:
26    --------
27    np.ndarray : Minimum contrast estimate theta_hat
28    """
29    def objective(theta: np.ndarray) -> float:
30        return rho(X, theta)
31
32    result = minimize(objective, theta_init, bounds=bounds, method='L-BFGS-B')
33    return result.x
34
35
36# Example: Least Squares Regression
37def least_squares_contrast(X: np.ndarray, beta: np.ndarray) -> float:
38    """
39    Contrast function: sum of squared residuals.
40    X has columns [1, x1, x2, ..., y] (last column is response)
41    """
42    design = X[:, :-1]  # Features
43    y = X[:, -1]        # Response
44    y_pred = design @ beta
45    return np.sum((y - y_pred)**2)
46
47
48# Generate linear regression data: y = 2 + 3*x + noise
49np.random.seed(42)
50n = 100
51x = np.random.uniform(0, 10, n)
52y = 2 + 3*x + np.random.normal(0, 1, n)
53X = np.column_stack([np.ones(n), x, y])  # [1, x, y]
54
55# Estimate
56beta_hat = minimum_contrast_estimate(
57    least_squares_contrast, X,
58    theta_init=np.array([0.0, 0.0])
59)
60print(f"True beta = (2, 3)")
61print(f"Estimated beta_hat = ({beta_hat[0]:.4f}, {beta_hat[1]:.4f})")

Visualizing the Discrepancy Function

🐍visualize_discrepancy.py
1import numpy as np
2import matplotlib.pyplot as plt
3from scipy import stats
4
5# Setup: Estimate mean mu of Normal(mu, 1)
6true_mu = 3.0
7n = 50
8
9# Generate data
10np.random.seed(42)
11X = np.random.normal(loc=true_mu, scale=1, size=n)
12
13# Define contrast function (negative log-likelihood per observation)
14def contrast(x_i, mu):
15    """rho(x, mu) = (x - mu)^2 / 2 (proportional to negative log-likelihood)"""
16    return (x_i - mu)**2 / 2
17
18# Compute sample contrast for range of mu values
19mu_range = np.linspace(0, 6, 200)
20sample_contrast = [np.mean([contrast(x_i, mu) for x_i in X])
21                   for mu in mu_range]
22
23# The true discrepancy D(mu_0, mu) = E[(X - mu)^2/2] = ((mu - mu_0)^2 + 1)/2
24true_discrepancy = ((mu_range - true_mu)**2 + 1) / 2
25
26# Plot
27plt.figure(figsize=(10, 6))
28plt.plot(mu_range, sample_contrast, 'b-', linewidth=2,
29         label=r'Sample contrast $\bar{\rho}(X, \mu)$')
30plt.plot(mu_range, true_discrepancy, 'r--', linewidth=2,
31         label=r'True discrepancy $D(\mu_0, \mu)$')
32plt.axvline(true_mu, color='g', linestyle=':', linewidth=2,
33            label=f'True $\mu_0$ = {true_mu}')
34plt.axvline(np.mean(X), color='purple', linestyle=':', linewidth=2,
35            label=f'Estimate $\hat{{\mu}}$ = {np.mean(X):.3f}')
36
37plt.xlabel(r'$\mu$', fontsize=14)
38plt.ylabel('Contrast / Discrepancy', fontsize=14)
39plt.title('Sample Contrast vs Population Discrepancy', fontsize=14)
40plt.legend(fontsize=12)
41plt.grid(True, alpha=0.3)
42plt.tight_layout()
43plt.savefig('discrepancy_visualization.png', dpi=150)
44plt.show()
45
46# The minimum of sample contrast gives our estimate
47mu_hat = mu_range[np.argmin(sample_contrast)]
48print(f"True mu = {true_mu}")
49print(f"Sample mean = {np.mean(X):.4f}")
50print(f"Minimizer of sample contrast = {mu_hat:.4f}")

Key Insights

The Estimation Recipe

How to Construct an Estimator

  1. Choose a contrast function ρ(X,θ)\rho(X, \theta) that measures how "incompatible" θ\theta is with the data
  2. Verify the key property: D(θ0,θ)D(\theta_0, \theta) is minimized at θ=θ0\theta = \theta_0
  3. Option A: Directly minimize ρ(X,θ)\rho(X, \theta) over θ\theta
  4. Option B: Solve the estimating equation ρ(X,θ^)=0\nabla \rho(X, \hat{\theta}) = 0

Why This Framework Is Powerful

  • Unifying: Maximum Likelihood, Least Squares, Method of Moments are all special cases
  • Flexible: Can handle complex models by choosing appropriate contrast functions
  • Analyzable: Properties of θ^\hat{\theta} (bias, variance, consistency) follow from properties of ρ\rho
  • Computationally tractable: Estimating equations often easier to solve than direct optimization

Caution: Uniqueness

Not all contrast functions have unique minima! When ρ(X,θ)\rho(X, \theta) has multiple local minima, different starting points may give different estimates. Always check that your estimator is well-defined.


Summary

This section introduced the foundational framework for estimation theory. Here are the key takeaways:

Core Concepts

ConceptKey FormulaIntuition
Parametric modelP={Pθ:θΘ}\mathcal{P} = \{P_{\theta} : \theta \in \Theta\}Data comes from a distribution indexed by θ\theta
Estimatorθ^(X):XΘ\hat{\theta}(X) : \mathcal{X} \to \ThetaRecipe for guessing θ\theta from data
Contrast functionρ:X×ΘR\rho : \mathcal{X} \times \Theta \to \mathbb{R}Measures incompatibility of θ\theta with XX
DiscrepancyD(θ0,θ)=Eθ0[ρ(X,θ)]D(\theta_0, \theta) = E_{\theta_0}[\rho(X,\theta)]Average contrast under true θ0\theta_0
Min contrast estimateθ^=argminρ(X,θ)\hat{\theta} = \arg\min \rho(X, \theta)θ\theta that best fits the data
Estimating equationρ(X,θ^)=0\nabla \rho(X, \hat{\theta}) = 0Find θ^\hat{\theta} where gradient vanishes

The Big Ideas

  1. Estimation is about finding the parameter that generated our data — we use observable quantities to infer unobservable truths
  2. Contrast functions formalize "goodness of fit" — smaller contrast means better compatibility with data
  3. The key requirement is identification D(θ0,θ)D(\theta_0, \theta) must be uniquely minimized at θ0\theta_0
  4. Estimating equations are often easier to solve — differentiation turns optimization into root-finding
Coming Next: In the next section, we'll explore the key properties of estimators: bias, variance, and mean squared error (MSE). These concepts help us evaluate how "good" an estimator is and choose between competing estimation strategies.
Loading comments...