Estimators and Their Properties | Chapter 11 - Point Estimation | Probability & Statistics for AI/ML

Learning Objectives

Before You Start

You should be comfortable with probability distributions, random variables, and basic calculus (derivatives and optimization). Familiarity with expected values will also help.

By the end of this section, you will be able to:

🎯

Understand the Parametric Framework

What it means to work with a family of distributions indexed by parameters

🔧

Define an Estimator

A function that maps observations to parameter estimates — data in, estimate out

📏

Explain Contrast Functions

Why they measure "discrepancy" between candidate parameters and the truth

📉

Derive Minimum Contrast Estimates

By minimizing the empirical contrast — finding the "best fit" parameter

✏️

Set Up Estimating Equations

Using gradient conditions and general $\Psi$ -functions

🚀

Apply These Concepts

To real estimation problems including Maximum Likelihood Estimation (MLE)

🏆What You'll Build

By the end of this section, you'll understand the complete estimation pipeline and be ready to implement an MLE estimator from scratch. You'll see that all estimators — from simple sample means to neural network training — follow the same fundamental pattern.

The Big Picture: Why Estimation Matters

Statistics is fundamentally about learning from data. We observe data, but we want to know about the underlying process that generated it.

☕The Problem: Coffee Shop Wait Times

A coffee shop manager wants to know the true average wait time for customers. Measuring every single customer forever is impossible, so they decide to sample customers and estimate the average.

Unknown Truth

True mean μ = 2.847 minutes(hidden from us)

What We Do

Sample n customers and compute θ̂ = sample mean

The Question

How close is θ̂ to μ? Does more data help?

🎮Try It Yourself: See Estimation in Action

Sample Size vs Estimate Accuracy

Watch how your estimate improves as you collect more data. The true mean is 2.847 — can you get close?

Sample Size (n)10

52.5k5k7.5k10k

Sample Mean

2.595

True Mean

2.847

Error

0.252

Error decreases as n increases (Law of Large Numbers)

100

500

1000

5000

10000

Key Insight

With only n = 10 samples, there's still considerable uncertainty. Increase the sample size to watch the estimate converge toward the true value.

The Detective Analogy

Think of estimation like being a detective. You arrive at a crime scene and find clues (your data). You can't rewind time to see exactly what happened (the true parameters), but you can use the clues to reconstruct what most likely occurred.

🔍Detective Work

Clues at crime scene → Evidence
Reconstruct what happened → Theory
Better methods → Closer to truth
Can never be 100% certain → Uncertainty

📊Statistical Estimation

Sample observations → $X_1, X_2, \ldots, X_n$
Apply estimator → $\hat{\theta}(X)$
Good estimator → $\hat{\theta} \approx \theta_0$
Quantify uncertainty → Confidence intervals

The better your detective methods (estimators), the closer your reconstruction gets to the truth. And just like in detective work, some methods are provably better than others.

A Concrete Example: The Coffee Shop Mystery

Imagine you're studying customer wait times at a coffee shop. You collect data: 2.3 min, 1.8 min, 4.1 min, 3.2 min...

The central question: What is the "true" average wait time for ALL customers (past, present, and future)?

Here's the key insight: that "true average" exists somewhere out there as a fixed number — maybe it's exactly 2.847 minutes. We'll never know it precisely, but we can get closer and closer with more data and smarter methods.

🧠The Mental Model: Estimation Pipeline

Reality (

\theta_0

)

Hidden truth

→

Generates Data

Sampling

→

We Observe (X)

Your data

→

Estimator T(·)

Your method

→

Estimate (

\hat{\theta}

)

Your guess

≈

\theta_0

Goal!

The entire estimation process: from hidden truth, through observable data, to our best guess

🎯

The Two Worlds of Estimation

What We Have (Observable)

Your sample: [2.3, 1.8, 4.1, 3.2, ...]
Sample mean: 2.85 min
Sample size: n = 100

🌉

Estimator

What We Want (Hidden)

True population mean: $\mu$ = ???
True variance: $\sigma^2$ = ???
The actual data-generating process

🔗Estimation builds the bridge from observable to hidden

The One Formula to Remember

$\hat{\theta} = T(X_1, X_2, \ldots, X_n)$

An estimator is just a function of your data. That's it. You put data in, you get a parameter estimate out. The magic is in choosing which function T(·) to use — that's what this chapter teaches you.

Why Should You Care? (The ML Connection)

Every time you train a machine learning model, you're doing estimation:

ML Task	What You're Estimating
Linear Regression	Slope and intercept parameters
Neural Network Training	Millions of weight parameters
GPT/LLM Training	Next-token probability distribution
Bayesian Inference	Posterior distribution over parameters

🤔Quick Check

When you run model.fit(X_train, y_train) in scikit-learn, which of the "Three Questions of Estimation" is scikit-learn answering for you?

Click to reveal answer

Question 2: How should we estimate it? — scikit-learn has already chosen the estimator for you (e.g., Ordinary Least Squares for LinearRegression). You provide the data, and it applies the pre-defined estimation algorithm.

The Universal Pattern

All estimation follows the same pattern: Data → Estimator → Estimate. Whether you're computing a sample mean or training GPT-4, you're applying a function to data to get parameter estimates.

The Three Questions of Estimation

Every estimation problem comes down to three questions:

What should we estimate? — Defining the parameter(s) of interest
How should we estimate it? — Choosing an estimator (this chapter!)
How good is our estimate? — Quantifying uncertainty (next chapters)

Real-World Estimation: From Problems to Mathematical Models

Estimation isn't just abstract mathematics — it solves concrete problems every day. Click on each example below to see how real-world challenges map directly to the estimation framework we're learning.

The Universal Pattern Across All Examples

Notice how every example follows the same structure: (1) True parameter θ₀ exists but is unknown, (2) Data X is a sample from the population, (3) Contrast function ρ(X, θ) measures how poorly θ fits the data, (4) Estimate $\hat{\theta}$ minimizes the contrast. This is the universal language of estimation.

The Estimation Problem (Formal)

Given observed data $X = (X_1, X_2, \ldots, X_n)$ , construct a function $\hat{\theta}(X)$ that gives us a "good guess" for the unknown parameter $\theta$ .

What Makes a Guess "Good"?

This is the million-dollar question! We want estimators that are: (1) unbiased — correct on average, (2) consistent — improve with more data, and (3) efficient — have minimal variance. We'll formalize these properties throughout this chapter.

The Parametric Framework

Before we can estimate anything, we need to set up our mathematical framework precisely. Let's decode the notation:

The Setup: Observation Space and Probability Families

$X \in \mathcal{X}, \quad X \sim P \in \mathcal{P}$

Let's break this down symbol by symbol:

Symbol	Name	What It Means
$X$	Observation vector	The data we actually observe (e.g., n wait times)
$\mathcal{X}$	Sample space	All possible values $X$ could take
$P$	Probability distribution	The unknown law governing how $X$ is generated
$\mathcal{P}$	Probability family	The set of all candidate distributions we consider

Think of the Sample Space as the Data Universe

If you're measuring wait times (positive numbers), then $\mathcal{X} = \mathbb{R}^n_{+}$ — all possible n-tuples of positive real numbers.

The Parametric Assumption

In the parametric case, we assume the true distribution $P$ belongs to a specific family indexed by parameters:

$\mathcal{P} = \{P_{\boldsymbol{\theta}} : \boldsymbol{\theta} \in \Theta\}$

Symbol	Name	What It Means
$\theta$	Parameter vector	The unknown quantities we want to estimate
$\Theta$	Parameter space	All possible values $\theta$ could take
$P_{\theta}$	Parametric distribution	The distribution when parameter equals $\theta$

Example: Normal Distribution

For normally distributed data: $\boldsymbol{\theta} = (\mu, \sigma^2)$ , and $\Theta = \mathbb{R} \times \mathbb{R}^{+}$ (mean can be any real number, variance must be positive).

The key insight: Once we specify $\theta$ , we completely determine the probability distribution. The estimation problem becomes: Which $\theta$ generated our data?

What Is an Estimator and Estimate?

An estimator $\hat{\theta}(X)$ is a machine (function) that turns data $X$ into guesses. An estimate $\hat{\theta}$ is the actual guess.

The estimator predicts the true parameter $\theta$ of the population using sample data $X$ from that population.

Formally, an estimator $\hat{\boldsymbol{\theta}}(X)$ is a function of the observation vector $X$ that produces an estimate of the unknown parameter $\theta$ .

$\hat{\boldsymbol{\theta}} : \mathcal{X} \to \Theta$

This notation emphasizes three crucial points:

$\hat{\theta}$ is a function — It takes data as input and produces a parameter estimate as output
$\hat{\theta}$ depends only on $X$ — We can only use observable data, not the true (unknown) $\theta$
$\hat{\theta}$ lives in $\Theta$ — The estimate should be a valid parameter value

The Hat Notation

The "hat" symbol always denotes an estimate or estimator. When you see $\hat{\theta}$ , think: "this is our data-based guess for $\theta$ ."

Estimator vs Estimate

A subtle but important distinction:

Term	Symbol	What It Is
Estimator	$\hat{\theta}(X)$	The function/rule itself (random, before seeing data)
Estimate	$\hat{\theta}(x)$	The specific value obtained after observing $x$ (fixed number)

The estimator $\hat{\theta}(X)$ is a random variable because $X$ is random. The estimate $\hat{\theta}(x)$ is a specific number computed from realized data $x$ .

Contrast Functions: Measuring Discrepancy

How do we find a "good" estimator? The key idea is to define a contrast function that measures how "far" any candidate parameter is from the truth.

Definition of Contrast Function

$\rho : \mathcal{X} \times \Theta \to \mathbb{R}$

A contrast function $\rho$ (Greek letter "rho") is a function that takes:

An observation $X$ from the sample space $\mathcal{X}$
A candidate parameter $\theta$ from the parameter space $\Theta$

And produces a real number measuring the "discrepancy" between $\theta$ and the truth.

Intuition for the Contrast Function

Think of $\rho(X, \theta)$ as answering: "How incompatible is this candidate parameter $\theta$ with the observed data $X$ ?"

Small $\rho(X, \theta)$ means $\theta$ is compatible with data $X$
Large $\rho(X, \theta)$ means $\theta$ is incompatible with data $X$

Wait — How Can We Compare Data and Parameters?

You might be confused: $X$ is data (numbers we observed), and $\theta$ is a parameter (an abstract quantity describing a distribution). These are completely different types of objects! How can a function take both and produce a meaningful "discrepancy"?

The Key Insight

The contrast function $\rho(X, \theta)$ does NOT directly compare $X$ and $\theta$ . Instead, it asks:

"If $\theta$ were the true parameter, how surprising or unlikely would this observed data $X$ be?"

Here's how it works:

The parameter $\theta$ defines a probability model — it tells us what the data should look like if $\theta$ were true
The data $X$ is what we actually observed — the real numbers we collected
The function $\rho$ measures the "fit" — how well does what we saw match what we'd expect if $\theta$ were true?

Concrete Examples

Let's make this crystal clear with examples. Click each to explore:

The Bridge Between Data and Parameters

The contrast function $\rho$ acts as a bridge between the world of data ( $X$ ) and the world of parameters ( $\theta$ ):

$\theta$ generates expectations — "what data should look like"
$X$ is reality — "what data actually looks like"
$\rho(X, \theta)$ measures the gap — "how well does expectation match reality?"

The Key Requirement

For $\rho$ to be a useful contrast function, we need a special property. Define the population discrepancy:

$D(\theta_0, \theta) \equiv E_{\theta_0}[\rho(X, \theta)]$

This measures the average discrepancy when the true parameter is $\theta_0$ and we're evaluating at $\theta$ .

The Fundamental Requirement

For $\rho$ to be a valid contrast function, we require:

$D(\theta_0, \theta) \text{ is uniquely minimized for } \theta = \theta_0$

In plain English: When averaged over the true distribution, the discrepancy is smallest exactly at the true parameter.

This requirement ensures that if we knew the truth ( $\theta_0$ ), the contrast function would correctly identify it as the best choice.

The Discrepancy Function $D(\theta_0, \theta)$

Let's understand the discrepancy function more deeply:

D(\theta_0, \theta) \equiv E_{\theta_0}[\rho(X, \theta)] = \int_{\mathcal{X}} \rho(x, \theta) \, f(x; \theta_0) \, dx

\int_{\mathcal{X}} \cdots \, dx

Sample from all possible observations in

\mathcal{X}

\rho(x, \theta)

Contrast at candidate parameter

\theta

f(x; \theta_0)

Population distribution with true parameter

\theta_0

In words: We weight the contrast

\rho(x, \theta)

by how likely each observation

x

is under the true distribution, then sum/integrate over all possibilities.

💡 Intuitive Meaning of the Discrepancy Function

The discrepancy function measures how bad a guessed model $\theta$ is on average, when the world is actually governed by the true parameter $\theta_0$ .

It evaluates the loss for every possible observation, weighted by how frequently that observation occurs in reality.

In machine learning, this is the ideal objective we want to minimize — and training loss is simply its finite-data approximation.

For discrete distributions, the integral becomes a sum:

$D(\theta_0, \theta) = \sum_{x \in \mathcal{X}} \rho(x, \theta) \, P_{\theta_0}(X = x)$

Intuition

$D(\theta_0, \theta)$ measures how bad our guessed model $\theta$ is on average, when reality is actually generated by the true parameter $\theta_0$ .

It is the expected pain of believing $\theta$ , when the world is actually governed by $\theta_0$ .

🧠 God's Eye Interpretation

Imagine God knows the true parameter $\theta_0$ .

God repeatedly simulates infinite datasets from the true distribution $f(x; \theta_0)$ .

For each simulated data point $x$ , God evaluates how wrong your guess $\theta$ is using $\rho(x, \theta)$ .

The discrepancy $D(\theta_0, \theta)$ is the average of that mistake over all possible worlds.

This explains why:

It uses the true distribution
It integrates over all possible observations
It is a population-level measure, not a sample-level one

⚙️ Engineering Analogy

$\theta_0$ = true physical parameter of a system

$\theta$ = your estimated model

$x$ = sensor reading

$\rho(x, \theta)$ = prediction error

Then $D(\theta_0, \theta)$ = expected prediction error over infinite future measurements.

"If I keep using model

\theta

forever, how badly will I perform on average?"

❓ Why do we integrate using the true distribution

f(x; \theta_0)

Because discrepancy is not about how good the model thinks it is — it is about how reality judges the model.

The true distribution tells us which observations actually occur in the real world. We care about performance on real data, not hypothetical data from our model.

🤖 Machine Learning View

In machine learning, we never know $\theta_0$ . So we replace this population discrepancy with a sample average:

$D(\theta_0, \theta) \approx \frac{1}{n} \sum_{i=1}^{n} \rho(x_i, \theta)$

🔗 This Is Your PyTorch/TensorFlow Loss Function!

Every deep learning training objective is a special case of the contrast function $\rho(x, \theta)$ :

GPT / BERT / Classification

Cross-Entropy Loss — Used in language models and classifiers

\rho(x, \theta) = -\sum_{c=1}^{C} y_c \log \hat{y}_c(\theta)

When GPT predicts the next token, it minimizes this over millions of text samples.

Regression / Autoencoders

Mean Squared Error — Used in regression and reconstruction

\rho(x, \theta) = \|y - f_\theta(x)\|^2

Neural networks learn by minimizing prediction error over training data.

CLIP / SimCLR / Embeddings

Contrastive Loss — Used in representation learning

\rho(x, \theta) = -\log \frac{\exp(\text{sim}(z_i, z_j)/\tau)}{\sum_k \exp(\text{sim}(z_i, z_k)/\tau)}

CLIP learns image-text alignment by contrasting positive pairs against negatives.

VAE / Diffusion Models

KL Divergence + Reconstruction — Used in generative models

\rho(x, \theta) = -\mathbb{E}_{q}[\log p_\theta(x|z)] + D_{KL}(q_\phi(z|x) \| p(z))

VAEs and diffusion models learn to generate by minimizing this variational bound.

The Profound Connection

✓Population discrepancy $D(\theta_0, \theta)$ = True expected loss (what we ideally want)

✓Empirical risk $\frac{1}{n}\sum_i \rho(x_i, \theta)$ = Training loss (what we actually compute)

✓Generalization = How close empirical risk is to population discrepancy

In PyTorch, this is literally:

🐍training_loop.py

1# Population discrepancy (ideal, but impossible):
2# D(θ₀, θ) = E_{x~P_θ₀}[ρ(x, θ)]
3
4# Empirical risk (what we actually compute):
5loss = 0
6for x, y in training_data:
7    loss += criterion(model(x), y)  # ρ(x, θ)
8loss = loss / len(training_data)    # (1/n) Σ ρ(xᵢ, θ)
9
10# Minimize!
11loss.backward()
12optimizer.step()

🎯 Classical estimation theory → Modern deep learning are the same mathematical framework!

Interactive Geometric Visualization

Let's visualize this "bowl" concept interactively! Move the sliders to see how the discrepancy function behaves as you explore different parameter values.

The Discrepancy "Bowl" — 2D Visualization

The discrepancy function D(θ₀, θ) measures how "wrong" a candidate θ is when θ₀ is the true parameter. It forms a bowl shape with the minimum at θ = θ₀.

True Parameter θ₀ = 2.0

Candidate θ = 0.0

Show Sample Contrast ρ(X, θ)

True Discrepancy D(θ₀, θ)

Current θ

Minimum at θ₀

Key Insight

As θ moves away from θ₀, the discrepancy grows. The "bowl" shape ensures there's a unique minimum at θ = θ₀. Move the slider to find it!

The Discrepancy "Bowl" — 3D Visualization

With two parameters (θ₁, θ₂), the discrepancy function forms a 3D bowl. The minimum sits at the true parameter values (θ₀₁, θ₀₂). Drag to rotate, scroll to zoom.

θ₀₁ = 0.0

θ₀₂ = 0.0

θ₁ = 1.0

θ₂ = 1.0

Show Wireframe

D(θ₀, θ) = 2.00

Bowl Surface D(θ₀, θ)

Minimum at θ₀

Current θ

Clicked Point

Try This

• Click on the surface to see coordinates at that point
• Move θ₁ and θ₂ sliders to explore the surface
• Watch how D grows as you move away from θ₀
• Rotate the view by dragging — see the bowl shape from different angles
• When θ = θ₀, you're at the bottom of the bowl (D = 0)

What You're Seeing

The 2D curve shows the discrepancy for a single parameter (e.g., estimating a mean). The 3D surface shows what happens with two parameters — the bowl becomes a paraboloid, and the minimum sits at the true parameter values.

Notice how the sample contrast (dashed orange line in 2D) is "noisy" compared to the true discrepancy (solid green). With more data, the sample contrast converges to the true discrepancy — this is the law of large numbers at work!

Minimum Contrast Estimates

Since we can't compute $D(\theta_0, \theta)$ (we don't know $\theta_0$ ), we need a clever workaround. Here's the key insight:

Since $\rho(X, \theta)$ is an unbiased estimate of $D(\theta_0, \theta)$ , we can minimize $\rho$ instead of $D$ !

The Minimum Contrast Estimator

Define the minimum contrast estimate as:

$\hat{\theta}(X) = \arg\min_{\theta \in \Theta} \rho(X, \theta)$

In words: $\hat{\theta}(X)$ is the parameter value that minimizes the contrast function for the observed data $X$ .

Component	Meaning
$\arg\min$	"argument that minimizes" — returns the $\theta$ , not the minimum value
$\theta \in \Theta$	Search over all valid parameter values
$\rho(X, \theta)$	Contrast function for observed $X$

Why This Works

The fundamental idea is beautifully simple:

If we could compute $D(\theta_0, \theta)$ and minimize it, we'd find $\theta_0$ (the truth)
We can't compute $D$ directly ( $\theta_0$ is unknown)
But $\rho(X, \theta)$ is an "unbiased estimate" of $D$ in a weak sense
So minimizing $\rho(X, \theta)$ should give us something close to $\theta_0$

The Sample Version

With $n$ observations $X_1, \ldots, X_n$ , we often use the average contrast:

$\bar{\rho}(X, \theta) = \frac{1}{n}\sum_{i=1}^{n}\rho(X_i, \theta)$

Minimizing this average is the basis for many estimation methods.

Estimating Equations

Now we introduce a powerful technique: instead of directly minimizing the contrast, we use calculus to find where the derivative equals zero.

The Gradient Condition

Suppose $\Theta$ is Euclidean ( $\Theta \subset \mathbb{R}^d$ ), $\theta_0$ is an interior point, and $D$ is smooth. Then at the minimum:

$\nabla_{\theta} D(\theta_0, \theta) \Big|_{\theta=\theta_0} = 0$

where $\nabla_{\theta}$ denotes the gradient (vector of partial derivatives):

$\nabla_{\theta} = \left(\frac{\partial}{\partial \theta_1}, \ldots, \frac{\partial}{\partial \theta_d}\right)$

Why the Gradient?

At a minimum of a smooth function, the surface is "flat" — all partial derivatives are zero. The gradient collects all these derivatives into one vector equation.

The Estimating Equation

Since we can't evaluate $D$ directly, we use the sample version. The estimating equation is:

$\nabla_{\theta}\rho(X, \hat{\theta}) = 0$

In words: Find $\hat{\theta}$ such that the gradient of the contrast function (evaluated at $\hat{\theta}$ ) equals zero.

This Is a System of Equations

If $\theta \in \mathbb{R}^d$ , then this gives us $d$ equations in $d$ unknowns:

$\frac{\partial \rho(X, \theta)}{\partial \theta_j}\Big|_{\theta=\hat{\theta}} = 0 \quad \text{for } j = 1, \ldots, d$

General Estimating Equations

There's an even more general framework that doesn't require starting from a contrast function. We directly specify estimating functions.

The $\Psi$ -Function Approach

"More generally, suppose we are given a function $\Psi : \mathcal{X} \times \mathbb{R}^d \to \mathbb{R}^d$ ..."

✓

Pure Symbolic Meaning

$\Psi$ is a function that takes two inputs:

X \in \mathcal{X}

(data)

\theta \in \mathbb{R}^d

(parameter vector)

and outputs a d-dimensional vector in $\mathbb{R}^d$ :

$\Psi(X, \theta) = \begin{pmatrix} \psi_1(X, \theta) \\ \vdots \\ \psi_d(X, \theta) \end{pmatrix}$

So Ψ is like:

•ψ₁(X, θ) = some equation

•ψ₂(X, θ) = another equation

•...

•ψ_d(X, θ) = last equation

✓

Intuition

Instead of using gradients ( $\nabla\rho$ ), this says:

Let's use any vector-valued function $\Psi(X, \theta)$ whose expectation is 0 at the true $\theta_0$ .

Think of $\Psi$ as a system of equations the true θ must satisfy.

Examples:

•moment conditions
•score equations (log-likelihood derivative)
•sample equations from GMM
•regression normal equations

🎯

What is Ψ (Psi)? — The "Error Detector"

Ψ is an "error detector" — a function that measures how "wrong" a candidate parameter θ is, given the observed data X.

The Key Idea: When θ equals the true parameter θ₀, the function Ψ(X, θ) should average to zero:

E[Ψ(X, θ₀)] = 0 ← "No error signal at the truth"

Think of it like a thermostat:

•If the room temperature (θ) matches the target (θ₀), the error signal is zero → Ψ = 0
•If the room is too cold, the error is negative → Ψ < 0 (heat more!)
•If the room is too hot, the error is positive → Ψ > 0 (cool down!)

Formally, $\Psi$ is a vector-valued function that takes data and a parameter, and outputs an "error vector":

$\Psi : \mathcal{X} \times \mathbb{R}^d \to \mathbb{R}^d$

where $\Psi \equiv (\psi_1, \ldots, \psi_d)^T$ — each component $\psi_j$ checks a different aspect of the parameter.

Two Parameters (d = 2)

Input

Data (X)

5.2

Parameters (θ)

μ = 3.0

σ² = 2.0

Ψ function

Output(2D error vector)

Ψ(X, θ) =

2.200.42

Use case: Normal distribution — estimating mean (μ) and variance (σ²)

Three Parameters (d = 3)

Input

Data (X)

y = 8

x₁ = 2

x₂ = 3

Parameters (θ)

β₀ = 1

β₁ = 2

β₂ = 1

Ψ function

Output(3D error vector)

Ψ(X, θ) =

0.000.000.00

✓

All zeros → θ is the MLE!

Use case: Linear regression — y = β₀ + β₁x₁ + β₂x₂

Notice: The output dimension matches the number of parameters (d). Each ψⱼ tells you how "off" the j-th parameter is. When all outputs are zero, you've found the estimate!

Symbol	Meaning	Intuition
$\Psi$	Vector of estimating functions	The "error detector" that tells us how wrong θ is
$\psi_j$	The $j$ -th component function	Checks if the j-th parameter component is correct
$d$	Dimension of parameter space	Number of parameters to estimate (d equations for d unknowns)
$\mathcal{X}$	Data space	All possible values the data X can take

📊 Concrete Example: Estimating the Mean

Suppose we want to estimate the mean μ of a distribution. A natural Ψ-function is:

Ψ(X, μ) = X - μ

If...	Then Ψ =	Meaning
X = 5, μ = 5	0	✅ μ is correct!
X = 7, μ = 5	+2	⬆️ μ is too low
X = 3, μ = 5	-2	⬇️ μ is too high

Key property: E[Ψ(X, μ₀)] = E[X - μ₀] = μ₀ - μ₀ = 0 when μ = μ₀ (the true mean).

Population Condition

Define the population version:

$V(\theta_0, \theta) = E_{\theta_0}[\Psi(X, \theta)]$

We require that $V(\theta_0, \theta) = 0$ has $\theta_0$ as its unique solution.

This means: when the true parameter is $\theta_0$ , the expected value of $\Psi(X, \theta)$ is zero only when $\theta = \theta_0$ .

The Estimating Equation Estimate

We define $\hat{\theta}$ as the solution to:

$\Psi(X, \hat{\theta}) = 0$

This is $d$ equations in $d$ unknowns — one equation for each component of $\theta$ .

Connection to Contrast Functions

When $\Psi(X, \theta) = \nabla_{\theta} \rho(X, \theta)$ , the estimating equation approach reduces to the minimum contrast approach. But the $\Psi$ -function framework is more general!

Intuitive Examples

Let's make these abstract concepts concrete with familiar examples.

Example 1: Method of Moments

Setup: Estimate the mean $\mu$ from data $X_1, \ldots, X_n$ .

Estimating function: $\psi(X_i, \mu) = X_i - \mu$

Estimating equation: $\sum_{i=1}^n (X_i - \hat{\mu}) = 0$

Solution: $\hat{\mu} = \bar{X} = \frac{1}{n}\sum_{i=1}^n X_i$

The sample mean! This is the simplest estimating equation estimate.

Example 2: Maximum Likelihood (Preview)

Setup: Observations with density $f(x; \theta)$ .

Contrast function (negative log-likelihood):

$\rho(X, \theta) = -\sum_{i=1}^n \log f(X_i; \theta)$

Estimating equation (score equation):

$\sum_{i=1}^n \frac{\partial \log f(X_i; \theta)}{\partial \theta}\Big|_{\theta=\hat{\theta}} = 0$

This is the famous score equation — the foundation of Maximum Likelihood Estimation (covered in detail in Chapter 12).

Example 3: Least Squares

Setup: Linear regression $Y = X\beta + \varepsilon$ .

Contrast function (sum of squared errors):

$\rho(Y, \beta) = \sum_{i=1}^n (Y_i - X_i^T\beta)^2$

Estimating equation (normal equations):

$X^T(Y - X\hat{\beta}) = 0$

Solution: $\hat{\beta} = (X^TX)^{-1}X^TY$

Complete Symbol Glossary

Here's a comprehensive reference for all the symbols in this section:

Symbol	Name	Meaning
$X$	Observation	The data we observe (random vector)
$x$	Realized data	Specific observed values (lowercase = fixed)
$\mathcal{X}$	Sample space	All possible values $X$ could take
$P$	Distribution	Probability law governing $X$
$\mathcal{P}$	Distribution family	Set of candidate distributions
$\theta$	Parameter	Unknown quantity we want to estimate
$\theta_0$	True parameter	The actual parameter value (unknown)
$\Theta$	Parameter space	All valid parameter values
$P_{\theta}$	Parametric distribution	Distribution when parameter is $\theta$
$\hat{\theta}$	Estimator/estimate	Our guess for $\theta$ based on data
$\rho$	Contrast function	Measures incompatibility of $\theta$ with $X$
$D(\theta_0, \theta)$	Discrepancy	Expected contrast under true $\theta_0$
$E_{\theta_0}$	Expectation	Expected value under distribution $P_{\theta_0}$
$\nabla_{\theta}$	Gradient	Vector of partial derivatives w.r.t. $\theta$
$\Psi$	Estimating function	Vector-valued function defining equations
$V(\theta_0, \theta)$	Population moment	Expected value of $\Psi$ under $\theta_0$

Python Implementation

Implementing a General Estimating Equation Solver

🐍estimating_equations.py

1import numpy as np
2from scipy.optimize import fsolve, minimize
3from typing import Callable
4
5def solve_estimating_equation(
6    psi: Callable[[np.ndarray, np.ndarray], np.ndarray],
7    X: np.ndarray,
8    theta_init: np.ndarray
9) -> np.ndarray:
10    """
11    Solve the estimating equation Psi(X, theta) = 0.
12
13    Parameters:
14    -----------
15    psi : Callable
16        Estimating function Psi(X, theta) -> R^d
17    X : np.ndarray
18        Observed data
19    theta_init : np.ndarray
20        Initial guess for theta
21
22    Returns:
23    --------
24    np.ndarray : Solution theta_hat
25    """
26    # Define the equation to solve: sum over observations
27    def equation(theta: np.ndarray) -> np.ndarray:
28        return np.sum([psi(x_i, theta) for x_i in X], axis=0)
29
30    # Solve using scipy's fsolve
31    theta_hat = fsolve(equation, theta_init)
32    return theta_hat
33
34
35# Example 1: Method of Moments for Normal(mu, sigma^2)
36def psi_normal(x: float, theta: np.ndarray) -> np.ndarray:
37    """
38    Estimating function for Normal(mu, sigma^2).
39    theta = (mu, sigma^2)
40    """
41    mu, sigma2 = theta
42    return np.array([
43        x - mu,                    # For mu: E[X - mu] = 0
44        (x - mu)**2 - sigma2       # For sigma^2: E[(X-mu)^2 - sigma^2] = 0
45    ])
46
47
48# Generate data from Normal(5, 4)
49np.random.seed(42)
50X = np.random.normal(loc=5, scale=2, size=100)
51
52# Solve
53theta_hat = solve_estimating_equation(psi_normal, X, theta_init=np.array([0.0, 1.0]))
54print(f"True theta = (5, 4)")
55print(f"Estimated theta_hat = ({theta_hat[0]:.4f}, {theta_hat[1]:.4f})")

Minimum Contrast Estimation

🐍minimum_contrast.py

1import numpy as np
2from scipy.optimize import minimize
3from typing import Callable
4
5def minimum_contrast_estimate(
6    rho: Callable[[np.ndarray, np.ndarray], float],
7    X: np.ndarray,
8    theta_init: np.ndarray,
9    bounds: tuple = None
10) -> np.ndarray:
11    """
12    Find theta that minimizes the contrast function rho(X, theta).
13
14    Parameters:
15    -----------
16    rho : Callable
17        Contrast function rho(X, theta) -> R
18    X : np.ndarray
19        Observed data
20    theta_init : np.ndarray
21        Initial guess
22    bounds : tuple, optional
23        Parameter bounds
24
25    Returns:
26    --------
27    np.ndarray : Minimum contrast estimate theta_hat
28    """
29    def objective(theta: np.ndarray) -> float:
30        return rho(X, theta)
31
32    result = minimize(objective, theta_init, bounds=bounds, method='L-BFGS-B')
33    return result.x
34
35
36# Example: Least Squares Regression
37def least_squares_contrast(X: np.ndarray, beta: np.ndarray) -> float:
38    """
39    Contrast function: sum of squared residuals.
40    X has columns [1, x1, x2, ..., y] (last column is response)
41    """
42    design = X[:, :-1]  # Features
43    y = X[:, -1]        # Response
44    y_pred = design @ beta
45    return np.sum((y - y_pred)**2)
46
47
48# Generate linear regression data: y = 2 + 3*x + noise
49np.random.seed(42)
50n = 100
51x = np.random.uniform(0, 10, n)
52y = 2 + 3*x + np.random.normal(0, 1, n)
53X = np.column_stack([np.ones(n), x, y])  # [1, x, y]
54
55# Estimate
56beta_hat = minimum_contrast_estimate(
57    least_squares_contrast, X,
58    theta_init=np.array([0.0, 0.0])
59)
60print(f"True beta = (2, 3)")
61print(f"Estimated beta_hat = ({beta_hat[0]:.4f}, {beta_hat[1]:.4f})")

Visualizing the Discrepancy Function

🐍visualize_discrepancy.py

1import numpy as np
2import matplotlib.pyplot as plt
3from scipy import stats
4
5# Setup: Estimate mean mu of Normal(mu, 1)
6true_mu = 3.0
7n = 50
8
9# Generate data
10np.random.seed(42)
11X = np.random.normal(loc=true_mu, scale=1, size=n)
12
13# Define contrast function (negative log-likelihood per observation)
14def contrast(x_i, mu):
15    """rho(x, mu) = (x - mu)^2 / 2 (proportional to negative log-likelihood)"""
16    return (x_i - mu)**2 / 2
17
18# Compute sample contrast for range of mu values
19mu_range = np.linspace(0, 6, 200)
20sample_contrast = [np.mean([contrast(x_i, mu) for x_i in X])
21                   for mu in mu_range]
22
23# The true discrepancy D(mu_0, mu) = E[(X - mu)^2/2] = ((mu - mu_0)^2 + 1)/2
24true_discrepancy = ((mu_range - true_mu)**2 + 1) / 2
25
26# Plot
27plt.figure(figsize=(10, 6))
28plt.plot(mu_range, sample_contrast, 'b-', linewidth=2,
29         label=r'Sample contrast $\bar{\rho}(X, \mu)$')
30plt.plot(mu_range, true_discrepancy, 'r--', linewidth=2,
31         label=r'True discrepancy $D(\mu_0, \mu)$')
32plt.axvline(true_mu, color='g', linestyle=':', linewidth=2,
33            label=f'True $\mu_0$ = {true_mu}')
34plt.axvline(np.mean(X), color='purple', linestyle=':', linewidth=2,
35            label=f'Estimate $\hat{{\mu}}$ = {np.mean(X):.3f}')
36
37plt.xlabel(r'$\mu$', fontsize=14)
38plt.ylabel('Contrast / Discrepancy', fontsize=14)
39plt.title('Sample Contrast vs Population Discrepancy', fontsize=14)
40plt.legend(fontsize=12)
41plt.grid(True, alpha=0.3)
42plt.tight_layout()
43plt.savefig('discrepancy_visualization.png', dpi=150)
44plt.show()
45
46# The minimum of sample contrast gives our estimate
47mu_hat = mu_range[np.argmin(sample_contrast)]
48print(f"True mu = {true_mu}")
49print(f"Sample mean = {np.mean(X):.4f}")
50print(f"Minimizer of sample contrast = {mu_hat:.4f}")

Key Insights

The Estimation Recipe

How to Construct an Estimator

Choose a contrast function $\rho(X, \theta)$ that measures how "incompatible" $\theta$ is with the data
Verify the key property: $D(\theta_0, \theta)$ is minimized at $\theta = \theta_0$
Option A: Directly minimize $\rho(X, \theta)$ over $\theta$
Option B: Solve the estimating equation $\nabla \rho(X, \hat{\theta}) = 0$

Why This Framework Is Powerful

Unifying: Maximum Likelihood, Least Squares, Method of Moments are all special cases
Flexible: Can handle complex models by choosing appropriate contrast functions
Analyzable: Properties of $\hat{\theta}$ (bias, variance, consistency) follow from properties of $\rho$
Computationally tractable: Estimating equations often easier to solve than direct optimization

Caution: Uniqueness

Not all contrast functions have unique minima! When $\rho(X, \theta)$ has multiple local minima, different starting points may give different estimates. Always check that your estimator is well-defined.

Summary

This section introduced the foundational framework for estimation theory. Here are the key takeaways:

Core Concepts

Concept	Key Formula	Intuition
Parametric model	$\mathcal{P} = \{P_{\theta} : \theta \in \Theta\}$	Data comes from a distribution indexed by $\theta$
Estimator	$\hat{\theta}(X) : \mathcal{X} \to \Theta$	Recipe for guessing $\theta$ from data
Contrast function	$\rho : \mathcal{X} \times \Theta \to \mathbb{R}$	Measures incompatibility of $\theta$ with $X$
Discrepancy	$D(\theta_0, \theta) = E_{\theta_0}[\rho(X,\theta)]$	Average contrast under true $\theta_0$
Min contrast estimate	$\hat{\theta} = \arg\min \rho(X, \theta)$	$\theta$ that best fits the data
Estimating equation	$\nabla \rho(X, \hat{\theta}) = 0$	Find $\hat{\theta}$ where gradient vanishes

The Big Ideas

Estimation is about finding the parameter that generated our data — we use observable quantities to infer unobservable truths
Contrast functions formalize "goodness of fit" — smaller contrast means better compatibility with data
The key requirement is identification — $D(\theta_0, \theta)$ must be uniquely minimized at $\theta_0$
Estimating equations are often easier to solve — differentiation turns optimization into root-finding

Coming Next: In the next section, we'll explore the key properties of estimators: bias, variance, and mean squared error (MSE). These concepts help us evaluate how "good" an estimator is and choose between competing estimation strategies.

Learning Objectives

Before You Start

The Big Picture: Why Estimation Matters

Sample Size vs Estimate Accuracy

The Detective Analogy

A Concrete Example: The Coffee Shop Mystery

The One Formula to Remember

Why Should You Care? (The ML Connection)

The Universal Pattern

The Three Questions of Estimation

Real-World Estimation: From Problems to Mathematical Models

🎒The School Uniform ProblemManufacturing & Design

🤖Training a Large Language ModelAI / Deep Learning

🏥Medical Diagnosis from Patient DataAI / Healthcare ML

📈Portfolio Risk EstimationQuantitative Finance

📡Wireless Channel EstimationElectrical Engineering / Communications

The Universal Pattern Across All Examples

The Estimation Problem (Formal)

What Makes a Guess "Good"?

The Parametric Framework

The Setup: Observation Space and Probability Families

Think of the Sample Space as the Data Universe

The Parametric Assumption

Example: Normal Distribution

What Is an Estimator and Estimate?

The Hat Notation

Estimator vs Estimate

Contrast Functions: Measuring Discrepancy

Definition of Contrast Function

Intuition for the Contrast Function

Wait — How Can We Compare Data and Parameters?

The Key Insight

Concrete Examples

1Estimating the Mean (Squared Error)The most intuitive contrast function

2Maximum Likelihood (Negative Log-Probability)The most powerful general-purpose method

3Normal Distribution (MLE = Least Squares!)A beautiful connection revealed

The Bridge Between Data and Parameters

The Key Requirement

The Fundamental Requirement

The Discrepancy Function D(θ0,θ)D(\theta_0, \theta)D(θ0​,θ)

💡 Intuitive Meaning of the Discrepancy Function

🔗 This Is Your PyTorch/TensorFlow Loss Function!

The Profound Connection

Interactive Geometric Visualization

The Discrepancy "Bowl" — 2D Visualization

Key Insight

The Discrepancy "Bowl" — 3D Visualization

Try This

What You're Seeing

Minimum Contrast Estimates

The Minimum Contrast Estimator

Why This Works

The Sample Version

Estimating Equations

The Gradient Condition

Why the Gradient?

The Estimating Equation

This Is a System of Equations

General Estimating Equations

The Ψ\PsiΨ-Function Approach

Pure Symbolic Meaning

Intuition

What is Ψ (Psi)? — The "Error Detector"

Two Parameters (d = 2)

Three Parameters (d = 3)

📊 Concrete Example: Estimating the Mean

Population Condition

The Estimating Equation Estimate

Connection to Contrast Functions

Intuitive Examples

Example 1: Method of Moments

Example 2: Maximum Likelihood (Preview)

Example 3: Least Squares

Complete Symbol Glossary

Python Implementation

Implementing a General Estimating Equation Solver

Minimum Contrast Estimation

Visualizing the Discrepancy Function

Key Insights

The Estimation Recipe

The Discrepancy Function $D(\theta_0, \theta)$

The $\Psi$ -Function Approach