Chapter 23
25 min read
Section 144 of 175

Gauss-Markov Theorem

Linear Regression

Learning Objectives

By the end of this section, you will be able to:

  1. Understand BLUE: Explain what "Best Linear Unbiased Estimator" means and why OLS achieves this property
  2. State the assumptions: List and interpret the four classical linear regression assumptions required for the Gauss-Markov theorem
  3. Recognize violations: Identify when assumptions are violated and understand the consequences for OLS estimation
  4. Apply remedies: Know which alternative estimators to use when different assumptions fail (WLS, GLS, IV)
  5. Connect to ML: Understand how the Gauss-Markov theorem relates to regularization, deep learning, and modern machine learning

The Big Picture

The Gauss-Markov theorem is one of the most important results in statistical theory. It tells us that under certain conditions, the ordinary least squares (OLS) estimator is not just a good estimator—it is the best possible linear unbiased estimator. This result provides the theoretical foundation for why linear regression has been the workhorse of statistics for over two centuries.

The Central Question: Among all possible ways to estimate regression coefficients using linear combinations of the data, which one has the smallest variance? The Gauss-Markov theorem answers: OLS, provided certain conditions hold.

Historical Origins

The theorem bears the names of two mathematical giants who contributed to its development:

MathematicianContributionYear
Carl Friedrich GaussDeveloped the method of least squares for astronomical calculations1795-1809
Andrey MarkovProved the optimality of OLS among linear unbiased estimators1912
AitkenExtended to generalized least squares (GLS)1935

Gauss originally developed least squares to predict the orbit of the dwarf planet Ceres. When Ceres disappeared behind the Sun in 1801, astronomers had only a few observations to predict where it would reappear. Using his method, Gauss predicted the location with remarkable accuracy—demonstrating the practical power of least squares estimation.

Why This Theorem Matters

The Gauss-Markov theorem matters because it answers a fundamental question in estimation theory: How do we choose among estimators? There are infinitely many ways to estimate a regression coefficient. The theorem tells us that if we want:

  • Unbiasedness: Our estimates should be correct on average (no systematic error)
  • Linearity: Our estimator should be a linear function of the observations
  • Efficiency: We want the smallest possible variance

...then OLS is the unique optimal choice. This is why linear regression remains the default method in countless applications, from economics to biology to engineering.

Key Insight for ML Engineers: The Gauss-Markov theorem explains why minimizing squared error leads to optimal predictions when the model is correctly specified. When assumptions fail, we need regularization, robust methods, or nonlinear models—topics that define modern machine learning.

The Classical Linear Regression Assumptions

The Gauss-Markov theorem holds under a specific set of conditions known as the classical linear regression assumptions (also called the Gauss-Markov conditions). Understanding these assumptions is crucial because when they fail, OLS loses its optimality.

Assumption 1: Linearity in Parameters

The true relationship between the dependent variable YY and the independent variables XX is linear in the parameters:

Y=Xβ+εY = X\beta + \varepsilon

This means:

  • YY is an n×1n \times 1 vector of observations
  • XX is an n×pn \times p design matrix (including the intercept column)
  • β\beta is a p×1p \times 1 vector of unknown parameters
  • ε\varepsilon is an n×1n \times 1 vector of random errors
Note: "Linear in parameters" allows nonlinear transformations of X. For example, Y=β0+β1X+β2X2Y = \beta_0 + \beta_1 X + \beta_2 X^2 is linear in parameters even though it's a quadratic function of X.

Assumption 2: Strict Exogeneity

The errors have zero conditional mean given the regressors:

E[εX]=0E[\varepsilon | X] = 0

This is the most important assumption. It implies:

  • The regressors X are not correlated with the error term
  • There are no omitted variables that are correlated with X
  • There is no simultaneity or reverse causality
  • Measurement error in X is absent
Critical: When strict exogeneity fails (endogeneity), OLS is biased and inconsistent. This is the only assumption whose violation makes OLS fundamentally flawed—not just inefficient.

Assumption 3: Spherical Errors

The error terms are homoscedastic (constant variance) and uncorrelated:

E[εεX]=σ2InE[\varepsilon\varepsilon' | X] = \sigma^2 I_n

This "spherical" condition combines two requirements:

ConditionMathematical FormInterpretation
HomoscedasticityVar(εᵢ|X) = σ² for all iError variance is constant across observations
No AutocorrelationCov(εᵢ, εⱼ|X) = 0 for i ≠ jErrors are uncorrelated with each other
Why "Spherical"? If we plot n error terms as a point in n-dimensional space, the expected region forms a sphere (not an ellipsoid). The variance-covariance matrix σ2I\sigma^2 I means equal variance in all directions with no correlation.

Assumption 4: Full Rank (No Perfect Multicollinearity)

The design matrix X has full column rank:

textrank(X)=p\\text{rank}(X) = p

This ensures:

  • The matrix XXX'X is invertible
  • No regressor is a perfect linear combination of other regressors
  • The OLS estimator β^=(XX)1XY\hat{\beta} = (X'X)^{-1}X'Y is well-defined and unique

The Gauss-Markov Theorem

What is BLUE?

BLUE stands for Best Linear Unbiased Estimator. Let's unpack each word:

TermMeaningImplication
BestMinimum varianceAmong all comparable estimators, OLS has the smallest variance
LinearLinear function of YThe estimator β̂ = AY for some matrix A depending only on X
UnbiasedE[β̂] = βOn average, the estimator equals the true parameter value
EstimatorFunction of dataA rule for computing an estimate from observed data

Formal Statement

Gauss-Markov Theorem: Under assumptions 1-4, the OLS estimator β^OLS=(XX)1XY\hat{\beta}_{OLS} = (X'X)^{-1}X'Y is BLUE. That is, for any other linear unbiased estimator β~\tilde{\beta}, we have:
textVar(hatbetaOLS)leqtextVar(tildebeta)\\text{Var}(\\hat{\\beta}_{OLS}) \\leq \\text{Var}(\\tilde{\\beta})
where the inequality means Var(β~)Var(β^OLS)\text{Var}(\tilde{\beta}) - \text{Var}(\hat{\beta}_{OLS}) is positive semi-definite.

The variance-covariance matrix of the OLS estimator is:

textVar(hatbetaOLS)=sigma2(XX)1\\text{Var}(\\hat{\\beta}_{OLS}) = \\sigma^2 (X'X)^{-1}

This is the minimum achievable variance among all linear unbiased estimators.

What This Does NOT Say: The theorem does not claim OLS is the best among ALL estimators—only among linear unbiased ones. There may exist biased estimators (like ridge regression) or nonlinear estimators with smaller mean squared error. This is the bias-variance tradeoff.

Interactive: BLUE Property Explorer

This visualization demonstrates the BLUE property by comparing OLS to alternative linear unbiased estimators. Run simulations to see that OLS consistently has the smallest variance:

BLUE Property Explorer

Best Linear Unbiased Estimator Demonstration

50
2.0
100

Regression Comparison

XYTrue lineOLS (BLUE)Pairwise AverageTheil-Sen

Slope Estimator Distribution (n=100 simulations)

OLS (BLUE)μ̂ = 2.0093, Var = 0.013257
Pairwise Avgμ̂ = 0.6049, Var = 280.836042
Theil-Senμ̂ = 2.0105, Var = 0.013411

Gauss-Markov Result

OLS has the smallest variance among all linear unbiased estimators. All estimators are centered around the true slope (β = 2), confirming they are unbiased.

Pairwise / OLS Variance

21184.54×

Theil-Sen / OLS Variance

1.01×

OLS (BLUE)

Best

Slope: 2.0442

Intercept: 0.8219

Ordinary Least Squares - minimizes sum of squared residuals

Pairwise Average

Slope: 1.4471

Intercept: 3.4275

Average of all pairwise slopes - unbiased but less efficient

Theil-Sen

Slope: 2.0096

Intercept: 0.9731

Median of pairwise slopes - robust but not minimum variance

The simulation shows three estimators:

  • OLS (BLUE): Minimizes sum of squared residuals, achieves minimum variance
  • Pairwise Average: Averages all pairwise slopes—unbiased but inefficient
  • Theil-Sen: Median of pairwise slopes—robust but not minimum variance
Key Observation: All three estimators are unbiased (centered around the true slope), but OLS has consistently smaller variance. The efficiency ratio shows how much more variable the alternative estimators are compared to OLS.

Proof Sketch

Strategy

The proof strategy is elegant: we show that any linear unbiased estimator can be written as OLS plus an additional term, and this additional term only adds variance—it can never reduce it.

Key Steps

  1. Represent any linear estimator: Any linear estimator can be written as β~=CY\tilde{\beta} = CY for some matrix C.
  2. Decompose C: Write C=(XX)1X+DC = (X'X)^{-1}X' + D where D captures the deviation from OLS.
  3. Apply unbiasedness condition: For β~\tilde{\beta} to be unbiased, we need E[β~]=βE[\tilde{\beta}] = \beta. This requires DX=0DX = 0.
  4. Calculate variance: Using DX=0DX = 0:
    textVar(tildebeta)=sigma2CC=sigma2(XX)1+sigma2DD\\text{Var}(\\tilde{\\beta}) = \\sigma^2 CC' = \\sigma^2 (X'X)^{-1} + \\sigma^2 DD'
  5. Conclude: Since DDDD' is positive semi-definite, the variance of β~\tilde{\beta} is at least as large as the OLS variance. Equality holds only whenD=0D = 0, meaning β~=β^OLS\tilde{\beta} = \hat{\beta}_{OLS}.

When Assumptions Fail

Understanding what happens when the Gauss-Markov assumptions are violated is essential for applied work. Different violations have different consequences and require different remedies:

ViolationOLS Bias?OLS Efficient?Standard Errors Valid?Solution
HeteroscedasticityNoNoNoWLS or robust SE
AutocorrelationNoNoNoGLS or HAC SE
EndogeneityYesN/AN/AIV/2SLS
NonlinearityYesN/AN/ACorrect specification

Interactive: Assumption Violations

Explore what happens to OLS when different assumptions are violated:

Assumption Violation Explorer

What happens when Gauss-Markov conditions fail?

80
70%

All Assumptions Satisfied

Errors are i.i.d. with mean zero and constant variance. OLS is BLUE.

Consequence:

OLS provides the most efficient linear unbiased estimates.

Solution:

No fix needed - proceed with standard OLS inference.

Data and Fitted Line

XYTrue lineOLS fit

Residual Plot (ε̂ vs X)

XResidual

True β₁

1.500

OLS β̂₁

1.522

Bias

+0.0221

SE(β̂₁)

0.0635

OLS is BLUE

Under the classical assumptions, OLS provides the Best Linear Unbiased Estimator.

Critical Distinction: Heteroscedasticity and autocorrelation affect efficiency—OLS is still unbiased but not optimal. Endogeneity and misspecification affectconsistency—OLS is fundamentally flawed and won't converge to the true value.

Applications in Machine Learning

The Gauss-Markov theorem has profound implications for modern machine learning, even though ML often deliberately violates its assumptions.

Connection to Regularization

Ridge regression and LASSO deliberately introduce bias to reduce variance. This seems to contradict Gauss-Markov, but it's actually a sophisticated application:

textMSE=textBias2+textVariance\\text{MSE} = \\text{Bias}^2 + \\text{Variance}
  • OLS (BLUE): Zero bias, minimum variance among unbiased estimators
  • Ridge/LASSO: Small bias, much smaller variance—can have lower MSE
The Bias-Variance Tradeoff: Gauss-Markov tells us the best we can do without bias. Modern ML asks: what if we accept a little bias to get much lower variance? This is the essence of regularization.

Implications for Deep Learning

Deep learning violates virtually every Gauss-Markov assumption:

AssumptionDeep Learning Reality
LinearityNeural networks are highly nonlinear
HomoscedasticityError variance often varies across input space
IndependenceBatch normalization and attention create dependencies
Known functional formArchitecture is learned/designed heuristically

Yet understanding Gauss-Markov helps deep learning practitioners:

  • Initialization: Xavier/He initialization aims for homoscedastic activations
  • Loss functions: MSE loss implicitly assumes homoscedastic Gaussian errors
  • Uncertainty: Heteroscedastic networks learn input-dependent variance
  • Regularization: Weight decay is ridge regression applied to neural nets
When to Use OLS: Despite deep learning's power, OLS remains optimal when: (1) the true relationship is linear, (2) you have limited data, (3) interpretability matters, or (4) you need valid inference (confidence intervals, hypothesis tests).

Python Implementation

Here's a complete implementation of OLS with Gauss-Markov verification:

OLS Estimator with BLUE Verification
🐍gauss_markov.py
1Import NumPy

NumPy provides the linear algebra operations for OLS estimation and variance calculations.

4Class Definition

We implement a comprehensive OLS estimator that computes all quantities needed to verify Gauss-Markov properties.

5Constructor

Initialize placeholders for coefficients, standard errors, and variance estimates.

12Fit Method

The main fitting method that computes OLS estimates using the normal equations: β̂ = (X'X)⁻¹X'y.

14Add Intercept Column

Prepend a column of ones to X for the intercept term. This creates the design matrix.

EXAMPLE
X becomes [[1, x₁], [1, x₂], ...]
17Store Dimensions

n is sample size, p is number of parameters (including intercept). Degrees of freedom = n - p.

21Normal Equations

Compute X'X and X'y, then solve the normal equations. np.linalg.solve is numerically stable.

EXAMPLE
β̂ = (X'X)⁻¹X'y
25Compute Residuals

Residuals ε̂ = y - Xβ̂ are used to estimate the error variance σ².

28Estimate Error Variance

σ̂² = SSE/(n-p) is an unbiased estimator of the error variance. This is crucial for standard errors.

31Variance-Covariance Matrix

Var(β̂) = σ²(X'X)⁻¹. This matrix contains the variance of each coefficient on the diagonal.

EXAMPLE
SE(β̂ⱼ) = sqrt(Var(β̂)ⱼⱼ)
34Standard Errors

Extract diagonal elements and take square root to get standard errors for each coefficient.

40Compare Estimators

This method demonstrates BLUE by comparing OLS variance to alternative linear estimators.

44Monte Carlo Simulation

Run many simulations with the same X but different error realizations to estimate variance empirically.

53Compare Variances

Calculate empirical variance of each estimator. OLS should have smaller variance than alternatives.

77 lines without explanation
1import numpy as np
2from scipy import stats
3
4# Complete OLS Implementation with Gauss-Markov Verification
5class OLSEstimator:
6    def __init__(self):
7        self.coefficients = None
8        self.std_errors = None
9        self.sigma_squared = None
10        self.var_cov_matrix = None
11
12    def fit(self, X, y):
13        # Add intercept column
14        X_design = np.column_stack([np.ones(len(X)), X])
15
16        # Store dimensions
17        n, p = X_design.shape
18        self.dof = n - p  # Degrees of freedom
19
20        # Solve normal equations: β̂ = (X'X)⁻¹X'y
21        XtX = X_design.T @ X_design
22        Xty = X_design.T @ y
23        self.coefficients = np.linalg.solve(XtX, Xty)
24
25        # Compute residuals
26        y_hat = X_design @ self.coefficients
27        residuals = y - y_hat
28
29        # Estimate error variance: σ̂² = SSE/(n-p)
30        SSE = np.sum(residuals**2)
31        self.sigma_squared = SSE / self.dof
32
33        # Variance-covariance matrix: Var(β̂) = σ²(X'X)⁻¹
34        XtX_inv = np.linalg.inv(XtX)
35        self.var_cov_matrix = self.sigma_squared * XtX_inv
36
37        # Standard errors
38        self.std_errors = np.sqrt(np.diag(self.var_cov_matrix))
39
40        return self
41
42    def compare_to_alternative(self, X, y, n_simulations=1000):
43        """Compare OLS variance to alternative linear estimators."""
44        n = len(X)
45        ols_slopes = []
46        pairwise_slopes = []
47
48        # Monte Carlo simulation
49        for _ in range(n_simulations):
50            # Generate new errors with same variance
51            errors = np.random.normal(0, np.sqrt(self.sigma_squared), n)
52            y_sim = X @ self.coefficients[1:] + self.coefficients[0] + errors
53
54            # OLS estimate
55            X_design = np.column_stack([np.ones(n), X])
56            beta_ols = np.linalg.solve(X_design.T @ X_design, X_design.T @ y_sim)
57            ols_slopes.append(beta_ols[1])
58
59            # Alternative: pairwise slopes average
60            slopes = []
61            for i in range(n-1):
62                for j in range(i+1, n):
63                    if X[j] != X[i]:
64                        slopes.append((y_sim[j] - y_sim[i]) / (X[j] - X[i]))
65            pairwise_slopes.append(np.mean(slopes))
66
67        # Compare variances
68        results = {
69            'OLS_variance': np.var(ols_slopes),
70            'Pairwise_variance': np.var(pairwise_slopes),
71            'Efficiency_ratio': np.var(pairwise_slopes) / np.var(ols_slopes)
72        }
73        return results
74
75# Example usage
76np.random.seed(42)
77n = 100
78X = np.random.uniform(0, 10, n)
79true_beta = np.array([2.0, 1.5])  # [intercept, slope]
80y = true_beta[0] + true_beta[1] * X + np.random.normal(0, 2, n)
81
82ols = OLSEstimator().fit(X, y)
83print(f"OLS Coefficients: {ols.coefficients}")
84print(f"Standard Errors: {ols.std_errors}")
85
86# Verify BLUE property
87comparison = ols.compare_to_alternative(X, y)
88print(f"\nBLUE Verification:")
89print(f"OLS Variance: {comparison['OLS_variance']:.6f}")
90print(f"Alternative Variance: {comparison['Pairwise_variance']:.6f}")
91print(f"Efficiency Ratio: {comparison['Efficiency_ratio']:.2f}x")

The implementation includes:

  • Standard OLS estimation using the normal equations
  • Variance-covariance matrix computation for inference
  • Monte Carlo simulation to empirically verify the BLUE property

Knowledge Check

Test your understanding of the Gauss-Markov theorem with these questions:

Knowledge Check

Test your understanding of the Gauss-Markov theorem

1 / 8

What does BLUE stand for in the context of the Gauss-Markov theorem?


Summary

The Gauss-Markov theorem is a cornerstone of statistical theory that establishes the optimality of OLS under specific conditions:

Key Takeaways:
  1. Under the classical assumptions (linearity, strict exogeneity, spherical errors, full rank), OLS is BLUE—the Best Linear Unbiased Estimator.
  2. "Best" means minimum variance among all linear unbiased estimators, not among all possible estimators.
  3. When assumptions fail: heteroscedasticity/autocorrelation cause inefficiency; endogeneity causes bias and inconsistency.
  4. Modern ML deliberately trades bias for variance reduction through regularization—a sophisticated violation of unbiasedness.
  5. Understanding these conditions helps practitioners choose between OLS, GLS, IV, or nonlinear methods.
Looking Ahead: In the next section, we'll explore regression diagnostics—how to detect when Gauss-Markov assumptions are violated using residual plots, statistical tests, and other diagnostic tools.
Loading comments...