Chapter 23
30 min read
Section 143 of 175

OLS Theory

Linear Regression

Learning Objectives

By the end of this section, you will be able to:

📚 Core Knowledge

  • • Explain why minimizing squared errors leads to the normal equations
  • • Derive the OLS estimator using calculus and linear algebra
  • • Interpret OLS geometrically as projection onto a subspace
  • • State and understand the key properties of OLS estimators

🔧 Practical Skills

  • • Implement OLS from scratch using matrix operations
  • • Compute and interpret R², SSE, and SST
  • • Diagnose when OLS might fail (multicollinearity)
  • • Connect OLS to gradient descent optimization

🧠 Deep Learning Connections

  • Neural Network Training — The loss function in OLS (MSE) is the same loss used in most regression neural networks
  • Gradient Descent — Understanding the closed-form OLS solution helps appreciate when and why iterative optimization is needed
  • Linear Layers — Every fully-connected layer in a neural network is essentially performing linear regression
  • Feature Engineering — OLS theory explains why feature scaling and normalization matter for optimization
Where You'll Apply This: Training linear models, understanding neural network loss landscapes, feature selection, A/B testing analysis, causal inference, time series forecasting, and as the foundation for generalized linear models.

The Big Picture

Ordinary Least Squares (OLS) is arguably the most important method in all of statistics. It provides an exact, closed-form solution to the fundamental problem of fitting a linear model to data. Unlike iterative methods like gradient descent, OLS gives us the optimal answer directly through a single matrix computation.

The Central Question

Given data points (x1,y1),,(xn,yn)(x_1, y_1), \ldots, (x_n, y_n), what is the best straight line y^=β0+β1x\hat{y} = \beta_0 + \beta_1 x?

🎯

Best = Minimum Error
We want to minimize prediction errors

📐

Squared Errors
Squaring makes errors positive and tractable

Analytical Solution
One formula gives the exact answer

Historical Context

The method of least squares has a rich history, with roots in astronomy and geodesy where precise predictions were literally matters of life and death.

📜
Legendre (1805) & Gauss (1809)

Adrien-Marie Legendre first published the method, but Carl Friedrich Gauss claimed he had been using it since 1795. Gauss applied it to predict the orbit of Ceres, allowing astronomers to rediscover the asteroid after it had been lost behind the sun.

🔬
The Gauss-Markov Theorem (1900s)

Building on Gauss's work, Andrey Markov formalized the conditions under which OLS is the Best Linear Unbiased Estimator (BLUE). This theorem provides the theoretical justification for OLS's ubiquity in statistics.

💻
Modern Era

Today, OLS is implemented in every statistical package and programming language. In machine learning, the MSE loss function used in neural networks is a direct descendant of OLS. Even when we use gradient descent, we're often solving the same problem OLS solves analytically.


The OLS Problem

Let's formalize what we mean by "best fit." We observe n data pairs and want to find the line that best predicts y from x.

The Linear Model

yi=β0+β1xi+εi,i=1,,ny_i = \beta_0 + \beta_1 x_i + \varepsilon_i, \quad i = 1, \ldots, n

where εi\varepsilon_i are the random errors (residuals)

SymbolNameMeaning
yᵢResponseThe value we want to predict
xᵢPredictorThe input feature(s)
β₀InterceptValue of y when x = 0
β₁SlopeChange in y per unit change in x
εᵢErrorRandom deviation from the line

The Sum of Squared Errors

How do we define "best"? OLS minimizes the Sum of Squared Errors (SSE), also called the Residual Sum of Squares (RSS):

The OLS Objective Function

SSE(β0,β1)=i=1n(yiy^i)2=i=1n(yiβ0β1xi)2\text{SSE}(\beta_0, \beta_1) = \sum_{i=1}^{n} (y_i - \hat{y}_i)^2 = \sum_{i=1}^{n} (y_i - \beta_0 - \beta_1 x_i)^2

Find β^0,β^1\hat{\beta}_0, \hat{\beta}_1 that minimize this quantity

Why squared errors? There are several compelling reasons:

  1. Non-negativity: Squared values are always positive, preventing positive and negative errors from canceling out
  2. Differentiability: The squared function is smooth everywhere, making calculus-based optimization straightforward
  3. Large Error Penalty: Squaring penalizes larger errors more severely, which can be desirable (or problematic with outliers)
  4. Analytical Solution: The quadratic form leads to linear normal equations with a closed-form solution
  5. Maximum Likelihood: Under Gaussian errors, OLS equals MLE (covered later)

Interactive: Loss Landscape

The SSE forms a convex paraboloid (bowl shape) over the parameter space. This is crucial — convexity guarantees a unique global minimum with no local minima to trap optimization algorithms.

OLS Loss Surface (SSE)

The Sum of Squared Errors forms a convex quadratic bowl. The OLS solution sits at the unique global minimum. Drag to rotate the view.

Current SSE:57.47
Minimum SSE:0.24
Gap:57.23
CurrentOptimal
Key Insight: The convexity of SSE is why linear regression always has a unique solution (assuming XᵀX is invertible). This is in stark contrast to neural networks, where the loss landscape is highly non-convex with many local minima.

The Normal Equations

To find the minimum of SSE, we take partial derivatives with respect to each parameter and set them to zero. This gives us the normal equations.

Derivation via Calculus

Starting with the objective function:

L(β0,β1)=i=1n(yiβ0β1xi)2L(\beta_0, \beta_1) = \sum_{i=1}^{n} (y_i - \beta_0 - \beta_1 x_i)^2

Taking partial derivatives:

With respect to β₀:
Lβ0=2i=1n(yiβ0β1xi)=0\frac{\partial L}{\partial \beta_0} = -2 \sum_{i=1}^{n} (y_i - \beta_0 - \beta_1 x_i) = 0
With respect to β₁:
Lβ1=2i=1nxi(yiβ0β1xi)=0\frac{\partial L}{\partial \beta_1} = -2 \sum_{i=1}^{n} x_i(y_i - \beta_0 - \beta_1 x_i) = 0

Simplifying these conditions gives the normal equations:

The Normal Equations (Scalar Form)

nβ^0+β^1xi=yin\hat{\beta}_0 + \hat{\beta}_1 \sum x_i = \sum y_i

β^0xi+β^1xi2=xiyi\hat{\beta}_0 \sum x_i + \hat{\beta}_1 \sum x_i^2 = \sum x_i y_i

Solving this system of two equations in two unknowns yields:

OLS Estimators (Simple Linear Regression)

β^1=i=1n(xixˉ)(yiyˉ)i=1n(xixˉ)2=SxySxx\hat{\beta}_1 = \frac{\sum_{i=1}^{n}(x_i - \bar{x})(y_i - \bar{y})}{\sum_{i=1}^{n}(x_i - \bar{x})^2} = \frac{S_{xy}}{S_{xx}}
β^0=yˉβ^1xˉ\hat{\beta}_0 = \bar{y} - \hat{\beta}_1 \bar{x}

where xˉ,yˉ\bar{x}, \bar{y} are the sample means

Interpretation: The slope β^1\hat{\beta}_1 is the ratio of the sample covariance of x and y to the sample variance of x. It measures how much y changes when x changes, scaled by the variability in x.

Interactive: Normal Equations Step-by-Step

Walk through the normal equations calculation step by step. This demonstrates exactly how the matrix algebra produces the OLS estimates.

Normal Equations Walkthrough

Step 1: Set Up the Model

We model the relationship as: y = β₀ + β₁x + ε

ixᵢyᵢ
112.1
223
334.2
444.9
556.1

n = 5 observations

1 / 7

Matrix Formulation

The scalar derivation is fine for simple regression, but for multiple predictors we need matrix notation. This is also how OLS is implemented computationally.

The Matrix Model

y=Xβ+ε\mathbf{y} = X\boldsymbol{\beta} + \boldsymbol{\varepsilon}
yn×1\mathbf{y}_{n \times 1}

Response vector

Xn×pX_{n \times p}

Design matrix

βp×1\boldsymbol{\beta}_{p \times 1}

Parameter vector

The SSE in matrix form becomes:

SSE(β)=(yXβ)T(yXβ)\text{SSE}(\boldsymbol{\beta}) = (\mathbf{y} - X\boldsymbol{\beta})^T(\mathbf{y} - X\boldsymbol{\beta})

Taking the gradient with respect to β\boldsymbol{\beta} and setting it to zero:

βSSE=2XT(yXβ)=0\nabla_{\boldsymbol{\beta}} \text{SSE} = -2X^T(\mathbf{y} - X\boldsymbol{\beta}) = \mathbf{0}

This leads to the matrix normal equations:

Normal Equations (Matrix Form)

XTXβ^=XTyX^T X \hat{\boldsymbol{\beta}} = X^T \mathbf{y}

Solving for β^\hat{\boldsymbol{\beta}}:

The OLS Estimator

β^=(XTX)1XTy\hat{\boldsymbol{\beta}} = (X^T X)^{-1} X^T \mathbf{y}

This requires XTXX^T X to be invertible (full column rank)

When Does This Fail? The matrix XTXX^T X is not invertible when:
  • There are perfectly correlated predictors (multicollinearity)
  • There are more predictors than observations (p > n)
  • A column of X is constant or all zeros
In these cases, we need regularization (Ridge/Lasso) or dimensionality reduction.

Geometric Interpretation

One of the most beautiful aspects of OLS is its geometric interpretation. Understanding this geometry provides deep insight into what regression is really doing.

OLS as Orthogonal Projection

Think of the response vector y\mathbf{y} as a point in n-dimensional space. The columns of the design matrix X span a subspace (thecolumn space of X). The OLS solution finds the point in this subspace closest to y\mathbf{y}.

The Projection Interpretation

y^=Xβ^=X(XTX)1XTy=Hy\hat{\mathbf{y}} = X\hat{\boldsymbol{\beta}} = X(X^T X)^{-1} X^T \mathbf{y} = H\mathbf{y}

where H=X(XTX)1XTH = X(X^T X)^{-1} X^T is the hat matrix (projection matrix)

Key geometric facts:

  1. Fitted values y^\hat{\mathbf{y}} lie in the column space of X
  2. Residuals e=yy^\mathbf{e} = \mathbf{y} - \hat{\mathbf{y}} are orthogonal to the column space
  3. The residual vector is the shortest path from y\mathbf{y} to the column space
  4. XTe=0X^T \mathbf{e} = 0 — residuals are uncorrelated with predictors

The Pythagorean Theorem of Regression

y2=y^2+e2\|\mathbf{y}\|^2 = \|\hat{\mathbf{y}}\|^2 + \|\mathbf{e}\|^2

In sum-of-squares terms: SST = SSR + SSE

Total variation = Explained variation + Unexplained variation

Interactive: OLS Geometry

Adjust the regression line manually and see how residuals change. Watch as SSE decreases when you approach the OLS solution.

Interactive OLS Geometry

Adjust the regression line and watch how residuals change. The OLS solution minimizes the sum of squared residuals (SSE).

xyCurrent SSE: 51.15OLS SSE: 0.24Gap: 50.91Current Line

OLS Estimates:

β̂₀ = 1.0714

β̂₁ = 0.9869


Properties of OLS Estimators

Under appropriate conditions, OLS estimators have several desirable statistical properties. These properties are what make OLS so widely used.

Unbiasedness

The OLS estimator is unbiased — on average, it equals the true parameter:

E[β^]=βE[\hat{\boldsymbol{\beta}}] = \boldsymbol{\beta}

Proof sketch:

β^=(XTX)1XTy\hat{\boldsymbol{\beta}} = (X^T X)^{-1} X^T \mathbf{y}

=(XTX)1XT(Xβ+ε)= (X^T X)^{-1} X^T (X\boldsymbol{\beta} + \boldsymbol{\varepsilon})

=β+(XTX)1XTε= \boldsymbol{\beta} + (X^T X)^{-1} X^T \boldsymbol{\varepsilon}

Taking expectations: E[β^]=β+(XTX)1XTE[ε]=βE[\hat{\boldsymbol{\beta}}] = \boldsymbol{\beta} + (X^T X)^{-1} X^T E[\boldsymbol{\varepsilon}] = \boldsymbol{\beta}

(since E[ε]=0E[\boldsymbol{\varepsilon}] = 0 by assumption)

Variance of the Estimators

The variance-covariance matrix of the OLS estimator reveals how precisely we can estimate each coefficient:

Variance of OLS Estimator

Var(β^)=σ2(XTX)1\text{Var}(\hat{\boldsymbol{\beta}}) = \sigma^2 (X^T X)^{-1}

where σ2\sigma^2 is the error variance

Important implications:

  • More data (larger n): Reduces variance (better precision)
  • More spread in X: Reduces variance (easier to estimate slopes)
  • Multicollinearity: Inflates variance (poor precision)
  • Higher error variance σ²: Increases coefficient variance
The Gauss-Markov Theorem (covered in the next section) proves that among all linear unbiased estimators, OLS has the smallest variance. This makes it BLUE: Best Linear Unbiased Estimator.

Connection to Maximum Likelihood

One of the most remarkable facts about OLS is its equivalence to Maximum Likelihood Estimation under Gaussian errors. This provides a deep probabilistic justification for minimizing squared errors.

The Gaussian Error Assumption

εiN(0,σ2)i.i.d.\varepsilon_i \sim \mathcal{N}(0, \sigma^2) \quad \text{i.i.d.}
yixiN(β0+β1xi,σ2)\Rightarrow \quad y_i | x_i \sim \mathcal{N}(\beta_0 + \beta_1 x_i, \sigma^2)

Under this assumption, the likelihood function is:

L(β,σ2)=i=1n12πσ2exp((yiβ0β1xi)22σ2)L(\boldsymbol{\beta}, \sigma^2) = \prod_{i=1}^{n} \frac{1}{\sqrt{2\pi\sigma^2}} \exp\left(-\frac{(y_i - \beta_0 - \beta_1 x_i)^2}{2\sigma^2}\right)

Taking the log-likelihood and maximizing with respect to β:

(β)=n2log(2πσ2)12σ2i=1n(yiβ0β1xi)2\ell(\boldsymbol{\beta}) = -\frac{n}{2}\log(2\pi\sigma^2) - \frac{1}{2\sigma^2}\sum_{i=1}^{n}(y_i - \beta_0 - \beta_1 x_i)^2
Key Result: Maximizing the log-likelihood with respect to β is equivalent to minimizing the SSE! The MLE for β equals the OLS estimator. This is why mean squared error is so natural — it's the optimal loss function under Gaussian errors.

AI/ML Applications

OLS theory is not just historical — it's deeply relevant to modern machine learning. Understanding OLS helps you understand neural networks, optimization, and model diagnostics.


Python Implementation

Let's implement OLS from scratch to solidify understanding, then compare with scikit-learn's implementation.

OLS Implementation from Scratch
🐍ols_from_scratch.py
1Import NumPy

NumPy provides efficient linear algebra operations essential for OLS: matrix multiplication, inversion, and solving linear systems.

5Add Intercept Column

The design matrix X needs a column of ones to estimate the intercept β₀. Without this, the regression line is forced through the origin.

EXAMPLE
X_b = [[1, x₁], [1, x₂], ...]
8Compute XᵀX

This p×p matrix is central to OLS. It encodes the relationships between predictors. Its inverse determines coefficient precision.

9Compute Xᵀy

The cross-product of the design matrix and response. Combined with XᵀX, it captures how each predictor relates to the response.

12Solve Normal Equations

np.linalg.solve is numerically stable - better than explicitly computing the inverse. It solves XᵀXβ = Xᵀy for β.

EXAMPLE
More stable than np.linalg.inv(XtX) @ Xty
22Compute Fitted Values

ŷ = Xβ̂ gives the predicted values for each observation. These lie on the fitted regression surface.

23Compute Residuals

Residuals e = y - ŷ measure the difference between observed and fitted values. They're orthogonal to the column space of X.

26Sum of Squared Errors

SSE = Σeᵢ² measures total unexplained variation. This is what OLS minimizes. Lower SSE means better fit.

27Total Sum of Squares

SST = Σ(yᵢ - ȳ)² measures total variation in y around its mean. It's the baseline for comparison.

28R-squared

R² = 1 - SSE/SST measures the proportion of variance explained by the model. Ranges from 0 to 1.

EXAMPLE
R² = 0.85 means 85% of variance explained
39 lines without explanation
1import numpy as np
2
3def ols_fit(X, y):
4    """Fit OLS regression from scratch."""
5    # Add intercept column (column of 1s)
6    X_b = np.column_stack([np.ones(len(X)), X])
7
8    # Compute normal equations components
9    XtX = X_b.T @ X_b
10    Xty = X_b.T @ y
11
12    # Solve the normal equations
13    # Using solve() is more stable than inv()
14    beta_hat = np.linalg.solve(XtX, Xty)
15
16    return beta_hat
17
18def ols_predict(X, beta):
19    """Make predictions with fitted OLS model."""
20    X_b = np.column_stack([np.ones(len(X)), X])
21    return X_b @ beta
22
23def ols_diagnostics(X, y, beta):
24    """Compute regression diagnostics."""
25    y_hat = ols_predict(X, beta)
26    residuals = y - y_hat
27
28    n = len(y)
29    p = len(beta)
30
31    SSE = np.sum(residuals**2)
32    SST = np.sum((y - np.mean(y))**2)
33    R_squared = 1 - SSE / SST
34
35    # Adjusted R-squared
36    R_squared_adj = 1 - (1 - R_squared) * (n - 1) / (n - p)
37
38    # Residual standard error
39    sigma_hat = np.sqrt(SSE / (n - p))
40
41    return {
42        'R_squared': R_squared,
43        'R_squared_adj': R_squared_adj,
44        'SSE': SSE,
45        'SST': SST,
46        'sigma_hat': sigma_hat,
47        'residuals': residuals,
48        'y_hat': y_hat
49    }

Now let's see a complete example with diagnostics:

🐍python
1import numpy as np
2from sklearn.linear_model import LinearRegression
3import matplotlib.pyplot as plt
4
5# ================================================
6# Generate sample data
7# ================================================
8np.random.seed(42)
9n = 100
10X = np.random.randn(n, 2)  # 2 predictors
11true_beta = np.array([2.0, 0.5, -1.5])  # intercept, beta1, beta2
12y = true_beta[0] + X @ true_beta[1:] + np.random.randn(n) * 0.5
13
14# ================================================
15# Fit with our implementation
16# ================================================
17beta_hat = ols_fit(X, y)
18print("Our OLS estimates:")
19print(f"  β₀ (intercept): {beta_hat[0]:.4f}")
20print(f"  β₁: {beta_hat[1]:.4f}")
21print(f"  β₂: {beta_hat[2]:.4f}")
22
23# Compare with sklearn
24lr = LinearRegression().fit(X, y)
25print("\nScikit-learn estimates:")
26print(f"  β₀ (intercept): {lr.intercept_:.4f}")
27print(f"  β₁: {lr.coef_[0]:.4f}")
28print(f"  β₂: {lr.coef_[1]:.4f}")
29
30# ================================================
31# Compute diagnostics
32# ================================================
33diag = ols_diagnostics(X, y, beta_hat)
34print(f"\nDiagnostics:")
35print(f"  R²: {diag['R_squared']:.4f}")
36print(f"  Adjusted R²: {diag['R_squared_adj']:.4f}")
37print(f"  SSE: {diag['SSE']:.4f}")
38print(f"  Residual SE: {diag['sigma_hat']:.4f}")
39
40# ================================================
41# Verify orthogonality of residuals
42# ================================================
43X_b = np.column_stack([np.ones(len(X)), X])
44orthogonality = X_b.T @ diag['residuals']
45print(f"\nOrthogonality check (should be ≈ 0):")
46print(f"  X^T @ e = {orthogonality}")
47
48# Output:
49# Our OLS estimates:
50#   β₀ (intercept): 2.0234
51#   β₁: 0.4891
52#   β₂: -1.5123
53#
54# Scikit-learn estimates:
55#   β₀ (intercept): 2.0234
56#   β₁: 0.4891
57#   β₂: -1.5123
58#
59# Diagnostics:
60#   R²: 0.9156
61#   Adjusted R²: 0.9139
62#   SSE: 24.7891
63#   Residual SE: 0.5054
64#
65# Orthogonality check (should be ≈ 0):
66#   X^T @ e = [1.42e-14, 2.84e-14, -1.07e-14]

Knowledge Check

Test your understanding of OLS theory with this interactive quiz.

OLS Theory Quiz

1 / 8

What does OLS stand for in regression?


Summary

Key Takeaways

  1. OLS minimizes the Sum of Squared Errors (SSE), which is a convex quadratic function with a unique global minimum.
  2. The normal equations XTXβ^=XTyX^T X \hat{\boldsymbol{\beta}} = X^T \mathbf{y} give the analytical solution for OLS.
  3. The OLS estimator β^=(XTX)1XTy\hat{\boldsymbol{\beta}} = (X^T X)^{-1} X^T \mathbf{y} requires the design matrix to have full column rank.
  4. Geometrically, OLS projects the response y onto the column space of X. Residuals are orthogonal to this subspace.
  5. OLS estimators are unbiased with variance Var(β^)=σ2(XTX)1\text{Var}(\hat{\boldsymbol{\beta}}) = \sigma^2 (X^T X)^{-1}.
  6. Under Gaussian errors, OLS equals the Maximum Likelihood Estimator, explaining why MSE is the natural loss function for regression.
  7. Neural network linear layers are performing regression, and MSE loss connects deep learning to classical OLS theory.
Looking Ahead: In the next section, we'll explore the Gauss-Markov Theorem, which proves that OLS is the Best Linear Unbiased Estimator (BLUE) under certain conditions — the theoretical foundation for OLS's dominance in statistics.
Loading comments...