Learning Objectives
By the end of this section, you will be able to:
📚 Core Knowledge
- • Explain why minimizing squared errors leads to the normal equations
- • Derive the OLS estimator using calculus and linear algebra
- • Interpret OLS geometrically as projection onto a subspace
- • State and understand the key properties of OLS estimators
🔧 Practical Skills
- • Implement OLS from scratch using matrix operations
- • Compute and interpret R², SSE, and SST
- • Diagnose when OLS might fail (multicollinearity)
- • Connect OLS to gradient descent optimization
🧠 Deep Learning Connections
- Neural Network Training — The loss function in OLS (MSE) is the same loss used in most regression neural networks
- Gradient Descent — Understanding the closed-form OLS solution helps appreciate when and why iterative optimization is needed
- Linear Layers — Every fully-connected layer in a neural network is essentially performing linear regression
- Feature Engineering — OLS theory explains why feature scaling and normalization matter for optimization
Where You'll Apply This: Training linear models, understanding neural network loss landscapes, feature selection, A/B testing analysis, causal inference, time series forecasting, and as the foundation for generalized linear models.
The Big Picture
Ordinary Least Squares (OLS) is arguably the most important method in all of statistics. It provides an exact, closed-form solution to the fundamental problem of fitting a linear model to data. Unlike iterative methods like gradient descent, OLS gives us the optimal answer directly through a single matrix computation.
The Central Question
Given data points , what is the best straight line ?
Best = Minimum Error
We want to minimize prediction errors
Squared Errors
Squaring makes errors positive and tractable
Analytical Solution
One formula gives the exact answer
Historical Context
The method of least squares has a rich history, with roots in astronomy and geodesy where precise predictions were literally matters of life and death.
Legendre (1805) & Gauss (1809)
Adrien-Marie Legendre first published the method, but Carl Friedrich Gauss claimed he had been using it since 1795. Gauss applied it to predict the orbit of Ceres, allowing astronomers to rediscover the asteroid after it had been lost behind the sun.
The Gauss-Markov Theorem (1900s)
Building on Gauss's work, Andrey Markov formalized the conditions under which OLS is the Best Linear Unbiased Estimator (BLUE). This theorem provides the theoretical justification for OLS's ubiquity in statistics.
Modern Era
Today, OLS is implemented in every statistical package and programming language. In machine learning, the MSE loss function used in neural networks is a direct descendant of OLS. Even when we use gradient descent, we're often solving the same problem OLS solves analytically.
The OLS Problem
Let's formalize what we mean by "best fit." We observe n data pairs and want to find the line that best predicts y from x.
The Linear Model
where are the random errors (residuals)
| Symbol | Name | Meaning |
|---|---|---|
| yᵢ | Response | The value we want to predict |
| xᵢ | Predictor | The input feature(s) |
| β₀ | Intercept | Value of y when x = 0 |
| β₁ | Slope | Change in y per unit change in x |
| εᵢ | Error | Random deviation from the line |
The Sum of Squared Errors
How do we define "best"? OLS minimizes the Sum of Squared Errors (SSE), also called the Residual Sum of Squares (RSS):
The OLS Objective Function
Find that minimize this quantity
Why squared errors? There are several compelling reasons:
- Non-negativity: Squared values are always positive, preventing positive and negative errors from canceling out
- Differentiability: The squared function is smooth everywhere, making calculus-based optimization straightforward
- Large Error Penalty: Squaring penalizes larger errors more severely, which can be desirable (or problematic with outliers)
- Analytical Solution: The quadratic form leads to linear normal equations with a closed-form solution
- Maximum Likelihood: Under Gaussian errors, OLS equals MLE (covered later)
Interactive: Loss Landscape
The SSE forms a convex paraboloid (bowl shape) over the parameter space. This is crucial — convexity guarantees a unique global minimum with no local minima to trap optimization algorithms.
OLS Loss Surface (SSE)
The Sum of Squared Errors forms a convex quadratic bowl. The OLS solution sits at the unique global minimum. Drag to rotate the view.
The Normal Equations
To find the minimum of SSE, we take partial derivatives with respect to each parameter and set them to zero. This gives us the normal equations.
Derivation via Calculus
Starting with the objective function:
Taking partial derivatives:
With respect to β₀:
With respect to β₁:
Simplifying these conditions gives the normal equations:
The Normal Equations (Scalar Form)
Solving this system of two equations in two unknowns yields:
OLS Estimators (Simple Linear Regression)
where are the sample means
Interactive: Normal Equations Step-by-Step
Walk through the normal equations calculation step by step. This demonstrates exactly how the matrix algebra produces the OLS estimates.
Normal Equations Walkthrough
Step 1: Set Up the Model
We model the relationship as: y = β₀ + β₁x + ε
| i | xᵢ | yᵢ |
|---|---|---|
| 1 | 1 | 2.1 |
| 2 | 2 | 3 |
| 3 | 3 | 4.2 |
| 4 | 4 | 4.9 |
| 5 | 5 | 6.1 |
n = 5 observations
Matrix Formulation
The scalar derivation is fine for simple regression, but for multiple predictors we need matrix notation. This is also how OLS is implemented computationally.
The Matrix Model
Response vector
Design matrix
Parameter vector
The SSE in matrix form becomes:
Taking the gradient with respect to and setting it to zero:
This leads to the matrix normal equations:
Normal Equations (Matrix Form)
Solving for :
The OLS Estimator
This requires to be invertible (full column rank)
- There are perfectly correlated predictors (multicollinearity)
- There are more predictors than observations (p > n)
- A column of X is constant or all zeros
Geometric Interpretation
One of the most beautiful aspects of OLS is its geometric interpretation. Understanding this geometry provides deep insight into what regression is really doing.
OLS as Orthogonal Projection
Think of the response vector as a point in n-dimensional space. The columns of the design matrix X span a subspace (thecolumn space of X). The OLS solution finds the point in this subspace closest to .
The Projection Interpretation
where is the hat matrix (projection matrix)
Key geometric facts:
- Fitted values lie in the column space of X
- Residuals are orthogonal to the column space
- The residual vector is the shortest path from to the column space
- — residuals are uncorrelated with predictors
The Pythagorean Theorem of Regression
In sum-of-squares terms: SST = SSR + SSE
Total variation = Explained variation + Unexplained variation
Interactive: OLS Geometry
Adjust the regression line manually and see how residuals change. Watch as SSE decreases when you approach the OLS solution.
Interactive OLS Geometry
Adjust the regression line and watch how residuals change. The OLS solution minimizes the sum of squared residuals (SSE).
OLS Estimates:
β̂₀ = 1.0714
β̂₁ = 0.9869
Properties of OLS Estimators
Under appropriate conditions, OLS estimators have several desirable statistical properties. These properties are what make OLS so widely used.
Unbiasedness
The OLS estimator is unbiased — on average, it equals the true parameter:
Proof sketch:
Taking expectations:
(since by assumption)
Variance of the Estimators
The variance-covariance matrix of the OLS estimator reveals how precisely we can estimate each coefficient:
Variance of OLS Estimator
where is the error variance
Important implications:
- More data (larger n): Reduces variance (better precision)
- More spread in X: Reduces variance (easier to estimate slopes)
- Multicollinearity: Inflates variance (poor precision)
- Higher error variance σ²: Increases coefficient variance
Connection to Maximum Likelihood
One of the most remarkable facts about OLS is its equivalence to Maximum Likelihood Estimation under Gaussian errors. This provides a deep probabilistic justification for minimizing squared errors.
The Gaussian Error Assumption
Under this assumption, the likelihood function is:
Taking the log-likelihood and maximizing with respect to β:
AI/ML Applications
OLS theory is not just historical — it's deeply relevant to modern machine learning. Understanding OLS helps you understand neural networks, optimization, and model diagnostics.
Python Implementation
Let's implement OLS from scratch to solidify understanding, then compare with scikit-learn's implementation.
Now let's see a complete example with diagnostics:
1import numpy as np
2from sklearn.linear_model import LinearRegression
3import matplotlib.pyplot as plt
4
5# ================================================
6# Generate sample data
7# ================================================
8np.random.seed(42)
9n = 100
10X = np.random.randn(n, 2) # 2 predictors
11true_beta = np.array([2.0, 0.5, -1.5]) # intercept, beta1, beta2
12y = true_beta[0] + X @ true_beta[1:] + np.random.randn(n) * 0.5
13
14# ================================================
15# Fit with our implementation
16# ================================================
17beta_hat = ols_fit(X, y)
18print("Our OLS estimates:")
19print(f" β₀ (intercept): {beta_hat[0]:.4f}")
20print(f" β₁: {beta_hat[1]:.4f}")
21print(f" β₂: {beta_hat[2]:.4f}")
22
23# Compare with sklearn
24lr = LinearRegression().fit(X, y)
25print("\nScikit-learn estimates:")
26print(f" β₀ (intercept): {lr.intercept_:.4f}")
27print(f" β₁: {lr.coef_[0]:.4f}")
28print(f" β₂: {lr.coef_[1]:.4f}")
29
30# ================================================
31# Compute diagnostics
32# ================================================
33diag = ols_diagnostics(X, y, beta_hat)
34print(f"\nDiagnostics:")
35print(f" R²: {diag['R_squared']:.4f}")
36print(f" Adjusted R²: {diag['R_squared_adj']:.4f}")
37print(f" SSE: {diag['SSE']:.4f}")
38print(f" Residual SE: {diag['sigma_hat']:.4f}")
39
40# ================================================
41# Verify orthogonality of residuals
42# ================================================
43X_b = np.column_stack([np.ones(len(X)), X])
44orthogonality = X_b.T @ diag['residuals']
45print(f"\nOrthogonality check (should be ≈ 0):")
46print(f" X^T @ e = {orthogonality}")
47
48# Output:
49# Our OLS estimates:
50# β₀ (intercept): 2.0234
51# β₁: 0.4891
52# β₂: -1.5123
53#
54# Scikit-learn estimates:
55# β₀ (intercept): 2.0234
56# β₁: 0.4891
57# β₂: -1.5123
58#
59# Diagnostics:
60# R²: 0.9156
61# Adjusted R²: 0.9139
62# SSE: 24.7891
63# Residual SE: 0.5054
64#
65# Orthogonality check (should be ≈ 0):
66# X^T @ e = [1.42e-14, 2.84e-14, -1.07e-14]Knowledge Check
Test your understanding of OLS theory with this interactive quiz.
OLS Theory Quiz
1 / 8What does OLS stand for in regression?
Summary
Key Takeaways
- OLS minimizes the Sum of Squared Errors (SSE), which is a convex quadratic function with a unique global minimum.
- The normal equations give the analytical solution for OLS.
- The OLS estimator requires the design matrix to have full column rank.
- Geometrically, OLS projects the response y onto the column space of X. Residuals are orthogonal to this subspace.
- OLS estimators are unbiased with variance .
- Under Gaussian errors, OLS equals the Maximum Likelihood Estimator, explaining why MSE is the natural loss function for regression.
- Neural network linear layers are performing regression, and MSE loss connects deep learning to classical OLS theory.
Looking Ahead: In the next section, we'll explore the Gauss-Markov Theorem, which proves that OLS is the Best Linear Unbiased Estimator (BLUE) under certain conditions — the theoretical foundation for OLS's dominance in statistics.