Boo-AI — Master Artificial Intelligence by Building from Scratch

Learning Objectives

By the end of this section, you will be able to:

📚 Core Knowledge

• Explain why minimizing squared errors leads to the normal equations
• Derive the OLS estimator using calculus and linear algebra
• Interpret OLS geometrically as projection onto a subspace
• State and understand the key properties of OLS estimators

🔧 Practical Skills

• Implement OLS from scratch using matrix operations
• Compute and interpret R², SSE, and SST
• Diagnose when OLS might fail (multicollinearity)
• Connect OLS to gradient descent optimization

🧠 Deep Learning Connections

Neural Network Training — The loss function in OLS (MSE) is the same loss used in most regression neural networks
Gradient Descent — Understanding the closed-form OLS solution helps appreciate when and why iterative optimization is needed
Linear Layers — Every fully-connected layer in a neural network is essentially performing linear regression
Feature Engineering — OLS theory explains why feature scaling and normalization matter for optimization

Where You'll Apply This: Training linear models, understanding neural network loss landscapes, feature selection, A/B testing analysis, causal inference, time series forecasting, and as the foundation for generalized linear models.

The Big Picture

Ordinary Least Squares (OLS) is arguably the most important method in all of statistics. It provides an exact, closed-form solution to the fundamental problem of fitting a linear model to data. Unlike iterative methods like gradient descent, OLS gives us the optimal answer directly through a single matrix computation.

The Central Question

Given data points $(x_1, y_1), \ldots, (x_n, y_n)$ , what is the best straight line $\hat{y} = \beta_0 + \beta_1 x$ ?

🎯

Best = Minimum Error
We want to minimize prediction errors

📐

Squared Errors
Squaring makes errors positive and tractable

✨

Analytical Solution
One formula gives the exact answer

Historical Context

The method of least squares has a rich history, with roots in astronomy and geodesy where precise predictions were literally matters of life and death.

📜

Legendre (1805) & Gauss (1809)

Adrien-Marie Legendre first published the method, but Carl Friedrich Gauss claimed he had been using it since 1795. Gauss applied it to predict the orbit of Ceres, allowing astronomers to rediscover the asteroid after it had been lost behind the sun.

🔬

The Gauss-Markov Theorem (1900s)

Building on Gauss's work, Andrey Markov formalized the conditions under which OLS is the Best Linear Unbiased Estimator (BLUE). This theorem provides the theoretical justification for OLS's ubiquity in statistics.

💻

Modern Era

Today, OLS is implemented in every statistical package and programming language. In machine learning, the MSE loss function used in neural networks is a direct descendant of OLS. Even when we use gradient descent, we're often solving the same problem OLS solves analytically.

The OLS Problem

Let's formalize what we mean by "best fit." We observe n data pairs and want to find the line that best predicts y from x.

The Linear Model

y_i = \beta_0 + \beta_1 x_i + \varepsilon_i, \quad i = 1, \ldots, n

where $\varepsilon_i$ are the random errors (residuals)

Symbol	Name	Meaning
yᵢ	Response	The value we want to predict
xᵢ	Predictor	The input feature(s)
β₀	Intercept	Value of y when x = 0
β₁	Slope	Change in y per unit change in x
εᵢ	Error	Random deviation from the line

The Sum of Squared Errors

How do we define "best"? OLS minimizes the Sum of Squared Errors (SSE), also called the Residual Sum of Squares (RSS):

The OLS Objective Function

\text{SSE}(\beta_0, \beta_1) = \sum_{i=1}^{n} (y_i - \hat{y}_i)^2 = \sum_{i=1}^{n} (y_i - \beta_0 - \beta_1 x_i)^2

Find $\hat{\beta}_0, \hat{\beta}_1$ that minimize this quantity

Why squared errors? There are several compelling reasons:

Non-negativity: Squared values are always positive, preventing positive and negative errors from canceling out
Differentiability: The squared function is smooth everywhere, making calculus-based optimization straightforward
Large Error Penalty: Squaring penalizes larger errors more severely, which can be desirable (or problematic with outliers)
Analytical Solution: The quadratic form leads to linear normal equations with a closed-form solution
Maximum Likelihood: Under Gaussian errors, OLS equals MLE (covered later)

Interactive: Loss Landscape

The SSE forms a convex paraboloid (bowl shape) over the parameter space. This is crucial — convexity guarantees a unique global minimum with no local minima to trap optimization algorithms.

OLS Loss Surface (SSE)

The Sum of Squared Errors forms a convex quadratic bowl. The OLS solution sits at the unique global minimum. Drag to rotate the view.

Current β₀: 2.00

Current β₁: 0.30

Show ContoursShow Gradient Path

Current SSE:57.47

Minimum SSE:0.24

Gap:57.23

CurrentOptimal

Key Insight: The convexity of SSE is why linear regression always has a unique solution (assuming XᵀX is invertible). This is in stark contrast to neural networks, where the loss landscape is highly non-convex with many local minima.

The Normal Equations

To find the minimum of SSE, we take partial derivatives with respect to each parameter and set them to zero. This gives us the normal equations.

Derivation via Calculus

Starting with the objective function:

L(\beta_0, \beta_1) = \sum_{i=1}^{n} (y_i - \beta_0 - \beta_1 x_i)^2

Taking partial derivatives:

With respect to β₀:

\frac{\partial L}{\partial \beta_0} = -2 \sum_{i=1}^{n} (y_i - \beta_0 - \beta_1 x_i) = 0

With respect to β₁:

\frac{\partial L}{\partial \beta_1} = -2 \sum_{i=1}^{n} x_i(y_i - \beta_0 - \beta_1 x_i) = 0

Simplifying these conditions gives the normal equations:

The Normal Equations (Scalar Form)

$n\hat{\beta}_0 + \hat{\beta}_1 \sum x_i = \sum y_i$

$\hat{\beta}_0 \sum x_i + \hat{\beta}_1 \sum x_i^2 = \sum x_i y_i$

Solving this system of two equations in two unknowns yields:

OLS Estimators (Simple Linear Regression)

\hat{\beta}_1 = \frac{\sum_{i=1}^{n}(x_i - \bar{x})(y_i - \bar{y})}{\sum_{i=1}^{n}(x_i - \bar{x})^2} = \frac{S_{xy}}{S_{xx}}

\hat{\beta}_0 = \bar{y} - \hat{\beta}_1 \bar{x}

where $\bar{x}, \bar{y}$ are the sample means

Interpretation: The slope

\hat{\beta}_1

is the ratio of the sample covariance of x and y to the sample variance of x. It measures how much y changes when x changes, scaled by the variability in x.

Interactive: Normal Equations Step-by-Step

Walk through the normal equations calculation step by step. This demonstrates exactly how the matrix algebra produces the OLS estimates.

Normal Equations Walkthrough

Show Details

Step 1: Set Up the Model

We model the relationship as: y = β₀ + β₁x + ε

i	xᵢ	yᵢ
1	1	2.1
2	2	3
3	3	4.2
4	4	4.9
5	5	6.1

n = 5 observations

1 / 7

Matrix Formulation

The scalar derivation is fine for simple regression, but for multiple predictors we need matrix notation. This is also how OLS is implemented computationally.

The Matrix Model

\mathbf{y} = X\boldsymbol{\beta} + \boldsymbol{\varepsilon}

\mathbf{y}_{n \times 1}

Response vector

X_{n \times p}

Design matrix

\boldsymbol{\beta}_{p \times 1}

Parameter vector

The SSE in matrix form becomes:

\text{SSE}(\boldsymbol{\beta}) = (\mathbf{y} - X\boldsymbol{\beta})^T(\mathbf{y} - X\boldsymbol{\beta})

Taking the gradient with respect to $\boldsymbol{\beta}$ and setting it to zero:

\nabla_{\boldsymbol{\beta}} \text{SSE} = -2X^T(\mathbf{y} - X\boldsymbol{\beta}) = \mathbf{0}

This leads to the matrix normal equations:

Normal Equations (Matrix Form)

X^T X \hat{\boldsymbol{\beta}} = X^T \mathbf{y}

Solving for $\hat{\boldsymbol{\beta}}$ :

The OLS Estimator

\hat{\boldsymbol{\beta}} = (X^T X)^{-1} X^T \mathbf{y}

This requires $X^T X$ to be invertible (full column rank)

When Does This Fail? The matrix

X^T X

is not invertible when:

There are perfectly correlated predictors (multicollinearity)
There are more predictors than observations (p > n)
A column of X is constant or all zeros

In these cases, we need regularization (Ridge/Lasso) or dimensionality reduction.

Geometric Interpretation

One of the most beautiful aspects of OLS is its geometric interpretation. Understanding this geometry provides deep insight into what regression is really doing.

OLS as Orthogonal Projection

Think of the response vector $\mathbf{y}$ as a point in n-dimensional space. The columns of the design matrix X span a subspace (thecolumn space of X). The OLS solution finds the point in this subspace closest to $\mathbf{y}$ .

The Projection Interpretation

$\hat{\mathbf{y}} = X\hat{\boldsymbol{\beta}} = X(X^T X)^{-1} X^T \mathbf{y} = H\mathbf{y}$

where $H = X(X^T X)^{-1} X^T$ is the hat matrix (projection matrix)

Key geometric facts:

Fitted values $\hat{\mathbf{y}}$ lie in the column space of X
Residuals $\mathbf{e} = \mathbf{y} - \hat{\mathbf{y}}$ are orthogonal to the column space
The residual vector is the shortest path from $\mathbf{y}$ to the column space
$X^T \mathbf{e} = 0$ — residuals are uncorrelated with predictors

The Pythagorean Theorem of Regression

\|\mathbf{y}\|^2 = \|\hat{\mathbf{y}}\|^2 + \|\mathbf{e}\|^2

In sum-of-squares terms: SST = SSR + SSE

Total variation = Explained variation + Unexplained variation

Interactive: OLS Geometry

Adjust the regression line manually and see how residuals change. Watch as SSE decreases when you approach the OLS solution.

Interactive OLS Geometry

Adjust the regression line and watch how residuals change. The OLS solution minimizes the sum of squared residuals (SSE).

β₀ (Intercept): 1.00

β₁ (Slope): 0.50

Show ResidualsShow OLS Reference

OLS Estimates:

β̂₀ = 1.0714

β̂₁ = 0.9869

Properties of OLS Estimators

Under appropriate conditions, OLS estimators have several desirable statistical properties. These properties are what make OLS so widely used.

Unbiasedness

The OLS estimator is unbiased — on average, it equals the true parameter:

E[\hat{\boldsymbol{\beta}}] = \boldsymbol{\beta}

Proof sketch:

$\hat{\boldsymbol{\beta}} = (X^T X)^{-1} X^T \mathbf{y}$

$= (X^T X)^{-1} X^T (X\boldsymbol{\beta} + \boldsymbol{\varepsilon})$

$= \boldsymbol{\beta} + (X^T X)^{-1} X^T \boldsymbol{\varepsilon}$

Taking expectations: $E[\hat{\boldsymbol{\beta}}] = \boldsymbol{\beta} + (X^T X)^{-1} X^T E[\boldsymbol{\varepsilon}] = \boldsymbol{\beta}$

(since $E[\boldsymbol{\varepsilon}] = 0$ by assumption)

Variance of the Estimators

The variance-covariance matrix of the OLS estimator reveals how precisely we can estimate each coefficient:

Variance of OLS Estimator

\text{Var}(\hat{\boldsymbol{\beta}}) = \sigma^2 (X^T X)^{-1}

where $\sigma^2$ is the error variance

Important implications:

More data (larger n): Reduces variance (better precision)
More spread in X: Reduces variance (easier to estimate slopes)
Multicollinearity: Inflates variance (poor precision)
Higher error variance σ²: Increases coefficient variance

The Gauss-Markov Theorem (covered in the next section) proves that among all linear unbiased estimators, OLS has the smallest variance. This makes it BLUE: Best Linear Unbiased Estimator.

Connection to Maximum Likelihood

One of the most remarkable facts about OLS is its equivalence to Maximum Likelihood Estimation under Gaussian errors. This provides a deep probabilistic justification for minimizing squared errors.

The Gaussian Error Assumption

\varepsilon_i \sim \mathcal{N}(0, \sigma^2) \quad \text{i.i.d.}

\Rightarrow \quad y_i | x_i \sim \mathcal{N}(\beta_0 + \beta_1 x_i, \sigma^2)

Under this assumption, the likelihood function is:

L(\boldsymbol{\beta}, \sigma^2) = \prod_{i=1}^{n} \frac{1}{\sqrt{2\pi\sigma^2}} \exp\left(-\frac{(y_i - \beta_0 - \beta_1 x_i)^2}{2\sigma^2}\right)

Taking the log-likelihood and maximizing with respect to β:

\ell(\boldsymbol{\beta}) = -\frac{n}{2}\log(2\pi\sigma^2) - \frac{1}{2\sigma^2}\sum_{i=1}^{n}(y_i - \beta_0 - \beta_1 x_i)^2

Key Result: Maximizing the log-likelihood with respect to β is equivalent to minimizing the SSE! The MLE for β equals the OLS estimator. This is why mean squared error is so natural — it's the optimal loss function under Gaussian errors.

AI/ML Applications

OLS theory is not just historical — it's deeply relevant to modern machine learning. Understanding OLS helps you understand neural networks, optimization, and model diagnostics.

Python Implementation

Let's implement OLS from scratch to solidify understanding, then compare with scikit-learn's implementation.

OLS Implementation from Scratch

🐍ols_from_scratch.py

Explanation(10)

Code(49)

1Import NumPy

NumPy provides efficient linear algebra operations essential for OLS: matrix multiplication, inversion, and solving linear systems.

5Add Intercept Column

The design matrix X needs a column of ones to estimate the intercept β₀. Without this, the regression line is forced through the origin.

EXAMPLE

X_b = [[1, x₁], [1, x₂], ...]

8Compute XᵀX

This p×p matrix is central to OLS. It encodes the relationships between predictors. Its inverse determines coefficient precision.

9Compute Xᵀy

The cross-product of the design matrix and response. Combined with XᵀX, it captures how each predictor relates to the response.

12Solve Normal Equations

np.linalg.solve is numerically stable - better than explicitly computing the inverse. It solves XᵀXβ = Xᵀy for β.

EXAMPLE

More stable than np.linalg.inv(XtX) @ Xty

22Compute Fitted Values

ŷ = Xβ̂ gives the predicted values for each observation. These lie on the fitted regression surface.

23Compute Residuals

Residuals e = y - ŷ measure the difference between observed and fitted values. They're orthogonal to the column space of X.

26Sum of Squared Errors

SSE = Σeᵢ² measures total unexplained variation. This is what OLS minimizes. Lower SSE means better fit.

27Total Sum of Squares

SST = Σ(yᵢ - ȳ)² measures total variation in y around its mean. It's the baseline for comparison.

28R-squared

R² = 1 - SSE/SST measures the proportion of variance explained by the model. Ranges from 0 to 1.

EXAMPLE

R² = 0.85 means 85% of variance explained

39 lines without explanation

1import numpy as np
2
3def ols_fit(X, y):
4    """Fit OLS regression from scratch."""
5    # Add intercept column (column of 1s)
6    X_b = np.column_stack([np.ones(len(X)), X])
7
8    # Compute normal equations components
9    XtX = X_b.T @ X_b
10    Xty = X_b.T @ y
11
12    # Solve the normal equations
13    # Using solve() is more stable than inv()
14    beta_hat = np.linalg.solve(XtX, Xty)
15
16    return beta_hat
17
18def ols_predict(X, beta):
19    """Make predictions with fitted OLS model."""
20    X_b = np.column_stack([np.ones(len(X)), X])
21    return X_b @ beta
22
23def ols_diagnostics(X, y, beta):
24    """Compute regression diagnostics."""
25    y_hat = ols_predict(X, beta)
26    residuals = y - y_hat
27
28    n = len(y)
29    p = len(beta)
30
31    SSE = np.sum(residuals**2)
32    SST = np.sum((y - np.mean(y))**2)
33    R_squared = 1 - SSE / SST
34
35    # Adjusted R-squared
36    R_squared_adj = 1 - (1 - R_squared) * (n - 1) / (n - p)
37
38    # Residual standard error
39    sigma_hat = np.sqrt(SSE / (n - p))
40
41    return {
42        'R_squared': R_squared,
43        'R_squared_adj': R_squared_adj,
44        'SSE': SSE,
45        'SST': SST,
46        'sigma_hat': sigma_hat,
47        'residuals': residuals,
48        'y_hat': y_hat
49    }

Now let's see a complete example with diagnostics:

🐍python

1import numpy as np
2from sklearn.linear_model import LinearRegression
3import matplotlib.pyplot as plt
4
5# ================================================
6# Generate sample data
7# ================================================
8np.random.seed(42)
9n = 100
10X = np.random.randn(n, 2)  # 2 predictors
11true_beta = np.array([2.0, 0.5, -1.5])  # intercept, beta1, beta2
12y = true_beta[0] + X @ true_beta[1:] + np.random.randn(n) * 0.5
13
14# ================================================
15# Fit with our implementation
16# ================================================
17beta_hat = ols_fit(X, y)
18print("Our OLS estimates:")
19print(f"  β₀ (intercept): {beta_hat[0]:.4f}")
20print(f"  β₁: {beta_hat[1]:.4f}")
21print(f"  β₂: {beta_hat[2]:.4f}")
22
23# Compare with sklearn
24lr = LinearRegression().fit(X, y)
25print("\nScikit-learn estimates:")
26print(f"  β₀ (intercept): {lr.intercept_:.4f}")
27print(f"  β₁: {lr.coef_[0]:.4f}")
28print(f"  β₂: {lr.coef_[1]:.4f}")
29
30# ================================================
31# Compute diagnostics
32# ================================================
33diag = ols_diagnostics(X, y, beta_hat)
34print(f"\nDiagnostics:")
35print(f"  R²: {diag['R_squared']:.4f}")
36print(f"  Adjusted R²: {diag['R_squared_adj']:.4f}")
37print(f"  SSE: {diag['SSE']:.4f}")
38print(f"  Residual SE: {diag['sigma_hat']:.4f}")
39
40# ================================================
41# Verify orthogonality of residuals
42# ================================================
43X_b = np.column_stack([np.ones(len(X)), X])
44orthogonality = X_b.T @ diag['residuals']
45print(f"\nOrthogonality check (should be ≈ 0):")
46print(f"  X^T @ e = {orthogonality}")
47
48# Output:
49# Our OLS estimates:
50#   β₀ (intercept): 2.0234
51#   β₁: 0.4891
52#   β₂: -1.5123
53#
54# Scikit-learn estimates:
55#   β₀ (intercept): 2.0234
56#   β₁: 0.4891
57#   β₂: -1.5123
58#
59# Diagnostics:
60#   R²: 0.9156
61#   Adjusted R²: 0.9139
62#   SSE: 24.7891
63#   Residual SE: 0.5054
64#
65# Orthogonality check (should be ≈ 0):
66#   X^T @ e = [1.42e-14, 2.84e-14, -1.07e-14]

Knowledge Check

Test your understanding of OLS theory with this interactive quiz.

OLS Theory Quiz

1 / 8

What does OLS stand for in regression?

Summary

Key Takeaways

OLS minimizes the Sum of Squared Errors (SSE), which is a convex quadratic function with a unique global minimum.
The normal equations $X^T X \hat{\boldsymbol{\beta}} = X^T \mathbf{y}$ give the analytical solution for OLS.
The OLS estimator $\hat{\boldsymbol{\beta}} = (X^T X)^{-1} X^T \mathbf{y}$ requires the design matrix to have full column rank.
Geometrically, OLS projects the response y onto the column space of X. Residuals are orthogonal to this subspace.
OLS estimators are unbiased with variance $\text{Var}(\hat{\boldsymbol{\beta}}) = \sigma^2 (X^T X)^{-1}$ .
Under Gaussian errors, OLS equals the Maximum Likelihood Estimator, explaining why MSE is the natural loss function for regression.
Neural network linear layers are performing regression, and MSE loss connects deep learning to classical OLS theory.

Looking Ahead: In the next section, we'll explore the Gauss-Markov Theorem, which proves that OLS is the Best Linear Unbiased Estimator (BLUE) under certain conditions — the theoretical foundation for OLS's dominance in statistics.

Learning Objectives

📚 Core Knowledge

🔧 Practical Skills

🧠 Deep Learning Connections

The Big Picture

The Central Question

Historical Context

Legendre (1805) & Gauss (1809)

The Gauss-Markov Theorem (1900s)

Modern Era

The OLS Problem

The Linear Model

The Sum of Squared Errors

The OLS Objective Function

Interactive: Loss Landscape

OLS Loss Surface (SSE)

The Normal Equations

Derivation via Calculus

With respect to β₀:

With respect to β₁:

The Normal Equations (Scalar Form)

OLS Estimators (Simple Linear Regression)

Interactive: Normal Equations Step-by-Step

Normal Equations Walkthrough

Step 1: Set Up the Model

Matrix Formulation

The Matrix Model

Normal Equations (Matrix Form)

The OLS Estimator

Geometric Interpretation

OLS as Orthogonal Projection

The Projection Interpretation

Interactive: OLS Geometry

Interactive OLS Geometry

Properties of OLS Estimators

Unbiasedness

Variance of the Estimators

Variance of OLS Estimator

Connection to Maximum Likelihood

The Gaussian Error Assumption

AI/ML Applications

📉MSE Loss in Neural Networks

🎢Gradient Descent and Normal Equations

🧊Linear Layers as Regression

🔍Feature Importance and Interpretability

Python Implementation

Knowledge Check

OLS Theory Quiz

Summary

Key Takeaways