Boo-AI — Master Artificial Intelligence by Building from Scratch

Learning Objectives

By the end of this section, you will be able to:

📚 Core Knowledge

• Express multiple regression in matrix notation
• Derive the OLS estimator using calculus
• Understand the geometric interpretation as projection
• Interpret regression coefficients correctly

🔧 Practical Skills

• Implement multiple regression from scratch
• Diagnose and address multicollinearity
• Compute and interpret VIF (Variance Inflation Factor)
• Apply the Hat Matrix for leverage analysis

🧠 Deep Learning Connections

• Linear layers — A neural network layer y = Wx + b is exactly multiple regression without activation
• Gradient descent — Training with MSE loss converges to the OLS solution for linear models
• Regularization — L2 regularization (weight decay) is Ridge regression; L1 is Lasso
• Feature preprocessing — Understanding collinearity helps design better input features

Where You'll Apply This: Economic forecasting, medical outcome prediction, marketing attribution, scientific experiments, feature importance analysis, and as the foundation for understanding regularized and generalized linear models.

The Big Picture

Multiple linear regression extends simple linear regression from one predictor to many. Instead of fitting a line through points, we fit a hyperplane through points in a higher-dimensional space. The mathematics becomes elegant when expressed in matrix form, revealing deep connections to linear algebra and optimization.

The Central Question

Given multiple predictor variables $X_1, X_2, \\ldots, X_p$ , how do we find the best linear combination to predict a response $Y$ ? And what does "best" mean precisely?

Y = \\beta_0 + \\beta_1 X_1 + \\beta_2 X_2 + \\cdots + \\beta_p X_p + \\varepsilon

The OLS (Ordinary Least Squares) solution minimizes the sum of squared residuals—the squared vertical distances between observed and predicted values. This leads to the famous Normal Equations, which have a beautiful closed-form solution in matrix notation.

Historical Context

📜

Carl Friedrich Gauss (1795)

Gauss developed least squares at age 18 to track the asteroid Ceres. He showed that if errors are normally distributed, minimizing squared errors gives the most probable estimate—connecting OLS to maximum likelihood estimation.

📊

Adrien-Marie Legendre (1805)

Published the first formal description of least squares. The priority dispute with Gauss became one of mathematics' famous controversies. Both contributed essential insights—Legendre to publication, Gauss to theory.

From Simple to Multiple Regression

In simple linear regression, we model the relationship between one predictor and one response:

Y = \\beta_0 + \\beta_1 X + \\varepsilon

Multiple regression generalizes this to $p$ predictors:

Y_i = \\beta_0 + \\beta_1 X_{i1} + \\beta_2 X_{i2} + \\cdots + \\beta_p X_{ip} + \\varepsilon_i

for each observation $i = 1, 2, \\ldots, n$ .

Geometric Intuition

The key geometric insight is that multiple regression fits a hyperplane to the data:

1 predictor: Fit a line in 2D (the familiar scatter plot with trend line)
2 predictors: Fit a plane in 3D (imagine a tilted sheet of paper through a cloud of points)
p predictors: Fit a p-dimensional hyperplane in (p+1)-dimensional space

Interactive: 3D Regression Plane

Explore how the regression plane fits data with two predictors. Adjust the true coefficients and noise level to see how the fitted plane changes. Enable residuals to visualize the vertical distances being minimized.

3D Multiple Regression Visualizer

Intercept

\beta_0

2.0

Slope

\beta_1

1.5

Slope

\beta_2

-0.8

Noise

\sigma

0.30

Rotate X-20°

Rotate Z45°

Points40

Regression Equation

\hat{Y} = 2.0 + 1.5X_1 -0.8X_2

The regression plane represents the predicted value of Y for any combination of X₁ and X₂.

Model Statistics

97.7%

R² (Variance Explained)

0.277

RMSE

Key Insight: The Geometry of Multiple Regression

In simple linear regression, we fit a line. With two predictors, we fit a plane. With p predictors, we fit a p-dimensional hyperplane in (p+1)-dimensional space. The coefficients $\beta_1$ and $\beta_2$ determine the slope of the plane in each predictor direction—they tell us how Y changes when each predictor increases by 1 unit, holding all other predictors constant.

Matrix Formulation

The power of multiple regression becomes apparent in matrix notation. Instead of writing out each equation individually, we express everything compactly:

The Matrix Form

\\mathbf{y} = \\mathbf{X}\\boldsymbol{\\beta} + \\boldsymbol{\\varepsilon}

n × 1 response

n × (p+1) design matrix

(p+1) × 1 coefficients

n × 1 errors

The Design Matrix

The design matrix $\\mathbf{X}$ is the heart of regression. Each row represents one observation, and each column represents a predictor:

\\mathbf{X} = \\begin{bmatrix} 1 & X_{11} & X_{12} & \\cdots & X_{1p} \\\\ 1 & X_{21} & X_{22} & \\cdots & X_{2p} \\\\ \\vdots & \\vdots & \\vdots & \\ddots & \\vdots \\\\ 1 & X_{n1} & X_{n2} & \\cdots & X_{np} \\end{bmatrix}

The column of ones allows the intercept $\\beta_0$ to be included in the matrix multiplication. This elegant trick unifies the treatment of intercept and slope coefficients.

Interactive: OLS Derivation Step by Step

Walk through the complete mathematical derivation of the OLS estimator. Each step shows the key equation and explains the intuition behind the matrix operations.

OLS Derivation: Step-by-Step

The Model Setup

Express multiple regression in matrix form

\mathbf{y} = \mathbf{X}\boldsymbol{\beta} + \boldsymbol{\varepsilon}

We have n observations and p predictors. y is an n×1 vector of responses, X is an n×(p+1) design matrix (including a column of 1s for the intercept), β is a (p+1)×1 vector of coefficients, and ε is an n×1 vector of errors.

💡

The design matrix X has one row per observation and one column per predictor (plus intercept). Each row represents one data point.

Step 1 of 7

Matrix Dimensions Reference

n × 1

n × (p+1)

(p+1) × 1

XᵀX

(p+1) × (p+1)

n × n

The Normal Equations

The derivation leads to the famous Normal Equations:

Normal Equations

\\mathbf{X}^T\\mathbf{X}\\boldsymbol{\\hat{\\beta}} = \\mathbf{X}^T\\mathbf{y}

Solving for $\\boldsymbol{\\hat{\\beta}}$ yields the OLS estimator:

\\boldsymbol{\\hat{\\beta}} = (\\mathbf{X}^T\\mathbf{X})^{-1}\\mathbf{X}^T\\mathbf{y}

Why "Normal" Equations? Not because of the normal distribution! The name comes from geometry: the residual vector

\\mathbf{e} = \\mathbf{y} - \\mathbf{\\hat{y}}

is normal (perpendicular) to the column space of X. The least squares solution is the orthogonal projection of y onto this column space.

Interpreting Coefficients

In multiple regression, each coefficient has a specific interpretation that differs from simple regression:

The Partial Effect: $\\beta_j$ represents the expected change in Y when $X_j$ increases by one unit, holding all other predictors constant.

This "holding constant" interpretation (ceteris paribus) is crucial and often misunderstood. It's different from the simple correlation between $X_j$ and Y.

Example: House Prices

Suppose we model house price as:

\\text{Price} = 50000 + 150 \\times \\text{sqft} - 2500 \\times \\text{age}

• Each additional square foot adds $150 to price (holding age constant)
• Each additional year of age reduces price by $2,500 (holding size constant)
• The intercept $50,000 is the predicted price for a 0 sqft, 0 year old house (extrapolation—interpret with caution!)

Interactive: Coefficient Explorer

Explore how to interpret regression coefficients in different real-world scenarios. Adjust the predictor values and see how the model generates predictions.

Coefficient Interpretation Explorer

Predicting house prices from size and age

Regression Model

\widehat{\text{price}} = 50,000 + 150\cdot\text{sqft} -2500\cdot\text{age}

Size(sqft)

2000 sq ft

1000 sq ft3000 sq ft

Coefficient

\beta_{1}

+150$ per sq ft

Age(age)

10 years

0 years50 years

Coefficient

\beta_{2}

-2500$ per years

Predicted Price

$325k

Calculation

50,000 + 150 × 2000 - 2500 × 10

📝 Interpretation

For a 2,000 sq ft house that is 10 years old, the predicted price is $325k. Each additional square foot adds $150 to the price (holding age constant). Each additional year of age reduces the price by $2,500 (holding size constant).

⚠️ Critical: "Holding All Else Constant"

Each coefficient represents the partial effect of that variable on the response, controlling for all other variables. This is called the ceteris paribus interpretation.

Warning: This interpretation assumes the predictors can actually vary independently. If predictors are correlated (e.g., experience and education often increase together), the "holding constant" interpretation becomes hypothetical—we may never observe someone with high experience but low education.

Multicollinearity

Multicollinearity occurs when predictors are highly correlated with each other. This creates problems for coefficient estimation:

❌ Problems with High Collinearity

• Standard errors of coefficients inflate dramatically
• Coefficient estimates become unstable across samples
• Individual effects become impossible to disentangle
• Sign of coefficient may flip unexpectedly

✓ What Still Works

• Coefficients remain unbiased (BLUE property)
• Overall model R² is unaffected
• Predictions for new data still work well
• Combined effect of correlated predictors is fine

The Variance Inflation Factor (VIF) quantifies how much the variance of a coefficient estimate is inflated due to collinearity:

\\text{VIF}_j = \\frac{1}{1 - R^2_j}

where $R^2_j$ is the R² from regressing $X_j$ on all other predictors

Interactive: Multicollinearity Demo

Explore how predictor correlation affects coefficient estimation. Increase the correlation between predictors and observe how standard errors inflate while the true coefficients remain unchanged.

Multicollinearity Diagnostic Demo

Correlation

\rho_{12}

0.30

Sample Size50

True

\beta_1

2.0

True

\beta_2

1.5

Predictor Correlation: X₁ vs X₂

VIF = 1.02 (Low)

Variance Inflation Factor: coefficients have 1.0× larger variance than if predictors were uncorrelated.

VIF = 1/(1 - r²) where r² is correlation between predictors

Coefficient Estimates

True

Estimate

\beta_1

2.00

1.87

±0.07

\beta_2

1.50

1.46

±0.06

Model R² = 97.0%

What Happens with High Correlation?

• Standard errors of coefficients inflate
• Coefficient estimates become unstable
• T-statistics become small (less significant)
• Individual effects become hard to separate
• But overall R² can still be high!

VIF Interpretation Guidelines

• VIF < 5: Generally acceptable
• 5 ≤ VIF < 10: Moderate collinearity, worth investigating
• VIF ≥ 10: Severe collinearity, action needed
• VIF = ∞: Perfect collinearity (XᵀX singular)

🧠 Connection to Deep Learning

In neural networks, feature collinearity manifests as redundant neuronsthat learn similar representations. This is why techniques like dropoutand weight decay (L2 regularization) help—they encourage the network to learn independent features. Batch normalization also helps by decorrelating activations. Ridge regression (adding λI to XᵀX) is the linear algebra equivalent of weight decay.

Statistical Inference

Under the Gauss-Markov assumptions, OLS estimators have remarkable properties:

Assumption	Mathematical Form	Consequence
Linearity	E[Y\|X] = Xβ	Model is correctly specified
Full rank	rank(X) = p + 1	Unique solution exists
Exogeneity	E[ε\|X] = 0	Unbiased estimates
Homoscedasticity	Var(ε\|X) = σ²I	Efficient standard errors
No autocorrelation	Cov(εᵢ, εⱼ) = 0 for i ≠ j	Valid inference

Under these assumptions, OLS is the Best Linear Unbiased Estimator (BLUE). Adding the assumption of normally distributed errors enables t-tests and F-tests:

t_j = \\frac{\\hat{\\beta}_j}{\\text{SE}(\\hat{\\beta}_j)} \\sim t_{n-p-1}

Test statistic for H₀: βⱼ = 0 follows a t-distribution with n-p-1 degrees of freedom

Connections to Machine Learning

Multiple linear regression is the foundation for understanding many machine learning concepts:

🧠 Linear Layer = Multiple Regression

A neural network linear layer $\\mathbf{y} = \\mathbf{W}\\mathbf{x} + \\mathbf{b}$ is exactly multiple regression! W contains the coefficients, b is the bias (intercept). Training with MSE loss via gradient descent converges to the OLS solution.

⚖️ Regularization

Ridge regression adds L2 penalty: $\\hat{\\beta}_{\\text{ridge}} = (\\mathbf{X}^T\\mathbf{X} + \\lambda\\mathbf{I})^{-1}\\mathbf{X}^T\\mathbf{y}$

This is identical to weight decay in neural networks! It shrinks coefficients toward zero, reducing overfitting and handling multicollinearity.

📈 Feature Engineering

Understanding collinearity and coefficient interpretation helps design better features. Correlated features waste model capacity; understanding partial effects helps interpret feature importance in complex models.

🎯 Loss Functions

MSE loss has the OLS solution as its global minimum. Understanding why squared error (not absolute error) leads to closed-form solutions explains the prevalence of MSE in machine learning.

Python Implementation

Let's implement multiple linear regression from scratch, including diagnostics for multicollinearity. This implementation follows the scikit-learn API style.

Multiple Linear Regression from Scratch

🐍multiple_regression.py

Explanation(14)

Code(100)

1Import NumPy

NumPy provides efficient array operations and linear algebra routines essential for matrix-based regression.

3Class Definition

We implement multiple regression as a scikit-learn style class with fit(), predict(), and diagnostic methods.

8Constructor

fit_intercept controls whether to include a constant term (β₀). Most models should include an intercept unless the data is centered.

15Fit Method

The fit method solves the Normal Equations to find the OLS estimates. This is where the matrix algebra happens.

22Design Matrix Construction

If fitting an intercept, we prepend a column of ones to X. This makes X_design have shape (n, p+1) where the first column corresponds to β₀.

EXAMPLE

X = [[x11, x12], [x21, x22]] → X_design = [[1, x11, x12], [1, x21, x22]]

27Form XᵀX and Xᵀy

XᵀX is a (p+1)×(p+1) symmetric positive semi-definite matrix. Xᵀy is a (p+1)×1 vector. These are the components of the Normal Equations.

31Solve Normal Equations

np.linalg.solve is more numerically stable than computing (XᵀX)⁻¹ explicitly. It solves XᵀXβ = Xᵀy for β using LU decomposition.

34Extract Coefficients

The first element of β is the intercept (if included), and the remaining elements are the slope coefficients for each predictor.

42Compute R²

R² (coefficient of determination) measures the proportion of variance in y explained by the model. R² = 1 - SSres/SStot, ranging from 0 to 1.

49Predict Method

Predictions are computed as ŷ = Xβ + β₀. This is a simple matrix-vector multiplication plus a scalar.

54Standard Errors

Standard errors quantify uncertainty in coefficient estimates. They depend on residual variance and the inverse of XᵀX.

64Residual MSE

Mean Squared Error of residuals estimates σ². We divide by n - (p+1) degrees of freedom for an unbiased estimate.

68Variance-Covariance Matrix

Var(β̂) = σ²(XᵀX)⁻¹. The diagonal elements give variances of individual coefficients. Off-diagonal elements show covariances between coefficient estimates.

74VIF Computation

Variance Inflation Factor measures multicollinearity. For each predictor, we regress it on all others and compute VIF = 1/(1-R²). VIF > 10 indicates problematic collinearity.

86 lines without explanation

1import numpy as np
2
3class MultipleLinearRegression:
4    """
5    Multiple Linear Regression using OLS (Ordinary Least Squares).
6    Solves: β̂ = (XᵀX)⁻¹Xᵀy
7    """
8
9    def __init__(self, fit_intercept=True):
10        self.fit_intercept = fit_intercept
11        self.coefficients_ = None
12        self.intercept_ = None
13        self.residuals_ = None
14        self.r_squared_ = None
15
16    def fit(self, X, y):
17        """Fit the model using the Normal Equations."""
18        X = np.asarray(X)
19        y = np.asarray(y).flatten()
20        n, p = X.shape
21
22        # Add intercept column if requested
23        if self.fit_intercept:
24            X_design = np.column_stack([np.ones(n), X])
25        else:
26            X_design = X
27
28        # Solve Normal Equations: β = (XᵀX)⁻¹Xᵀy
29        XtX = X_design.T @ X_design
30        Xty = X_design.T @ y
31
32        # Use solve instead of explicit inverse (more stable)
33        beta = np.linalg.solve(XtX, Xty)
34
35        # Extract intercept and coefficients
36        if self.fit_intercept:
37            self.intercept_ = beta[0]
38            self.coefficients_ = beta[1:]
39        else:
40            self.intercept_ = 0.0
41            self.coefficients_ = beta
42
43        # Compute residuals and R²
44        y_pred = self.predict(X)
45        self.residuals_ = y - y_pred
46        ss_res = np.sum(self.residuals_ ** 2)
47        ss_tot = np.sum((y - np.mean(y)) ** 2)
48        self.r_squared_ = 1 - ss_res / ss_tot
49
50        return self
51
52    def predict(self, X):
53        """Generate predictions for new data."""
54        X = np.asarray(X)
55        return X @ self.coefficients_ + self.intercept_
56
57    def get_standard_errors(self, X):
58        """Compute standard errors of coefficients."""
59        X = np.asarray(X)
60        n, p = X.shape
61
62        if self.fit_intercept:
63            X_design = np.column_stack([np.ones(n), X])
64        else:
65            X_design = X
66
67        # Residual variance (MSE)
68        df = n - X_design.shape[1]  # Degrees of freedom
69        mse = np.sum(self.residuals_ ** 2) / df
70
71        # Variance-covariance matrix of β
72        XtX_inv = np.linalg.inv(X_design.T @ X_design)
73        var_beta = mse * XtX_inv
74
75        # Standard errors are sqrt of diagonal
76        return np.sqrt(np.diag(var_beta))
77
78    def compute_vif(self, X):
79        """Compute Variance Inflation Factors for each predictor."""
80        X = np.asarray(X)
81        n, p = X.shape
82        vif = np.zeros(p)
83
84        for i in range(p):
85            # Regress X_i on all other predictors
86            mask = np.ones(p, dtype=bool)
87            mask[i] = False
88            X_other = X[:, mask]
89            X_i = X[:, i]
90
91            # Fit regression
92            if X_other.shape[1] > 0:
93                model = MultipleLinearRegression(fit_intercept=True)
94                model.fit(X_other, X_i)
95                r_squared = model.r_squared_
96                vif[i] = 1 / (1 - r_squared) if r_squared < 1 else np.inf
97            else:
98                vif[i] = 1.0
99
100        return vif

Production Use: For real applications, use from sklearn.linear_model import LinearRegression or statsmodels.OLS. They handle numerical stability, edge cases, and provide comprehensive diagnostics.

Knowledge Check

Test your understanding of multiple linear regression with these questions:

Knowledge Check: Multiple Regression

Score: 0/7

In the OLS solution β̂ = (XᵀX)⁻¹Xᵀy, what does the matrix (XᵀX)⁻¹Xᵀ represent geometrically?

Summary

Key Takeaways

✅Multiple regression fits a hyperplane: Y = β₀ + β₁X₁ + ... + βₚXₚ + ε

✅Matrix form: y = Xβ + ε; OLS solution: β̂ = (XᵀX)⁻¹Xᵀy

✅Each βⱼ is a partial effect—change in Y per unit Xⱼ, holding others constant

✅Normal equations come from setting the gradient of SSR to zero

✅Multicollinearity inflates standard errors but doesn't bias coefficients

✅VIF > 10 indicates problematic collinearity requiring action

✅The Hat Matrix H = X(XᵀX)⁻¹Xᵀ projects y onto the column space of X

✅Linear layers in neural networks are exactly multiple regression

Looking Ahead

In the next section, we'll dive deep into the OLS Theory—the statistical properties of the OLS estimator under different assumptions. We'll prove why OLS is BLUE (Best Linear Unbiased Estimator) under the Gauss-Markov assumptions and explore what happens when assumptions are violated.