Learning Objectives
By the end of this section, you will be able to:
📚 Core Knowledge
- • Express multiple regression in matrix notation
- • Derive the OLS estimator using calculus
- • Understand the geometric interpretation as projection
- • Interpret regression coefficients correctly
🔧 Practical Skills
- • Implement multiple regression from scratch
- • Diagnose and address multicollinearity
- • Compute and interpret VIF (Variance Inflation Factor)
- • Apply the Hat Matrix for leverage analysis
🧠 Deep Learning Connections
- • Linear layers — A neural network layer y = Wx + b is exactly multiple regression without activation
- • Gradient descent — Training with MSE loss converges to the OLS solution for linear models
- • Regularization — L2 regularization (weight decay) is Ridge regression; L1 is Lasso
- • Feature preprocessing — Understanding collinearity helps design better input features
Where You'll Apply This: Economic forecasting, medical outcome prediction, marketing attribution, scientific experiments, feature importance analysis, and as the foundation for understanding regularized and generalized linear models.
The Big Picture
Multiple linear regression extends simple linear regression from one predictor to many. Instead of fitting a line through points, we fit a hyperplane through points in a higher-dimensional space. The mathematics becomes elegant when expressed in matrix form, revealing deep connections to linear algebra and optimization.
The Central Question
Given multiple predictor variables , how do we find the best linear combination to predict a response ? And what does "best" mean precisely?
The OLS (Ordinary Least Squares) solution minimizes the sum of squared residuals—the squared vertical distances between observed and predicted values. This leads to the famous Normal Equations, which have a beautiful closed-form solution in matrix notation.
Historical Context
Carl Friedrich Gauss (1795)
Gauss developed least squares at age 18 to track the asteroid Ceres. He showed that if errors are normally distributed, minimizing squared errors gives the most probable estimate—connecting OLS to maximum likelihood estimation.
Adrien-Marie Legendre (1805)
Published the first formal description of least squares. The priority dispute with Gauss became one of mathematics' famous controversies. Both contributed essential insights—Legendre to publication, Gauss to theory.
From Simple to Multiple Regression
In simple linear regression, we model the relationship between one predictor and one response:
Multiple regression generalizes this to predictors:
for each observation .
Geometric Intuition
The key geometric insight is that multiple regression fits a hyperplane to the data:
- 1 predictor: Fit a line in 2D (the familiar scatter plot with trend line)
- 2 predictors: Fit a plane in 3D (imagine a tilted sheet of paper through a cloud of points)
- p predictors: Fit a p-dimensional hyperplane in (p+1)-dimensional space
Interactive: 3D Regression Plane
Explore how the regression plane fits data with two predictors. Adjust the true coefficients and noise level to see how the fitted plane changes. Enable residuals to visualize the vertical distances being minimized.
Regression Equation
The regression plane represents the predicted value of Y for any combination of X₁ and X₂.
Model Statistics
Key Insight: The Geometry of Multiple Regression
In simple linear regression, we fit a line. With two predictors, we fit a plane. With p predictors, we fit a p-dimensional hyperplane in (p+1)-dimensional space. The coefficients and determine the slope of the plane in each predictor direction—they tell us how Y changes when each predictor increases by 1 unit, holding all other predictors constant.
Matrix Formulation
The power of multiple regression becomes apparent in matrix notation. Instead of writing out each equation individually, we express everything compactly:
The Matrix Form
The Design Matrix
The design matrix is the heart of regression. Each row represents one observation, and each column represents a predictor:
The column of ones allows the intercept to be included in the matrix multiplication. This elegant trick unifies the treatment of intercept and slope coefficients.
Interactive: OLS Derivation Step by Step
Walk through the complete mathematical derivation of the OLS estimator. Each step shows the key equation and explains the intuition behind the matrix operations.
The Model Setup
Express multiple regression in matrix form
We have n observations and p predictors. y is an n×1 vector of responses, X is an n×(p+1) design matrix (including a column of 1s for the intercept), β is a (p+1)×1 vector of coefficients, and ε is an n×1 vector of errors.
The design matrix X has one row per observation and one column per predictor (plus intercept). Each row represents one data point.
Matrix Dimensions Reference
The Normal Equations
The derivation leads to the famous Normal Equations:
Normal Equations
Solving for yields the OLS estimator:
Interpreting Coefficients
In multiple regression, each coefficient has a specific interpretation that differs from simple regression:
The Partial Effect: represents the expected change in Y when increases by one unit, holding all other predictors constant.
This "holding constant" interpretation (ceteris paribus) is crucial and often misunderstood. It's different from the simple correlation between and Y.
Example: House Prices
Suppose we model house price as:
- • Each additional square foot adds $150 to price (holding age constant)
- • Each additional year of age reduces price by $2,500 (holding size constant)
- • The intercept $50,000 is the predicted price for a 0 sqft, 0 year old house (extrapolation—interpret with caution!)
Interactive: Coefficient Explorer
Explore how to interpret regression coefficients in different real-world scenarios. Adjust the predictor values and see how the model generates predictions.
Predicting house prices from size and age
📝 Interpretation
For a 2,000 sq ft house that is 10 years old, the predicted price is $325k. Each additional square foot adds $150 to the price (holding age constant). Each additional year of age reduces the price by $2,500 (holding size constant).
⚠️ Critical: "Holding All Else Constant"
Each coefficient represents the partial effect of that variable on the response, controlling for all other variables. This is called the ceteris paribus interpretation.
Warning: This interpretation assumes the predictors can actually vary independently. If predictors are correlated (e.g., experience and education often increase together), the "holding constant" interpretation becomes hypothetical—we may never observe someone with high experience but low education.
Multicollinearity
Multicollinearity occurs when predictors are highly correlated with each other. This creates problems for coefficient estimation:
❌ Problems with High Collinearity
- • Standard errors of coefficients inflate dramatically
- • Coefficient estimates become unstable across samples
- • Individual effects become impossible to disentangle
- • Sign of coefficient may flip unexpectedly
✓ What Still Works
- • Coefficients remain unbiased (BLUE property)
- • Overall model R² is unaffected
- • Predictions for new data still work well
- • Combined effect of correlated predictors is fine
The Variance Inflation Factor (VIF) quantifies how much the variance of a coefficient estimate is inflated due to collinearity:
where is the R² from regressing on all other predictors
Interactive: Multicollinearity Demo
Explore how predictor correlation affects coefficient estimation. Increase the correlation between predictors and observe how standard errors inflate while the true coefficients remain unchanged.
Predictor Correlation: X₁ vs X₂
Coefficient Estimates
What Happens with High Correlation?
- • Standard errors of coefficients inflate
- • Coefficient estimates become unstable
- • T-statistics become small (less significant)
- • Individual effects become hard to separate
- • But overall R² can still be high!
VIF Interpretation Guidelines
- • VIF < 5: Generally acceptable
- • 5 ≤ VIF < 10: Moderate collinearity, worth investigating
- • VIF ≥ 10: Severe collinearity, action needed
- • VIF = ∞: Perfect collinearity (XᵀX singular)
🧠 Connection to Deep Learning
In neural networks, feature collinearity manifests as redundant neuronsthat learn similar representations. This is why techniques like dropoutand weight decay (L2 regularization) help—they encourage the network to learn independent features. Batch normalization also helps by decorrelating activations. Ridge regression (adding λI to XᵀX) is the linear algebra equivalent of weight decay.
Statistical Inference
Under the Gauss-Markov assumptions, OLS estimators have remarkable properties:
| Assumption | Mathematical Form | Consequence |
|---|---|---|
| Linearity | E[Y|X] = Xβ | Model is correctly specified |
| Full rank | rank(X) = p + 1 | Unique solution exists |
| Exogeneity | E[ε|X] = 0 | Unbiased estimates |
| Homoscedasticity | Var(ε|X) = σ²I | Efficient standard errors |
| No autocorrelation | Cov(εᵢ, εⱼ) = 0 for i ≠ j | Valid inference |
Under these assumptions, OLS is the Best Linear Unbiased Estimator (BLUE). Adding the assumption of normally distributed errors enables t-tests and F-tests:
Test statistic for H₀: βⱼ = 0 follows a t-distribution with n-p-1 degrees of freedom
Connections to Machine Learning
Multiple linear regression is the foundation for understanding many machine learning concepts:
🧠 Linear Layer = Multiple Regression
A neural network linear layer is exactly multiple regression! W contains the coefficients, b is the bias (intercept). Training with MSE loss via gradient descent converges to the OLS solution.
⚖️ Regularization
Ridge regression adds L2 penalty:
This is identical to weight decay in neural networks! It shrinks coefficients toward zero, reducing overfitting and handling multicollinearity.
📈 Feature Engineering
Understanding collinearity and coefficient interpretation helps design better features. Correlated features waste model capacity; understanding partial effects helps interpret feature importance in complex models.
🎯 Loss Functions
MSE loss has the OLS solution as its global minimum. Understanding why squared error (not absolute error) leads to closed-form solutions explains the prevalence of MSE in machine learning.
Python Implementation
Let's implement multiple linear regression from scratch, including diagnostics for multicollinearity. This implementation follows the scikit-learn API style.
from sklearn.linear_model import LinearRegression or statsmodels.OLS. They handle numerical stability, edge cases, and provide comprehensive diagnostics.Knowledge Check
Test your understanding of multiple linear regression with these questions:
In the OLS solution β̂ = (XᵀX)⁻¹Xᵀy, what does the matrix (XᵀX)⁻¹Xᵀ represent geometrically?
Summary
Key Takeaways
Looking Ahead
In the next section, we'll dive deep into the OLS Theory—the statistical properties of the OLS estimator under different assumptions. We'll prove why OLS is BLUE (Best Linear Unbiased Estimator) under the Gauss-Markov assumptions and explore what happens when assumptions are violated.