Chapter 23
25 min read
Section 141 of 175

Simple Linear Regression

Linear Regression

Learning Objectives

By the end of this section, you will be able to:

📐 Mathematical Understanding

  • • Derive the OLS estimators from first principles
  • • Understand why minimizing squared errors yields a unique solution
  • • Explain the geometric meaning of regression coefficients
  • • Interpret R² and understand its limitations

🔧 Practical Skills

  • • Implement simple linear regression from scratch
  • • Diagnose model problems using residual plots
  • • Interpret regression coefficients in real contexts
  • • Recognize when linear regression is appropriate

🧠 Deep Learning Connections

  • Neural Networks — Linear regression is a single-layer network with identity activation and MSE loss
  • Gradient Descent — The OLS loss surface is convex, guaranteeing convergence
  • Regularization — Ridge and Lasso extend OLS to prevent overfitting
  • Feature Engineering — Understanding linear relationships guides feature design
Where You'll Apply This: Predicting house prices, forecasting sales, understanding causal relationships, A/B test analysis, and as the foundation for understanding more complex models like neural networks and gradient-based optimization.

The Big Picture

Simple linear regression is arguably the most fundamental statistical model—and understanding it deeply unlocks the door to all of machine learning. At its core, it answers a question humans have asked for centuries: Given that two quantities are related, how can we best describe and predict one from the other?

The Core Problem

We observe pairs of values (x1,y1),(x2,y2),,(xn,yn)(x_1, y_1), (x_2, y_2), \ldots, (x_n, y_n). We want to find the best straight line that describes the relationship between X and Y—a line that we can use to predict Y for any new value of X.

y^=β0+β1x\hat{y} = \beta_0 + \beta_1 x

The regression line: intercept β0\beta_0 plus slope β1\beta_1 times X

Historical Context

The word "regression" comes from Sir Francis Galton's 1886 study of heredity. He noticed that children of very tall parents tended to be shorter than their parents, and children of very short parents tended to be taller—heights "regressed" toward the population mean. While studying this phenomenon, he developed the mathematical framework that became linear regression.

📜
Carl Friedrich Gauss (1809)

Developed the method of least squares to predict the orbit of the asteroid Ceres. His work established the mathematical foundation of OLS that we still use today.

🔬
Francis Galton (1886)

Coined the term "regression" while studying inheritance patterns. His work with Karl Pearson established regression as a fundamental statistical tool.

Where Regression Is Used

DomainExample ApplicationX (Input)Y (Output)
EconomicsDemand forecastingPriceQuantity sold
MedicineDrug dosage responseDosage (mg)Blood pressure reduction
Real EstateHouse price predictionSquare footageSale price ($)
MarketingAd spend optimizationAd budget ($)Revenue ($)
ClimateTemperature trendsYearGlobal temperature (°C)
ML/AILearning rate tuningLearning rateValidation loss

The Mathematical Model

Simple linear regression models the relationship between a single independent variable X (also called the predictor, feature, or regressor) and a dependent variable Y (the response or target). The model assumes this relationship is linear in nature, plus some random noise.

Model Components

The Simple Linear Regression Model

Yi=β0+β1Xi+εiY_i = \beta_0 + \beta_1 X_i + \varepsilon_i

for i=1,2,,ni = 1, 2, \ldots, n

Let's break down each component:

Y
Dependent Variable (Response)

The variable we want to predict or explain. In ML terms, this is the target or label. Examples: house price, test score, click-through rate.

X
Independent Variable (Predictor)

The variable we use to make predictions. In ML terms, this is the feature or input. Examples: square footage, study hours, ad spend.

β₀
Intercept (Bias Term)

The expected value of Y when X = 0. In neural network terms, this is the bias. Geometrically, it's where the regression line crosses the Y-axis.

β₁
Slope (Weight)

The change in Y for each one-unit change in X. This is the most important parameter—it captures the strength and direction of the relationship.

ε
Error Term (Residual)

Random noise that captures everything the model doesn't explain. This represents measurement error, omitted variables, and inherent randomness. We assume εiN(0,σ2)\varepsilon_i \sim \mathcal{N}(0, \sigma^2).

The Classical Assumptions

For OLS to provide the best possible estimates (BLUE: Best Linear Unbiased Estimator), we need certain assumptions to hold:

  1. Linearity: The true relationship between X and Y is linear. Check by examining the scatter plot and residual plots.
  2. Independence: The errors εi\varepsilon_i are independent of each other. This can fail with time series or clustered data.
  3. Homoscedasticity: The error variance σ2\sigma^2is constant across all X values. The scatter should be uniform, not funnel-shaped.
  4. Normality: Errors are normally distributed. This is mainly important for inference (confidence intervals, hypothesis tests).
  5. No measurement error in X: The predictor variable is measured without error. This is often violated in practice.
Important: OLS still provides the best linear unbiased estimates even if normality fails, as long as the other assumptions hold (Gauss-Markov theorem). Normality is only needed for valid p-values and confidence intervals.

Ordinary Least Squares Derivation

Now we tackle the central question: How do we find the best values for β₀ and β₁? The answer lies in defining what "best" means mathematically, then using calculus to find the optimal parameters.

The Loss Function

We define "best" as minimizing the Sum of Squared Errors (SSE)—the total squared distance between observed Y values and our predictions.

The OLS Objective Function

SSE(β0,β1)=i=1n(yiy^i)2=i=1n(yiβ0β1xi)2\text{SSE}(\beta_0, \beta_1) = \sum_{i=1}^{n} (y_i - \hat{y}_i)^2 = \sum_{i=1}^{n} (y_i - \beta_0 - \beta_1 x_i)^2

We seek β^0,β^1\hat{\beta}_0, \hat{\beta}_1 that minimize this quantity

Why squared errors? There are several good reasons:

  • Differentiability: Squared errors are smooth, allowing us to use calculus. Absolute errors have a kink at zero.
  • Penalizes large errors: Squaring emphasizes outliers, which may or may not be desirable depending on context.
  • Unique solution: The SSE surface is a convex paraboloid, guaranteeing a single global minimum.
  • Connection to Gaussian MLE: If errors are normally distributed, OLS is equivalent to maximum likelihood estimation.

Closed-Form Solution

To find the minimum, we take partial derivatives with respect to β₀ and β₁, set them to zero, and solve the resulting system of equations (called the normal equations).

Step 1: Derivative with respect to β₀
SSEβ0=2i=1n(yiβ0β1xi)=0\frac{\partial \text{SSE}}{\partial \beta_0} = -2 \sum_{i=1}^{n} (y_i - \beta_0 - \beta_1 x_i) = 0

This gives us: yi=nβ0+β1xi\sum y_i = n\beta_0 + \beta_1 \sum x_i, which means yˉ=β0+β1xˉ\bar{y} = \beta_0 + \beta_1 \bar{x}

Step 2: Derivative with respect to β₁
SSEβ1=2i=1nxi(yiβ0β1xi)=0\frac{\partial \text{SSE}}{\partial \beta_1} = -2 \sum_{i=1}^{n} x_i(y_i - \beta_0 - \beta_1 x_i) = 0

This gives us: xiyi=β0xi+β1xi2\sum x_i y_i = \beta_0 \sum x_i + \beta_1 \sum x_i^2

Step 3: The OLS Estimators

Slope:

β^1=i=1n(xixˉ)(yiyˉ)i=1n(xixˉ)2=Cov(X,Y)Var(X)\hat{\beta}_1 = \frac{\sum_{i=1}^{n}(x_i - \bar{x})(y_i - \bar{y})}{\sum_{i=1}^{n}(x_i - \bar{x})^2} = \frac{\text{Cov}(X,Y)}{\text{Var}(X)}

Intercept:

β^0=yˉβ^1xˉ\hat{\beta}_0 = \bar{y} - \hat{\beta}_1 \bar{x}
Key Insight: The slope formula β^1=Cov(X,Y)/Var(X)\hat{\beta}_1 = \text{Cov}(X,Y)/\text{Var}(X) shows that the slope is the covariance (how X and Y move together) normalized by the spread of X. If X and Y have strong positive covariance, the slope is positive and vice versa.

Geometric Interpretation

Geometry provides powerful intuition for understanding regression. Here are the key visual insights:

  • The regression line passes through (x̄, ȳ): The "center of mass" of the data always lies on the fitted line. This is a mathematical consequence of the normal equations.
  • Residuals are vertical distances: OLS minimizes vertical (not perpendicular) distances from points to the line. This is because we're predicting Y from X, not finding the best line in abstract space.
  • Sum of residuals equals zero: The positive residuals (points above the line) exactly balance the negative residuals (points below).
  • The SSE surface is a bowl: Plotting SSE as a function of (β₀, β₁) produces a convex paraboloid with a unique minimum—the OLS solution.

Interactive: Regression Explorer

Explore how data characteristics affect the regression line. Adjust the true slope, intercept, and noise level to see how OLS finds the best fit. Toggle residuals to visualize what OLS is minimizing.

Interactive Regression Explorer

Explore how data properties affect the regression line and fit quality

Data Parameters

Display Options

X (Independent Variable)Y (Dependent Variable)Fitted LineData Points
Intercept (β̂₀)
0.645
Slope (β̂₁)
2.048
R² Score
0.904
RMSE
1.589

Fitted Regression Equation:

ŷ = 0.645 + 2.048x

Interpretation: For each unit increase in X, Y increases by 2.048 units

Interactive: Loss Surface

The SSE forms a convex surface over the parameter space. Watch gradient descent descend into the minimum, or manually explore different parameter values to understand why OLS finds the optimal solution.

OLS Loss Surface Visualizer

Explore the Sum of Squared Errors (SSE) surface and watch gradient descent find the minimum

SSE Contour Plot

β₁ (Slope)β₀ (Intercept)CurrentOptimal
Low SSE
High SSE

Manual Parameter Control

Gradient Descent Animation

Steps: 0

Comparison

Current SSE
699.43
Optimal SSE
37.04
Optimal Parameters (OLS Solution)
β₀* = 0.957β₁* = 2.042

Key Insight

The SSE surface is a convex paraboloid. This means there's exactly one minimum (the OLS solution), and gradient descent will always find it regardless of where you start. This convexity is what makes linear regression so elegant!


Interpreting Coefficients

Interpretation is where regression becomes useful for real-world decision making. Here's how to read the coefficients:

Slope Interpretation (β̂₁)

"For each one-unit increase in X, we expect Y to change by β̂₁ units, on average, holding all else constant."

Example: If β̂₁ = 150 in a house price model with X = square footage, we interpret: "Each additional square foot is associated with a $150 increase in price."

Intercept Interpretation (β̂₀)

"The expected value of Y when X equals zero."

Caution: Often the intercept has no meaningful interpretation if X = 0 is outside the data range or physically impossible. A house with 0 square feet doesn't exist, so the intercept is just a mathematical anchoring point.

Correlation ≠ Causation: Regression shows association, not causation. Even a strong relationship (high R²) doesn't prove that X causes Y. For causal claims, you need randomized experiments or careful causal inference methods.

Goodness of Fit: R²

How well does our model fit the data? The coefficient of determination, R² (R-squared), provides a standardized measure from 0 to 1.

The R² Formula

R2=1SSRSST=1(yiy^i)2(yiyˉ)2R^2 = 1 - \frac{\text{SSR}}{\text{SST}} = 1 - \frac{\sum(y_i - \hat{y}_i)^2}{\sum(y_i - \bar{y})^2}
SSR

Residual variance

SST

Total variance

Explained proportion

How to interpret R²:

  • R² = 0: The model explains nothing. Predicting the mean ȳ would be just as good.
  • R² = 0.5: The model explains 50% of the variance in Y.
  • R² = 1: Perfect fit—all points lie exactly on the regression line.
R² Limitations: R² always increases when adding predictors, even useless ones. For multiple regression, use adjusted R² instead. Also, a high R² doesn't guarantee the model is correct—you could have a high R² with a misspecified model.

Residual Analysis

After fitting a regression, always examine the residuals (ε̂ᵢ = yᵢ - ŷᵢ). Residual plots reveal whether the model assumptions are satisfied and can diagnose problems that summary statistics miss.

PatternIndicatesRemedy
Random scatter around zeroModel is appropriate ✓None needed
Funnel shape (widening)HeteroscedasticityTransform Y, use weighted LS
Curved patternNon-linear relationshipAdd polynomial terms or transform X
Runs of positive/negativeAutocorrelation (time series)Add lagged terms, use GLS
Few extreme outliersInfluential pointsInvestigate outliers, use robust regression

Interactive: Residual Patterns

Explore different residual patterns and learn to diagnose regression problems visually. Understanding these patterns is essential for building reliable models.

Residual Analysis Patterns

Learn to diagnose regression problems by examining residual patterns

Residuals vs Fitted Values

Fitted Values (ŷ)Residuals (y - ŷ)

Histogram of Residuals

Residual Value

Blue dashed line: Normal distribution (expected under OLS assumptions)

Well-Behaved Residuals

Residuals are randomly scattered around zero with constant variance.

Remedy: No violations detected. The linear model is appropriate for this data.

Mean Residual
0.0000
Should be ≈ 0
Std Dev
1.6064
σ̂
Skewness
0.1171
Should be ≈ 0
Durbin-Watson
1.7650
≈ 2 = no autocorr.

Connection to Deep Learning

Simple linear regression might seem basic, but it's the foundation upon which all of deep learning is built. Understanding these connections illuminates why neural networks work.

🧠 Single Neuron = Linear Regression

A single neuron with linear activation and MSE loss is exactly linear regression. The weight is β₁, the bias is β₀, and training with gradient descent converges to the OLS solution.

📉 Convexity Guarantees Convergence

The MSE loss surface for linear models is convex (bowl-shaped), meaning gradient descent always finds the global minimum. Deep learning loses this guarantee when we add nonlinearities.

🔗 Regularization = Modified OLS

Ridge regression adds L2 penalty; Lasso adds L1. These are identical to weight decay and L1 regularization in neural networks, which prevent overfitting by penalizing large weights.

📐 Normal Equations vs Gradient Descent

OLS has a closed-form solution, but for large-scale problems or online learning, gradient descent (the neural network approach) is more practical. The math is the same—just different solution methods.

The Relationship: Linear Regression ↔ Neural Network

Linear Regression

ŷ = β₀ + β₁x

Single Neuron (PyTorch)

ŷ = nn.Linear(1, 1)(x)

Python Implementation

Let's implement simple linear regression from scratch to solidify our understanding. This implementation follows the scikit-learn API pattern.

Simple Linear Regression from Scratch
🐍simple_linear_regression.py
1Import NumPy

NumPy provides vectorized operations essential for efficient linear algebra computations in regression.

4Class Definition

Encapsulating regression as a class allows us to store fitted parameters and apply them to new data, following the scikit-learn API pattern.

5Constructor

Initialize placeholders for the coefficients (intercept β₀ and slope β₁) that will be learned from data.

10Fit Method

The fit method learns the optimal coefficients from training data using the OLS closed-form solution. This is where the 'learning' happens.

12Sample Size

Store the number of samples n, which we need for computing means and the formula denominator.

15Compute Means

Calculate x̄ and ȳ—the means of X and Y. The regression line always passes through the point (x̄, ȳ).

EXAMPLE
If X = [1,2,3,4,5], then x̄ = 3
19Numerator: Cross-Product Sum

Compute Σ(xᵢ - x̄)(yᵢ - ȳ), which measures how X and Y vary together. This is n times the covariance of X and Y.

22Denominator: Sum of Squared Deviations

Compute Σ(xᵢ - x̄)², which measures the spread of X around its mean. This is n times the variance of X.

25OLS Slope Formula

β̂₁ = Cov(X,Y)/Var(X). The slope equals the covariance divided by variance—how much Y changes per unit X.

EXAMPLE
If Cov(X,Y)=10 and Var(X)=5, then β̂₁=2
28OLS Intercept Formula

β̂₀ = ȳ - β̂₁x̄. The intercept ensures the regression line passes through (x̄, ȳ). This is the predicted Y when X equals its mean.

32Predict Method

Apply the learned linear model to new data. For any input X, predict ŷ = β̂₀ + β̂₁X.

35Vectorized Prediction

NumPy broadcasting allows us to predict for single values or entire arrays efficiently. This is crucial for large-scale ML applications.

38R² Score Method

Compute the coefficient of determination, which measures how well our model explains the variance in Y.

40Predictions

First, generate predictions for all training points using our fitted model.

43Residual Sum of Squares (SSR)

SSR = Σ(yᵢ - ŷᵢ)² measures total squared error. This is what OLS minimizes. Lower is better.

46Total Sum of Squares (SST)

SST = Σ(yᵢ - ȳ)² measures total variance in Y. This is the baseline—how much error we'd have if we just predicted the mean.

49R² Formula

R² = 1 - SSR/SST. The proportion of variance explained. R²=1 means perfect fit; R²=0 means no better than predicting the mean.

EXAMPLE
R²=0.85 means 85% of Y's variance is explained by X
31 lines without explanation
1import numpy as np
2
3# Simple Linear Regression from Scratch
4class SimpleLinearRegression:
5    def __init__(self):
6        self.intercept_ = None  # β₀
7        self.coef_ = None       # β₁
8        self.n_samples_ = None
9
10    def fit(self, X, y):
11        """Fit the model using OLS closed-form solution."""
12        n = len(X)
13        self.n_samples_ = n
14
15        # Step 1: Compute means
16        x_mean = np.mean(X)
17        y_mean = np.mean(y)
18
19        # Step 2: Compute numerator Σ(xᵢ - x̄)(yᵢ - ȳ)
20        numerator = np.sum((X - x_mean) * (y - y_mean))
21
22        # Step 3: Compute denominator Σ(xᵢ - x̄)²
23        denominator = np.sum((X - x_mean) ** 2)
24
25        # Step 4: Calculate slope β̂₁ = Cov(X,Y) / Var(X)
26        self.coef_ = numerator / denominator
27
28        # Step 5: Calculate intercept β̂₀ = ȳ - β̂₁x̄
29        self.intercept_ = y_mean - self.coef_ * x_mean
30
31        return self
32
33    def predict(self, X):
34        """Predict using the linear model."""
35        return self.intercept_ + self.coef_ * X
36
37    def score(self, X, y):
38        """Return R² coefficient of determination."""
39        y_pred = self.predict(X)
40
41        # Sum of squared residuals
42        ss_res = np.sum((y - y_pred) ** 2)
43
44        # Total sum of squares
45        ss_tot = np.sum((y - np.mean(y)) ** 2)
46
47        # R² = 1 - SS_res / SS_tot
48        return 1 - (ss_res / ss_tot)
In Practice: Use sklearn.linear_model.LinearRegression orstatsmodels.OLS for production code. Our implementation is for learning; theirs handles numerical stability, missing data, and provides additional diagnostics.

Knowledge Check

Test your understanding of simple linear regression with these questions:

Knowledge Check: Simple Linear Regression

Question 1 of 8

In simple linear regression Y = β₀ + β₁X + ε, what does β₁ represent?


Summary

Key Takeaways

Simple linear regression models Y = β₀ + β₁X + ε, finding the best straight line through the data.
OLS minimizes the Sum of Squared Errors, giving closed-form solutions for β̂₀ and β̂₁.
The slope β̂₁ = Cov(X,Y)/Var(X) measures how Y changes per unit X.
R² measures the proportion of variance explained, ranging from 0 to 1.
Residual plots diagnose violations of linearity, homoscedasticity, and independence.
The regression line always passes through (x̄, ȳ), the center of the data.
The SSE surface is convex, guaranteeing gradient descent finds the global minimum.
Linear regression is a single-layer neural network with linear activation and MSE loss.

Looking Ahead

In the next section, we'll extend to Multiple Linear Regression—the case with multiple predictors. This requires matrix notation and introduces important concepts like multicollinearity, but the core OLS framework remains the same. The jump from simple to multiple regression is the same conceptual leap from a single neuron to a layer with multiple inputs.

Loading comments...