Boo-AI — Master Artificial Intelligence by Building from Scratch

Learning Objectives

By the end of this section, you will be able to:

📚 Core Knowledge

• Understand the four key assumptions of linear regression (LINE)
• Interpret all four standard diagnostic plots
• Distinguish between leverage and influence
• Calculate and interpret Cook's distance

🔧 Practical Skills

• Detect violations of regression assumptions from plots
• Perform formal statistical tests for each assumption
• Identify influential observations that need attention
• Implement diagnostic checks in Python

🧠 Deep Learning Connections

• Residual Analysis in Neural Networks — Understanding prediction errors informs architecture choices and loss function design
• Outlier Detection — Identifying influential training examples helps with data cleaning and curriculum learning
• Heteroscedastic Regression — Neural networks can learn to predict varying uncertainty, generalizing homoscedasticity
• Model Uncertainty — Diagnostics for neural networks borrow from classical regression analysis

Where You'll Apply This: Model validation in any regression task, quality assurance in machine learning pipelines, debugging poor model performance, identifying data quality issues, and building trustworthy predictive systems.

The Big Picture

Fitting a regression model is only half the job. The other half—arguably more important—is checking whether the model is valid. Regression diagnostics answer critical questions: Are the assumptions satisfied? Are there observations that unduly influence our results? Can we trust our predictions and inferences?

The Core Insight

Linear regression makes assumptions about the data-generating process. When these assumptions are violated, our coefficient estimates may be biased, our standard errors may be wrong, and our predictions may be unreliable. Diagnostics help us detect these problems before they cause real-world harm.

📏

Linearity

🎲

Independence

📊

Normality

⚖️

Equal Variance

The LINE mnemonic: Linearity, Independence, Normality, Equal variance (homoscedasticity)

Why Diagnostics Matter

Consider what can go wrong when assumptions are violated:

Violation	Consequence	Risk
Nonlinearity	Biased predictions, especially at extremes	Model systematically misses the true relationship
Non-independence	Underestimated standard errors	Overconfident predictions, spurious significance
Non-normality	Invalid p-values and confidence intervals	Wrong conclusions about statistical significance
Heteroscedasticity	Inefficient estimates, wrong standard errors	Confidence intervals with wrong coverage
Influential outliers	Results driven by few observations	Conclusions change dramatically with small data changes

Historical Context

📜

Francis Anscombe (1973)

Anscombe's Quartet dramatically showed why visualization is essential. Four datasets with identical summary statistics (mean, variance, correlation, regression line) looked completely different when plotted—one had an outlier, one was nonlinear, and so on. This made graphical diagnostics a standard practice.

🔬

R. Dennis Cook (1977)

Cook introduced his famous distance measure to identify influential observations. By combining leverage and residual information, Cook's distance became the standard way to detect points that unduly affect regression results—still widely used today.

The Regression Assumptions

The classical linear regression model assumes that the true data-generating process is:

Y_i = \beta_0 + \beta_1 X_{i1} + \beta_2 X_{i2} + \cdots + \beta_p X_{ip} + \epsilon_i

where the errors $\epsilon_i$ satisfy specific conditions

Linearity

The relationship between predictors and response is linear in parameters. This doesn't mean the relationship must look like a straight line—we can include polynomial terms, interactions, or transformations. What it means is that $E[Y|X]$ is a linear combination of the parameters.

Detection: Look at the Residuals vs Fitted plot. A systematic pattern (curves, U-shapes) indicates the linear model is missing structure.

Independence

The errors $\epsilon_i$ are independent of each other. This assumption is often violated with time series data (autocorrelation) or spatial data (spatial correlation). When errors are correlated, OLS standard errors underestimate the true uncertainty.

Detection: Plot residuals in order (time or index). Look for runs of positive or negative residuals. The Durbin-Watson test formally tests for first-order autocorrelation.

Normality

The errors follow a normal distribution: $\epsilon_i \sim N(0, \sigma^2)$ . This assumption is needed for valid hypothesis tests and confidence intervals. However, it's the least important assumption—OLS estimates are still unbiased and consistent even without normality.

Detection: The Normal Q-Q plot shows residuals against theoretical normal quantiles. Points should follow the diagonal line. Deviations at the tails suggest heavy or light tails.

Equal Variance (Homoscedasticity)

The error variance is constant: $\text{Var}(\epsilon_i) = \sigma^2$ for all observations. Heteroscedasticity (non-constant variance) makes OLS estimates inefficient and standard errors incorrect.

Detection: The Scale-Location plot shows $\sqrt{|\text{standardized residuals}|}$ vs fitted values. A funnel shape indicates variance changes with predicted values.

Residual Analysis

Residuals are the foundation of regression diagnostics. The observed residual for observation $i$ is:

e_i = y_i - \hat{y}_i = y_i - \mathbf{x}_i^T \hat{\boldsymbol{\beta}}

Residuals are estimates of the true errors $\epsilon_i$ . If the model is correct, residuals should behave like independent, homoscedastic, normal random variables with mean zero. Any systematic pattern suggests a problem.

Standardized vs. Raw Residuals: Raw residuals have different variances (points with high leverage have smaller residual variance). Standardized residuals

r_i = e_i / (s\sqrt{1-h_{ii}})

correct for this, making them comparable across observations. Always use standardized residuals for diagnostic plots.

Interactive: The Four Diagnostic Plots

The classic "four-plot" diagnostic display shows everything you need. Adjust the parameters below to simulate different types of assumption violations, then observe how they manifest in each plot.

Regression Diagnostic Plots

Sample Size: 50

Heteroscedasticity: 0.0

Nonlinearity: 0.0

Outlier Strength: 0.0

Normal Point

Outlier (|standardized residual| > 2)

High Leverage Point

Interactive: Assumption Checker

This interactive tool lets you simulate specific types of assumption violations and see how they affect both the diagnostic plots and formal statistical tests.

Regression Assumption Checker

Simulate Assumption Violation:

Linearity

Pass

Residuals should show no systematic pattern when plotted against fitted values.

Homoscedasticity

Pass

Residual variance should be constant. BP slope: 0.048

Normality

Pass

Residuals should follow a normal distribution. SW ≈ 0.990

Independence

Pass

Residuals should be independent. DW = 1.909 (ideal ≈ 2)

Current Setting: No Violation

Ideal case: All assumptions satisfied. Errors are i.i.d. normal with constant variance.

Leverage and Influence

Not all observations are created equal. Some observations have more potential to affect the regression line than others. We need to distinguish between:

Leverage (Potential Influence)

How far is the observation's predictor values from the center? High leverage means unusual X values, giving the point potential to "pull" the regression line.

Influence (Actual Effect)

How much does removing the observation change the regression? Influential points actually move the line substantially—they have both high leverage AND don't follow the pattern.

What is Leverage?

The hat matrix $\mathbf{H}$ maps observed values to fitted values:

\hat{\mathbf{y}} = \mathbf{H}\mathbf{y}, \quad \text{where } \mathbf{H} = \mathbf{X}(\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T

The diagonal element $h_{ii}$ is the leverage of observation $i$ . For simple linear regression:

h_{ii} = \frac{1}{n} + \frac{(x_i - \bar{x})^2}{\sum_{j=1}^n (x_j - \bar{x})^2}

Notice that leverage depends only on X values, not Y. Points far from $\bar{x}$ have high leverage. The average leverage is $p/n$ , so we flag points with $h_{ii} > 2p/n$ as high leverage.

Interactive: Leverage and Influence Explorer

Drag the special point around to see how leverage and influence change. A point can have high leverage without being influential (if it follows the pattern), or be influential through a combination of leverage and residual.

Leverage and Influence Explorer

Include Special Point

Special Point Position

X Position: 5.0

Y Position: 7.5

Special Point Statistics

Leverage (h):0.1842

Threshold (2p/n):0.1905

Cook's D:0.4382

Threshold (4/n):0.1905

Regression Coefficients

Without special point:

y = 1.93 + 1.56x

With special point:

y = 2.27 + 1.41x

Slope change:-9.6%

Point Classification: 🔴 Outlier

High Cook's D but low leverage: Doesn't follow the pattern but has moderate X value.

Leverage (h)

Measures how far a point's X value is from the mean. High leverage points have unusual predictor values. They could pull the regression line but may not if they follow the overall pattern. Range: [1/n, 1].

Influence (Cook's D)

Measures how much the regression coefficients change when a point is removed. Combines both leverage and residual size. High influence points actually change the regression line substantially.

Cook's Distance

Cook's Distance is the most important single measure of influence. It quantifies how much all fitted values change when observation $i$ is deleted:

D_i = \frac{\sum_{j=1}^n (\hat{y}_j - \hat{y}_{j(i)})^2}{p \cdot \text{MSE}} = \frac{r_i^2}{p} \cdot \frac{h_{ii}}{1 - h_{ii}}

where $\hat{y}_{j(i)}$ is the fitted value for observation $j$ when observation $i$ is removed

The second form shows that Cook's distance combines standardized residual $r_i$ with leverage $h_{ii}$ . This is why a point needs BOTH a large residual AND high leverage to be truly influential.

Threshold	Rule	Interpretation
D > 4/n	Common rule of thumb	Point worth investigating
D > 1	More conservative	Point likely problematic
D > F(0.5, p, n-p)	50th percentile of F	Formal statistical threshold

Interactive: Cook's Distance Demo

Click on points in either the scatter plot or bar chart to exclude them and see how the regression line changes. Observations with high Cook's distance will cause larger changes.

Cook's Distance Explorer

Sample Size: 25

Scatter Plot (Click points to exclude)

Cook's Distance by Observation

Regression Comparison

All points:

y = 2.389 + 1.158x

Without excluded:

y = 2.389 + 1.158x

Influential Points

Points with Cook's D > 0.160:

#2 (D=0.185)#15 (D=0.189)#25 (D=0.498)

Normal Point

Influential (D > 4/n)

Excluded

Original Fit

Current Fit

Applications in Machine Learning

While modern ML often uses more complex models than linear regression, diagnostic thinking remains essential. Here's how these concepts transfer:

🔍 Residual Analysis for Neural Networks

Plotting predicted vs actual values reveals systematic biases. Patterns in residuals suggest the network is missing important structure—perhaps you need more layers, different architecture, or additional features.

🎯 Influential Training Examples

Methods like influence functions (Koh & Liang, 2017) extend Cook's distance ideas to deep learning. They identify which training examples most affect a particular prediction—useful for debugging, data cleaning, and understanding model behavior.

📊 Heteroscedastic Neural Networks

Instead of assuming constant variance, modern networks can output both $\mu(x)$ and $\sigma^2(x)$ . This is heteroscedastic regression—the network learns where uncertainty is high vs low, providing calibrated uncertainty estimates.

⚠️ Outlier Detection in Feature Space

High leverage points have unusual feature combinations. In ML, these might be out-of- distribution examples where the model should express uncertainty rather than make confident predictions. Detecting them is crucial for safe deployment.

Lesson for Deep Learning: Even with powerful nonlinear models, always examine residuals. A pattern suggests the model is systematically wrong in ways that more capacity alone won't fix—you may need different features, architecture, or data.

Python Implementation

Here's a complete implementation of regression diagnostics from scratch. This class computes all the key diagnostic measures and formal statistical tests.

Comprehensive Regression Diagnostics

🐍regression_diagnostics.py

Explanation(15)

Code(88)

1Import Libraries

NumPy for numerical operations, scipy.stats for statistical tests, and matplotlib for visualization.

6Class Definition

We create a class to perform comprehensive regression diagnostics, storing fitted model information and computing various diagnostic statistics.

8Constructor

Takes the response variable y and design matrix X (including column of ones for intercept). Fits the model and calculates all diagnostic measures.

14Fit OLS Model

Computes β = (X'X)^(-1)X'y using the normal equations. This gives us the least squares estimates of the regression coefficients.

18Calculate Residuals

Residuals e = y - ŷ are the differences between observed and predicted values. They are the foundation of all diagnostic tests.

21Hat Matrix Diagonal

The diagonal of the hat matrix H = X(X'X)^(-1)X' gives the leverage of each observation. Leverage measures how far a point's X value is from the mean.

EXAMPLE

h_ii = 1/n + (x_i - x̄)² / Σ(x_j - x̄)²

26Mean Squared Error

MSE = SSE/(n-p) is the unbiased estimate of error variance σ². We need n-p degrees of freedom to account for estimating p parameters.

30Standardized Residuals

Divide each residual by its estimated standard deviation: r_i = e_i / (s√(1-h_ii)). This accounts for the fact that residuals have different variances.

35Cook's Distance

Measures how much all fitted values change when observation i is removed. Combines both leverage and residual size into a single influence measure.

EXAMPLE

D_i = (r_i²/p) × (h_ii/(1-h_ii))

41Durbin-Watson Test

Tests for first-order autocorrelation in residuals. DW ≈ 2 means no autocorrelation, DW < 2 means positive autocorrelation, DW > 2 means negative.

48Breusch-Pagan Test

Tests for heteroscedasticity by regressing squared residuals on predictors. A significant relationship indicates non-constant variance.

57Shapiro-Wilk Test

Tests if residuals come from a normal distribution. Low p-value suggests non-normality, which affects inference but not coefficient estimates.

62Influential Points

Identifies observations with Cook's D > 4/n (common threshold). These points substantially affect the regression coefficients.

67High Leverage Points

Identifies observations with leverage > 2p/n (twice the average leverage). These have unusual predictor values.

72Generate Summary

Creates a comprehensive diagnostic report with test statistics, p-values, and flagged observations for easy interpretation.

73 lines without explanation

1import numpy as np
2from scipy import stats
3import matplotlib.pyplot as plt
4
5# Comprehensive Regression Diagnostics Class
6class RegressionDiagnostics:
7
8    def __init__(self, y, X):
9        self.y = y
10        self.X = X
11        self.n, self.p = X.shape
12
13        # Fit OLS model
14        XtX_inv = np.linalg.inv(X.T @ X)
15        self.beta = XtX_inv @ X.T @ y
16        self.fitted = X @ self.beta
17
18        # Calculate residuals
19        self.residuals = y - self.fitted
20
21        # Hat matrix diagonal (leverage)
22        H = X @ XtX_inv @ X.T
23        self.leverage = np.diag(H)
24
25        # MSE and standard error
26        self.sse = np.sum(self.residuals**2)
27        self.mse = self.sse / (self.n - self.p)
28        self.se = np.sqrt(self.mse)
29
30        # Standardized residuals
31        self.std_residuals = self.residuals / (
32            self.se * np.sqrt(1 - self.leverage)
33        )
34
35        # Cook's distance
36        self.cooks_d = (self.std_residuals**2 / self.p) * (
37            self.leverage / (1 - self.leverage)
38        )
39
40    def durbin_watson(self):
41        """Test for autocorrelation in residuals."""
42        diff = np.diff(self.residuals)
43        dw = np.sum(diff**2) / np.sum(self.residuals**2)
44        return dw
45
46    def breusch_pagan_test(self):
47        """Test for heteroscedasticity."""
48        # Regress squared residuals on X
49        sq_resid = self.residuals**2
50        XtX_inv = np.linalg.inv(self.X.T @ self.X)
51        gamma = XtX_inv @ self.X.T @ sq_resid
52        fitted_sq = self.X @ gamma
53        ss_reg = np.sum((fitted_sq - sq_resid.mean())**2)
54        ss_tot = np.sum((sq_resid - sq_resid.mean())**2)
55        r2 = ss_reg / ss_tot
56        bp_stat = self.n * r2
57        p_value = 1 - stats.chi2.cdf(bp_stat, self.p - 1)
58        return bp_stat, p_value
59
60    def shapiro_wilk_test(self):
61        """Test normality of residuals."""
62        stat, p_value = stats.shapiro(self.residuals)
63        return stat, p_value
64
65    def get_influential_points(self, threshold=None):
66        """Find observations with high Cook's distance."""
67        if threshold is None:
68            threshold = 4 / self.n
69        return np.where(self.cooks_d > threshold)[0]
70
71    def get_high_leverage_points(self, threshold=None):
72        """Find high leverage observations."""
73        if threshold is None:
74            threshold = 2 * self.p / self.n
75        return np.where(self.leverage > threshold)[0]
76
77    def summary(self):
78        """Generate diagnostic summary."""
79        dw = self.durbin_watson()
80        bp_stat, bp_pval = self.breusch_pagan_test()
81        sw_stat, sw_pval = self.shapiro_wilk_test()
82
83        print("=== Regression Diagnostics ===")
84        print(f"Durbin-Watson: {dw:.4f}")
85        print(f"Breusch-Pagan: {bp_stat:.4f} (p={bp_pval:.4f})")
86        print(f"Shapiro-Wilk:  {sw_stat:.4f} (p={sw_pval:.4f})")
87        print(f"Influential:   {self.get_influential_points()}")
88        print(f"High Leverage: {self.get_high_leverage_points()}")

Using statsmodels: In practice, use import statsmodels.api as sm then sm.OLS(y, X).fit().summary() for complete diagnostics. The influence attribute provides Cook's distance, leverage, and more.

Knowledge Check

Test your understanding of regression diagnostics with these questions:

Knowledge Check: Regression Diagnostics

Score: 0/0

Question 1 of 80% Complete

In the Residuals vs Fitted plot, a funnel-shaped pattern indicates:

Summary

Key Takeaways

✅Regression diagnostics check the LINE assumptions: Linearity, Independence, Normality, and Equal variance.

✅The four diagnostic plots reveal assumption violations visually: patterns indicate problems.

✅Leverage measures how unusual an observation's X values are; influence measures actual effect on results.

✅Cook's distance combines leverage and residual to identify truly influential points.

✅Formal tests (Durbin-Watson, Breusch-Pagan, Shapiro-Wilk) complement visual diagnostics.

✅Standardized residuals account for varying variances; use them for diagnostics.

✅High leverage doesn't mean high influence—point must also deviate from pattern.

✅These concepts extend to deep learning: residual analysis, influence functions, heteroscedastic regression.

Looking Ahead

In the next section, we'll explore Regularization: Ridge and Lasso— techniques that address some of the problems diagnostics reveal (like multicollinearity) while also preventing overfitting. You'll see how adding a penalty term transforms the optimization problem and leads to more stable, interpretable models.