Learning Objectives
By the end of this section, you will be able to:
📚 Core Knowledge
- • Understand the four key assumptions of linear regression (LINE)
- • Interpret all four standard diagnostic plots
- • Distinguish between leverage and influence
- • Calculate and interpret Cook's distance
🔧 Practical Skills
- • Detect violations of regression assumptions from plots
- • Perform formal statistical tests for each assumption
- • Identify influential observations that need attention
- • Implement diagnostic checks in Python
🧠 Deep Learning Connections
- • Residual Analysis in Neural Networks — Understanding prediction errors informs architecture choices and loss function design
- • Outlier Detection — Identifying influential training examples helps with data cleaning and curriculum learning
- • Heteroscedastic Regression — Neural networks can learn to predict varying uncertainty, generalizing homoscedasticity
- • Model Uncertainty — Diagnostics for neural networks borrow from classical regression analysis
Where You'll Apply This: Model validation in any regression task, quality assurance in machine learning pipelines, debugging poor model performance, identifying data quality issues, and building trustworthy predictive systems.
The Big Picture
Fitting a regression model is only half the job. The other half—arguably more important—is checking whether the model is valid. Regression diagnostics answer critical questions: Are the assumptions satisfied? Are there observations that unduly influence our results? Can we trust our predictions and inferences?
The Core Insight
Linear regression makes assumptions about the data-generating process. When these assumptions are violated, our coefficient estimates may be biased, our standard errors may be wrong, and our predictions may be unreliable. Diagnostics help us detect these problems before they cause real-world harm.
Linearity
Independence
Normality
Equal Variance
The LINE mnemonic: Linearity, Independence, Normality, Equal variance (homoscedasticity)
Why Diagnostics Matter
Consider what can go wrong when assumptions are violated:
| Violation | Consequence | Risk |
|---|---|---|
| Nonlinearity | Biased predictions, especially at extremes | Model systematically misses the true relationship |
| Non-independence | Underestimated standard errors | Overconfident predictions, spurious significance |
| Non-normality | Invalid p-values and confidence intervals | Wrong conclusions about statistical significance |
| Heteroscedasticity | Inefficient estimates, wrong standard errors | Confidence intervals with wrong coverage |
| Influential outliers | Results driven by few observations | Conclusions change dramatically with small data changes |
Historical Context
Francis Anscombe (1973)
Anscombe's Quartet dramatically showed why visualization is essential. Four datasets with identical summary statistics (mean, variance, correlation, regression line) looked completely different when plotted—one had an outlier, one was nonlinear, and so on. This made graphical diagnostics a standard practice.
R. Dennis Cook (1977)
Cook introduced his famous distance measure to identify influential observations. By combining leverage and residual information, Cook's distance became the standard way to detect points that unduly affect regression results—still widely used today.
The Regression Assumptions
The classical linear regression model assumes that the true data-generating process is:
where the errors satisfy specific conditions
Linearity
The relationship between predictors and response is linear in parameters. This doesn't mean the relationship must look like a straight line—we can include polynomial terms, interactions, or transformations. What it means is that is a linear combination of the parameters.
Detection: Look at the Residuals vs Fitted plot. A systematic pattern (curves, U-shapes) indicates the linear model is missing structure.
Independence
The errors are independent of each other. This assumption is often violated with time series data (autocorrelation) or spatial data (spatial correlation). When errors are correlated, OLS standard errors underestimate the true uncertainty.
Detection: Plot residuals in order (time or index). Look for runs of positive or negative residuals. The Durbin-Watson test formally tests for first-order autocorrelation.
Normality
The errors follow a normal distribution: . This assumption is needed for valid hypothesis tests and confidence intervals. However, it's the least important assumption—OLS estimates are still unbiased and consistent even without normality.
Detection: The Normal Q-Q plot shows residuals against theoretical normal quantiles. Points should follow the diagonal line. Deviations at the tails suggest heavy or light tails.
Equal Variance (Homoscedasticity)
The error variance is constant: for all observations. Heteroscedasticity (non-constant variance) makes OLS estimates inefficient and standard errors incorrect.
Detection: The Scale-Location plot shows vs fitted values. A funnel shape indicates variance changes with predicted values.
Residual Analysis
Residuals are the foundation of regression diagnostics. The observed residual for observation is:
Residuals are estimates of the true errors . If the model is correct, residuals should behave like independent, homoscedastic, normal random variables with mean zero. Any systematic pattern suggests a problem.
Interactive: The Four Diagnostic Plots
The classic "four-plot" diagnostic display shows everything you need. Adjust the parameters below to simulate different types of assumption violations, then observe how they manifest in each plot.
Regression Diagnostic Plots
Interactive: Assumption Checker
This interactive tool lets you simulate specific types of assumption violations and see how they affect both the diagnostic plots and formal statistical tests.
Regression Assumption Checker
Linearity
PassResiduals should show no systematic pattern when plotted against fitted values.
Homoscedasticity
PassResidual variance should be constant. BP slope: 0.048
Normality
PassResiduals should follow a normal distribution. SW ≈ 0.990
Independence
PassResiduals should be independent. DW = 1.909 (ideal ≈ 2)
Current Setting: No Violation
Ideal case: All assumptions satisfied. Errors are i.i.d. normal with constant variance.
Leverage and Influence
Not all observations are created equal. Some observations have more potential to affect the regression line than others. We need to distinguish between:
Leverage (Potential Influence)
How far is the observation's predictor values from the center? High leverage means unusual X values, giving the point potential to "pull" the regression line.
Influence (Actual Effect)
How much does removing the observation change the regression? Influential points actually move the line substantially—they have both high leverage AND don't follow the pattern.
What is Leverage?
The hat matrix maps observed values to fitted values:
The diagonal element is the leverage of observation . For simple linear regression:
Notice that leverage depends only on X values, not Y. Points far from have high leverage. The average leverage is , so we flag points with as high leverage.
Interactive: Leverage and Influence Explorer
Drag the special point around to see how leverage and influence change. A point can have high leverage without being influential (if it follows the pattern), or be influential through a combination of leverage and residual.
Leverage and Influence Explorer
Special Point Position
Special Point Statistics
Regression Coefficients
Point Classification: 🔴 Outlier
High Cook's D but low leverage: Doesn't follow the pattern but has moderate X value.
Leverage (h)
Measures how far a point's X value is from the mean. High leverage points have unusual predictor values. They could pull the regression line but may not if they follow the overall pattern. Range: [1/n, 1].
Influence (Cook's D)
Measures how much the regression coefficients change when a point is removed. Combines both leverage and residual size. High influence points actually change the regression line substantially.
Cook's Distance
Cook's Distance is the most important single measure of influence. It quantifies how much all fitted values change when observation is deleted:
where is the fitted value for observation when observation is removed
The second form shows that Cook's distance combines standardized residual with leverage . This is why a point needs BOTH a large residual AND high leverage to be truly influential.
| Threshold | Rule | Interpretation |
|---|---|---|
| D > 4/n | Common rule of thumb | Point worth investigating |
| D > 1 | More conservative | Point likely problematic |
| D > F(0.5, p, n-p) | 50th percentile of F | Formal statistical threshold |
Interactive: Cook's Distance Demo
Click on points in either the scatter plot or bar chart to exclude them and see how the regression line changes. Observations with high Cook's distance will cause larger changes.
Cook's Distance Explorer
Scatter Plot (Click points to exclude)
Cook's Distance by Observation
Regression Comparison
Influential Points
Points with Cook's D > 0.160:
Applications in Machine Learning
While modern ML often uses more complex models than linear regression, diagnostic thinking remains essential. Here's how these concepts transfer:
🔍 Residual Analysis for Neural Networks
Plotting predicted vs actual values reveals systematic biases. Patterns in residuals suggest the network is missing important structure—perhaps you need more layers, different architecture, or additional features.
🎯 Influential Training Examples
Methods like influence functions (Koh & Liang, 2017) extend Cook's distance ideas to deep learning. They identify which training examples most affect a particular prediction—useful for debugging, data cleaning, and understanding model behavior.
📊 Heteroscedastic Neural Networks
Instead of assuming constant variance, modern networks can output both and . This is heteroscedastic regression—the network learns where uncertainty is high vs low, providing calibrated uncertainty estimates.
⚠️ Outlier Detection in Feature Space
High leverage points have unusual feature combinations. In ML, these might be out-of- distribution examples where the model should express uncertainty rather than make confident predictions. Detecting them is crucial for safe deployment.
Python Implementation
Here's a complete implementation of regression diagnostics from scratch. This class computes all the key diagnostic measures and formal statistical tests.
import statsmodels.api as sm then sm.OLS(y, X).fit().summary() for complete diagnostics. The influence attribute provides Cook's distance, leverage, and more.Knowledge Check
Test your understanding of regression diagnostics with these questions:
Knowledge Check: Regression Diagnostics
In the Residuals vs Fitted plot, a funnel-shaped pattern indicates:
Summary
Key Takeaways
Looking Ahead
In the next section, we'll explore Regularization: Ridge and Lasso— techniques that address some of the problems diagnostics reveal (like multicollinearity) while also preventing overfitting. You'll see how adding a penalty term transforms the optimization problem and leads to more stable, interpretable models.