Learning Objectives
By the end of this section, you will be able to:
📐 Mathematical Understanding
- • Derive Ridge and Lasso objective functions from first principles
- • Explain why L1 produces sparsity while L2 does not
- • Understand the geometric interpretation of constraint regions
- • Derive the closed-form solution for Ridge regression
🔧 Practical Skills
- • Implement Ridge and Lasso regression from scratch
- • Use cross-validation to select optimal regularization strength
- • Choose between Ridge, Lasso, and Elastic Net for different problems
- • Interpret regularization paths to understand feature importance
🧠 Deep Learning Connections
- • Weight Decay — L2 regularization in neural networks is exactly Ridge regression applied to each layer
- • Sparse Networks — L1 regularization promotes weight pruning, leading to efficient architectures
- • Bayesian Interpretation — L2 = Gaussian prior on weights, L1 = Laplace prior, enabling probabilistic neural networks
- • Feature Selection — Lasso-based methods are used in attention mechanisms and neural architecture search
Where You'll Apply This: Preventing overfitting in any regression model, automatic feature selection, building interpretable models, weight decay in deep learning, and handling multicollinearity in high-dimensional data.
The Big Picture
Ordinary Least Squares (OLS) finds the coefficients that minimize the sum of squared errors. But what if this solution is too good—fitting the training data perfectly but failing to generalize? This is the problem of overfitting, and regularization is one of the most elegant solutions ever devised.
The Core Insight
Regularization adds a penalty term to the loss function that discourages large coefficient values. By accepting a small amount of bias (shrinking coefficients from their OLS values), we can achieve a much larger reduction in variance, leading to better predictions on new data.
The penalty term controls how much we constrain the coefficients
Historical Context
Andrey Tikhonov (1943)
Developed Tikhonov regularization for solving ill-posed inverse problems in physics. His work laid the mathematical foundation for Ridge regression. The same technique appears in signal processing, image reconstruction, and machine learning.
Arthur Hoerl & Robert Kennard (1970)
Introduced Ridge regression to statistics, showing how it stabilizes estimates when predictors are highly correlated (multicollinearity). Their paper "Ridge Regression: Biased Estimation for Nonorthogonal Problems" is a cornerstone of modern regression.
Robert Tibshirani (1996)
Introduced the Lasso (Least Absolute Shrinkage and Selection Operator), which uses L1 penalty to simultaneously perform regularization AND variable selection. This revolutionized high-dimensional statistics and directly influenced sparse methods in machine learning.
The Overfitting Problem
Consider what happens when we have many features relative to our sample size. OLS can find coefficients that fit the training data perfectly but fail dramatically on new data. The coefficients become unstable—small changes in the data cause wild swings in estimates.
| Scenario | OLS Behavior | Regularization Solution |
|---|---|---|
| p > n (more features than samples) | Infinite solutions, arbitrarily large coefficients | Unique, stable solution with bounded coefficients |
| Multicollinearity | Huge variance, coefficient signs may flip | Shrinks correlated coefficients, stabilizes estimates |
| Noisy features | Fits noise as signal, overfits | Shrinks noisy coefficients toward zero |
| Small sample size | High variance, poor generalization | Trades bias for lower variance, better predictions |
Ridge Regression (L2 Regularization)
Ridge regression adds a penalty proportional to the sum of squared coefficients. This is equivalent to constraining the coefficients to lie within an L2 ball centered at the origin.
Mathematical Formulation
Ridge Regression Objective
Equivalently:
Let's break down each component:
Regularization Parameter
Controls the trade-off between fit and complexity. gives OLS; shrinks all coefficients to zero.
L2 Penalty (Squared Norm)
is smooth and differentiable everywhere, allowing for a closed-form solution. The gradient is simply .
Closed-Form Solution
One of Ridge regression's great advantages is that it has a closed-form solution, requiring only matrix inversion:
Ridge Closed-Form Solution
Compare to OLS:
The key difference is the addition of to the matrix. This has profound consequences:
- Invertibility: Even if is singular (happens when ), adding makes it invertible.
- Numerical Stability: Small eigenvalues of become , preventing division by near-zero values.
- Shrinkage: Coefficients are systematically shrunk toward zero, with the amount of shrinkage controlled by .
Shrinkage and Eigenvalues
To understand Ridge more deeply, express the solution using the singular value decompositionof . If , then the Ridge coefficients are:
The term is the shrinkage factor. When is small (corresponding to directions of low variance in ), the shrinkage is strong. When is large, shrinkage is mild.
Lasso Regression (L1 Regularization)
Lasso (Least Absolute Shrinkage and Selection Operator) uses the sum of absolute values of coefficients as the penalty. This seemingly small change has dramatic consequences.
Mathematical Formulation
Lasso Regression Objective
Equivalently:
Unlike Ridge, Lasso does not have a closed-form solution because the L1 norm is not differentiable at zero. We must use iterative algorithms like coordinate descent.
Why Lasso Produces Sparsity
The magic of Lasso lies in the soft thresholding operation. For a single variable, the Lasso solution has the form:
Soft Thresholding Operator
Also written as:
This formula reveals everything:
- If , the coefficient is setexactly to zero. The feature is eliminated from the model.
- If , the coefficient is shrunk by exactly toward zero.
- The sign of the coefficient is preserved (positive stays positive, negative stays negative).
Ridge (L2) Shrinkage
Coefficients shrink proportionally: . They get arbitrarily close to zero but never exactly zero.
Lasso (L1) Shrinkage
Coefficients shrink by a constant amount, and small ones become exactly zero. This is automatic feature selection!
Geometric Interpretation
The difference between Ridge and Lasso becomes crystal clear when we view them geometrically. Both problems can be written as constrained optimization:
Ridge Constraint
A sphere (circle in 2D)
Lasso Constraint
A diamond (rotated square in 2D)
The regularized solution is found where the elliptical contours of the RSS(centered at the OLS solution) first touch the constraint region. For the smooth circle of Ridge, this almost always happens on the curved boundary—giving a non-sparse solution. For the diamond of Lasso, the contours often hit a corner, where one or more coordinates are exactly zero.
Interactive: Ridge vs Lasso Geometry
Explore how the different constraint shapes lead to different solutions. Adjust the OLS solution and constraint size to see when Lasso produces zeros at the corners.
Geometric Interpretation: Ridge vs Lasso
Why Lasso produces sparse solutions: corners vs smooth boundaries
Ridge Solution
beta2 = 0.876
Both coefficients shrunk proportionally
Lasso Solution
beta2 = 0.628
Solution on edge (try adjusting OLS ratio)
Why the Geometry Matters
The elliptical contours represent equal loss (RSS). The regularized solution is where the smallest contour touches the constraint region. For the L2 circle, this is almost always on the smooth boundary. For the L1 diamond, it often hits a corner where one or more coefficients are exactly zero. This is why Lasso performs automatic feature selection!
Bias-Variance Tradeoff
Regularization is fundamentally about the bias-variance tradeoff. By introducing bias (shrinking coefficients away from their OLS values), we can achieve a larger reduction in variance, leading to lower overall prediction error.
Prediction Error Decomposition
As regularization strength increases:
- Bias increases: We're forcing coefficients away from their "true" OLS values.
- Variance decreases: The estimates become less sensitive to the particular training sample.
- MSE has a U-shape: There's an optimal where the total error is minimized.
Interactive: Bias-Variance Demo
Watch how regularization trades bias for variance. The thin lines show individual fits from different training samples (variance), while the thick lines show the average (bias is the deviation from the true line).
Bias-Variance Tradeoff with Regularization
See how regularization trades higher bias for lower variance
OLS (Unregularized)
Ridge (Regularized)
The Bias-Variance Tradeoff
Notice how the blue OLS lines spread out widely (high variance) but center on the true line (low bias). The orange Ridge lines cluster tightly (low variance) but are slightly off from the true line (more bias). When variance reduction exceeds bias increase, regularization wins!
Regularization Paths
A regularization path shows how coefficients change as varies from 0 (OLS) to infinity (all zeros). This visualization is invaluable for understanding which features are most important and how they relate to each other.
Ridge Path
All coefficients shrink smoothly toward zero together. Coefficients never become exactly zero, so the model always uses all features.
Lasso Path
Coefficients hit zero at different values of . Features "drop out" of the model one by one, revealing a hierarchy of importance. The path is piecewise linear.
Interactive: Regularization Path Explorer
Explore how Ridge and Lasso shrink coefficients differently. Notice how Lasso sets coefficients to zero while Ridge only approaches zero.
Regularization Path Explorer
Observe how coefficients shrink as regularization strength increases
Ridge Behavior
Ridge regression shrinks all coefficients smoothly toward zero but never exactly to zero. Useful when you believe all features have some predictive power.
Selecting the Regularization Parameter
The regularization parameter is a hyperparameterthat controls the bias-variance tradeoff. Choosing it well is crucial for good performance.
- Cross-Validation: The gold standard. Split data into k folds, fit models for each value on k-1 folds, evaluate on the held-out fold. Choose with lowest average validation error.
- 1-Standard-Error Rule: Instead of the minimum-error , choose the largest within 1 SE of the minimum. This gives a simpler, more interpretable model with similar performance.
- Information Criteria: AIC, BIC, and related criteria balance fit against complexity without requiring held-out data. Useful for very large datasets where CV is expensive.
Interactive: Cross-Validation for Lambda Selection
Explore how cross-validation finds the optimal regularization strength. Notice the U-shaped validation curve and how the 1-SE rule selects a more regularized model.
Cross-Validation for Lambda Selection
Find the optimal regularization strength using k-fold cross-validation
Optimal Lambda (Min CV Error)
1-SE Lambda (More Regularized)
The 1-Standard-Error Rule
While the minimum CV error gives the "optimal" lambda, practitioners often use the 1-SE rule: choose the largest lambda (simplest model) whose CV error is within 1 standard error of the minimum. This leads to simpler, more interpretable models with similar predictive performance.
Elastic Net: Best of Both Worlds
What if you want the sparsity of Lasso but also need to handle groups of correlated features?Elastic Net combines both L1 and L2 penalties:
Elastic Net Objective
Often parameterized as:
| Method | L1 | L2 | Sparsity | Handles Correlation |
|---|---|---|---|---|
| Ridge | 0 | 100% | No | Yes (shrinks together) |
| Lasso | 100% | 0 | Yes | No (picks one arbitrarily) |
| Elastic Net | Mixed | Mixed | Yes | Yes (groups stay together) |
Deep Learning Connection
Regularization in linear models directly translates to deep learning. Understanding these connections illuminates why neural network training uses similar techniques.
🔧 Weight Decay = Ridge
Adding weight_decay to an optimizer implements L2 regularization. The loss becomes , identical to Ridge regression applied to each layer's weights.
✂️ L1 for Pruning
L1 regularization on neural network weights encourages sparse networks. Small weights become exactly zero, effectively pruning connections. This is used for network compression and efficient inference.
📊 Bayesian Interpretation
L2 regularization is equivalent to placing a Gaussian prior on weights. L1 is a Laplace prior. This connection enables Bayesian neural networks and uncertainty quantification.
🎯 Feature Selection
Sparse attention mechanisms use L1-like penalties to select important features. Group Lasso selects entire groups of features (or neurons) together, used in structured pruning.
PyTorch Example: Weight Decay
# L2 regularization in PyTorch
optimizer = torch.optim.Adam(
model.parameters(),
lr=0.001,
weight_decay=0.01 # This is lambda!
)
# Equivalent to adding lambda * sum(w^2) to the lossPython Implementation
Here are complete implementations of Ridge and Lasso regression from scratch. Ridge uses the closed-form solution while Lasso requires coordinate descent.
sklearn.linear_model.Ridge,sklearn.linear_model.Lasso, or sklearn.linear_model.ElasticNet. For large-scale problems, use glmnet (via glmnet-python) which implements highly optimized coordinate descent with warm starts.Knowledge Check
Test your understanding of regularization with these questions:
Knowledge Check: Regularization
Question 1 of 8What is the primary purpose of regularization in regression?
Summary
Key Takeaways
Looking Ahead
In the next chapter, we'll explore Generalized Linear Models (GLMs)— extending regression to non-normal response distributions like binary outcomes (logistic regression) and counts (Poisson regression). The regularization techniques you've learned here apply directly to GLMs, making Ridge and Lasso logistic regression possible.