Boo-AI — Master Artificial Intelligence by Building from Scratch

Learning Objectives

By the end of this section, you will be able to:

📐 Mathematical Understanding

• Derive Ridge and Lasso objective functions from first principles
• Explain why L1 produces sparsity while L2 does not
• Understand the geometric interpretation of constraint regions
• Derive the closed-form solution for Ridge regression

🔧 Practical Skills

• Implement Ridge and Lasso regression from scratch
• Use cross-validation to select optimal regularization strength
• Choose between Ridge, Lasso, and Elastic Net for different problems
• Interpret regularization paths to understand feature importance

🧠 Deep Learning Connections

• Weight Decay — L2 regularization in neural networks is exactly Ridge regression applied to each layer
• Sparse Networks — L1 regularization promotes weight pruning, leading to efficient architectures
• Bayesian Interpretation — L2 = Gaussian prior on weights, L1 = Laplace prior, enabling probabilistic neural networks
• Feature Selection — Lasso-based methods are used in attention mechanisms and neural architecture search

Where You'll Apply This: Preventing overfitting in any regression model, automatic feature selection, building interpretable models, weight decay in deep learning, and handling multicollinearity in high-dimensional data.

The Big Picture

Ordinary Least Squares (OLS) finds the coefficients that minimize the sum of squared errors. But what if this solution is too good—fitting the training data perfectly but failing to generalize? This is the problem of overfitting, and regularization is one of the most elegant solutions ever devised.

The Core Insight

Regularization adds a penalty term to the loss function that discourages large coefficient values. By accepting a small amount of bias (shrinking coefficients from their OLS values), we can achieve a much larger reduction in variance, leading to better predictions on new data.

$\text{Total Loss} = \underbrace{\text{RSS}}_{\text{Data Fit}} + \underbrace{\lambda \cdot \text{Penalty}(\boldsymbol{\beta})}_{\text{Complexity Cost}}$

The penalty term controls how much we constrain the coefficients

Historical Context

📜

Andrey Tikhonov (1943)

Developed Tikhonov regularization for solving ill-posed inverse problems in physics. His work laid the mathematical foundation for Ridge regression. The same technique appears in signal processing, image reconstruction, and machine learning.

🔬

Arthur Hoerl & Robert Kennard (1970)

Introduced Ridge regression to statistics, showing how it stabilizes estimates when predictors are highly correlated (multicollinearity). Their paper "Ridge Regression: Biased Estimation for Nonorthogonal Problems" is a cornerstone of modern regression.

🎯

Robert Tibshirani (1996)

Introduced the Lasso (Least Absolute Shrinkage and Selection Operator), which uses L1 penalty to simultaneously perform regularization AND variable selection. This revolutionized high-dimensional statistics and directly influenced sparse methods in machine learning.

The Overfitting Problem

Consider what happens when we have many features relative to our sample size. OLS can find coefficients that fit the training data perfectly but fail dramatically on new data. The coefficients become unstable—small changes in the data cause wild swings in estimates.

Scenario	OLS Behavior	Regularization Solution
p > n (more features than samples)	Infinite solutions, arbitrarily large coefficients	Unique, stable solution with bounded coefficients
Multicollinearity	Huge variance, coefficient signs may flip	Shrinks correlated coefficients, stabilizes estimates
Noisy features	Fits noise as signal, overfits	Shrinks noisy coefficients toward zero
Small sample size	High variance, poor generalization	Trades bias for lower variance, better predictions

Ridge Regression (L2 Regularization)

Ridge regression adds a penalty proportional to the sum of squared coefficients. This is equivalent to constraining the coefficients to lie within an L2 ball centered at the origin.

Mathematical Formulation

Ridge Regression Objective

\hat{\boldsymbol{\beta}}^{\text{ridge}} = \arg\min_{\boldsymbol{\beta}} \left\{ \sum_{i=1}^{n}(y_i - \mathbf{x}_i^T\boldsymbol{\beta})^2 + \lambda \sum_{j=1}^{p} \beta_j^2 \right\}

Equivalently: $\|\mathbf{y} - \mathbf{X}\boldsymbol{\beta}\|_2^2 + \lambda \|\boldsymbol{\beta}\|_2^2$

Let's break down each component:

Regularization Parameter

Controls the trade-off between fit and complexity. $\lambda = 0$ gives OLS; $\lambda \to \infty$ shrinks all coefficients to zero.

L2 Penalty (Squared Norm)

$\|\boldsymbol{\beta}\|_2^2 = \sum_j \beta_j^2$ is smooth and differentiable everywhere, allowing for a closed-form solution. The gradient is simply $2\boldsymbol{\beta}$ .

Note on the Intercept: The intercept

\beta_0

is typicallynot penalized because it represents the overall level of the response and doesn't contribute to overfitting. Always center your data or exclude the intercept from the penalty term.

Closed-Form Solution

One of Ridge regression's great advantages is that it has a closed-form solution, requiring only matrix inversion:

Ridge Closed-Form Solution

\hat{\boldsymbol{\beta}}^{\text{ridge}} = (\mathbf{X}^T\mathbf{X} + \lambda \mathbf{I})^{-1}\mathbf{X}^T\mathbf{y}

Compare to OLS: $\hat{\boldsymbol{\beta}}^{\text{OLS}} = (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{y}$

The key difference is the addition of $\lambda \mathbf{I}$ to the matrix $\mathbf{X}^T\mathbf{X}$ . This has profound consequences:

Invertibility: Even if $\mathbf{X}^T\mathbf{X}$ is singular (happens when $p > n$ ), adding $\lambda \mathbf{I}$ makes it invertible.
Numerical Stability: Small eigenvalues of $\mathbf{X}^T\mathbf{X}$ become $\lambda_i + \lambda$ , preventing division by near-zero values.
Shrinkage: Coefficients are systematically shrunk toward zero, with the amount of shrinkage controlled by $\lambda$ .

Shrinkage and Eigenvalues

To understand Ridge more deeply, express the solution using the singular value decompositionof $\mathbf{X}$ . If $\mathbf{X} = \mathbf{U}\mathbf{D}\mathbf{V}^T$ , then the Ridge coefficients are:

\hat{\boldsymbol{\beta}}^{\text{ridge}} = \sum_{j=1}^{p} \mathbf{v}_j \frac{d_j}{d_j^2 + \lambda} \mathbf{u}_j^T \mathbf{y}

The term $\frac{d_j}{d_j^2 + \lambda}$ is the shrinkage factor. When $d_j$ is small (corresponding to directions of low variance in $\mathbf{X}$ ), the shrinkage is strong. When $d_j$ is large, shrinkage is mild.

Intuition: Ridge shrinks most in directions where the data provides little information (small singular values). This is precisely where OLS estimates are most unstable, making Ridge an elegant bias-variance tradeoff.

Lasso Regression (L1 Regularization)

Lasso (Least Absolute Shrinkage and Selection Operator) uses the sum of absolute values of coefficients as the penalty. This seemingly small change has dramatic consequences.

Mathematical Formulation

Lasso Regression Objective

\hat{\boldsymbol{\beta}}^{\text{lasso}} = \arg\min_{\boldsymbol{\beta}} \left\{ \sum_{i=1}^{n}(y_i - \mathbf{x}_i^T\boldsymbol{\beta})^2 + \lambda \sum_{j=1}^{p} |\beta_j| \right\}

Equivalently: $\|\mathbf{y} - \mathbf{X}\boldsymbol{\beta}\|_2^2 + \lambda \|\boldsymbol{\beta}\|_1$

Unlike Ridge, Lasso does not have a closed-form solution because the L1 norm is not differentiable at zero. We must use iterative algorithms like coordinate descent.

Why Lasso Produces Sparsity

The magic of Lasso lies in the soft thresholding operation. For a single variable, the Lasso solution has the form:

Soft Thresholding Operator

\hat{\beta}^{\text{lasso}} = \text{sign}(\hat{\beta}^{\text{OLS}}) \cdot \max(|\hat{\beta}^{\text{OLS}}| - \lambda, 0)

Also written as: $S(\hat{\beta}^{\text{OLS}}, \lambda)$

This formula reveals everything:

If $|\hat{\beta}^{\text{OLS}}| < \lambda$ , the coefficient is setexactly to zero. The feature is eliminated from the model.
If $|\hat{\beta}^{\text{OLS}}| > \lambda$ , the coefficient is shrunk by exactly $\lambda$ toward zero.
The sign of the coefficient is preserved (positive stays positive, negative stays negative).

Ridge (L2) Shrinkage

Coefficients shrink proportionally: $\hat{\beta}^{\text{ridge}} = \frac{1}{1+\lambda}\hat{\beta}^{\text{OLS}}$ . They get arbitrarily close to zero but never exactly zero.

Lasso (L1) Shrinkage

Coefficients shrink by a constant amount, and small ones become exactly zero. This is automatic feature selection!

Geometric Interpretation

The difference between Ridge and Lasso becomes crystal clear when we view them geometrically. Both problems can be written as constrained optimization:

Ridge Constraint

\sum_{j=1}^p \beta_j^2 \leq t

A sphere (circle in 2D)

Lasso Constraint

\sum_{j=1}^p |\beta_j| \leq t

A diamond (rotated square in 2D)

The regularized solution is found where the elliptical contours of the RSS(centered at the OLS solution) first touch the constraint region. For the smooth circle of Ridge, this almost always happens on the curved boundary—giving a non-sparse solution. For the diamond of Lasso, the contours often hit a corner, where one or more coordinates are exactly zero.

Interactive: Ridge vs Lasso Geometry

Explore how the different constraint shapes lead to different solutions. Adjust the OLS solution and constraint size to see when Lasso produces zeros at the corners.

Geometric Interpretation: Ridge vs Lasso

Why Lasso produces sparse solutions: corners vs smooth boundaries

Constraint Radius (t): 1.50

OLS Solution: (2.5, 1.8)

Show Loss Contours

Ridge Solution

beta1 = 1.217
beta2 = 0.876

Both coefficients shrunk proportionally

Lasso Solution

beta1 = 0.872
beta2 = 0.628

Solution on edge (try adjusting OLS ratio)

Why the Geometry Matters

The elliptical contours represent equal loss (RSS). The regularized solution is where the smallest contour touches the constraint region. For the L2 circle, this is almost always on the smooth boundary. For the L1 diamond, it often hits a corner where one or more coefficients are exactly zero. This is why Lasso performs automatic feature selection!

Bias-Variance Tradeoff

Regularization is fundamentally about the bias-variance tradeoff. By introducing bias (shrinking coefficients away from their OLS values), we can achieve a larger reduction in variance, leading to lower overall prediction error.

Prediction Error Decomposition

\text{MSE} = \underbrace{\text{Bias}^2}_{\text{Increases with } \lambda} + \underbrace{\text{Variance}}_{\text{Decreases with } \lambda} + \text{Irreducible Error}

As regularization strength $\lambda$ increases:

Bias increases: We're forcing coefficients away from their "true" OLS values.
Variance decreases: The estimates become less sensitive to the particular training sample.
MSE has a U-shape: There's an optimal $\lambda$ where the total error is minimized.

Interactive: Bias-Variance Demo

Watch how regularization trades bias for variance. The thin lines show individual fits from different training samples (variance), while the thick lines show the average (bias is the deviation from the true line).

Bias-Variance Tradeoff with Regularization

See how regularization trades higher bias for lower variance

Regularization (Lambda): 1.00

Noise Level: 0.50

Sample Size: 20

OLS (Unregularized)

Bias

0.022

Variance

0.011

MSE

0.012

Ridge (Regularized)

Bias

0.026

Variance

0.011

MSE

0.012

OLS has lower MSE in this case. Try increasing noise or decreasing sample size.

The Bias-Variance Tradeoff

Notice how the blue OLS lines spread out widely (high variance) but center on the true line (low bias). The orange Ridge lines cluster tightly (low variance) but are slightly off from the true line (more bias). When variance reduction exceeds bias increase, regularization wins!

Regularization Paths

A regularization path shows how coefficients change as $\lambda$ varies from 0 (OLS) to infinity (all zeros). This visualization is invaluable for understanding which features are most important and how they relate to each other.

Ridge Path

All coefficients shrink smoothly toward zero together. Coefficients never become exactly zero, so the model always uses all features.

Lasso Path

Coefficients hit zero at different values of $\lambda$ . Features "drop out" of the model one by one, revealing a hierarchy of importance. The path is piecewise linear.

Interactive: Regularization Path Explorer

Explore how Ridge and Lasso shrink coefficients differently. Notice how Lasso sets coefficients to zero while Ridge only approaches zero.

Regularization Path Explorer

Observe how coefficients shrink as regularization strength increases

Show MSE Curve

Regularization Strength (log scale): 1.000

0.001 (weak)Optimal: 0.132100 (strong)

Age

1.692

true: 2.5

Income

-1.079

true: -1.8

Education

0.645

true: 1.2

Experience

0.146

true: 0.3

Location

-0.045

true: -0.1

Ridge Behavior

Ridge regression shrinks all coefficients smoothly toward zero but never exactly to zero. Useful when you believe all features have some predictive power.

Selecting the Regularization Parameter

The regularization parameter $\lambda$ is a hyperparameterthat controls the bias-variance tradeoff. Choosing it well is crucial for good performance.

Cross-Validation: The gold standard. Split data into k folds, fit models for each $\lambda$ value on k-1 folds, evaluate on the held-out fold. Choose $\lambda$ with lowest average validation error.
1-Standard-Error Rule: Instead of the minimum-error $\lambda$ , choose the largest $\lambda$ within 1 SE of the minimum. This gives a simpler, more interpretable model with similar performance.
Information Criteria: AIC, BIC, and related criteria balance fit against complexity without requiring held-out data. Useful for very large datasets where CV is expensive.

Interactive: Cross-Validation for Lambda Selection

Explore how cross-validation finds the optimal regularization strength. Notice the U-shaped validation curve and how the 1-SE rule selects a more regularized model.

Cross-Validation for Lambda Selection

Find the optimal regularization strength using k-fold cross-validation

Number of Folds (k): 5

Data Noise Level: 1.00

Optimal Lambda (Min CV Error)

Lambda = 2.8118

CV MSE = 1.1497

1-SE Lambda (More Regularized)

Lambda = 3.3932

CV MSE = 1.1628

The 1-Standard-Error Rule

While the minimum CV error gives the "optimal" lambda, practitioners often use the 1-SE rule: choose the largest lambda (simplest model) whose CV error is within 1 standard error of the minimum. This leads to simpler, more interpretable models with similar predictive performance.

Elastic Net: Best of Both Worlds

What if you want the sparsity of Lasso but also need to handle groups of correlated features?Elastic Net combines both L1 and L2 penalties:

Elastic Net Objective

\hat{\boldsymbol{\beta}}^{\text{EN}} = \arg\min_{\boldsymbol{\beta}} \left\{ \|\mathbf{y} - \mathbf{X}\boldsymbol{\beta}\|_2^2 + \lambda_1 \|\boldsymbol{\beta}\|_1 + \lambda_2 \|\boldsymbol{\beta}\|_2^2 \right\}

Often parameterized as: $\lambda[(1-\alpha)\|\boldsymbol{\beta}\|_2^2/2 + \alpha\|\boldsymbol{\beta}\|_1]$

Method	L1	L2	Sparsity	Handles Correlation
Ridge	0	100%	No	Yes (shrinks together)
Lasso	100%	0	Yes	No (picks one arbitrarily)
Elastic Net	Mixed	Mixed	Yes	Yes (groups stay together)

When to use Elastic Net: When you have groups of correlated features and want some sparsity but don't want the model to arbitrarily select just one from each group. Common in genomics, text analysis, and any domain with highly correlated predictors.

Deep Learning Connection

Regularization in linear models directly translates to deep learning. Understanding these connections illuminates why neural network training uses similar techniques.

🔧 Weight Decay = Ridge

Adding weight_decay to an optimizer implements L2 regularization. The loss becomes $L + \lambda\sum w^2$ , identical to Ridge regression applied to each layer's weights.

✂️ L1 for Pruning

L1 regularization on neural network weights encourages sparse networks. Small weights become exactly zero, effectively pruning connections. This is used for network compression and efficient inference.

📊 Bayesian Interpretation

L2 regularization is equivalent to placing a Gaussian prior on weights. L1 is a Laplace prior. This connection enables Bayesian neural networks and uncertainty quantification.

🎯 Feature Selection

Sparse attention mechanisms use L1-like penalties to select important features. Group Lasso selects entire groups of features (or neurons) together, used in structured pruning.

PyTorch Example: Weight Decay

# L2 regularization in PyTorch
optimizer = torch.optim.Adam(
    model.parameters(),
    lr=0.001,
    weight_decay=0.01  # This is lambda!
)

# Equivalent to adding lambda * sum(w^2) to the loss

Scaling Matters: In deep learning, features often have very different scales. Unlike linear regression where we standardize inputs, neural networks learn their own scaling. But regularization still penalizes large weights equally, so proper initialization and normalization layers become important.

Python Implementation

Here are complete implementations of Ridge and Lasso regression from scratch. Ridge uses the closed-form solution while Lasso requires coordinate descent.

Ridge and Lasso from Scratch

🐍regularization.py

Explanation(15)

Code(84)

1Import Libraries

NumPy for linear algebra and numerical operations.

4Ridge Regression Class

Implements Ridge regression with L2 penalty. The penalty term lambda * ||beta||^2 shrinks coefficients toward zero.

6Constructor

Stores the regularization parameter alpha (commonly called lambda in statistics). Higher values = more regularization.

11Fit Method

Learns the regularized coefficients from training data using the closed-form Ridge solution.

13Add Intercept Column

Prepends a column of ones to X to absorb the intercept term into the coefficient vector. The intercept is typically NOT regularized.

17Ridge Closed-Form Solution

beta = (X'X + alpha*I)^(-1) X'y. The identity matrix I is modified to not penalize the intercept (first coefficient).

EXAMPLE

When alpha=0, this reduces to OLS: (X'X)^(-1)X'y

24Extract Intercept and Coefficients

Separate the intercept (first element) from the feature coefficients for convenient access.

29Predict Method

Applies the learned model to new data: y_hat = intercept + X @ coefficients.

34Lasso Regression Class

Implements Lasso regression with L1 penalty. Uses coordinate descent since there is no closed-form solution.

41Coordinate Descent Algorithm

Iteratively optimizes one coefficient at a time while holding others fixed. This exploits the structure of the L1 penalty.

50Initialize Coefficients

Start with all zeros or small random values. The algorithm will iteratively refine these.

55Coordinate Descent Loop

Cycle through each coefficient and update it using the soft-thresholding operator, which is the key to Lasso's sparsity.

62Compute Partial Residual

For feature j, compute the residual assuming all other coefficients are fixed. This is what we minimize for coefficient j.

66Soft Thresholding

The core Lasso update: if |rho| < alpha, set beta_j = 0. Otherwise, shrink rho toward zero by alpha. This is why Lasso creates zeros!

EXAMPLE

S(rho, alpha) = sign(rho) * max(|rho| - alpha, 0)

74Check Convergence

Stop when coefficients change by less than tolerance between iterations. This ensures the algorithm has found a stable solution.

69 lines without explanation

1import numpy as np
2
3# Ridge Regression (L2 Regularization)
4class RidgeRegression:
5
6    def __init__(self, alpha=1.0):
7        self.alpha = alpha  # Regularization strength
8        self.coef_ = None
9        self.intercept_ = None
10
11    def fit(self, X, y):
12        # Add intercept column
13        n, p = X.shape
14        X_aug = np.column_stack([np.ones(n), X])
15
16        # Ridge closed-form: (X'X + alpha*I)^(-1) X'y
17        I = np.eye(p + 1)
18        I[0, 0] = 0  # Don't regularize intercept
19        XtX = X_aug.T @ X_aug
20        Xty = X_aug.T @ y
21        theta = np.linalg.solve(XtX + self.alpha * I, Xty)
22
23        # Extract intercept and coefficients
24        self.intercept_ = theta[0]
25        self.coef_ = theta[1:]
26        return self
27
28    def predict(self, X):
29        return self.intercept_ + X @ self.coef_
30
31
32# Lasso Regression (L1 Regularization)
33class LassoRegression:
34
35    def __init__(self, alpha=1.0, max_iter=1000, tol=1e-4):
36        self.alpha = alpha
37        self.max_iter = max_iter
38        self.tol = tol
39        self.coef_ = None
40        self.intercept_ = None
41
42    def fit(self, X, y):
43        n, p = X.shape
44
45        # Center the data (standard for Lasso)
46        X_mean = X.mean(axis=0)
47        y_mean = y.mean()
48        X_centered = X - X_mean
49        y_centered = y - y_mean
50
51        # Initialize coefficients
52        beta = np.zeros(p)
53
54        # Coordinate descent
55        for iteration in range(self.max_iter):
56            beta_old = beta.copy()
57
58            for j in range(p):
59                # Compute partial residual
60                residual = y_centered - X_centered @ beta
61                residual += X_centered[:, j] * beta[j]
62
63                # Compute rho_j
64                rho_j = X_centered[:, j] @ residual
65
66                # Soft thresholding
67                z_j = (X_centered[:, j] ** 2).sum()
68                if rho_j < -self.alpha * n:
69                    beta[j] = (rho_j + self.alpha * n) / z_j
70                elif rho_j > self.alpha * n:
71                    beta[j] = (rho_j - self.alpha * n) / z_j
72                else:
73                    beta[j] = 0  # Sparsity!
74
75            # Check convergence
76            if np.max(np.abs(beta - beta_old)) < self.tol:
77                break
78
79        self.coef_ = beta
80        self.intercept_ = y_mean - X_mean @ beta
81        return self
82
83    def predict(self, X):
84        return self.intercept_ + X @ self.coef_

In Practice: Use sklearn.linear_model.Ridge,sklearn.linear_model.Lasso, or sklearn.linear_model.ElasticNet. For large-scale problems, use glmnet (via glmnet-python) which implements highly optimized coordinate descent with warm starts.

Knowledge Check

Test your understanding of regularization with these questions:

Knowledge Check: Regularization

Question 1 of 8

What is the primary purpose of regularization in regression?

Summary

Key Takeaways

✅Regularization adds a penalty to the loss function that constrains coefficient magnitudes, preventing overfitting.

✅Ridge (L2) shrinks all coefficients smoothly toward zero but never exactly to zero. It has a closed-form solution.

✅Lasso (L1) can set coefficients exactly to zero, performing automatic feature selection. It requires iterative optimization.

✅The geometric interpretation explains why: L1's diamond has corners where sparsity occurs.

✅Regularization trades bias for variance. Optimal lambda balances this tradeoff.

✅Cross-validation is the standard method for selecting lambda. The 1-SE rule favors simpler models.

✅Elastic Net combines L1 and L2, providing sparsity while handling correlated features.

✅Weight decay in deep learning is exactly L2 regularization; L1 enables network pruning.

Looking Ahead

In the next chapter, we'll explore Generalized Linear Models (GLMs)— extending regression to non-normal response distributions like binary outcomes (logistic regression) and counts (Poisson regression). The regularization techniques you've learned here apply directly to GLMs, making Ridge and Lasso logistic regression possible.