Boo-AI — Master Artificial Intelligence by Building from Scratch

Learning Objectives

By the end of this section, you will be able to:

📚 Core Knowledge

• Explain why regularization helps in high dimensions
• Connect regularized MLE to Bayesian inference with priors
• Compare Ridge, LASSO, and Elastic Net
• Understand the bias-variance tradeoff in regularization

🔧 Practical Skills

• Choose appropriate regularization for your problem
• Select λ using cross-validation
• Interpret regularization paths
• Implement regularized estimators in Python

🧠 Deep Learning Connections

• Weight decay: L2 regularization in neural networks
• Dropout: Implicit regularization through stochastic training
• Early stopping: Regularization via training duration
• Sparsity in transformers: Attention pruning and efficiency

Why Regularization?

Standard MLE has a fundamental problem: overfitting. When the number of parameters p is large relative to sample size n, MLE produces estimates with high variance — they fit the training data perfectly but generalize poorly.

The High-Dimensional Challenge

Classical Regime (n ≫ p)

Many observations, few parameters. MLE works well — low bias, low variance. Asymptotic theory applies.

✓

High-Dimensional (p ≈ n or p > n)

Parameters comparable to or exceeding observations. MLE often fails — estimates are unstable, variance explodes, may not even be unique!

⚠️

The Core Insight: Regularization introduces bias deliberately to reduce variance. If parameters are expected to be "small" or "sparse," we can encode this prior belief to stabilize estimation.

Penalized Maximum Likelihood

Regularized MLE adds a penalty term to the likelihood:

Penalized MLE Objective

\hat{\boldsymbol{\theta}}_\lambda = \arg\max_{\boldsymbol{\theta}} \left[ \log L(\boldsymbol{\theta}; \mathbf{X}) - \lambda \cdot \text{Penalty}(\boldsymbol{\theta}) \right]

Or equivalently, minimize: -log L + λ · Penalty

Key components:

Log-likelihood: Measures how well parameters fit the data (same as standard MLE)
Penalty: Discourages "complex" parameter values (large, many non-zero, etc.)
λ ≥ 0: Regularization strength — λ=0 gives standard MLE, λ→∞ shrinks to zero

Ridge Regression (L2 Regularization)

Ridge regression uses the squared L2 norm as penalty:

Ridge Regression

\hat{\boldsymbol{\beta}}_\text{ridge} = \arg\min_{\boldsymbol{\beta}} \left[ \|\mathbf{y} - \mathbf{X}\boldsymbol{\beta}\|_2^2 + \lambda \|\boldsymbol{\beta}\|_2^2 \right]

= (\mathbf{X}^T\mathbf{X} + \lambda \mathbf{I})^{-1} \mathbf{X}^T \mathbf{y}

Key properties:

Property	Description
Closed form	Unique solution exists even when p > n (unlike OLS)
Shrinkage	All coefficients shrink toward zero, but none exactly zero
Bias-variance	Introduces bias, but reduces variance
Stability	(X^T X + λI) always invertible for λ > 0

Bayesian Interpretation

Ridge has a beautiful Bayesian interpretation:

Bayesian Ridge = MAP with Gaussian Prior

Likelihood: y | β ~ N(Xβ, σ²I)

Prior: β ~ N(0, τ²I) (spherical Gaussian)

Posterior mode (MAP): β̂_ridge with λ = σ²/τ²

Regularization strength λ encodes prior belief about parameter magnitudes!

Small λ (large τ²): Weak prior, trust data more → closer to MLE.
Large λ (small τ²): Strong prior, shrink toward zero more.

LASSO (L1 Regularization)

The LASSO (Least Absolute Shrinkage and Selection Operator) uses the L1 norm:

LASSO

\hat{\boldsymbol{\beta}}_\text{lasso} = \arg\min_{\boldsymbol{\beta}} \left[ \|\mathbf{y} - \mathbf{X}\boldsymbol{\beta}\|_2^2 + \lambda \|\boldsymbol{\beta}\|_1 \right]

where $\|\boldsymbol{\beta}\|_1 = \sum_j |\beta_j|$

The key difference from Ridge: LASSO produces sparse solutions — some coefficients are exactly zero!

Why L1 Induces Sparsity

Geometry of L1 vs. L2

L2 (Ridge)

Constraint region is a sphere. The elliptical loss contours typically touch the sphere away from axes → coefficients shrink but stay non-zero.

L1 (LASSO)

Constraint region is a diamond (corners at axes). Loss contours often touch corners → coefficients become exactly zero (sparse!).

Aspect	Ridge (L2)	LASSO (L1)
Sparsity	No — all coefficients non-zero	Yes — some exactly zero
Solution	Closed form	Requires optimization (coordinate descent)
Correlated features	Keeps all, shrinks together	Arbitrarily selects one
Bayesian prior	Gaussian	Laplace (double exponential)
Use case	Prediction, multicollinearity	Feature selection, interpretation

Elastic Net

Elastic Net combines L1 and L2 penalties — the best of both worlds:

Elastic Net

\hat{\boldsymbol{\beta}}_\text{EN} = \arg\min_{\boldsymbol{\beta}} \left[ \|\mathbf{y} - \mathbf{X}\boldsymbol{\beta}\|_2^2 + \lambda_1 \|\boldsymbol{\beta}\|_1 + \lambda_2 \|\boldsymbol{\beta}\|_2^2 \right]

Often parameterized as: λ(α‖β‖₁ + (1-α)‖β‖₂²) where α ∈ [0,1] controls the mix

Advantages:

Grouping effect: Correlated features are selected/rejected together
Sparsity + stability: L1 provides selection, L2 provides stability
p > n: Can select more than n features (unlike LASSO alone)

Choosing λ via Cross-Validation

The regularization parameter λ controls the bias-variance tradeoff. We choose it via cross-validation:

K-Fold Cross-Validation for λ

Create a grid of λ values (typically on log scale)
For each λ: do K-fold CV, compute average prediction error
Choose λ that minimizes CV error (or use "one-standard-error rule")
Refit on full data with chosen λ

One-Standard-Error Rule: Instead of λ that minimizes CV error, choose the largest λ (most regularization) within one SE of the minimum. This gives simpler, more interpretable models with similar predictive performance.

Regularization Paths

A regularization path shows how coefficients change as λ varies:

Interpreting Regularization Paths

• λ = 0 (left): Full MLE — all coefficients at their unrestricted values

• λ → ∞ (right): Full shrinkage — all coefficients → 0

• LASSO path: Coefficients enter at specific λ values (sparse)

• Ridge path: All coefficients shrink continuously (no selection)

• Vertical line: Optimal λ from cross-validation

Path stability: In LASSO, the order in which features enter the model depends on correlations. Highly correlated features may swap order between different samples — use Elastic Net for more stable selection.

AI/ML Applications

🧠 Weight Decay in Neural Networks

L2 regularization on weights is called weight decay. It's ubiquitous in deep learning:

L = L_\text{data} + \frac{\lambda}{2}\sum_i w_i^2

🎲 Dropout as Regularization

Randomly zeroing activations during training is an implicit regularizer. It's approximately equivalent to an ensemble of sparse networks, preventing co-adaptation.

⏱️ Early Stopping

Stopping training before convergence acts as regularization. Early in training, the model learns general patterns; later it memorizes noise. Equivalent to L2 for linear models!

✂️ Sparse Attention / Pruning

L1-style sparsity is used to prune neural networks — removing weights or attention heads that are close to zero reduces model size while maintaining performance.

Regularization ↔ Bayesian Priors

Regularization	Equivalent Prior
L2 (Ridge)	Gaussian prior: β ~ N(0, τ²I)
L1 (LASSO)	Laplace prior: β ~ Laplace(0, b)
Elastic Net	Mixture of Gaussian and Laplace
Dropout	Spike-and-slab (approximate)
Weight decay + L1	Horseshoe prior (approximate)

Python Implementation

🐍python

1import numpy as np
2from scipy.optimize import minimize
3from sklearn.linear_model import Ridge, Lasso, ElasticNet
4from sklearn.model_selection import cross_val_score
5from typing import Tuple
6
7# ============================================
8# Ridge Regression from Scratch
9# ============================================
10
11def ridge_regression(
12    X: np.ndarray,
13    y: np.ndarray,
14    lambda_: float
15) -> np.ndarray:
16    """
17    Ridge regression closed-form solution.
18
19    Parameters
20    ----------
21    X : array of shape (n, p)
22        Design matrix
23    y : array of shape (n,)
24        Response vector
25    lambda_ : float
26        Regularization parameter
27
28    Returns
29    -------
30    beta : array of shape (p,)
31        Ridge regression coefficients
32    """
33    n, p = X.shape
34    # (X^T X + λI)^{-1} X^T y
35    XtX = X.T @ X
36    regularized = XtX + lambda_ * np.eye(p)
37    beta = np.linalg.solve(regularized, X.T @ y)
38    return beta
39
40
41# ============================================
42# LASSO via Coordinate Descent
43# ============================================
44
45def soft_threshold(x: float, lambda_: float) -> float:
46    """Soft thresholding operator for LASSO."""
47    if x > lambda_:
48        return x - lambda_
49    elif x < -lambda_:
50        return x + lambda_
51    else:
52        return 0.0
53
54
55def lasso_coordinate_descent(
56    X: np.ndarray,
57    y: np.ndarray,
58    lambda_: float,
59    max_iter: int = 1000,
60    tol: float = 1e-6
61) -> np.ndarray:
62    """
63    LASSO via coordinate descent.
64
65    Parameters
66    ----------
67    X : array of shape (n, p)
68        Design matrix (should be standardized)
69    y : array of shape (n,)
70        Response vector
71    lambda_ : float
72        Regularization parameter
73    max_iter : int
74        Maximum iterations
75    tol : float
76        Convergence tolerance
77
78    Returns
79    -------
80    beta : array of shape (p,)
81        LASSO coefficients
82    """
83    n, p = X.shape
84    beta = np.zeros(p)
85
86    # Precompute X^T X diagonal (for standardized X, this is n)
87    X_col_norms_sq = np.sum(X**2, axis=0)
88
89    for iteration in range(max_iter):
90        beta_old = beta.copy()
91
92        for j in range(p):
93            # Compute partial residual (excluding feature j)
94            r_j = y - X @ beta + X[:, j] * beta[j]
95
96            # Compute OLS estimate for feature j
97            rho_j = X[:, j] @ r_j
98
99            # Apply soft thresholding
100            beta[j] = soft_threshold(rho_j, n * lambda_) / X_col_norms_sq[j]
101
102        # Check convergence
103        if np.max(np.abs(beta - beta_old)) < tol:
104            break
105
106    return beta
107
108
109# ============================================
110# Cross-Validation for Lambda Selection
111# ============================================
112
113def cv_select_lambda(
114    X: np.ndarray,
115    y: np.ndarray,
116    lambdas: np.ndarray,
117    method: str = 'ridge',
118    cv: int = 5
119) -> Tuple[float, np.ndarray]:
120    """
121    Select optimal lambda via cross-validation.
122
123    Parameters
124    ----------
125    X : array of shape (n, p)
126        Design matrix
127    y : array of shape (n,)
128        Response vector
129    lambdas : array
130        Grid of lambda values to try
131    method : str
132        'ridge', 'lasso', or 'elasticnet'
133    cv : int
134        Number of CV folds
135
136    Returns
137    -------
138    best_lambda : float
139        Optimal lambda
140    cv_errors : array
141        CV error for each lambda
142    """
143    cv_errors = []
144
145    for lam in lambdas:
146        if method == 'ridge':
147            model = Ridge(alpha=lam)
148        elif method == 'lasso':
149            model = Lasso(alpha=lam, max_iter=10000)
150        else:
151            model = ElasticNet(alpha=lam, l1_ratio=0.5, max_iter=10000)
152
153        # Negative MSE (sklearn convention)
154        scores = cross_val_score(model, X, y, cv=cv, scoring='neg_mean_squared_error')
155        cv_errors.append(-scores.mean())
156
157    cv_errors = np.array(cv_errors)
158    best_idx = np.argmin(cv_errors)
159    best_lambda = lambdas[best_idx]
160
161    return best_lambda, cv_errors
162
163
164# ============================================
165# Regularization Path
166# ============================================
167
168def compute_regularization_path(
169    X: np.ndarray,
170    y: np.ndarray,
171    lambdas: np.ndarray,
172    method: str = 'ridge'
173) -> np.ndarray:
174    """
175    Compute coefficients for a range of lambda values.
176
177    Parameters
178    ----------
179    X : array of shape (n, p)
180        Design matrix
181    y : array of shape (n,)
182        Response vector
183    lambdas : array
184        Grid of lambda values
185    method : str
186        'ridge' or 'lasso'
187
188    Returns
189    -------
190    paths : array of shape (len(lambdas), p)
191        Coefficients at each lambda
192    """
193    p = X.shape[1]
194    paths = np.zeros((len(lambdas), p))
195
196    for i, lam in enumerate(lambdas):
197        if method == 'ridge':
198            paths[i] = ridge_regression(X, y, lam)
199        else:
200            paths[i] = lasso_coordinate_descent(X, y, lam)
201
202    return paths
203
204
205# ============================================
206# Demonstration
207# ============================================
208
209if __name__ == "__main__":
210    np.random.seed(42)
211
212    print("=" * 60)
213    print("REGULARIZED MLE: RIDGE, LASSO, ELASTIC NET")
214    print("=" * 60)
215
216    # Generate data with some irrelevant features
217    n, p = 100, 20
218    p_relevant = 5
219
220    X = np.random.randn(n, p)
221    beta_true = np.zeros(p)
222    beta_true[:p_relevant] = np.array([3, -2, 1.5, -1, 0.5])
223
224    y = X @ beta_true + np.random.randn(n) * 0.5
225
226    print(f"\nData: n={n}, p={p}, {p_relevant} relevant features")
227    print(f"True non-zero coefficients: {beta_true[:p_relevant]}")
228
229    # Ridge regression
230    print("\n--- Ridge Regression ---")
231    beta_ridge = ridge_regression(X, y, lambda_=1.0)
232    print(f"Ridge coefficients (λ=1): {beta_ridge[:p_relevant]}")
233    print(f"Sum of irrelevant coefficients: {np.sum(np.abs(beta_ridge[p_relevant:])):.4f}")
234
235    # LASSO
236    print("\n--- LASSO ---")
237    beta_lasso = lasso_coordinate_descent(X, y, lambda_=0.1)
238    n_nonzero = np.sum(np.abs(beta_lasso) > 1e-6)
239    print(f"LASSO coefficients (λ=0.1): {beta_lasso[:p_relevant]}")
240    print(f"Number of non-zero coefficients: {n_nonzero}")
241
242    # Cross-validation for lambda selection
243    print("\n--- Cross-Validation for λ Selection ---")
244    lambdas = np.logspace(-3, 1, 50)
245
246    best_lam_ridge, cv_ridge = cv_select_lambda(X, y, lambdas, method='ridge')
247    best_lam_lasso, cv_lasso = cv_select_lambda(X, y, lambdas, method='lasso')
248
249    print(f"Best λ for Ridge: {best_lam_ridge:.4f}")
250    print(f"Best λ for LASSO: {best_lam_lasso:.4f}")
251
252    # Fit with optimal lambda
253    beta_ridge_opt = ridge_regression(X, y, best_lam_ridge)
254    beta_lasso_opt = lasso_coordinate_descent(X, y, best_lam_lasso)
255
256    print("\n--- Comparison at Optimal λ ---")
257    print(f"{'Method':<15} {'MSE':<10} {'Sparsity':<10}")
258    print("-" * 35)
259
260    mse_ridge = np.mean((X @ beta_ridge_opt - X @ beta_true)**2)
261    mse_lasso = np.mean((X @ beta_lasso_opt - X @ beta_true)**2)
262
263    sparsity_ridge = np.sum(np.abs(beta_ridge_opt) < 1e-6)
264    sparsity_lasso = np.sum(np.abs(beta_lasso_opt) < 1e-6)
265
266    print(f"{'Ridge':<15} {mse_ridge:<10.4f} {sparsity_ridge:<10}")
267    print(f"{'LASSO':<15} {mse_lasso:<10.4f} {sparsity_lasso:<10}")

Summary

Key Takeaways

Regularization trades bias for reduced variance, essential when p is large relative to n.
Ridge (L2) shrinks all coefficients toward zero but keeps all non-zero. Best for prediction and handling multicollinearity.
LASSO (L1) sets some coefficients exactly to zero (sparse). Best for feature selection and interpretability.
Elastic Net combines L1 and L2, getting sparsity with stability for correlated features.
Bayesian interpretation: Regularization = MAP estimation with specific priors (Gaussian for L2, Laplace for L1).
λ selection: Use cross-validation with "one-SE rule" for simpler, more robust models.

Chapter Complete! You've now mastered the major methods of estimation: Method of Moments, Maximum Likelihood, EM Algorithm, Rao-Blackwell improvement, and regularized estimation. These tools form the foundation for all statistical modeling in machine learning.