Learning Objectives
By the end of this section, you will be able to:
📚 Core Knowledge
- • Explain why regularization helps in high dimensions
- • Connect regularized MLE to Bayesian inference with priors
- • Compare Ridge, LASSO, and Elastic Net
- • Understand the bias-variance tradeoff in regularization
🔧 Practical Skills
- • Choose appropriate regularization for your problem
- • Select λ using cross-validation
- • Interpret regularization paths
- • Implement regularized estimators in Python
🧠 Deep Learning Connections
- • Weight decay: L2 regularization in neural networks
- • Dropout: Implicit regularization through stochastic training
- • Early stopping: Regularization via training duration
- • Sparsity in transformers: Attention pruning and efficiency
Why Regularization?
Standard MLE has a fundamental problem: overfitting. When the number of parameters p is large relative to sample size n, MLE produces estimates with high variance — they fit the training data perfectly but generalize poorly.
The High-Dimensional Challenge
Classical Regime (n ≫ p)
Many observations, few parameters. MLE works well — low bias, low variance. Asymptotic theory applies.
High-Dimensional (p ≈ n or p > n)
Parameters comparable to or exceeding observations. MLE often fails — estimates are unstable, variance explodes, may not even be unique!
The Core Insight: Regularization introduces bias deliberately to reduce variance. If parameters are expected to be "small" or "sparse," we can encode this prior belief to stabilize estimation.
Penalized Maximum Likelihood
Regularized MLE adds a penalty term to the likelihood:
Penalized MLE Objective
Or equivalently, minimize: -log L + λ · Penalty
Key components:
- Log-likelihood: Measures how well parameters fit the data (same as standard MLE)
- Penalty: Discourages "complex" parameter values (large, many non-zero, etc.)
- λ ≥ 0: Regularization strength — λ=0 gives standard MLE, λ→∞ shrinks to zero
Ridge Regression (L2 Regularization)
Ridge regression uses the squared L2 norm as penalty:
Ridge Regression
Key properties:
| Property | Description |
|---|---|
| Closed form | Unique solution exists even when p > n (unlike OLS) |
| Shrinkage | All coefficients shrink toward zero, but none exactly zero |
| Bias-variance | Introduces bias, but reduces variance |
| Stability | (X^T X + λI) always invertible for λ > 0 |
Bayesian Interpretation
Ridge has a beautiful Bayesian interpretation:
Bayesian Ridge = MAP with Gaussian Prior
Likelihood: y | β ~ N(Xβ, σ²I)
Prior: β ~ N(0, τ²I) (spherical Gaussian)
Posterior mode (MAP): β̂_ridge with λ = σ²/τ²
Regularization strength λ encodes prior belief about parameter magnitudes!
Large λ (small τ²): Strong prior, shrink toward zero more.
LASSO (L1 Regularization)
The LASSO (Least Absolute Shrinkage and Selection Operator) uses the L1 norm:
LASSO
where
The key difference from Ridge: LASSO produces sparse solutions — some coefficients are exactly zero!
Why L1 Induces Sparsity
Geometry of L1 vs. L2
L2 (Ridge)
Constraint region is a sphere. The elliptical loss contours typically touch the sphere away from axes → coefficients shrink but stay non-zero.
L1 (LASSO)
Constraint region is a diamond (corners at axes). Loss contours often touch corners → coefficients become exactly zero (sparse!).
| Aspect | Ridge (L2) | LASSO (L1) |
|---|---|---|
| Sparsity | No — all coefficients non-zero | Yes — some exactly zero |
| Solution | Closed form | Requires optimization (coordinate descent) |
| Correlated features | Keeps all, shrinks together | Arbitrarily selects one |
| Bayesian prior | Gaussian | Laplace (double exponential) |
| Use case | Prediction, multicollinearity | Feature selection, interpretation |
Elastic Net
Elastic Net combines L1 and L2 penalties — the best of both worlds:
Elastic Net
Often parameterized as: λ(α‖β‖₁ + (1-α)‖β‖₂²) where α ∈ [0,1] controls the mix
Advantages:
- Grouping effect: Correlated features are selected/rejected together
- Sparsity + stability: L1 provides selection, L2 provides stability
- p > n: Can select more than n features (unlike LASSO alone)
Choosing λ via Cross-Validation
The regularization parameter λ controls the bias-variance tradeoff. We choose it via cross-validation:
K-Fold Cross-Validation for λ
- Create a grid of λ values (typically on log scale)
- For each λ: do K-fold CV, compute average prediction error
- Choose λ that minimizes CV error (or use "one-standard-error rule")
- Refit on full data with chosen λ
Regularization Paths
A regularization path shows how coefficients change as λ varies:
Interpreting Regularization Paths
• λ = 0 (left): Full MLE — all coefficients at their unrestricted values
• λ → ∞ (right): Full shrinkage — all coefficients → 0
• LASSO path: Coefficients enter at specific λ values (sparse)
• Ridge path: All coefficients shrink continuously (no selection)
• Vertical line: Optimal λ from cross-validation
AI/ML Applications
🧠 Weight Decay in Neural Networks
L2 regularization on weights is called weight decay. It's ubiquitous in deep learning:
🎲 Dropout as Regularization
Randomly zeroing activations during training is an implicit regularizer. It's approximately equivalent to an ensemble of sparse networks, preventing co-adaptation.
⏱️ Early Stopping
Stopping training before convergence acts as regularization. Early in training, the model learns general patterns; later it memorizes noise. Equivalent to L2 for linear models!
✂️ Sparse Attention / Pruning
L1-style sparsity is used to prune neural networks — removing weights or attention heads that are close to zero reduces model size while maintaining performance.
Regularization ↔ Bayesian Priors
| Regularization | Equivalent Prior |
|---|---|
| L2 (Ridge) | Gaussian prior: β ~ N(0, τ²I) |
| L1 (LASSO) | Laplace prior: β ~ Laplace(0, b) |
| Elastic Net | Mixture of Gaussian and Laplace |
| Dropout | Spike-and-slab (approximate) |
| Weight decay + L1 | Horseshoe prior (approximate) |
Python Implementation
1import numpy as np
2from scipy.optimize import minimize
3from sklearn.linear_model import Ridge, Lasso, ElasticNet
4from sklearn.model_selection import cross_val_score
5from typing import Tuple
6
7# ============================================
8# Ridge Regression from Scratch
9# ============================================
10
11def ridge_regression(
12 X: np.ndarray,
13 y: np.ndarray,
14 lambda_: float
15) -> np.ndarray:
16 """
17 Ridge regression closed-form solution.
18
19 Parameters
20 ----------
21 X : array of shape (n, p)
22 Design matrix
23 y : array of shape (n,)
24 Response vector
25 lambda_ : float
26 Regularization parameter
27
28 Returns
29 -------
30 beta : array of shape (p,)
31 Ridge regression coefficients
32 """
33 n, p = X.shape
34 # (X^T X + λI)^{-1} X^T y
35 XtX = X.T @ X
36 regularized = XtX + lambda_ * np.eye(p)
37 beta = np.linalg.solve(regularized, X.T @ y)
38 return beta
39
40
41# ============================================
42# LASSO via Coordinate Descent
43# ============================================
44
45def soft_threshold(x: float, lambda_: float) -> float:
46 """Soft thresholding operator for LASSO."""
47 if x > lambda_:
48 return x - lambda_
49 elif x < -lambda_:
50 return x + lambda_
51 else:
52 return 0.0
53
54
55def lasso_coordinate_descent(
56 X: np.ndarray,
57 y: np.ndarray,
58 lambda_: float,
59 max_iter: int = 1000,
60 tol: float = 1e-6
61) -> np.ndarray:
62 """
63 LASSO via coordinate descent.
64
65 Parameters
66 ----------
67 X : array of shape (n, p)
68 Design matrix (should be standardized)
69 y : array of shape (n,)
70 Response vector
71 lambda_ : float
72 Regularization parameter
73 max_iter : int
74 Maximum iterations
75 tol : float
76 Convergence tolerance
77
78 Returns
79 -------
80 beta : array of shape (p,)
81 LASSO coefficients
82 """
83 n, p = X.shape
84 beta = np.zeros(p)
85
86 # Precompute X^T X diagonal (for standardized X, this is n)
87 X_col_norms_sq = np.sum(X**2, axis=0)
88
89 for iteration in range(max_iter):
90 beta_old = beta.copy()
91
92 for j in range(p):
93 # Compute partial residual (excluding feature j)
94 r_j = y - X @ beta + X[:, j] * beta[j]
95
96 # Compute OLS estimate for feature j
97 rho_j = X[:, j] @ r_j
98
99 # Apply soft thresholding
100 beta[j] = soft_threshold(rho_j, n * lambda_) / X_col_norms_sq[j]
101
102 # Check convergence
103 if np.max(np.abs(beta - beta_old)) < tol:
104 break
105
106 return beta
107
108
109# ============================================
110# Cross-Validation for Lambda Selection
111# ============================================
112
113def cv_select_lambda(
114 X: np.ndarray,
115 y: np.ndarray,
116 lambdas: np.ndarray,
117 method: str = 'ridge',
118 cv: int = 5
119) -> Tuple[float, np.ndarray]:
120 """
121 Select optimal lambda via cross-validation.
122
123 Parameters
124 ----------
125 X : array of shape (n, p)
126 Design matrix
127 y : array of shape (n,)
128 Response vector
129 lambdas : array
130 Grid of lambda values to try
131 method : str
132 'ridge', 'lasso', or 'elasticnet'
133 cv : int
134 Number of CV folds
135
136 Returns
137 -------
138 best_lambda : float
139 Optimal lambda
140 cv_errors : array
141 CV error for each lambda
142 """
143 cv_errors = []
144
145 for lam in lambdas:
146 if method == 'ridge':
147 model = Ridge(alpha=lam)
148 elif method == 'lasso':
149 model = Lasso(alpha=lam, max_iter=10000)
150 else:
151 model = ElasticNet(alpha=lam, l1_ratio=0.5, max_iter=10000)
152
153 # Negative MSE (sklearn convention)
154 scores = cross_val_score(model, X, y, cv=cv, scoring='neg_mean_squared_error')
155 cv_errors.append(-scores.mean())
156
157 cv_errors = np.array(cv_errors)
158 best_idx = np.argmin(cv_errors)
159 best_lambda = lambdas[best_idx]
160
161 return best_lambda, cv_errors
162
163
164# ============================================
165# Regularization Path
166# ============================================
167
168def compute_regularization_path(
169 X: np.ndarray,
170 y: np.ndarray,
171 lambdas: np.ndarray,
172 method: str = 'ridge'
173) -> np.ndarray:
174 """
175 Compute coefficients for a range of lambda values.
176
177 Parameters
178 ----------
179 X : array of shape (n, p)
180 Design matrix
181 y : array of shape (n,)
182 Response vector
183 lambdas : array
184 Grid of lambda values
185 method : str
186 'ridge' or 'lasso'
187
188 Returns
189 -------
190 paths : array of shape (len(lambdas), p)
191 Coefficients at each lambda
192 """
193 p = X.shape[1]
194 paths = np.zeros((len(lambdas), p))
195
196 for i, lam in enumerate(lambdas):
197 if method == 'ridge':
198 paths[i] = ridge_regression(X, y, lam)
199 else:
200 paths[i] = lasso_coordinate_descent(X, y, lam)
201
202 return paths
203
204
205# ============================================
206# Demonstration
207# ============================================
208
209if __name__ == "__main__":
210 np.random.seed(42)
211
212 print("=" * 60)
213 print("REGULARIZED MLE: RIDGE, LASSO, ELASTIC NET")
214 print("=" * 60)
215
216 # Generate data with some irrelevant features
217 n, p = 100, 20
218 p_relevant = 5
219
220 X = np.random.randn(n, p)
221 beta_true = np.zeros(p)
222 beta_true[:p_relevant] = np.array([3, -2, 1.5, -1, 0.5])
223
224 y = X @ beta_true + np.random.randn(n) * 0.5
225
226 print(f"\nData: n={n}, p={p}, {p_relevant} relevant features")
227 print(f"True non-zero coefficients: {beta_true[:p_relevant]}")
228
229 # Ridge regression
230 print("\n--- Ridge Regression ---")
231 beta_ridge = ridge_regression(X, y, lambda_=1.0)
232 print(f"Ridge coefficients (λ=1): {beta_ridge[:p_relevant]}")
233 print(f"Sum of irrelevant coefficients: {np.sum(np.abs(beta_ridge[p_relevant:])):.4f}")
234
235 # LASSO
236 print("\n--- LASSO ---")
237 beta_lasso = lasso_coordinate_descent(X, y, lambda_=0.1)
238 n_nonzero = np.sum(np.abs(beta_lasso) > 1e-6)
239 print(f"LASSO coefficients (λ=0.1): {beta_lasso[:p_relevant]}")
240 print(f"Number of non-zero coefficients: {n_nonzero}")
241
242 # Cross-validation for lambda selection
243 print("\n--- Cross-Validation for λ Selection ---")
244 lambdas = np.logspace(-3, 1, 50)
245
246 best_lam_ridge, cv_ridge = cv_select_lambda(X, y, lambdas, method='ridge')
247 best_lam_lasso, cv_lasso = cv_select_lambda(X, y, lambdas, method='lasso')
248
249 print(f"Best λ for Ridge: {best_lam_ridge:.4f}")
250 print(f"Best λ for LASSO: {best_lam_lasso:.4f}")
251
252 # Fit with optimal lambda
253 beta_ridge_opt = ridge_regression(X, y, best_lam_ridge)
254 beta_lasso_opt = lasso_coordinate_descent(X, y, best_lam_lasso)
255
256 print("\n--- Comparison at Optimal λ ---")
257 print(f"{'Method':<15} {'MSE':<10} {'Sparsity':<10}")
258 print("-" * 35)
259
260 mse_ridge = np.mean((X @ beta_ridge_opt - X @ beta_true)**2)
261 mse_lasso = np.mean((X @ beta_lasso_opt - X @ beta_true)**2)
262
263 sparsity_ridge = np.sum(np.abs(beta_ridge_opt) < 1e-6)
264 sparsity_lasso = np.sum(np.abs(beta_lasso_opt) < 1e-6)
265
266 print(f"{'Ridge':<15} {mse_ridge:<10.4f} {sparsity_ridge:<10}")
267 print(f"{'LASSO':<15} {mse_lasso:<10.4f} {sparsity_lasso:<10}")Summary
Key Takeaways
- Regularization trades bias for reduced variance, essential when p is large relative to n.
- Ridge (L2) shrinks all coefficients toward zero but keeps all non-zero. Best for prediction and handling multicollinearity.
- LASSO (L1) sets some coefficients exactly to zero (sparse). Best for feature selection and interpretability.
- Elastic Net combines L1 and L2, getting sparsity with stability for correlated features.
- Bayesian interpretation: Regularization = MAP estimation with specific priors (Gaussian for L2, Laplace for L1).
- λ selection: Use cross-validation with "one-SE rule" for simpler, more robust models.
Chapter Complete! You've now mastered the major methods of estimation: Method of Moments, Maximum Likelihood, EM Algorithm, Rao-Blackwell improvement, and regularized estimation. These tools form the foundation for all statistical modeling in machine learning.