Learning Objectives
By the end of this section, you will be able to:
📚 Core Knowledge
- • Extend point estimation concepts to vector parameters
- • Understand the Fisher Information Matrix
- • Apply the multivariate Cramér-Rao bound
- • Characterize joint efficiency of estimators
🔧 Practical Skills
- • Compute covariance matrices of vector estimators
- • Calculate Fisher Information matrices
- • Apply multivariate MLE to real problems
- • Interpret asymptotic covariance matrices
🧠 Deep Learning Connections
- • Neural network training: Millions of parameters estimated jointly
- • Natural gradient: Uses Fisher Information Matrix for optimization
- • Laplace approximation: Multivariate Gaussian posterior via Hessian
- • Multivariate regression: Covariance structure of predictions
Why Multivariate Estimation?
So far, we've focused on estimating a single parameter θ. But in most real applications, we estimate multiple parameters simultaneously:
Normal Distribution
Estimate both μ and σ²:
Linear Regression
Estimate all coefficients:
Multinomial
Estimate category probabilities:
Neural Networks
Estimate millions of weights:
The key questions extend naturally:
- What is the joint covariance structure of our estimates?
- What is the best achievable precision (multivariate CRLB)?
- How do we characterize joint efficiency?
- What is sufficient for a vector parameter?
Vector Parameters and Estimators
Let be a vector of p parameters. An estimator is now a vector-valued function:
Unbiasedness extends component-wise:
Covariance Matrix of Estimators
Instead of a single variance, we now have a covariance matrix:
Covariance Matrix
The covariance matrix captures:
- Diagonal entries: Individual variances Var(θ̂ⱼ)
- Off-diagonal entries: How estimates of different parameters co-vary
- Correlations: Derived as Cor(θ̂ᵢ, θ̂ⱼ) = Cov(θ̂ᵢ, θ̂ⱼ) / √(Var(θ̂ᵢ)Var(θ̂ⱼ))
Multivariate Cramér-Rao Bound
Fisher Information Matrix
The Fisher Information Matrix I(θ) is a p × p matrix that generalizes scalar Fisher Information:
Fisher Information Matrix
Or equivalently (under regularity conditions):
In matrix form with the score vector :
The Matrix Inequality
The multivariate Cramér-Rao Lower Bound states:
Multivariate CRLB
For any unbiased estimator θ̂ of θ:
where A ≥ B means A - B is positive semi-definite.
This matrix inequality implies several scalar inequalities:
- Marginal bounds:
- Linear combinations: For any vector a,
- Total variance:
Multivariate MLE Properties
Asymptotic Normality
The multivariate MLE θ̂ₙ has an asymptotic multivariate normaldistribution:
Asymptotic Distribution of MLE
Or approximately:
This means:
- The MLE is asymptotically efficient — achieves the CRLB
- We can construct confidence regions using the multivariate normal
- The inverse Fisher Information gives the asymptotic covariance
Confidence ellipsoids: A (1-α) confidence region for θ is:
Multivariate Sufficiency
The concept of sufficiency extends naturally. A statistic T(X)is sufficient for the vector parameter θ if the conditional distribution of X given T does not depend on θ.
Examples of Multivariate Sufficient Statistics
| Distribution | Parameter θ | Sufficient Statistic T |
|---|---|---|
| Normal(μ, σ²) | (μ, σ²) | (ΣXᵢ, ΣXᵢ²) |
| Multinomial(n, π) | (π₁, ..., πₖ) | (n₁, ..., nₖ) |
| Multivariate Normal | (μ, Σ) | (ΣXᵢ, ΣXᵢXᵢᵀ) |
| Regression | (β, σ²) | (XᵀX, Xᵀy, yᵀy) |
AI/ML Applications
🧭 Natural Gradient Descent
Standard gradient descent ignores parameter correlations. Natural gradient uses the Fisher Information Matrix to account for the geometry of parameter space:
🔍 Laplace Approximation
Approximate the posterior with a multivariate Gaussian centered at the MAP estimate, with covariance equal to the inverse Hessian:
📊 Multivariate Regression
In linear regression Y = Xβ + ε, the coefficient covariance is:
High correlation in X → high covariance in β̂ (multicollinearity)
🎲 Uncertainty in Deep Learning
For neural networks with millions of parameters, the full covariance matrix is intractable. Methods like:
- Diagonal approximation (mean-field VI)
- Low-rank approximations (KFAC)
- MC Dropout for uncertainty
Python Implementation
1import numpy as np
2from scipy import stats
3from typing import Tuple
4
5# ============================================
6# Fisher Information Matrix Examples
7# ============================================
8
9def fisher_information_normal(mu: float, sigma2: float, n: int = 1) -> np.ndarray:
10 """
11 Fisher Information Matrix for Normal(μ, σ²).
12
13 Parameters for θ = (μ, σ²).
14
15 Returns
16 -------
17 I : array of shape (2, 2)
18 Fisher Information Matrix (for n observations)
19 """
20 # For single observation
21 I_single = np.array([
22 [1/sigma2, 0],
23 [0, 1/(2*sigma2**2)]
24 ])
25 # For n observations, multiply by n
26 return n * I_single
27
28
29def fisher_information_regression(X: np.ndarray, sigma2: float) -> np.ndarray:
30 """
31 Fisher Information Matrix for linear regression.
32
33 Parameters
34 ----------
35 X : array of shape (n, p)
36 Design matrix
37 sigma2 : float
38 Error variance
39
40 Returns
41 -------
42 I : array of shape (p, p)
43 Fisher Information Matrix for β
44 """
45 return (1/sigma2) * (X.T @ X)
46
47
48def fisher_information_multinomial(pi: np.ndarray, n: int = 1) -> np.ndarray:
49 """
50 Fisher Information Matrix for Multinomial(n, π).
51
52 Parameters
53 ----------
54 pi : array of shape (K,)
55 Category probabilities (sum to 1)
56 n : int
57 Number of trials
58
59 Returns
60 -------
61 I : array of shape (K-1, K-1)
62 Fisher Information Matrix for (π₁, ..., π_{K-1})
63 """
64 K = len(pi)
65 # We use K-1 parameters due to constraint sum(π) = 1
66 I = np.zeros((K-1, K-1))
67 for i in range(K-1):
68 for j in range(K-1):
69 if i == j:
70 I[i, j] = n * (1/pi[i] + 1/pi[K-1])
71 else:
72 I[i, j] = n / pi[K-1]
73 return I
74
75
76# ============================================
77# Multivariate CRLB
78# ============================================
79
80def crlb_variances(fisher_info: np.ndarray) -> np.ndarray:
81 """
82 Compute CRLB lower bounds for each parameter variance.
83
84 Parameters
85 ----------
86 fisher_info : array of shape (p, p)
87 Fisher Information Matrix
88
89 Returns
90 -------
91 bounds : array of shape (p,)
92 Lower bounds for Var(θ̂ⱼ) for each parameter
93 """
94 inv_fisher = np.linalg.inv(fisher_info)
95 return np.diag(inv_fisher)
96
97
98def efficiency(cov_estimator: np.ndarray, fisher_info: np.ndarray) -> float:
99 """
100 Compute generalized efficiency (ratio of determinants).
101
102 Parameters
103 ----------
104 cov_estimator : array of shape (p, p)
105 Covariance matrix of the estimator
106 fisher_info : array of shape (p, p)
107 Fisher Information Matrix
108
109 Returns
110 -------
111 eff : float
112 Generalized efficiency (1 = achieves CRLB)
113 """
114 crlb = np.linalg.inv(fisher_info)
115 return np.linalg.det(crlb) / np.linalg.det(cov_estimator)
116
117
118# ============================================
119# Multivariate MLE for Normal
120# ============================================
121
122def mle_normal(X: np.ndarray) -> Tuple[np.ndarray, np.ndarray]:
123 """
124 MLE for Normal(μ, σ²) with asymptotic covariance.
125
126 Parameters
127 ----------
128 X : array of shape (n,)
129 Observed data
130
131 Returns
132 -------
133 theta_hat : array of shape (2,)
134 MLE estimates [μ̂, σ̂²]
135 cov_hat : array of shape (2, 2)
136 Estimated asymptotic covariance matrix
137 """
138 n = len(X)
139
140 # MLEs
141 mu_hat = np.mean(X)
142 sigma2_hat = np.var(X, ddof=0) # MLE uses n, not n-1
143
144 theta_hat = np.array([mu_hat, sigma2_hat])
145
146 # Fisher Information at MLE
147 I_hat = fisher_information_normal(mu_hat, sigma2_hat, n=n)
148
149 # Asymptotic covariance
150 cov_hat = np.linalg.inv(I_hat)
151
152 return theta_hat, cov_hat
153
154
155def mle_regression(X: np.ndarray, y: np.ndarray) -> Tuple[np.ndarray, np.ndarray]:
156 """
157 MLE for linear regression.
158
159 Parameters
160 ----------
161 X : array of shape (n, p)
162 Design matrix
163 y : array of shape (n,)
164 Response vector
165
166 Returns
167 -------
168 beta_hat : array of shape (p,)
169 MLE coefficient estimates
170 cov_hat : array of shape (p, p)
171 Estimated covariance matrix
172 """
173 n, p = X.shape
174
175 # MLE for β
176 XtX_inv = np.linalg.inv(X.T @ X)
177 beta_hat = XtX_inv @ X.T @ y
178
179 # MLE for σ²
180 residuals = y - X @ beta_hat
181 sigma2_hat = np.sum(residuals**2) / n # MLE uses n
182
183 # Covariance of β̂
184 cov_hat = sigma2_hat * XtX_inv
185
186 return beta_hat, cov_hat
187
188
189# ============================================
190# Confidence Ellipsoid
191# ============================================
192
193def confidence_ellipse_params(
194 theta_hat: np.ndarray,
195 cov: np.ndarray,
196 confidence: float = 0.95
197) -> dict:
198 """
199 Compute parameters for confidence ellipse (2D case).
200
201 Parameters
202 ----------
203 theta_hat : array of shape (2,)
204 Point estimate
205 cov : array of shape (2, 2)
206 Covariance matrix
207 confidence : float
208 Confidence level
209
210 Returns
211 -------
212 params : dict
213 Center, eigenvalues, eigenvectors, and chi-square cutoff
214 """
215 # Chi-square cutoff
216 chi2_val = stats.chi2.ppf(confidence, df=2)
217
218 # Eigendecomposition
219 eigenvalues, eigenvectors = np.linalg.eigh(cov)
220
221 # Semi-axis lengths
222 semi_axes = np.sqrt(chi2_val * eigenvalues)
223
224 # Rotation angle
225 angle = np.arctan2(eigenvectors[1, 1], eigenvectors[0, 1])
226
227 return {
228 'center': theta_hat,
229 'eigenvalues': eigenvalues,
230 'eigenvectors': eigenvectors,
231 'semi_axes': semi_axes,
232 'angle_rad': angle,
233 'angle_deg': np.degrees(angle),
234 'chi2_cutoff': chi2_val
235 }
236
237
238# ============================================
239# Demonstration
240# ============================================
241
242if __name__ == "__main__":
243 np.random.seed(42)
244
245 print("=" * 60)
246 print("MULTIVARIATE ESTIMATION")
247 print("=" * 60)
248
249 # Example 1: Normal distribution
250 print("\n--- Normal(μ, σ²) MLE ---")
251 mu_true, sigma2_true = 10.0, 4.0
252 n = 100
253 X = np.random.normal(mu_true, np.sqrt(sigma2_true), n)
254
255 theta_hat, cov_hat = mle_normal(X)
256
257 print(f"True parameters: μ={mu_true}, σ²={sigma2_true}")
258 print(f"MLE estimates: μ̂={theta_hat[0]:.4f}, σ̂²={theta_hat[1]:.4f}")
259 print(f"\nAsymptotic covariance matrix:")
260 print(cov_hat)
261
262 # CRLB
263 I_true = fisher_information_normal(mu_true, sigma2_true, n)
264 crlb = crlb_variances(I_true)
265 print(f"\nCRLB for variances: Var(μ̂)≥{crlb[0]:.4f}, Var(σ̂²)≥{crlb[1]:.4f}")
266 print(f"Actual asymptotic variances: {np.diag(cov_hat)}")
267
268 # Example 2: Regression
269 print("\n--- Linear Regression MLE ---")
270 n, p = 100, 3
271 X_reg = np.column_stack([np.ones(n), np.random.randn(n, p-1)])
272 beta_true = np.array([2.0, 1.5, -0.5])
273 y = X_reg @ beta_true + np.random.normal(0, 1.5, n)
274
275 beta_hat, cov_beta = mle_regression(X_reg, y)
276
277 print(f"True β: {beta_true}")
278 print(f"MLE β̂: {beta_hat}")
279 print(f"\nCovariance matrix of β̂:")
280 print(cov_beta)
281 print(f"\nStandard errors: {np.sqrt(np.diag(cov_beta))}")
282
283 # Confidence ellipse
284 print("\n--- Confidence Ellipse (first 2 params) ---")
285 ellipse = confidence_ellipse_params(beta_hat[:2], cov_beta[:2, :2])
286 print(f"Center: {ellipse['center']}")
287 print(f"Semi-axes: {ellipse['semi_axes']}")
288 print(f"Rotation angle: {ellipse['angle_deg']:.1f}°")Summary
Key Takeaways
- Vector parameters are estimated jointly, with quality characterized by the covariance matrix rather than scalar variance.
- The Fisher Information Matrix I(θ) generalizes scalar Fisher Information, with entries [I]ᵢⱼ = E[∂logf/∂θᵢ · ∂logf/∂θⱼ].
- The multivariate CRLB states Cov(θ̂) ≥ I(θ)⁻¹ in the matrix (positive semi-definite) sense.
- Multivariate MLEs are asymptotically normal with covariance I(θ)⁻¹/n, achieving the CRLB.
- Deep learning applications include natural gradient descent, Laplace approximation, and understanding parameter correlations.
Looking Ahead: In Chapter 12, we'll learn practical methods for finding estimators: Method of Moments, Maximum Likelihood Estimation, the EM algorithm, and the Rao-Blackwell improvement theorem.