Boo-AI — Master Artificial Intelligence by Building from Scratch

Learning Objectives

By the end of this section, you will be able to:

📚 Core Knowledge

• Extend point estimation concepts to vector parameters
• Understand the Fisher Information Matrix
• Apply the multivariate Cramér-Rao bound
• Characterize joint efficiency of estimators

🔧 Practical Skills

• Compute covariance matrices of vector estimators
• Calculate Fisher Information matrices
• Apply multivariate MLE to real problems
• Interpret asymptotic covariance matrices

🧠 Deep Learning Connections

• Neural network training: Millions of parameters estimated jointly
• Natural gradient: Uses Fisher Information Matrix for optimization
• Laplace approximation: Multivariate Gaussian posterior via Hessian
• Multivariate regression: Covariance structure of predictions

Why Multivariate Estimation?

So far, we've focused on estimating a single parameter θ. But in most real applications, we estimate multiple parameters simultaneously:

Normal Distribution

Estimate both μ and σ²: $\boldsymbol{\theta} = (\mu, \sigma^2)^T$

Linear Regression

Estimate all coefficients: $\boldsymbol{\beta} = (\beta_0, \beta_1, \ldots, \beta_p)^T$

Multinomial

Estimate category probabilities: $\boldsymbol{\pi} = (\pi_1, \ldots, \pi_K)^T$

Neural Networks

Estimate millions of weights: $\boldsymbol{w} = (w_1, \ldots, w_d)^T$

The key questions extend naturally:

What is the joint covariance structure of our estimates?
What is the best achievable precision (multivariate CRLB)?
How do we characterize joint efficiency?
What is sufficient for a vector parameter?

Vector Parameters and Estimators

Let $\boldsymbol{\theta} = (\theta_1, \ldots, \theta_p)^T \in \mathbb{R}^p$ be a vector of p parameters. An estimator is now a vector-valued function:

\hat{\boldsymbol{\theta}} = (\hat{\theta}_1, \ldots, \hat{\theta}_p)^T

Unbiasedness extends component-wise:

E[\hat{\boldsymbol{\theta}}] = \boldsymbol{\theta}_0 \quad \Leftrightarrow \quad E[\hat{\theta}_j] = \theta_{0j} \text{ for all } j = 1, \ldots, p

Covariance Matrix of Estimators

Instead of a single variance, we now have a covariance matrix:

Covariance Matrix

\text{Cov}(\hat{\boldsymbol{\theta}}) = E[(\hat{\boldsymbol{\theta}} - \boldsymbol{\theta}_0)(\hat{\boldsymbol{\theta}} - \boldsymbol{\theta}_0)^T]

= \begin{pmatrix} \text{Var}(\hat{\theta}_1) & \text{Cov}(\hat{\theta}_1, \hat{\theta}_2) & \cdots \\ \text{Cov}(\hat{\theta}_2, \hat{\theta}_1) & \text{Var}(\hat{\theta}_2) & \cdots \\ \vdots & \vdots & \ddots \end{pmatrix}

The covariance matrix captures:

Diagonal entries: Individual variances Var(θ̂ⱼ)
Off-diagonal entries: How estimates of different parameters co-vary
Correlations: Derived as Cor(θ̂ᵢ, θ̂ⱼ) = Cov(θ̂ᵢ, θ̂ⱼ) / √(Var(θ̂ᵢ)Var(θ̂ⱼ))

High correlation between estimates means they are hard to distinguish — if you overestimate one, you likely overestimate the other. This is common when parameters are confounded (e.g., μ and σ in some models).

Multivariate Cramér-Rao Bound

Fisher Information Matrix

The Fisher Information Matrix I(θ) is a p × p matrix that generalizes scalar Fisher Information:

Fisher Information Matrix

[\mathbf{I}(\boldsymbol{\theta})]_{ij} = E\left[\frac{\partial \log f(X|\boldsymbol{\theta})}{\partial \theta_i} \cdot \frac{\partial \log f(X|\boldsymbol{\theta})}{\partial \theta_j}\right]

Or equivalently (under regularity conditions):

[\mathbf{I}(\boldsymbol{\theta})]_{ij} = -E\left[\frac{\partial^2 \log f(X|\boldsymbol{\theta})}{\partial \theta_i \partial \theta_j}\right]

In matrix form with the score vector $\mathbf{s}(\boldsymbol{\theta}) = \nabla_{\boldsymbol{\theta}} \log f(X|\boldsymbol{\theta})$ :

\mathbf{I}(\boldsymbol{\theta}) = E[\mathbf{s}(\boldsymbol{\theta}) \mathbf{s}(\boldsymbol{\theta})^T] = -E[\nabla^2_{\boldsymbol{\theta}} \log f(X|\boldsymbol{\theta})]

The Matrix Inequality

The multivariate Cramér-Rao Lower Bound states:

Multivariate CRLB

For any unbiased estimator θ̂ of θ:

\text{Cov}(\hat{\boldsymbol{\theta}}) \geq \mathbf{I}(\boldsymbol{\theta})^{-1}

where A ≥ B means A - B is positive semi-definite.

This matrix inequality implies several scalar inequalities:

Marginal bounds: $\text{Var}(\hat{\theta}_j) \geq [\mathbf{I}(\boldsymbol{\theta})^{-1}]_{jj}$
Linear combinations: For any vector a, $\text{Var}(\mathbf{a}^T\hat{\boldsymbol{\theta}}) \geq \mathbf{a}^T \mathbf{I}(\boldsymbol{\theta})^{-1} \mathbf{a}$
Total variance: $\text{tr}(\text{Cov}(\hat{\boldsymbol{\theta}})) \geq \text{tr}(\mathbf{I}(\boldsymbol{\theta})^{-1})$

Important: The bound for each θⱼ involves the inverseof the full Fisher Information Matrix, not just the jth diagonal element. This accounts for the information about θⱼ that comes indirectly through other parameters.

Multivariate MLE Properties

Asymptotic Normality

The multivariate MLE θ̂ₙ has an asymptotic multivariate normaldistribution:

Asymptotic Distribution of MLE

\sqrt{n}(\hat{\boldsymbol{\theta}}_n - \boldsymbol{\theta}_0) \xrightarrow{d} \mathcal{N}(\mathbf{0}, \mathbf{I}(\boldsymbol{\theta}_0)^{-1})

Or approximately: $\hat{\boldsymbol{\theta}}_n \sim \mathcal{N}\left(\boldsymbol{\theta}_0, \frac{1}{n}\mathbf{I}(\boldsymbol{\theta}_0)^{-1}\right)$

This means:

The MLE is asymptotically efficient — achieves the CRLB
We can construct confidence regions using the multivariate normal
The inverse Fisher Information gives the asymptotic covariance

Confidence ellipsoids: A (1-α) confidence region for θ is:

\{{\boldsymbol{\theta}}: n(\hat{\boldsymbol{\theta}}_n - \boldsymbol{\theta})^T \mathbf{I}(\hat{\boldsymbol{\theta}}_n) (\hat{\boldsymbol{\theta}}_n - \boldsymbol{\theta}) \leq \chi^2_{p, 1-\alpha}\}

Multivariate Sufficiency

The concept of sufficiency extends naturally. A statistic T(X)is sufficient for the vector parameter θ if the conditional distribution of X given T does not depend on θ.

Examples of Multivariate Sufficient Statistics

Distribution	Parameter θ	Sufficient Statistic T
Normal(μ, σ²)	(μ, σ²)	(ΣXᵢ, ΣXᵢ²)
Multinomial(n, π)	(π₁, ..., πₖ)	(n₁, ..., nₖ)
Multivariate Normal	(μ, Σ)	(ΣXᵢ, ΣXᵢXᵢᵀ)
Regression	(β, σ²)	(XᵀX, Xᵀy, yᵀy)

Dimension matching: For a p-dimensional parameter θ, the minimal sufficient statistic typically has dimension p (or close to it). If you have fewer sufficient statistics than parameters, the model may not be identifiable!

AI/ML Applications

🧭 Natural Gradient Descent

Standard gradient descent ignores parameter correlations. Natural gradient uses the Fisher Information Matrix to account for the geometry of parameter space:

\boldsymbol{\theta}_{t+1} = \boldsymbol{\theta}_t - \eta \mathbf{I}(\boldsymbol{\theta}_t)^{-1} \nabla L(\boldsymbol{\theta}_t)

🔍 Laplace Approximation

Approximate the posterior with a multivariate Gaussian centered at the MAP estimate, with covariance equal to the inverse Hessian:

p(\boldsymbol{\theta}|\mathbf{X}) \approx \mathcal{N}(\hat{\boldsymbol{\theta}}_{\text{MAP}}, \mathbf{H}^{-1})

📊 Multivariate Regression

In linear regression Y = Xβ + ε, the coefficient covariance is:

\text{Cov}(\hat{\boldsymbol{\beta}}) = \sigma^2 (\mathbf{X}^T\mathbf{X})^{-1}

High correlation in X → high covariance in β̂ (multicollinearity)

🎲 Uncertainty in Deep Learning

For neural networks with millions of parameters, the full covariance matrix is intractable. Methods like:

Diagonal approximation (mean-field VI)
Low-rank approximations (KFAC)
MC Dropout for uncertainty

Python Implementation

🐍python

1import numpy as np
2from scipy import stats
3from typing import Tuple
4
5# ============================================
6# Fisher Information Matrix Examples
7# ============================================
8
9def fisher_information_normal(mu: float, sigma2: float, n: int = 1) -> np.ndarray:
10    """
11    Fisher Information Matrix for Normal(μ, σ²).
12
13    Parameters for θ = (μ, σ²).
14
15    Returns
16    -------
17    I : array of shape (2, 2)
18        Fisher Information Matrix (for n observations)
19    """
20    # For single observation
21    I_single = np.array([
22        [1/sigma2, 0],
23        [0, 1/(2*sigma2**2)]
24    ])
25    # For n observations, multiply by n
26    return n * I_single
27
28
29def fisher_information_regression(X: np.ndarray, sigma2: float) -> np.ndarray:
30    """
31    Fisher Information Matrix for linear regression.
32
33    Parameters
34    ----------
35    X : array of shape (n, p)
36        Design matrix
37    sigma2 : float
38        Error variance
39
40    Returns
41    -------
42    I : array of shape (p, p)
43        Fisher Information Matrix for β
44    """
45    return (1/sigma2) * (X.T @ X)
46
47
48def fisher_information_multinomial(pi: np.ndarray, n: int = 1) -> np.ndarray:
49    """
50    Fisher Information Matrix for Multinomial(n, π).
51
52    Parameters
53    ----------
54    pi : array of shape (K,)
55        Category probabilities (sum to 1)
56    n : int
57        Number of trials
58
59    Returns
60    -------
61    I : array of shape (K-1, K-1)
62        Fisher Information Matrix for (π₁, ..., π_{K-1})
63    """
64    K = len(pi)
65    # We use K-1 parameters due to constraint sum(π) = 1
66    I = np.zeros((K-1, K-1))
67    for i in range(K-1):
68        for j in range(K-1):
69            if i == j:
70                I[i, j] = n * (1/pi[i] + 1/pi[K-1])
71            else:
72                I[i, j] = n / pi[K-1]
73    return I
74
75
76# ============================================
77# Multivariate CRLB
78# ============================================
79
80def crlb_variances(fisher_info: np.ndarray) -> np.ndarray:
81    """
82    Compute CRLB lower bounds for each parameter variance.
83
84    Parameters
85    ----------
86    fisher_info : array of shape (p, p)
87        Fisher Information Matrix
88
89    Returns
90    -------
91    bounds : array of shape (p,)
92        Lower bounds for Var(θ̂ⱼ) for each parameter
93    """
94    inv_fisher = np.linalg.inv(fisher_info)
95    return np.diag(inv_fisher)
96
97
98def efficiency(cov_estimator: np.ndarray, fisher_info: np.ndarray) -> float:
99    """
100    Compute generalized efficiency (ratio of determinants).
101
102    Parameters
103    ----------
104    cov_estimator : array of shape (p, p)
105        Covariance matrix of the estimator
106    fisher_info : array of shape (p, p)
107        Fisher Information Matrix
108
109    Returns
110    -------
111    eff : float
112        Generalized efficiency (1 = achieves CRLB)
113    """
114    crlb = np.linalg.inv(fisher_info)
115    return np.linalg.det(crlb) / np.linalg.det(cov_estimator)
116
117
118# ============================================
119# Multivariate MLE for Normal
120# ============================================
121
122def mle_normal(X: np.ndarray) -> Tuple[np.ndarray, np.ndarray]:
123    """
124    MLE for Normal(μ, σ²) with asymptotic covariance.
125
126    Parameters
127    ----------
128    X : array of shape (n,)
129        Observed data
130
131    Returns
132    -------
133    theta_hat : array of shape (2,)
134        MLE estimates [μ̂, σ̂²]
135    cov_hat : array of shape (2, 2)
136        Estimated asymptotic covariance matrix
137    """
138    n = len(X)
139
140    # MLEs
141    mu_hat = np.mean(X)
142    sigma2_hat = np.var(X, ddof=0)  # MLE uses n, not n-1
143
144    theta_hat = np.array([mu_hat, sigma2_hat])
145
146    # Fisher Information at MLE
147    I_hat = fisher_information_normal(mu_hat, sigma2_hat, n=n)
148
149    # Asymptotic covariance
150    cov_hat = np.linalg.inv(I_hat)
151
152    return theta_hat, cov_hat
153
154
155def mle_regression(X: np.ndarray, y: np.ndarray) -> Tuple[np.ndarray, np.ndarray]:
156    """
157    MLE for linear regression.
158
159    Parameters
160    ----------
161    X : array of shape (n, p)
162        Design matrix
163    y : array of shape (n,)
164        Response vector
165
166    Returns
167    -------
168    beta_hat : array of shape (p,)
169        MLE coefficient estimates
170    cov_hat : array of shape (p, p)
171        Estimated covariance matrix
172    """
173    n, p = X.shape
174
175    # MLE for β
176    XtX_inv = np.linalg.inv(X.T @ X)
177    beta_hat = XtX_inv @ X.T @ y
178
179    # MLE for σ²
180    residuals = y - X @ beta_hat
181    sigma2_hat = np.sum(residuals**2) / n  # MLE uses n
182
183    # Covariance of β̂
184    cov_hat = sigma2_hat * XtX_inv
185
186    return beta_hat, cov_hat
187
188
189# ============================================
190# Confidence Ellipsoid
191# ============================================
192
193def confidence_ellipse_params(
194    theta_hat: np.ndarray,
195    cov: np.ndarray,
196    confidence: float = 0.95
197) -> dict:
198    """
199    Compute parameters for confidence ellipse (2D case).
200
201    Parameters
202    ----------
203    theta_hat : array of shape (2,)
204        Point estimate
205    cov : array of shape (2, 2)
206        Covariance matrix
207    confidence : float
208        Confidence level
209
210    Returns
211    -------
212    params : dict
213        Center, eigenvalues, eigenvectors, and chi-square cutoff
214    """
215    # Chi-square cutoff
216    chi2_val = stats.chi2.ppf(confidence, df=2)
217
218    # Eigendecomposition
219    eigenvalues, eigenvectors = np.linalg.eigh(cov)
220
221    # Semi-axis lengths
222    semi_axes = np.sqrt(chi2_val * eigenvalues)
223
224    # Rotation angle
225    angle = np.arctan2(eigenvectors[1, 1], eigenvectors[0, 1])
226
227    return {
228        'center': theta_hat,
229        'eigenvalues': eigenvalues,
230        'eigenvectors': eigenvectors,
231        'semi_axes': semi_axes,
232        'angle_rad': angle,
233        'angle_deg': np.degrees(angle),
234        'chi2_cutoff': chi2_val
235    }
236
237
238# ============================================
239# Demonstration
240# ============================================
241
242if __name__ == "__main__":
243    np.random.seed(42)
244
245    print("=" * 60)
246    print("MULTIVARIATE ESTIMATION")
247    print("=" * 60)
248
249    # Example 1: Normal distribution
250    print("\n--- Normal(μ, σ²) MLE ---")
251    mu_true, sigma2_true = 10.0, 4.0
252    n = 100
253    X = np.random.normal(mu_true, np.sqrt(sigma2_true), n)
254
255    theta_hat, cov_hat = mle_normal(X)
256
257    print(f"True parameters: μ={mu_true}, σ²={sigma2_true}")
258    print(f"MLE estimates: μ̂={theta_hat[0]:.4f}, σ̂²={theta_hat[1]:.4f}")
259    print(f"\nAsymptotic covariance matrix:")
260    print(cov_hat)
261
262    # CRLB
263    I_true = fisher_information_normal(mu_true, sigma2_true, n)
264    crlb = crlb_variances(I_true)
265    print(f"\nCRLB for variances: Var(μ̂)≥{crlb[0]:.4f}, Var(σ̂²)≥{crlb[1]:.4f}")
266    print(f"Actual asymptotic variances: {np.diag(cov_hat)}")
267
268    # Example 2: Regression
269    print("\n--- Linear Regression MLE ---")
270    n, p = 100, 3
271    X_reg = np.column_stack([np.ones(n), np.random.randn(n, p-1)])
272    beta_true = np.array([2.0, 1.5, -0.5])
273    y = X_reg @ beta_true + np.random.normal(0, 1.5, n)
274
275    beta_hat, cov_beta = mle_regression(X_reg, y)
276
277    print(f"True β: {beta_true}")
278    print(f"MLE β̂: {beta_hat}")
279    print(f"\nCovariance matrix of β̂:")
280    print(cov_beta)
281    print(f"\nStandard errors: {np.sqrt(np.diag(cov_beta))}")
282
283    # Confidence ellipse
284    print("\n--- Confidence Ellipse (first 2 params) ---")
285    ellipse = confidence_ellipse_params(beta_hat[:2], cov_beta[:2, :2])
286    print(f"Center: {ellipse['center']}")
287    print(f"Semi-axes: {ellipse['semi_axes']}")
288    print(f"Rotation angle: {ellipse['angle_deg']:.1f}°")

Summary

Key Takeaways

Vector parameters are estimated jointly, with quality characterized by the covariance matrix rather than scalar variance.
The Fisher Information Matrix I(θ) generalizes scalar Fisher Information, with entries [I]ᵢⱼ = E[∂logf/∂θᵢ · ∂logf/∂θⱼ].
The multivariate CRLB states Cov(θ̂) ≥ I(θ)⁻¹ in the matrix (positive semi-definite) sense.
Multivariate MLEs are asymptotically normal with covariance I(θ)⁻¹/n, achieving the CRLB.
Deep learning applications include natural gradient descent, Laplace approximation, and understanding parameter correlations.

Looking Ahead: In Chapter 12, we'll learn practical methods for finding estimators: Method of Moments, Maximum Likelihood Estimation, the EM algorithm, and the Rao-Blackwell improvement theorem.