Boo-AI — Master Artificial Intelligence by Building from Scratch

Learning Objectives

The multivariate normal (MVN) distribution is the most important continuous distribution in statistics and machine learning. By the end of this section, you will master the MVN and understand why it appears everywhere in modern AI systems.

Define the MVN mathematically and understand every component of its PDF: the mean vector $\boldsymbol{\mu}$ , the covariance matrix $\boldsymbol{\Sigma}$ , and the normalizing constant
Visualize and interpret the elliptical contours of the MVN, understanding how eigenvalues and eigenvectors determine the shape and orientation
Compute marginal and conditional distributions of MVN, recognizing that they remain Gaussian with closed-form parameters
Understand the affine transformation property: linear transformations of MVN are still MVN, which is fundamental for neural network analysis
Appreciate high-dimensional behavior: the concentration of measure phenomenon and its implications for machine learning
Apply MVN concepts in practical contexts: Gaussian processes, VAEs, Bayesian deep learning, and probabilistic PCA
Implement MVN sampling and density evaluation efficiently using Cholesky decomposition and understand the Mahalanobis distance

Why This is the Most Important Distribution

The MVN is to multivariate statistics what the univariate normal is to single-variable statistics—but even more fundamental. It arises naturally from the Central Limit Theorem for sums of random vectors, describes thermal noise in electronic systems, and serves as the foundation for Gaussian processes, factor analysis, and most Bayesian machine learning.

In deep learning specifically: weight initializations are Gaussian, dropout noise is related to Gaussian noise injection, reparameterization tricks in VAEs use Gaussian distributions, and even the training dynamics of wide neural networks converge to Gaussian processes!

The Big Picture: Why MVN Matters

"The multivariate normal is the hydrogen atom of statistics—simple enough to analyze completely, yet rich enough to capture the essential phenomena."

The multivariate normal extends the familiar bell curve to $d$ dimensions. Just as the univariate normal $N(\mu, \sigma^2)$ is parameterized by a mean and variance, the MVN $N(\boldsymbol{\mu}, \boldsymbol{\Sigma})$ is parameterized by:

Mean vector $\boldsymbol{\mu} \in \mathbb{R}^d$ : The center of the distribution, the expected value $E[\mathbf{X}]$
Covariance matrix $\boldsymbol{\Sigma} \in \mathbb{R}^{d \times d}$ : A symmetric positive definite matrix that encodes variances and correlations

The covariance matrix is the key to understanding MVN geometry. Its entries tell us:

\Sigma_{ij} = \text{Cov}(X_i, X_j) = E[(X_i - \mu_i)(X_j - \mu_j)]

Diagonal entries $\Sigma_{ii}$ : The variance of each component
Off-diagonal entries $\Sigma_{ij}$ : Covariances between pairs of components

The Geometry of MVN

The contours of constant probability density are ellipsoids centered at $\boldsymbol{\mu}$ . The shape and orientation of these ellipsoids are determined by the eigenvalues and eigenvectors of $\boldsymbol{\Sigma}$ :

Eigenvectors: Point along the principal axes of the ellipsoid
Eigenvalues: Determine the length of each principal axis (variance along that direction)

Historical Origins: From Gauss to Modern AI

Carl Friedrich Gauss (1777-1855)

The normal distribution is often called the "Gaussian distribution" after Carl Friedrich Gauss, who derived it in the context of astronomical measurement errors around 1809. Gauss showed that if errors arise from the sum of many small independent effects, their distribution approaches the normal—the first statement of what we now call the Central Limit Theorem.

Francis Galton (1822-1911) and Karl Pearson (1857-1936)

Galton extended univariate ideas to the bivariate case, discovering regression to the mean and developing the concept of correlation. His student Karl Pearson formalized the bivariate normal and pioneered multivariate analysis, developing Principal Component Analysis (PCA).

R.A. Fisher and Modern Statistics

R.A. Fisher (1890-1962) developed the theoretical foundations for maximum likelihood estimation and multivariate analysis, showing how MVN structure enables powerful inference procedures. The Fisher information matrix—central to modern neural network optimization—derives its geometric meaning from the MVN.

The MVN in AI History

The MVN has been pivotal in AI development:

1960s: Linear discriminant analysis (LDA) assumes MVN class-conditional distributions
1990s: Gaussian mixture models and the EM algorithm
2000s: Gaussian processes for nonparametric Bayesian learning
2013: VAEs use MVN priors and posteriors with reparameterization tricks
2018+: Neural tangent kernel theory shows wide networks → Gaussian processes

Mathematical Definition

The Probability Density Function

A random vector $\mathbf{X} = (X_1, \ldots, X_d)^T$ follows a d-dimensional multivariate normal distribution, written $\mathbf{X} \sim N(\boldsymbol{\mu}, \boldsymbol{\Sigma})$ , if its probability density function is:

f(\mathbf{x}) = \frac{1}{(2\pi)^{d/2} |\boldsymbol{\Sigma}|^{1/2}} \exp\left( -\frac{1}{2} (\mathbf{x} - \boldsymbol{\mu})^T \boldsymbol{\Sigma}^{-1} (\mathbf{x} - \boldsymbol{\mu}) \right)

Let's understand each component:

Component	Meaning	Role
(2π)^(d/2)	Dimension factor	Normalizes the integral to 1
\|Σ\|^(1/2)	Square root of determinant	Accounts for the 'volume' of the covariance ellipsoid
(x - μ)	Deviation from mean	How far the point is from the center
Σ⁻¹	Precision matrix	Inverse of covariance; gives 'natural' metric
Quadratic form	(x-μ)ᵀΣ⁻¹(x-μ)	Squared Mahalanobis distance

The Mahalanobis Distance

The exponent contains the squared Mahalanobis distance:

D^2(\mathbf{x}) = (\mathbf{x} - \boldsymbol{\mu})^T \boldsymbol{\Sigma}^{-1} (\mathbf{x} - \boldsymbol{\mu})

This is the "natural" distance for MVN—it measures how many standard deviations a point is from the mean, accounting for correlations. Points on the same contour ellipse have the same Mahalanobis distance.

Key Insight: D² ~ χ²(d)

If $\mathbf{X} \sim N(\boldsymbol{\mu}, \boldsymbol{\Sigma})$ , then the squared Mahalanobis distance follows a chi-squared distribution:

D^2 = (\mathbf{X} - \boldsymbol{\mu})^T \boldsymbol{\Sigma}^{-1} (\mathbf{X} - \boldsymbol{\mu}) \sim \chi^2_d

This is how we construct confidence ellipsoids: the boundary where $D^2 = \chi^2_{d,\alpha}$ .

Interactive PDF Explorer

Explore how the MVN parameters affect the distribution. Adjust the mean vector, variances, and correlation to see how the contour ellipses change. Toggle the principal axes to see the eigenvector directions, and enable marginal distributions to see the projections.

Multivariate Normal PDF Explorer

μ_X (Mean X): 0.0

μ_Y (Mean Y): 0.0

ρ (Correlation): 0.50

σ_X: 1.00

σ_Y: 1.00

Show Principal AxesShow Marginals

Covariance Matrix Σ

\boldsymbol{\Sigma} = \begin{pmatrix} 1.00 & 0.50 \\ 0.50 & 1.00 \end{pmatrix}

det(Σ) = 0.750

Key Properties

• Eigenvalues: λ₁ = 1.500, λ₂ = 0.500
• Principal Angle: 45.0°
• Correlation: ρ = 0.50
• Shape: Strongly elliptical

What to Try

Set ρ = 0 to see circular contours (when σ_X = σ_Y)
Increase |ρ| toward ±1 to see elongated ellipses
Make σ_X ≠ σ_Y with ρ = 0 to see axis-aligned ellipses
Watch how the eigenvalue labels change as you adjust parameters
Enable marginals to see that they remain Gaussian regardless of ρ!

Geometry: Ellipses and Eigenvalues

The shape of MVN contours is determined by the eigendecomposition of the covariance matrix:

\boldsymbol{\Sigma} = \mathbf{Q} \boldsymbol{\Lambda} \mathbf{Q}^T

where $\mathbf{Q}$ contains the eigenvectors as columns, and $\boldsymbol{\Lambda}$ is diagonal with eigenvalues $\lambda_1, \ldots, \lambda_d$ .

Eigenvectors give the directions of the principal axes
Eigenvalues give the variances along each principal axis
√λ gives the standard deviation along each axis

How Covariance Shapes the Ellipse

Step 1: Unit Circle

Start with the standard bivariate normal N(0, I) which has circular contours.

Covariance Matrix

Σ = [1.00, 0.00
0.00, 1.00]

Eigenvalue λ₁

1.000

Major axis length: 2.00σ

Eigenvalue λ₂

1.000

Minor axis length: 2.00σ

Key Insight

The eigenvectors of Σ point along the principal axes of the ellipse, and the eigenvalues determine the length of each axis. The ratio λ₁/λ₂ = 1.00 tells us how elongated the ellipse is. When ρ = 0, the eigenvectors align with the coordinate axes.

Principal Component Analysis (PCA)

PCA is exactly this eigendecomposition! The principal components are the eigenvectors, ordered by eigenvalue. PCA rotates data so that the MVN becomes axis-aligned, with variances λ₁ ≥ λ₂ ≥ ... ≥ λ_d along the axes.

Key Formulas for Bivariate Case

For the 2D case with covariance matrix:

\boldsymbol{\Sigma} = \begin{pmatrix} \sigma_X^2 & \rho\sigma_X\sigma_Y \\ \rho\sigma_X\sigma_Y & \sigma_Y^2 \end{pmatrix}

The eigenvalues are:

\lambda_{1,2} = \frac{\sigma_X^2 + \sigma_Y^2}{2} \pm \sqrt{\left(\frac{\sigma_X^2 - \sigma_Y^2}{2}\right)^2 + \rho^2\sigma_X^2\sigma_Y^2}

And the principal axis makes angle θ with the x-axis:

\tan(2\theta) = \frac{2\rho\sigma_X\sigma_Y}{\sigma_X^2 - \sigma_Y^2}

Key Properties of MVN

1. Marginal Distributions

Every marginal distribution of an MVN is also normal:

X_i \sim N(\mu_i, \Sigma_{ii})

More generally, any subset of components has a multivariate normal distribution with the corresponding subvector of means and submatrix of covariances.

Independence vs Zero Correlation

For MVN (and only for MVN), uncorrelated implies independent. If $\text{Cov}(X_i, X_j) = 0$ for all $i \neq j$ , then the components are mutually independent.

This is a special property! For general distributions, zero correlation does not imply independence.

2. Affine Transformation Property

This is perhaps the most powerful property. If $\mathbf{X} \sim N(\boldsymbol{\mu}, \boldsymbol{\Sigma})$ and $\mathbf{A}$ is a matrix with $\mathbf{b}$ a vector, then:

\mathbf{A}\mathbf{X} + \mathbf{b} \sim N(\mathbf{A}\boldsymbol{\mu} + \mathbf{b}, \mathbf{A}\boldsymbol{\Sigma}\mathbf{A}^T)

Why This Matters for Neural Networks

A linear layer in a neural network computes $\mathbf{W}\mathbf{x} + \mathbf{b}$ . If the input is Gaussian (which approximately holds for many initializations and activations), the output before the nonlinearity is also Gaussian! This enables:

Analysis of weight initialization schemes (Xavier, He)
Understanding of batch normalization effects
Neural tangent kernel theory for infinite-width networks

3. Sum of Independent Gaussians

If $\mathbf{X} \sim N(\boldsymbol{\mu}_X, \boldsymbol{\Sigma}_X)$ and $\mathbf{Y} \sim N(\boldsymbol{\mu}_Y, \boldsymbol{\Sigma}_Y)$ are independent:

\mathbf{X} + \mathbf{Y} \sim N(\boldsymbol{\mu}_X + \boldsymbol{\mu}_Y, \boldsymbol{\Sigma}_X + \boldsymbol{\Sigma}_Y)

Conditional Distributions

One of the most beautiful properties of MVN: conditionals are also Gaussian, with closed-form parameters. Partition $\mathbf{X}$ as:

\mathbf{X} = \begin{pmatrix} \mathbf{X}_1 \\ \mathbf{X}_2 \end{pmatrix}, \quad \boldsymbol{\mu} = \begin{pmatrix} \boldsymbol{\mu}_1 \\ \boldsymbol{\mu}_2 \end{pmatrix}, \quad \boldsymbol{\Sigma} = \begin{pmatrix} \boldsymbol{\Sigma}_{11} & \boldsymbol{\Sigma}_{12} \\ \boldsymbol{\Sigma}_{21} & \boldsymbol{\Sigma}_{22} \end{pmatrix}

Then the conditional distribution is:

\mathbf{X}_1 | \mathbf{X}_2 = \mathbf{x}_2 \sim N(\boldsymbol{\mu}_{1|2}, \boldsymbol{\Sigma}_{1|2})

with:

\boldsymbol{\mu}_{1|2} = \boldsymbol{\mu}_1 + \boldsymbol{\Sigma}_{12}\boldsymbol{\Sigma}_{22}^{-1}(\mathbf{x}_2 - \boldsymbol{\mu}_2)

\boldsymbol{\Sigma}_{1|2} = \boldsymbol{\Sigma}_{11} - \boldsymbol{\Sigma}_{12}\boldsymbol{\Sigma}_{22}^{-1}\boldsymbol{\Sigma}_{21}

Conditional Distribution of MVN

Observed X = 1.50

ρ = 0.70

σ_X = 1.0

σ_Y = 1.0

Conditional Distribution Formula

Y | X = x \sim N\left(\mu_Y + \rho\frac{\sigma_Y}{\sigma_X}(x - \mu_X), \; \sigma_Y^2(1 - \rho^2)\right)

Current Values

Conditional Mean:1.050

Conditional Std:0.714

Marginal Variance:1.000

Variance Reduction:49.0%

Key Insight: Variance Reduction

Observing X reduces our uncertainty about Y. The conditional variance σ²_Y|X = σ²_Y(1 - ρ²) is always smaller than the marginal variance σ²_Y. With ρ = 0.70, knowing X explains 49.0% of the variance in Y!

Key Observations

Conditional mean is a linear function of the observed values—this is why linear regression is optimal under Gaussian assumptions!
Conditional variance does not depend on the observed value—it's constant (homoscedastic). Observing $\mathbf{X}_2$ always reduces variance by the same amount.
The formula $\boldsymbol{\Sigma}_{12}\boldsymbol{\Sigma}_{22}^{-1}$ appears everywhere in statistics: it's the regression coefficient matrix!

High Dimensions: The Concentration of Measure

In machine learning, we often work with very high-dimensional data (images with millions of pixels, language models with billions of parameters). The MVN exhibits surprising behavior in high dimensions that is crucial to understand.

High-Dimensional MVN: The Curse and Blessing

As dimension d increases, the multivariate normal exhibits fascinating behaviors that are crucial for understanding modern deep learning systems operating in high-dimensional spaces.

Dimension d:

Concentration on Shell (d = 2)

Concentration of Measure

Expected Distance

√2 ≈ 1.41

E[||X||]

Shell Concentration

19.0%

in outer 10% shell

Relative Std Dev

0.707

σ/μ ratio

Typical Angle

≈ 90°

between random vectors

The Concentration of Measure Phenomenon

In high dimensions, almost all the probability mass of a standard MVN concentrates on a thin spherical shell at distance √d from the origin. At d = 100, over 99.9% of samples lie in the outer 10% shell! This is why high-dimensional data behaves so differently from our 2D/3D intuitions.

Implications for Deep Learning

• Weight Initialization: Xavier/He initialization scales by 1/√d to control activation magnitudes
• Batch Normalization: Re-centers activations to combat the concentration phenomenon
• Nearest Neighbor Search: All points become equidistant in high-d, making similarity search harder
• Random Projections: The Johnson-Lindenstrauss lemma exploits concentration for dimensionality reduction

Key High-Dimensional Phenomena

Shell Concentration: For standard MVN in $d$ dimensions, almost all samples lie on a thin shell at distance $\sqrt{d}$ from the origin. The origin—the mode of the distribution—has near-zero probability!
Orthogonality: Random pairs of samples become nearly orthogonal as $d \to \infty$ . The expected cosine between two random unit vectors goes to 0.
Curse of Dimensionality: The volume of the unit ball shrinks exponentially. Most of the volume concentrates in the "corners" of the hypercube.

Implications for Deep Learning

These phenomena directly impact neural network design:

Weight Initialization: Xavier/He initialization scales weights by $1/\sqrt{d}$ to keep activation variances constant across layers
Batch Normalization: Re-centers and re-scales activations to combat distribution shift caused by concentration
Layer Normalization: Normalizes across feature dimensions, leveraging the shell concentration phenomenon

AI/ML Applications

1. Variational Autoencoders (VAEs)

VAEs use MVN as both prior and approximate posterior:

\mathbf{z} \sim N(\mathbf{0}, \mathbf{I}) \quad \text{(prior)}

q_\phi(\mathbf{z}|\mathbf{x}) = N(\boldsymbol{\mu}_\phi(\mathbf{x}), \text{diag}(\boldsymbol{\sigma}^2_\phi(\mathbf{x}))) \quad \text{(encoder)}

The reparameterization trick uses the affine property: sample $\mathbf{z} = \boldsymbol{\mu} + \boldsymbol{\sigma} \odot \boldsymbol{\epsilon}$ where $\boldsymbol{\epsilon} \sim N(\mathbf{0}, \mathbf{I})$ .

2. Gaussian Processes

A GP defines a prior over functions where any finite collection of function values is MVN:

\begin{pmatrix} f(\mathbf{x}_1) \\ \vdots \\ f(\mathbf{x}_n) \end{pmatrix} \sim N\left(\mathbf{0}, \begin{pmatrix} k(\mathbf{x}_1, \mathbf{x}_1) & \cdots & k(\mathbf{x}_1, \mathbf{x}_n) \\ \vdots & \ddots & \vdots \\ k(\mathbf{x}_n, \mathbf{x}_1) & \cdots & k(\mathbf{x}_n, \mathbf{x}_n) \end{pmatrix}\right)

The conditional distribution formula gives us posterior predictions with uncertainty!

3. Bayesian Neural Networks

Weight distributions are often modeled as MVN:

\mathbf{w} \sim N(\boldsymbol{\mu}_w, \boldsymbol{\Sigma}_w)

The Laplace approximation approximates the posterior as Gaussian using the Hessian.

4. Probabilistic PCA

PPCA assumes latent factors are MVN:

\mathbf{z} \sim N(\mathbf{0}, \mathbf{I}), \quad \mathbf{x} | \mathbf{z} \sim N(\mathbf{W}\mathbf{z} + \boldsymbol{\mu}, \sigma^2\mathbf{I})

Marginalizing over $\mathbf{z}$ , we get $\mathbf{x} \sim N(\boldsymbol{\mu}, \mathbf{W}\mathbf{W}^T + \sigma^2\mathbf{I})$ .

5. Fisher Information and Natural Gradient

For MVN, the Fisher information has special structure enabling efficient natural gradient descent. This underpins K-FAC and other second-order optimization methods for neural networks.

Python Implementation

Basic MVN Operations

Multivariate Normal: Basic Operations

🐍mvn_basics.py

Explanation(7)

Code(49)

5Mean Vector

The mean vector μ specifies the center of the distribution in d-dimensional space. Each component is the mean of the corresponding marginal distribution.

12Covariance Matrix Construction

The covariance matrix Σ is symmetric positive definite. Diagonal elements are variances (σ²), off-diagonal elements are covariances (ρσ_xσ_y). This matrix encodes all second-order information about the distribution.

19SciPy MVN Object

scipy.stats.multivariate_normal creates a frozen distribution object. You can then call .pdf() for density, .rvs() for sampling, .logpdf() for log-density, and more.

28PDF Evaluation

The PDF value tells us the probability density at a specific point. For MVN, this uses the formula involving the determinant and quadratic form (x-μ)ᵀΣ⁻¹(x-μ).

31Eigendecomposition

The eigenvalues of Σ are the variances along the principal axes, and the eigenvectors point along those axes. This is the key to understanding the elliptical contours!

36Principal Axis Angle

The angle of the first eigenvector from the x-axis tells us how the ellipse is rotated. When ρ=0, the eigenvectors align with the coordinate axes.

44Conditional Distribution

For bivariate normal, Y|X=x is also normal. The conditional mean shifts proportionally to (x - μ_X), and the conditional variance shrinks by factor (1 - ρ²).

42 lines without explanation

1import numpy as np
2from scipy.stats import multivariate_normal
3import matplotlib.pyplot as plt
4
5# Define the multivariate normal distribution
6# Mean vector (2D)
7mu = np.array([1.0, 2.0])
8
9# Covariance matrix (2x2, symmetric positive definite)
10# Variances on diagonal, covariances off-diagonal
11sigma_x, sigma_y = 1.5, 1.0
12rho = 0.7  # Correlation coefficient
13cov = rho * sigma_x * sigma_y
14
15Sigma = np.array([
16    [sigma_x**2, cov],
17    [cov, sigma_y**2]
18])
19
20# Create the multivariate normal distribution
21mvn = multivariate_normal(mean=mu, cov=Sigma)
22
23# Sample from the distribution
24np.random.seed(42)
25samples = mvn.rvs(size=1000)
26
27# Evaluate PDF at a specific point
28point = np.array([1.5, 2.5])
29pdf_value = mvn.pdf(point)
30print(f"PDF at {point}: {pdf_value:.4f}")
31
32# Compute eigenvalues and eigenvectors of Sigma
33eigenvalues, eigenvectors = np.linalg.eig(Sigma)
34print(f"\nEigenvalues: {eigenvalues}")
35print(f"Eigenvectors:\n{eigenvectors}")
36
37# Principal axis angle
38angle = np.arctan2(eigenvectors[1, 0], eigenvectors[0, 0])
39print(f"Principal axis angle: {np.degrees(angle):.1f}°")
40
41# Verify: marginal distributions
42print(f"\nMarginal X: N({mu[0]}, {Sigma[0,0]:.3f})")
43print(f"Marginal Y: N({mu[1]}, {Sigma[1,1]:.3f})")
44
45# Conditional distribution Y | X = x_obs
46x_obs = 2.0
47cond_mean = mu[1] + rho * (sigma_y / sigma_x) * (x_obs - mu[0])
48cond_var = sigma_y**2 * (1 - rho**2)
49print(f"\nConditional Y|X={x_obs}: N({cond_mean:.3f}, {cond_var:.3f})")

Sampling via Cholesky Decomposition

Efficient MVN Sampling

🐍mvn_sampling.py

Explanation(5)

Code(43)

5Cholesky Sampling Method

The key insight: if Z ~ N(0, I) and Σ = LLᵀ (Cholesky), then μ + LZ ~ N(μ, Σ). This transforms independent standard normals into correlated MVN samples.

12Cholesky Decomposition

np.linalg.cholesky finds the lower triangular matrix L such that LLᵀ = Σ. This only works if Σ is positive definite (all eigenvalues > 0).

15Standard Normal Samples

We start with d independent standard normal samples Z ~ N(0, I). These are easy to generate using Box-Muller or inverse transform.

18Affine Transformation

The transformation X = μ + LZ applies a linear map that introduces the correlations encoded in Σ, then shifts to the correct mean.

31Verification

We check that our samples have the correct statistics. With enough samples, the sample mean and covariance should closely match the true parameters.

38 lines without explanation

1import numpy as np
2import matplotlib.pyplot as plt
3
4def sample_mvn_cholesky(mu, Sigma, n_samples):
5    """
6    Sample from MVN using Cholesky decomposition.
7    This is how most libraries implement MVN sampling.
8
9    X = μ + L @ Z, where Z ~ N(0, I) and L @ L.T = Σ
10    """
11    d = len(mu)
12
13    # Cholesky decomposition: Σ = L @ L.T
14    L = np.linalg.cholesky(Sigma)
15
16    # Sample standard normal vectors
17    Z = np.random.randn(n_samples, d)
18
19    # Transform: X = μ + L @ Z
20    X = mu + Z @ L.T
21
22    return X
23
24# Example: Bivariate normal
25mu = np.array([0, 0])
26Sigma = np.array([[1.0, 0.6],
27                  [0.6, 1.0]])
28
29# Generate samples
30np.random.seed(42)
31samples = sample_mvn_cholesky(mu, Sigma, 2000)
32
33# Verify statistics
34print("Sample mean:", samples.mean(axis=0))
35print("Sample covariance:\n", np.cov(samples.T))
36print("True covariance:\n", Sigma)
37
38# The Cholesky factor L
39L = np.linalg.cholesky(Sigma)
40print("\nCholesky factor L:")
41print(L)
42print("\nVerify L @ L.T = Σ:")
43print(L @ L.T)

Mahalanobis Distance and Confidence Ellipses

Mahalanobis Distance

🐍mahalanobis.py

Explanation(4)

Code(44)

7Mahalanobis Distance Formula

Unlike Euclidean distance, Mahalanobis distance accounts for correlations and different variances. Points along the principal axis (direction of maximum variance) are 'closer' in this metric.

11Matrix Inverse

We need Σ⁻¹ (the precision matrix). For large d, computing this directly is expensive; iterative methods or Cholesky solve are preferred.

24Same Euclidean, Different Mahalanobis

Two points at the same Euclidean distance from the mean can have very different Mahalanobis distances. The point along the principal axis is 'more typical' for the distribution.

38Chi-Square Connection

The squared Mahalanobis distance follows a chi-squared distribution with d degrees of freedom. This is used to construct confidence ellipses: the boundary where D² = χ²_{d,α}.

40 lines without explanation

1import numpy as np
2from scipy.stats import chi2
3
4def mahalanobis_distance(x, mu, Sigma):
5    """
6    Compute Mahalanobis distance from x to μ.
7
8    D² = (x - μ)ᵀ Σ⁻¹ (x - μ)
9
10    This is the "natural" distance metric for MVN.
11    """
12    diff = x - mu
13    Sigma_inv = np.linalg.inv(Sigma)
14    return np.sqrt(diff @ Sigma_inv @ diff)
15
16# Example
17mu = np.array([0, 0])
18Sigma = np.array([[1.0, 0.8],
19                  [0.8, 1.0]])
20
21# Compare two points at same Euclidean distance
22point1 = np.array([1.5, 0.0])   # Along x-axis
23point2 = np.array([1.06, 1.06]) # Along principal axis
24
25# Euclidean distances
26eucl_1 = np.linalg.norm(point1 - mu)
27eucl_2 = np.linalg.norm(point2 - mu)
28
29# Mahalanobis distances
30maha_1 = mahalanobis_distance(point1, mu, Sigma)
31maha_2 = mahalanobis_distance(point2, mu, Sigma)
32
33print(f"Point 1: {point1}")
34print(f"  Euclidean: {eucl_1:.3f}, Mahalanobis: {maha_1:.3f}")
35print(f"\nPoint 2: {point2}")
36print(f"  Euclidean: {eucl_2:.3f}, Mahalanobis: {maha_2:.3f}")
37
38# For MVN: D² follows chi-squared distribution!
39# Probability that D² <= threshold
40d = 2  # dimensions
41threshold = 2.0
42prob = chi2.cdf(threshold**2, df=d)
43print(f"\nP(D² <= {threshold}²) = {prob:.3f}")
44print(f"This defines a {prob*100:.1f}% confidence ellipse")

Common Pitfalls

Pitfall 1: Non-Positive Definite Covariance

The covariance matrix must be positive definite for the MVN PDF to be valid. Common causes of non-PD matrices:

Rounding errors in computed covariances
More variables than observations (singular matrix)
Perfectly correlated features

Fix: Add small regularization $\Sigma + \epsilon I$ , or use robust covariance estimation.

Pitfall 2: Confusing Correlation and Independence

For MVN only, zero correlation implies independence. But if you're not certain your data is Gaussian, uncorrelated variables can still be dependent!

Pitfall 3: Forgetting the Jacobian

When transforming MVN random variables, the PDF transforms with a Jacobian factor. For $\mathbf{Y} = g(\mathbf{X})$ :

f_Y(\mathbf{y}) = f_X(g^{-1}(\mathbf{y})) \cdot \left|\det\left(\frac{\partial g^{-1}}{\partial \mathbf{y}}\right)\right|

Pitfall 4: High-Dimensional Covariance Estimation

Estimating $\Sigma$ requires $O(d^2)$ parameters. With $n < d$ samples, the sample covariance is singular.

Solutions: Shrinkage estimators (Ledoit-Wolf), factor models, diagonal approximations.

Pitfall 5: Assuming Gaussianity

Real data is often not Gaussian: heavy tails, skewness, multimodality. Always check normality assumptions, especially in the tails where departures matter most.

Test Your Understanding

Question 1 of 8

For a bivariate normal (X, Y) with correlation ρ, what is the marginal distribution of X?

Summary

The multivariate normal distribution is foundational to modern statistics and machine learning. Here are the key takeaways:

Mathematical Foundations

PDF: $f(\mathbf{x}) = \frac{1}{(2\pi)^{d/2}|\Sigma|^{1/2}} \exp\left(-\frac{1}{2}(\mathbf{x}-\boldsymbol{\mu})^T\Sigma^{-1}(\mathbf{x}-\boldsymbol{\mu})\right)$
Parameters: Mean vector $\boldsymbol{\mu}$ and covariance matrix $\boldsymbol{\Sigma}$
Contours: Ellipsoids whose axes align with eigenvectors of $\boldsymbol{\Sigma}$

Key Properties

Marginals are Gaussian: Any subset of components is MVN
Conditionals are Gaussian: With linear mean and constant variance
Affine transformations preserve Gaussianity: $\mathbf{AX} + \mathbf{b} \sim N(\mathbf{A}\boldsymbol{\mu}+\mathbf{b}, \mathbf{A}\boldsymbol{\Sigma}\mathbf{A}^T)$
Uncorrelated = Independent: Only for MVN!

High-Dimensional Behavior

Probability concentrates on a shell at distance $\sqrt{d}$ from origin
Random vectors become nearly orthogonal
This affects weight initialization, normalization, and optimization in neural networks

AI/ML Applications

VAEs: MVN priors and posteriors with reparameterization
Gaussian Processes: Prior over functions, exact posterior inference
Bayesian Neural Networks: Gaussian weight posteriors
Probabilistic PCA: Latent variable models
Natural Gradient: Fisher information geometry

The Central Message

The multivariate normal is special because it is closed under marginalization, conditioning, and linear transformations. This closure makes analytical calculations possible and is why Gaussian assumptions are so prevalent in machine learning.

Understanding MVN deeply—its geometry, its behavior in high dimensions, and its computational properties—is essential for anyone working in probabilistic machine learning.