Chapter 7
35 min read
Section 52 of 175

Multivariate Normal Distribution

Multivariate Distributions

Learning Objectives

The multivariate normal (MVN) distribution is the most important continuous distribution in statistics and machine learning. By the end of this section, you will master the MVN and understand why it appears everywhere in modern AI systems.

  1. Define the MVN mathematically and understand every component of its PDF: the mean vector μ\boldsymbol{\mu}, the covariance matrix Σ\boldsymbol{\Sigma}, and the normalizing constant
  2. Visualize and interpret the elliptical contours of the MVN, understanding how eigenvalues and eigenvectors determine the shape and orientation
  3. Compute marginal and conditional distributions of MVN, recognizing that they remain Gaussian with closed-form parameters
  4. Understand the affine transformation property: linear transformations of MVN are still MVN, which is fundamental for neural network analysis
  5. Appreciate high-dimensional behavior: the concentration of measure phenomenon and its implications for machine learning
  6. Apply MVN concepts in practical contexts: Gaussian processes, VAEs, Bayesian deep learning, and probabilistic PCA
  7. Implement MVN sampling and density evaluation efficiently using Cholesky decomposition and understand the Mahalanobis distance

Why This is the Most Important Distribution

The MVN is to multivariate statistics what the univariate normal is to single-variable statistics—but even more fundamental. It arises naturally from the Central Limit Theorem for sums of random vectors, describes thermal noise in electronic systems, and serves as the foundation for Gaussian processes, factor analysis, and most Bayesian machine learning.

In deep learning specifically: weight initializations are Gaussian, dropout noise is related to Gaussian noise injection, reparameterization tricks in VAEs use Gaussian distributions, and even the training dynamics of wide neural networks converge to Gaussian processes!


The Big Picture: Why MVN Matters

"The multivariate normal is the hydrogen atom of statistics—simple enough to analyze completely, yet rich enough to capture the essential phenomena."

The multivariate normal extends the familiar bell curve to dd dimensions. Just as the univariate normal N(μ,σ2)N(\mu, \sigma^2) is parameterized by a mean and variance, the MVN N(μ,Σ)N(\boldsymbol{\mu}, \boldsymbol{\Sigma}) is parameterized by:

  • Mean vector μRd\boldsymbol{\mu} \in \mathbb{R}^d: The center of the distribution, the expected value E[X]E[\mathbf{X}]
  • Covariance matrix ΣRd×d\boldsymbol{\Sigma} \in \mathbb{R}^{d \times d}: A symmetric positive definite matrix that encodes variances and correlations

The covariance matrix is the key to understanding MVN geometry. Its entries tell us:

Σij=Cov(Xi,Xj)=E[(Xiμi)(Xjμj)]\Sigma_{ij} = \text{Cov}(X_i, X_j) = E[(X_i - \mu_i)(X_j - \mu_j)]
  • Diagonal entries Σii\Sigma_{ii}: The variance of each component
  • Off-diagonal entries Σij\Sigma_{ij}: Covariances between pairs of components

The Geometry of MVN

The contours of constant probability density are ellipsoids centered atμ\boldsymbol{\mu}. The shape and orientation of these ellipsoids are determined by the eigenvalues and eigenvectors of Σ\boldsymbol{\Sigma}:

  • Eigenvectors: Point along the principal axes of the ellipsoid
  • Eigenvalues: Determine the length of each principal axis (variance along that direction)

Historical Origins: From Gauss to Modern AI

Carl Friedrich Gauss (1777-1855)

The normal distribution is often called the "Gaussian distribution" after Carl Friedrich Gauss, who derived it in the context of astronomical measurement errors around 1809. Gauss showed that if errors arise from the sum of many small independent effects, their distribution approaches the normal—the first statement of what we now call the Central Limit Theorem.

Francis Galton (1822-1911) and Karl Pearson (1857-1936)

Galton extended univariate ideas to the bivariate case, discovering regression to the mean and developing the concept of correlation. His student Karl Pearson formalized the bivariate normal and pioneered multivariate analysis, developing Principal Component Analysis (PCA).

R.A. Fisher and Modern Statistics

R.A. Fisher (1890-1962) developed the theoretical foundations for maximum likelihood estimation and multivariate analysis, showing how MVN structure enables powerful inference procedures. The Fisher information matrix—central to modern neural network optimization—derives its geometric meaning from the MVN.

The MVN in AI History

The MVN has been pivotal in AI development:

  • 1960s: Linear discriminant analysis (LDA) assumes MVN class-conditional distributions
  • 1990s: Gaussian mixture models and the EM algorithm
  • 2000s: Gaussian processes for nonparametric Bayesian learning
  • 2013: VAEs use MVN priors and posteriors with reparameterization tricks
  • 2018+: Neural tangent kernel theory shows wide networks → Gaussian processes

Mathematical Definition

The Probability Density Function

A random vector X=(X1,,Xd)T\mathbf{X} = (X_1, \ldots, X_d)^T follows a d-dimensional multivariate normal distribution, writtenXN(μ,Σ)\mathbf{X} \sim N(\boldsymbol{\mu}, \boldsymbol{\Sigma}), if its probability density function is:

f(x)=1(2π)d/2Σ1/2exp(12(xμ)TΣ1(xμ))f(\mathbf{x}) = \frac{1}{(2\pi)^{d/2} |\boldsymbol{\Sigma}|^{1/2}} \exp\left( -\frac{1}{2} (\mathbf{x} - \boldsymbol{\mu})^T \boldsymbol{\Sigma}^{-1} (\mathbf{x} - \boldsymbol{\mu}) \right)

Let's understand each component:

ComponentMeaningRole
(2π)^(d/2)Dimension factorNormalizes the integral to 1
|Σ|^(1/2)Square root of determinantAccounts for the 'volume' of the covariance ellipsoid
(x - μ)Deviation from meanHow far the point is from the center
Σ⁻¹Precision matrixInverse of covariance; gives 'natural' metric
Quadratic form(x-μ)ᵀΣ⁻¹(x-μ)Squared Mahalanobis distance

The Mahalanobis Distance

The exponent contains the squared Mahalanobis distance:

D2(x)=(xμ)TΣ1(xμ)D^2(\mathbf{x}) = (\mathbf{x} - \boldsymbol{\mu})^T \boldsymbol{\Sigma}^{-1} (\mathbf{x} - \boldsymbol{\mu})

This is the "natural" distance for MVN—it measures how many standard deviations a point is from the mean, accounting for correlations. Points on the same contour ellipse have the same Mahalanobis distance.

Key Insight: D² ~ χ²(d)

If XN(μ,Σ)\mathbf{X} \sim N(\boldsymbol{\mu}, \boldsymbol{\Sigma}), then the squared Mahalanobis distance follows a chi-squared distribution:

D2=(Xμ)TΣ1(Xμ)χd2D^2 = (\mathbf{X} - \boldsymbol{\mu})^T \boldsymbol{\Sigma}^{-1} (\mathbf{X} - \boldsymbol{\mu}) \sim \chi^2_d

This is how we construct confidence ellipsoids: the boundary where D2=χd,α2D^2 = \chi^2_{d,\alpha}.


Interactive PDF Explorer

Explore how the MVN parameters affect the distribution. Adjust the mean vector, variances, and correlation to see how the contour ellipses change. Toggle the principal axes to see the eigenvector directions, and enable marginal distributions to see the projections.

Multivariate Normal PDF Explorer

λ₁ = 1.50λ₂ = 0.50μ = (0.0, 0.0)-3-3-2-2-1-100112233XYContour Levels1σ2σ3σ
Covariance Matrix Σ
Σ=(1.000.500.501.00)\boldsymbol{\Sigma} = \begin{pmatrix} 1.00 & 0.50 \\ 0.50 & 1.00 \end{pmatrix}
det(Σ) = 0.750
Key Properties
  • Eigenvalues: λ₁ = 1.500, λ₂ = 0.500
  • Principal Angle: 45.0°
  • Correlation: ρ = 0.50
  • Shape: Strongly elliptical

What to Try

  • Set ρ = 0 to see circular contours (when σ_X = σ_Y)
  • Increase |ρ| toward ±1 to see elongated ellipses
  • Make σ_X ≠ σ_Y with ρ = 0 to see axis-aligned ellipses
  • Watch how the eigenvalue labels change as you adjust parameters
  • Enable marginals to see that they remain Gaussian regardless of ρ!

Geometry: Ellipses and Eigenvalues

The shape of MVN contours is determined by the eigendecomposition of the covariance matrix:

Σ=QΛQT\boldsymbol{\Sigma} = \mathbf{Q} \boldsymbol{\Lambda} \mathbf{Q}^T

where Q\mathbf{Q} contains the eigenvectors as columns, andΛ\boldsymbol{\Lambda} is diagonal with eigenvaluesλ1,,λd\lambda_1, \ldots, \lambda_d.

  • Eigenvectors give the directions of the principal axes
  • Eigenvalues give the variances along each principal axis
  • √λ gives the standard deviation along each axis

How Covariance Shapes the Ellipse

Step 1: Unit Circle
Start with the standard bivariate normal N(0, I) which has circular contours.
-2-20022
Covariance Matrix
Σ = [1.00, 0.00
    0.00, 1.00]
Eigenvalue λ₁
1.000
Major axis length: 2.00σ
Eigenvalue λ₂
1.000
Minor axis length: 2.00σ
Key Insight
The eigenvectors of Σ point along the principal axes of the ellipse, and the eigenvalues determine the length of each axis. The ratio λ₁/λ₂ = 1.00 tells us how elongated the ellipse is. When ρ = 0, the eigenvectors align with the coordinate axes.

Principal Component Analysis (PCA)

PCA is exactly this eigendecomposition! The principal components are the eigenvectors, ordered by eigenvalue. PCA rotates data so that the MVN becomes axis-aligned, with variances λ₁ ≥ λ₂ ≥ ... ≥ λ_d along the axes.

Key Formulas for Bivariate Case

For the 2D case with covariance matrix:

Σ=(σX2ρσXσYρσXσYσY2)\boldsymbol{\Sigma} = \begin{pmatrix} \sigma_X^2 & \rho\sigma_X\sigma_Y \\ \rho\sigma_X\sigma_Y & \sigma_Y^2 \end{pmatrix}

The eigenvalues are:

λ1,2=σX2+σY22±(σX2σY22)2+ρ2σX2σY2\lambda_{1,2} = \frac{\sigma_X^2 + \sigma_Y^2}{2} \pm \sqrt{\left(\frac{\sigma_X^2 - \sigma_Y^2}{2}\right)^2 + \rho^2\sigma_X^2\sigma_Y^2}

And the principal axis makes angle θ with the x-axis:

tan(2θ)=2ρσXσYσX2σY2\tan(2\theta) = \frac{2\rho\sigma_X\sigma_Y}{\sigma_X^2 - \sigma_Y^2}

Key Properties of MVN

1. Marginal Distributions

Every marginal distribution of an MVN is also normal:

XiN(μi,Σii)X_i \sim N(\mu_i, \Sigma_{ii})

More generally, any subset of components has a multivariate normal distribution with the corresponding subvector of means and submatrix of covariances.

Independence vs Zero Correlation

For MVN (and only for MVN), uncorrelated implies independent. If Cov(Xi,Xj)=0\text{Cov}(X_i, X_j) = 0 for all iji \neq j, then the components are mutually independent.

This is a special property! For general distributions, zero correlation does not imply independence.

2. Affine Transformation Property

This is perhaps the most powerful property. If XN(μ,Σ)\mathbf{X} \sim N(\boldsymbol{\mu}, \boldsymbol{\Sigma})and A\mathbf{A} is a matrix with b\mathbf{b} a vector, then:

AX+bN(Aμ+b,AΣAT)\mathbf{A}\mathbf{X} + \mathbf{b} \sim N(\mathbf{A}\boldsymbol{\mu} + \mathbf{b}, \mathbf{A}\boldsymbol{\Sigma}\mathbf{A}^T)

Why This Matters for Neural Networks

A linear layer in a neural network computes Wx+b\mathbf{W}\mathbf{x} + \mathbf{b}. If the input is Gaussian (which approximately holds for many initializations and activations), the output before the nonlinearity is also Gaussian! This enables:

  • Analysis of weight initialization schemes (Xavier, He)
  • Understanding of batch normalization effects
  • Neural tangent kernel theory for infinite-width networks

3. Sum of Independent Gaussians

If XN(μX,ΣX)\mathbf{X} \sim N(\boldsymbol{\mu}_X, \boldsymbol{\Sigma}_X) andYN(μY,ΣY)\mathbf{Y} \sim N(\boldsymbol{\mu}_Y, \boldsymbol{\Sigma}_Y) are independent:

X+YN(μX+μY,ΣX+ΣY)\mathbf{X} + \mathbf{Y} \sim N(\boldsymbol{\mu}_X + \boldsymbol{\mu}_Y, \boldsymbol{\Sigma}_X + \boldsymbol{\Sigma}_Y)

Conditional Distributions

One of the most beautiful properties of MVN: conditionals are also Gaussian, with closed-form parameters. Partition X\mathbf{X} as:

X=(X1X2),μ=(μ1μ2),Σ=(Σ11Σ12Σ21Σ22)\mathbf{X} = \begin{pmatrix} \mathbf{X}_1 \\ \mathbf{X}_2 \end{pmatrix}, \quad \boldsymbol{\mu} = \begin{pmatrix} \boldsymbol{\mu}_1 \\ \boldsymbol{\mu}_2 \end{pmatrix}, \quad \boldsymbol{\Sigma} = \begin{pmatrix} \boldsymbol{\Sigma}_{11} & \boldsymbol{\Sigma}_{12} \\ \boldsymbol{\Sigma}_{21} & \boldsymbol{\Sigma}_{22} \end{pmatrix}

Then the conditional distribution is:

X1X2=x2N(μ12,Σ12)\mathbf{X}_1 | \mathbf{X}_2 = \mathbf{x}_2 \sim N(\boldsymbol{\mu}_{1|2}, \boldsymbol{\Sigma}_{1|2})

with:

μ12=μ1+Σ12Σ221(x2μ2)\boldsymbol{\mu}_{1|2} = \boldsymbol{\mu}_1 + \boldsymbol{\Sigma}_{12}\boldsymbol{\Sigma}_{22}^{-1}(\mathbf{x}_2 - \boldsymbol{\mu}_2)
Σ12=Σ11Σ12Σ221Σ21\boldsymbol{\Sigma}_{1|2} = \boldsymbol{\Sigma}_{11} - \boldsymbol{\Sigma}_{12}\boldsymbol{\Sigma}_{22}^{-1}\boldsymbol{\Sigma}_{21}

Conditional Distribution of MVN

XYX = 1.5f(Y | X = 1.5)μY|X = 1.05-202ConditionalMarginal
Conditional Distribution Formula
YX=xN(μY+ρσYσX(xμX),  σY2(1ρ2))Y | X = x \sim N\left(\mu_Y + \rho\frac{\sigma_Y}{\sigma_X}(x - \mu_X), \; \sigma_Y^2(1 - \rho^2)\right)
Current Values
Conditional Mean:1.050
Conditional Std:0.714
Marginal Variance:1.000
Variance Reduction:49.0%
Key Insight: Variance Reduction
Observing X reduces our uncertainty about Y. The conditional variance σ²Y|X = σ²Y(1 - ρ²) is always smaller than the marginal variance σ²Y. With ρ = 0.70, knowing X explains 49.0% of the variance in Y!

Key Observations

  • Conditional mean is a linear function of the observed values—this is why linear regression is optimal under Gaussian assumptions!
  • Conditional variance does not depend on the observed value—it's constant (homoscedastic). Observing X2\mathbf{X}_2 always reduces variance by the same amount.
  • The formula Σ12Σ221\boldsymbol{\Sigma}_{12}\boldsymbol{\Sigma}_{22}^{-1} appears everywhere in statistics: it's the regression coefficient matrix!

High Dimensions: The Concentration of Measure

In machine learning, we often work with very high-dimensional data (images with millions of pixels, language models with billions of parameters). The MVN exhibits surprising behavior in high dimensions that is crucial to understand.

High-Dimensional MVN: The Curse and Blessing

As dimension d increases, the multivariate normal exhibits fascinating behaviors that are crucial for understanding modern deep learning systems operating in high-dimensional spaces.

Dimension d:
Concentration on Shell (d = 2)
E[||X||] = √d90% boundary
Concentration of Measure
0%50%100%1235102050100Dimension dVolume in outer 10% shell
Expected Distance
21.41
E[||X||]
Shell Concentration
19.0%
in outer 10% shell
Relative Std Dev
0.707
σ/μ ratio
Typical Angle
≈ 90°
between random vectors
The Concentration of Measure Phenomenon
In high dimensions, almost all the probability mass of a standard MVN concentrates on a thin spherical shell at distance √d from the origin. At d = 100, over 99.9% of samples lie in the outer 10% shell! This is why high-dimensional data behaves so differently from our 2D/3D intuitions.
Implications for Deep Learning
  • Weight Initialization: Xavier/He initialization scales by 1/√d to control activation magnitudes
  • Batch Normalization: Re-centers activations to combat the concentration phenomenon
  • Nearest Neighbor Search: All points become equidistant in high-d, making similarity search harder
  • Random Projections: The Johnson-Lindenstrauss lemma exploits concentration for dimensionality reduction

Key High-Dimensional Phenomena

  1. Shell Concentration: For standard MVN in dd dimensions, almost all samples lie on a thin shell at distance d\sqrt{d} from the origin. The origin—the mode of the distribution—has near-zero probability!
  2. Orthogonality: Random pairs of samples become nearly orthogonal as dd \to \infty. The expected cosine between two random unit vectors goes to 0.
  3. Curse of Dimensionality: The volume of the unit ball shrinks exponentially. Most of the volume concentrates in the "corners" of the hypercube.

Implications for Deep Learning

These phenomena directly impact neural network design:

  • Weight Initialization: Xavier/He initialization scales weights by 1/d1/\sqrt{d}to keep activation variances constant across layers
  • Batch Normalization: Re-centers and re-scales activations to combat distribution shift caused by concentration
  • Layer Normalization: Normalizes across feature dimensions, leveraging the shell concentration phenomenon

AI/ML Applications

1. Variational Autoencoders (VAEs)

VAEs use MVN as both prior and approximate posterior:

zN(0,I)(prior)\mathbf{z} \sim N(\mathbf{0}, \mathbf{I}) \quad \text{(prior)}
qϕ(zx)=N(μϕ(x),diag(σϕ2(x)))(encoder)q_\phi(\mathbf{z}|\mathbf{x}) = N(\boldsymbol{\mu}_\phi(\mathbf{x}), \text{diag}(\boldsymbol{\sigma}^2_\phi(\mathbf{x}))) \quad \text{(encoder)}

The reparameterization trick uses the affine property: samplez=μ+σϵ\mathbf{z} = \boldsymbol{\mu} + \boldsymbol{\sigma} \odot \boldsymbol{\epsilon}where ϵN(0,I)\boldsymbol{\epsilon} \sim N(\mathbf{0}, \mathbf{I}).

2. Gaussian Processes

A GP defines a prior over functions where any finite collection of function values is MVN:

(f(x1)f(xn))N(0,(k(x1,x1)k(x1,xn)k(xn,x1)k(xn,xn)))\begin{pmatrix} f(\mathbf{x}_1) \\ \vdots \\ f(\mathbf{x}_n) \end{pmatrix} \sim N\left(\mathbf{0}, \begin{pmatrix} k(\mathbf{x}_1, \mathbf{x}_1) & \cdots & k(\mathbf{x}_1, \mathbf{x}_n) \\ \vdots & \ddots & \vdots \\ k(\mathbf{x}_n, \mathbf{x}_1) & \cdots & k(\mathbf{x}_n, \mathbf{x}_n) \end{pmatrix}\right)

The conditional distribution formula gives us posterior predictions with uncertainty!

3. Bayesian Neural Networks

Weight distributions are often modeled as MVN:

wN(μw,Σw)\mathbf{w} \sim N(\boldsymbol{\mu}_w, \boldsymbol{\Sigma}_w)

The Laplace approximation approximates the posterior as Gaussian using the Hessian.

4. Probabilistic PCA

PPCA assumes latent factors are MVN:

zN(0,I),xzN(Wz+μ,σ2I)\mathbf{z} \sim N(\mathbf{0}, \mathbf{I}), \quad \mathbf{x} | \mathbf{z} \sim N(\mathbf{W}\mathbf{z} + \boldsymbol{\mu}, \sigma^2\mathbf{I})

Marginalizing over z\mathbf{z}, we get xN(μ,WWT+σ2I)\mathbf{x} \sim N(\boldsymbol{\mu}, \mathbf{W}\mathbf{W}^T + \sigma^2\mathbf{I}).

5. Fisher Information and Natural Gradient

For MVN, the Fisher information has special structure enabling efficient natural gradient descent. This underpins K-FAC and other second-order optimization methods for neural networks.


Python Implementation

Basic MVN Operations

Multivariate Normal: Basic Operations
🐍mvn_basics.py
5Mean Vector

The mean vector μ specifies the center of the distribution in d-dimensional space. Each component is the mean of the corresponding marginal distribution.

12Covariance Matrix Construction

The covariance matrix Σ is symmetric positive definite. Diagonal elements are variances (σ²), off-diagonal elements are covariances (ρσ_xσ_y). This matrix encodes all second-order information about the distribution.

19SciPy MVN Object

scipy.stats.multivariate_normal creates a frozen distribution object. You can then call .pdf() for density, .rvs() for sampling, .logpdf() for log-density, and more.

28PDF Evaluation

The PDF value tells us the probability density at a specific point. For MVN, this uses the formula involving the determinant and quadratic form (x-μ)ᵀΣ⁻¹(x-μ).

31Eigendecomposition

The eigenvalues of Σ are the variances along the principal axes, and the eigenvectors point along those axes. This is the key to understanding the elliptical contours!

36Principal Axis Angle

The angle of the first eigenvector from the x-axis tells us how the ellipse is rotated. When ρ=0, the eigenvectors align with the coordinate axes.

44Conditional Distribution

For bivariate normal, Y|X=x is also normal. The conditional mean shifts proportionally to (x - μ_X), and the conditional variance shrinks by factor (1 - ρ²).

42 lines without explanation
1import numpy as np
2from scipy.stats import multivariate_normal
3import matplotlib.pyplot as plt
4
5# Define the multivariate normal distribution
6# Mean vector (2D)
7mu = np.array([1.0, 2.0])
8
9# Covariance matrix (2x2, symmetric positive definite)
10# Variances on diagonal, covariances off-diagonal
11sigma_x, sigma_y = 1.5, 1.0
12rho = 0.7  # Correlation coefficient
13cov = rho * sigma_x * sigma_y
14
15Sigma = np.array([
16    [sigma_x**2, cov],
17    [cov, sigma_y**2]
18])
19
20# Create the multivariate normal distribution
21mvn = multivariate_normal(mean=mu, cov=Sigma)
22
23# Sample from the distribution
24np.random.seed(42)
25samples = mvn.rvs(size=1000)
26
27# Evaluate PDF at a specific point
28point = np.array([1.5, 2.5])
29pdf_value = mvn.pdf(point)
30print(f"PDF at {point}: {pdf_value:.4f}")
31
32# Compute eigenvalues and eigenvectors of Sigma
33eigenvalues, eigenvectors = np.linalg.eig(Sigma)
34print(f"\nEigenvalues: {eigenvalues}")
35print(f"Eigenvectors:\n{eigenvectors}")
36
37# Principal axis angle
38angle = np.arctan2(eigenvectors[1, 0], eigenvectors[0, 0])
39print(f"Principal axis angle: {np.degrees(angle):.1f}°")
40
41# Verify: marginal distributions
42print(f"\nMarginal X: N({mu[0]}, {Sigma[0,0]:.3f})")
43print(f"Marginal Y: N({mu[1]}, {Sigma[1,1]:.3f})")
44
45# Conditional distribution Y | X = x_obs
46x_obs = 2.0
47cond_mean = mu[1] + rho * (sigma_y / sigma_x) * (x_obs - mu[0])
48cond_var = sigma_y**2 * (1 - rho**2)
49print(f"\nConditional Y|X={x_obs}: N({cond_mean:.3f}, {cond_var:.3f})")

Sampling via Cholesky Decomposition

Efficient MVN Sampling
🐍mvn_sampling.py
5Cholesky Sampling Method

The key insight: if Z ~ N(0, I) and Σ = LLᵀ (Cholesky), then μ + LZ ~ N(μ, Σ). This transforms independent standard normals into correlated MVN samples.

12Cholesky Decomposition

np.linalg.cholesky finds the lower triangular matrix L such that LLᵀ = Σ. This only works if Σ is positive definite (all eigenvalues > 0).

15Standard Normal Samples

We start with d independent standard normal samples Z ~ N(0, I). These are easy to generate using Box-Muller or inverse transform.

18Affine Transformation

The transformation X = μ + LZ applies a linear map that introduces the correlations encoded in Σ, then shifts to the correct mean.

31Verification

We check that our samples have the correct statistics. With enough samples, the sample mean and covariance should closely match the true parameters.

38 lines without explanation
1import numpy as np
2import matplotlib.pyplot as plt
3
4def sample_mvn_cholesky(mu, Sigma, n_samples):
5    """
6    Sample from MVN using Cholesky decomposition.
7    This is how most libraries implement MVN sampling.
8
9    X = μ + L @ Z, where Z ~ N(0, I) and L @ L.T = Σ
10    """
11    d = len(mu)
12
13    # Cholesky decomposition: Σ = L @ L.T
14    L = np.linalg.cholesky(Sigma)
15
16    # Sample standard normal vectors
17    Z = np.random.randn(n_samples, d)
18
19    # Transform: X = μ + L @ Z
20    X = mu + Z @ L.T
21
22    return X
23
24# Example: Bivariate normal
25mu = np.array([0, 0])
26Sigma = np.array([[1.0, 0.6],
27                  [0.6, 1.0]])
28
29# Generate samples
30np.random.seed(42)
31samples = sample_mvn_cholesky(mu, Sigma, 2000)
32
33# Verify statistics
34print("Sample mean:", samples.mean(axis=0))
35print("Sample covariance:\n", np.cov(samples.T))
36print("True covariance:\n", Sigma)
37
38# The Cholesky factor L
39L = np.linalg.cholesky(Sigma)
40print("\nCholesky factor L:")
41print(L)
42print("\nVerify L @ L.T = Σ:")
43print(L @ L.T)

Mahalanobis Distance and Confidence Ellipses

Mahalanobis Distance
🐍mahalanobis.py
7Mahalanobis Distance Formula

Unlike Euclidean distance, Mahalanobis distance accounts for correlations and different variances. Points along the principal axis (direction of maximum variance) are 'closer' in this metric.

11Matrix Inverse

We need Σ⁻¹ (the precision matrix). For large d, computing this directly is expensive; iterative methods or Cholesky solve are preferred.

24Same Euclidean, Different Mahalanobis

Two points at the same Euclidean distance from the mean can have very different Mahalanobis distances. The point along the principal axis is 'more typical' for the distribution.

38Chi-Square Connection

The squared Mahalanobis distance follows a chi-squared distribution with d degrees of freedom. This is used to construct confidence ellipses: the boundary where D² = χ²_{d,α}.

40 lines without explanation
1import numpy as np
2from scipy.stats import chi2
3
4def mahalanobis_distance(x, mu, Sigma):
5    """
6    Compute Mahalanobis distance from x to μ.
7
8    D² = (x - μ)ᵀ Σ⁻¹ (x - μ)
9
10    This is the "natural" distance metric for MVN.
11    """
12    diff = x - mu
13    Sigma_inv = np.linalg.inv(Sigma)
14    return np.sqrt(diff @ Sigma_inv @ diff)
15
16# Example
17mu = np.array([0, 0])
18Sigma = np.array([[1.0, 0.8],
19                  [0.8, 1.0]])
20
21# Compare two points at same Euclidean distance
22point1 = np.array([1.5, 0.0])   # Along x-axis
23point2 = np.array([1.06, 1.06]) # Along principal axis
24
25# Euclidean distances
26eucl_1 = np.linalg.norm(point1 - mu)
27eucl_2 = np.linalg.norm(point2 - mu)
28
29# Mahalanobis distances
30maha_1 = mahalanobis_distance(point1, mu, Sigma)
31maha_2 = mahalanobis_distance(point2, mu, Sigma)
32
33print(f"Point 1: {point1}")
34print(f"  Euclidean: {eucl_1:.3f}, Mahalanobis: {maha_1:.3f}")
35print(f"\nPoint 2: {point2}")
36print(f"  Euclidean: {eucl_2:.3f}, Mahalanobis: {maha_2:.3f}")
37
38# For MVN: D² follows chi-squared distribution!
39# Probability that D² <= threshold
40d = 2  # dimensions
41threshold = 2.0
42prob = chi2.cdf(threshold**2, df=d)
43print(f"\nP(D² <= {threshold}²) = {prob:.3f}")
44print(f"This defines a {prob*100:.1f}% confidence ellipse")

Common Pitfalls

Pitfall 1: Non-Positive Definite Covariance

The covariance matrix must be positive definite for the MVN PDF to be valid. Common causes of non-PD matrices:

  • Rounding errors in computed covariances
  • More variables than observations (singular matrix)
  • Perfectly correlated features

Fix: Add small regularization Σ+ϵI\Sigma + \epsilon I, or use robust covariance estimation.

Pitfall 2: Confusing Correlation and Independence

For MVN only, zero correlation implies independence. But if you're not certain your data is Gaussian, uncorrelated variables can still be dependent!

Pitfall 3: Forgetting the Jacobian

When transforming MVN random variables, the PDF transforms with a Jacobian factor. For Y=g(X)\mathbf{Y} = g(\mathbf{X}):

fY(y)=fX(g1(y))det(g1y)f_Y(\mathbf{y}) = f_X(g^{-1}(\mathbf{y})) \cdot \left|\det\left(\frac{\partial g^{-1}}{\partial \mathbf{y}}\right)\right|

Pitfall 4: High-Dimensional Covariance Estimation

Estimating Σ\Sigma requires O(d2)O(d^2) parameters. With n<dn < d samples, the sample covariance is singular.

Solutions: Shrinkage estimators (Ledoit-Wolf), factor models, diagonal approximations.

Pitfall 5: Assuming Gaussianity

Real data is often not Gaussian: heavy tails, skewness, multimodality. Always check normality assumptions, especially in the tails where departures matter most.


Test Your Understanding

Question 1 of 8

For a bivariate normal (X, Y) with correlation ρ, what is the marginal distribution of X?


Summary

The multivariate normal distribution is foundational to modern statistics and machine learning. Here are the key takeaways:

Mathematical Foundations

  • PDF: f(x)=1(2π)d/2Σ1/2exp(12(xμ)TΣ1(xμ))f(\mathbf{x}) = \frac{1}{(2\pi)^{d/2}|\Sigma|^{1/2}} \exp\left(-\frac{1}{2}(\mathbf{x}-\boldsymbol{\mu})^T\Sigma^{-1}(\mathbf{x}-\boldsymbol{\mu})\right)
  • Parameters: Mean vector μ\boldsymbol{\mu} and covariance matrix Σ\boldsymbol{\Sigma}
  • Contours: Ellipsoids whose axes align with eigenvectors of Σ\boldsymbol{\Sigma}

Key Properties

  • Marginals are Gaussian: Any subset of components is MVN
  • Conditionals are Gaussian: With linear mean and constant variance
  • Affine transformations preserve Gaussianity: AX+bN(Aμ+b,AΣAT)\mathbf{AX} + \mathbf{b} \sim N(\mathbf{A}\boldsymbol{\mu}+\mathbf{b}, \mathbf{A}\boldsymbol{\Sigma}\mathbf{A}^T)
  • Uncorrelated = Independent: Only for MVN!

High-Dimensional Behavior

  • Probability concentrates on a shell at distance d\sqrt{d} from origin
  • Random vectors become nearly orthogonal
  • This affects weight initialization, normalization, and optimization in neural networks

AI/ML Applications

  • VAEs: MVN priors and posteriors with reparameterization
  • Gaussian Processes: Prior over functions, exact posterior inference
  • Bayesian Neural Networks: Gaussian weight posteriors
  • Probabilistic PCA: Latent variable models
  • Natural Gradient: Fisher information geometry

The Central Message

The multivariate normal is special because it is closed under marginalization, conditioning, and linear transformations. This closure makes analytical calculations possible and is why Gaussian assumptions are so prevalent in machine learning.

Understanding MVN deeply—its geometry, its behavior in high dimensions, and its computational properties—is essential for anyone working in probabilistic machine learning.

Loading comments...