Chapter 7
25 min read
Section 53 of 175

Conditional Distributions of MVN

Multivariate Distributions

Learning Objectives

Conditional distributions of the multivariate normal (MVN) are among the most powerful and widely-used results in all of statistics and machine learning. By the end of this section, you will:

  1. Master the formulas for conditional mean and conditional covariance of the multivariate normal distribution, understanding every term intuitively
  2. Visualize how conditioning on observed variables "slices" the joint distribution, producing a new Gaussian with shifted mean and reduced variance
  3. Derive the deep connection between conditional MVN and linear regression, showing that regression coefficients emerge naturally from the covariance structure
  4. Understand how Gaussian Processes use conditional MVN for function-space inference, enabling uncertainty-aware predictions
  5. Recognize the Kalman filter as sequential conditioning on observations, making state estimation principled
  6. Apply these concepts to modern deep learning: variational inference, latent variable models, and Bayesian neural networks
  7. Implement conditional MVN computations efficiently in Python using matrix operations

Why This is Foundational

The conditional MVN formulas are not just theoretical curiosities—they are the computational engine behind:

  • Gaussian Process regression and Bayesian optimization
  • Kalman filters for robotics, navigation, and time series
  • Linear and multivariate regression
  • Variational Autoencoders (VAEs) and flow-based models
  • Bayesian linear regression and uncertainty quantification

Master this section, and you'll have the key to understanding a vast landscape of probabilistic methods.


The Big Picture: What Happens When We Condition

"Conditioning is the soul of statistics." — Joe Blitzstein

Consider a multivariate normal distribution over several variables. When we observe (condition on) some of these variables, what happens to the distribution of the remaining variables?

The remarkable answer for the multivariate normal is:

The MVN Conditioning Miracle

  1. The conditional distribution is still Gaussian—conditioning preserves normality
  2. The conditional mean is a linear function of the observed values
  3. The conditional covariance is constant—it doesn't depend on the specific observed values, only on which variables were observed
  4. The conditional variance is always reduced (or unchanged if variables are independent)

This is extraordinary! For most distributions, conditioning leads to complex, non-standard forms. But for the MVN, we get clean, closed-form formulas that are computationally tractable.

The Geometric Intuition

Imagine a 2D Gaussian as a "probability cloud" shaped like an ellipse. When we condition on X=x0X = x_0:

  1. We "slice" the cloud with a vertical plane at x=x0x = x_0
  2. The cross-section is another Gaussian (1D in this case)
  3. If the original ellipse was tilted (correlated), the slice is centered away from the marginal mean of Y
  4. The slice is always narrower than the full marginal distribution of Y

Historical Context: From Gauss to Machine Learning

The properties of conditional Gaussian distributions have a rich history spanning over 200 years.

Carl Friedrich Gauss (1777-1855)

Gauss developed the method of least squares for astronomical observations, implicitly using properties of conditional normal distributions. His work on error analysis laid the foundation for understanding how correlated measurements could be optimally combined.

Francis Galton (1822-1911)

Galton discovered regression toward the mean while studying heredity. He noticed that tall parents tended to have children closer to average height. This is precisely the behavior of the conditional mean in a bivariate normal:

E[YX=x]=μY+ρσYσX(xμX)E[Y|X=x] = \mu_Y + \rho \frac{\sigma_Y}{\sigma_X}(x - \mu_X)

When ρ<1|\rho| < 1, the conditional mean is "regressed" toward μY\mu_Y compared to naive prediction.

Karl Pearson (1857-1936)

Pearson formalized the correlation coefficient and developed the full theory of the bivariate normal distribution, including the explicit formulas for conditional distributions that we use today.

Rudolf Kálmán (1930-2016)

Kálmán revolutionized control theory by showing that optimal state estimation in linear dynamical systems is achieved by sequential Gaussian conditioning. The Kalman filter, published in 1960, uses the MVN conditional formulas at every time step.

Modern Relevance

Today, conditional MVN is the mathematical foundation for Gaussian Processes, which power Bayesian optimization (used to tune hyperparameters in deep learning), geostatistics, and uncertainty quantification in neural networks.


Bivariate Normal Conditioning: The Foundation

Let's start with the simplest case: two correlated normal random variables (X,Y)(X, Y) with joint distribution:

(XY)N((μXμY),(σX2ρσXσYρσXσYσY2))\begin{pmatrix} X \\ Y \end{pmatrix} \sim N\left( \begin{pmatrix} \mu_X \\ \mu_Y \end{pmatrix}, \begin{pmatrix} \sigma_X^2 & \rho \sigma_X \sigma_Y \\ \rho \sigma_X \sigma_Y & \sigma_Y^2 \end{pmatrix} \right)

where ρ[1,1]\rho \in [-1, 1] is the correlation coefficient.

The Conditional Distribution

Given that we observe X=xX = x, the conditional distribution of YY is:

YX=xN(μYX,σYX2)Y | X = x \sim N\left( \mu_{Y|X}, \sigma_{Y|X}^2 \right)

where the conditional parameters are:

Conditional Mean

μYX=E[YX=x]=μY+ρσYσX(xμX)\mu_{Y|X} = E[Y|X=x] = \mu_Y + \rho \frac{\sigma_Y}{\sigma_X}(x - \mu_X)

Let's understand each term:

  • μY\mu_Y: The starting point—the unconditional mean of YY
  • (xμX)(x - \mu_X): How far the observed XX is from its mean (in raw units)
  • (xμX)/σX(x - \mu_X)/\sigma_X: How far XX is from its mean (in standard deviations)
  • ρ\rho: The correlation—how much YY should "follow" XX
  • ρσY/σX\rho \sigma_Y / \sigma_X: The regression coefficient—how many units YY shifts per unit increase in XX

Conditional Variance

σYX2=Var(YX)=σY2(1ρ2)\sigma_{Y|X}^2 = \text{Var}(Y|X) = \sigma_Y^2 (1 - \rho^2)

This formula encapsulates a profound insight:

  • The conditional variance does not depend on x—the uncertainty reduction is the same regardless of which XX value we observe
  • The factor (1ρ2)(1 - \rho^2) is always between 0 and 1, so variance is always reduced
  • When ρ=0\rho = 0: σYX2=σY2\sigma_{Y|X}^2 = \sigma_Y^2 (no reduction—XX provides no information)
  • When ρ=1|\rho| = 1: σYX2=0\sigma_{Y|X}^2 = 0 (perfect prediction—YY is deterministic given XX)
  • ρ2\rho^2 is the fraction of variance explained—this is exactly R2R^2 in linear regression!

The Key Insight

The conditional variance formula σY2(1ρ2)\sigma_Y^2(1 - \rho^2) tells us:

  • ρ2\rho^2 = fraction of Y's variance explained by knowing X
  • 1ρ21 - \rho^2 = fraction of Y's variance remaining after knowing X

If ρ=0.7\rho = 0.7, then ρ2=0.49\rho^2 = 0.49, meaning knowing XX explains 49% of Y's variance!


Interactive Exploration: See Conditioning in Action

The visualization below shows a bivariate normal distribution. Adjust the correlation and the observed X value to see how the conditional distribution of Y changes.

Conditional Distribution of Bivariate Normal
X = 1.0Marginal f(Y)XY-2-20022Marginal f(Y)Conditional f(Y|X)Observed X
Conditional Mean
E[Y|X=1.0] = 0.700
μY+ρσYσX(xμX)\mu_Y + \rho \frac{\sigma_Y}{\sigma_X}(x - \mu_X)
Conditional Std Dev
σYX\sigma_{Y|X} = 0.714
σY1ρ2\sigma_Y \sqrt{1 - \rho^2}
Variance Reduction
ρ2\rho^2 = 49.0%
Fraction of variance explained

Key Insight: When we condition on X=xX = x, the conditional distribution YXY|X is still normal, but with:

  • A shifted mean that depends linearly on the observed value of X
  • A reduced variance by factor (1ρ2)(1 - \rho^2), independent of which X value we observe
  • Higher ρ|\rho| means knowing X gives more information about Y

What to Observe

  • As |ρ| increases: The conditional distribution becomes narrower (less uncertainty)
  • As X moves away from 0: The conditional mean shifts in the direction of correlation
  • The green curve (marginal): Never changes—it's the unconditional distribution of Y
  • The red curve (conditional): Shifts position and narrows based on ρ and observed X

General Multivariate Normal Conditioning

Now let's extend to the general case with nn variables. We partition the random vector into two groups:

X=(X1X2)N((μ1μ2),(Σ11Σ12Σ21Σ22))\mathbf{X} = \begin{pmatrix} \mathbf{X}_1 \\ \mathbf{X}_2 \end{pmatrix} \sim N\left( \begin{pmatrix} \boldsymbol{\mu}_1 \\ \boldsymbol{\mu}_2 \end{pmatrix}, \begin{pmatrix} \boldsymbol{\Sigma}_{11} & \boldsymbol{\Sigma}_{12} \\ \boldsymbol{\Sigma}_{21} & \boldsymbol{\Sigma}_{22} \end{pmatrix} \right)

where:

  • X1\mathbf{X}_1: The variables we want to predict (dimension pp)
  • X2\mathbf{X}_2: The observed (conditioning) variables (dimension qq)
  • Σ11\boldsymbol{\Sigma}_{11}: Covariance of X1\mathbf{X}_1 (p×pp \times p)
  • Σ22\boldsymbol{\Sigma}_{22}: Covariance of X2\mathbf{X}_2 (q×qq \times q)
  • Σ12=Σ21T\boldsymbol{\Sigma}_{12} = \boldsymbol{\Sigma}_{21}^T: Cross-covariance (p×qp \times q)

The General Conditional Formulas

Given observation X2=x2\mathbf{X}_2 = \mathbf{x}_2, the conditional distribution of X1\mathbf{X}_1 is:

X1X2=x2N(μˉ,Σˉ)\mathbf{X}_1 | \mathbf{X}_2 = \mathbf{x}_2 \sim N\left( \bar{\boldsymbol{\mu}}, \bar{\boldsymbol{\Sigma}} \right)

where:

μˉ=μ1+Σ12Σ221(x2μ2)\bar{\boldsymbol{\mu}} = \boldsymbol{\mu}_1 + \boldsymbol{\Sigma}_{12} \boldsymbol{\Sigma}_{22}^{-1} (\mathbf{x}_2 - \boldsymbol{\mu}_2)
Σˉ=Σ11Σ12Σ221Σ21\bar{\boldsymbol{\Sigma}} = \boldsymbol{\Sigma}_{11} - \boldsymbol{\Sigma}_{12} \boldsymbol{\Sigma}_{22}^{-1} \boldsymbol{\Sigma}_{21}

Understanding the Matrix Formulas

Let's interpret each term:

Conditional Mean

μˉ=μ1+Σ12Σ221Regression matrix(x2μ2)Deviation from mean\bar{\boldsymbol{\mu}} = \boldsymbol{\mu}_1 + \underbrace{\boldsymbol{\Sigma}_{12} \boldsymbol{\Sigma}_{22}^{-1}}_{\text{Regression matrix}} \underbrace{(\mathbf{x}_2 - \boldsymbol{\mu}_2)}_{\text{Deviation from mean}}
  • μ1\boldsymbol{\mu}_1: Start from the unconditional mean of X1\mathbf{X}_1
  • Σ12Σ221\boldsymbol{\Sigma}_{12} \boldsymbol{\Sigma}_{22}^{-1}: The multivariate regression coefficient matrix (p×qp \times q)
  • (x2μ2)(\mathbf{x}_2 - \boldsymbol{\mu}_2): How far the observations deviate from their mean
  • The mean shifts linearly with the observed values

Conditional Covariance

Σˉ=Σ11Σ12Σ221Σ21Variance reduction\bar{\boldsymbol{\Sigma}} = \boldsymbol{\Sigma}_{11} - \underbrace{\boldsymbol{\Sigma}_{12} \boldsymbol{\Sigma}_{22}^{-1} \boldsymbol{\Sigma}_{21}}_{\text{Variance reduction}}
  • Σ11\boldsymbol{\Sigma}_{11}: Start from the unconditional covariance of X1\mathbf{X}_1
  • Σ12Σ221Σ21\boldsymbol{\Sigma}_{12} \boldsymbol{\Sigma}_{22}^{-1} \boldsymbol{\Sigma}_{21}: Variance reduction from knowing X2\mathbf{X}_2
  • This is a positive semi-definite matrix, so ΣˉΣ11\bar{\boldsymbol{\Sigma}} \preceq \boldsymbol{\Sigma}_{11}
  • The result is independent of the specific observed values x2\mathbf{x}_2

Schur Complement

The conditional covariance Σˉ=Σ11Σ12Σ221Σ21\bar{\boldsymbol{\Sigma}} = \boldsymbol{\Sigma}_{11} - \boldsymbol{\Sigma}_{12} \boldsymbol{\Sigma}_{22}^{-1} \boldsymbol{\Sigma}_{21} is called the Schur complement of Σ22\boldsymbol{\Sigma}_{22} in Σ\boldsymbol{\Sigma}. It appears throughout linear algebra and statistics.


Connection to Linear Regression

One of the most beautiful results in statistics is that linear regression emerges naturally from the conditional MVN.

The Deep Connection

Consider the bivariate normal (X,Y)(X, Y). The conditional expectation is:

E[YX=x]=μY+ρσYσX(xμX)=(μYρσYσXμX)β0+ρσYσXβ1xE[Y|X=x] = \mu_Y + \rho \frac{\sigma_Y}{\sigma_X}(x - \mu_X) = \underbrace{\left(\mu_Y - \rho \frac{\sigma_Y}{\sigma_X} \mu_X\right)}_{\beta_0} + \underbrace{\rho \frac{\sigma_Y}{\sigma_X}}_{\beta_1} x

This is exactly the linear regression line E[YX]=β0+β1XE[Y|X] = \beta_0 + \beta_1 X!

QuantityFormulaMeaning
Slopeβ₁ = ρ σ_Y / σ_XChange in E[Y] per unit increase in X
Interceptβ₀ = μ_Y - β₁ μ_XE[Y] when X = 0
Residual Varianceσ²_Y|X = σ²_Y(1 - ρ²)Unexplained variance
ρ²Fraction of variance explained

Interactive Linear Regression Demo

The visualization below shows data from a bivariate normal and the regression line. The conditional distribution at any X value is shown as an orange curve.

Linear Regression as Conditional MVN
X = 1.5XY-2-20022Data (bivariate N)E[Y|X]f(Y|X=x)
Regression Slope
β1=ρσYσX\beta_1 = \rho \frac{\sigma_Y}{\sigma_X} = 0.800
Conditional Mean at X=1.5
E[Y|X] = 1.200
Residual Std Dev
σYX=σY1ρ2\sigma_{Y|X} = \sigma_Y\sqrt{1-\rho^2} = 0.600

The Deep Connection

For bivariate normal (X,Y)(X, Y), the conditional expectation is:

E[YX=x]=μY+ρσYσX(xμX)=β0+β1xE[Y|X=x] = \mu_Y + \rho \frac{\sigma_Y}{\sigma_X}(x - \mu_X) = \beta_0 + \beta_1 x

This is exactly the linear regression line! The regression coefficient β1=ρσY/σX\beta_1 = \rho \sigma_Y / \sigma_X comes directly from the conditional MVN formula. Furthermore, R2=ρ2=(rhorho100).toFixed(0)%R^2 = \rho^2 = {(rho * rho * 100).toFixed(0)}\% is the fraction of variance explained.

Key Takeaway

Linear regression is not just a curve-fitting technique—it's the conditional expectation of a bivariate normal distribution. The OLS estimator finds the parameters that would make the data most likely under this model.


Gaussian Processes: MVN Conditioning in Function Space

"A Gaussian Process is a distribution over functions, where any finite collection of function values is jointly Gaussian."

Gaussian Processes (GPs) are one of the most powerful applications of conditional MVN. The key insight: GP regression is just MVN conditioning in infinite dimensions.

How GPs Work

A GP defines a prior distribution over functions f(x)f(x). When we observe training data (X,y)(X, \mathbf{y}), we condition on these observations to get a posterior distribution over functions.

At any finite set of test points XX_*, the joint distribution of function values is:

(ff)N(0,(K(X,X)K(X,X)K(X,X)K(X,X)))\begin{pmatrix} \mathbf{f} \\ \mathbf{f}_* \end{pmatrix} \sim N\left( \mathbf{0}, \begin{pmatrix} K(X, X) & K(X, X_*) \\ K(X_*, X) & K(X_*, X_*) \end{pmatrix} \right)

where K(,)K(\cdot, \cdot) is the kernel function that defines the prior covariance structure.

The GP Posterior (Conditional MVN!)

Given observations y=f+ϵ\mathbf{y} = \mathbf{f} + \boldsymbol{\epsilon} (with noise ϵN(0,σn2I)\boldsymbol{\epsilon} \sim N(\mathbf{0}, \sigma_n^2 I)), the posterior is:

fX,y,XN(μ,Σ)\mathbf{f}_* | X, \mathbf{y}, X_* \sim N\left( \boldsymbol{\mu}_*, \boldsymbol{\Sigma}_* \right)

where:

μ=K(X,X)[K(X,X)+σn2I]1y\boldsymbol{\mu}_* = K(X_*, X) \left[ K(X, X) + \sigma_n^2 I \right]^{-1} \mathbf{y}
Σ=K(X,X)K(X,X)[K(X,X)+σn2I]1K(X,X)\boldsymbol{\Sigma}_* = K(X_*, X_*) - K(X_*, X) \left[ K(X, X) + \sigma_n^2 I \right]^{-1} K(X, X_*)

These are exactly the conditional MVN formulas! The kernel defines Σ12,Σ22\boldsymbol{\Sigma}_{12}, \boldsymbol{\Sigma}_{22}, etc.

Gaussian Process Regression

Controls smoothness of function

Controls amplitude of function

Interactive: Click anywhere on the plot to add training points. Click on an existing point to remove it. Watch how the posterior mean and uncertainty update using the conditional MVN formulas!
xf(x)-3-3-2-2-1-100112233Posterior mean95% CI (±2σ\pm 2\sigma)Training data

GP Posterior = Conditional MVN

Given training data (X,y)(X, \mathbf{y}), the posterior at test point xx_* is:

Posterior Mean:

μ=k(x,X)[K(X,X)+σn2I]1y\mu_* = k(x_*, X) [K(X,X) + \sigma_n^2 I]^{-1} \mathbf{y}

Posterior Variance:

σ2=k(x,x)k(x,X)[K(X,X)+σn2I]1k(X,x)\sigma_*^2 = k(x_*, x_*) - k(x_*, X) [K(X,X) + \sigma_n^2 I]^{-1} k(X, x_*)

These are exactly the conditional MVN formulas! The kernel k(,)k(\cdot, \cdot) defines the prior covariance structure.

Training Points
4
Avg Posterior Std
0.530
Uncertainty Reduction
47%

Key Properties of GP Posterior

  • Near training points: Low uncertainty (we've observed data here)
  • Far from training points: High uncertainty (reverting to prior)
  • Mean passes through data: Interpolates exactly (with noise-free observations)
  • Smoothness controlled by kernel: Length scale determines function smoothness

Kalman Filter: Sequential Conditioning

The Kalman filter is perhaps the most celebrated application of conditional Gaussians in engineering. It solves the problem of optimal state estimation in linear dynamical systems with Gaussian noise.

The Problem Setting

Consider a system with hidden state xt\mathbf{x}_t that evolves according to:

xt+1=Fxt+wt(State transition)\mathbf{x}_{t+1} = F \mathbf{x}_t + \mathbf{w}_t \quad \text{(State transition)}
zt=Hxt+vt(Measurement)\mathbf{z}_t = H \mathbf{x}_t + \mathbf{v}_t \quad \text{(Measurement)}

where wtN(0,Q)\mathbf{w}_t \sim N(\mathbf{0}, Q) is process noise and vtN(0,R)\mathbf{v}_t \sim N(\mathbf{0}, R) is measurement noise.

The Kalman Filter Algorithm

At each time step, the Kalman filter performs two operations:

  1. Predict (Prior): Propagate the state estimate forward in time
    x^tt1=Fx^t1t1\hat{\mathbf{x}}_{t|t-1} = F \hat{\mathbf{x}}_{t-1|t-1}
    Ptt1=FPt1t1FT+QP_{t|t-1} = F P_{t-1|t-1} F^T + Q
  2. Update (Condition!): Incorporate new measurement via MVN conditioning
    Kt=Ptt1HT(HPtt1HT+R)1(Kalman gain)K_t = P_{t|t-1} H^T (H P_{t|t-1} H^T + R)^{-1} \quad \text{(Kalman gain)}
    x^tt=x^tt1+Kt(ztHx^tt1)\hat{\mathbf{x}}_{t|t} = \hat{\mathbf{x}}_{t|t-1} + K_t (\mathbf{z}_t - H \hat{\mathbf{x}}_{t|t-1})
    Ptt=(IKtH)Ptt1P_{t|t} = (I - K_t H) P_{t|t-1}

The Update Step is Conditional MVN

The Kalman update formulas are exactly the conditional MVN formulas! The Kalman gain KK is:

K=Σ12Σ221=PHT(HPHT+R)1K = \Sigma_{12} \Sigma_{22}^{-1} = P H^T (H P H^T + R)^{-1}

This is the optimal weight for combining prior prediction with new measurement.

Kalman Filter: Conditioning in Action
Time stepPosition010203005101520True positionKalman estimateNoisy measurements
True Position
0.00
Estimated Position
0.00
Position Uncertainty (σ\sigma)
3.162
Estimation Error
0.00

Kalman Filter = Predict + Condition

1. Predict (Prior):

x^kk1=Fx^k1k1\hat{x}_{k|k-1} = F \hat{x}_{k-1|k-1}
Pkk1=FPk1k1FT+QP_{k|k-1} = F P_{k-1|k-1} F^T + Q

2. Update (Condition):

K=Pkk1HT(HPkk1HT+R)1K = P_{k|k-1} H^T (H P_{k|k-1} H^T + R)^{-1}
x^kk=x^kk1+K(zkHx^kk1)\hat{x}_{k|k} = \hat{x}_{k|k-1} + K(z_k - H\hat{x}_{k|k-1})

The update step is exactly the conditional MVN formula! The Kalman gain K is the optimal weight for combining prior prediction with new measurement.


Deep Learning Applications

Conditional MVN is not just classical statistics—it's fundamental to modern deep learning.

1. Variational Autoencoders (VAEs)

VAEs use conditional Gaussians in two places:

  • Encoder: q(zx)=N(μϕ(x),σϕ2(x))q(\mathbf{z}|\mathbf{x}) = N(\boldsymbol{\mu}_\phi(\mathbf{x}), \boldsymbol{\sigma}_\phi^2(\mathbf{x})) — maps data to latent distribution
  • Decoder: p(xz)=N(μθ(z),σθ2)p(\mathbf{x}|\mathbf{z}) = N(\boldsymbol{\mu}_\theta(\mathbf{z}), \boldsymbol{\sigma}_\theta^2) — generates data from latent

The reparameterization trick leverages the fact that conditional Gaussians can be sampled as z=μ+σϵ\mathbf{z} = \boldsymbol{\mu} + \boldsymbol{\sigma} \odot \boldsymbol{\epsilon} where ϵN(0,I)\boldsymbol{\epsilon} \sim N(\mathbf{0}, I).

2. Normalizing Flows

Flow-based models transform a simple base distribution (often Gaussian) through invertible transformations. Affine coupling layers use conditional Gaussians:

x2x1=x1exp(s(x1))+t(x1)\mathbf{x}_2 | \mathbf{x}_1 = \mathbf{x}_1 \cdot \exp(s(\mathbf{x}_1)) + t(\mathbf{x}_1)

3. Bayesian Neural Networks

In Bayesian deep learning, we place priors on weights and compute posteriors. The approximate posterior is often Gaussian:

q(W)=N(μW,diag(σW2))q(\mathbf{W}) = N(\boldsymbol{\mu}_W, \text{diag}(\boldsymbol{\sigma}_W^2))

Predictions involve conditioning on inputs, integrating out weights.

4. Attention Mechanisms

While standard attention isn't explicitly Gaussian, probabilistic attention mechanisms can be interpreted as computing conditional expectations over keys given queries, with Gaussian assumptions on representations.

5. Latent Variable Models

Many generative models (GMMs, HMMs, LDA) use conditional Gaussians for observations given latent states. The E-step in EM computes p(zx)p(\mathbf{z}|\mathbf{x}), which for Gaussian mixtures is a weighted average of conditionals.


Python Implementation

Bivariate Normal Conditioning

Bivariate Normal Conditioning
🐍bivariate_conditional.py
8Bivariate Normal Parameters

Define the marginal means, standard deviations, and correlation. These fully specify the joint distribution.

14Observed Value

x_obs is the value at which we condition. This is the 'given' part of Y | X = x_obs.

18Conditional Mean Formula

The conditional mean shifts from mu_y by rho * (sigma_y/sigma_x) * (x_obs - mu_x). This is the linear regression formula!

19Conditional Standard Deviation

The conditional std dev is sigma_y * sqrt(1 - rho^2). Note: it doesn't depend on x_obs - the uncertainty reduction is the same everywhere.

23Variance Reduction

rho^2 is the fraction of Y's variance explained by knowing X. This equals R^2 in linear regression!

33Regression Coefficient

The slope beta = Sigma_12 / Sigma_11 (or rho * sigma_y / sigma_x) is the optimal linear predictor coefficient.

32 lines without explanation
1import numpy as np
2from scipy.stats import norm, multivariate_normal
3import matplotlib.pyplot as plt
4
5# Bivariate Normal Conditional Distribution
6# Given (X, Y) ~ N(mu, Sigma), find distribution of Y | X = x_obs
7
8# Parameters
9mu_x, mu_y = 0, 0
10sigma_x, sigma_y = 1, 1
11rho = 0.7  # Correlation coefficient
12
13# Observed value
14x_obs = 1.5
15
16# Conditional distribution formulas
17# Y | X = x_obs ~ N(mu_cond, sigma_cond^2)
18mu_cond = mu_y + rho * (sigma_y / sigma_x) * (x_obs - mu_x)
19sigma_cond = sigma_y * np.sqrt(1 - rho**2)
20
21print("Conditional Distribution Y | X = {:.2f}:".format(x_obs))
22print(f"  Mean: E[Y|X={x_obs}] = {mu_cond:.4f}")
23print(f"  Std Dev: sqrt(Var(Y|X)) = {sigma_cond:.4f}")
24print(f"  Variance Reduction: {rho**2 * 100:.1f}%")
25
26# Verify by computing from joint distribution
27cov_matrix = [[sigma_x**2, rho * sigma_x * sigma_y],
28              [rho * sigma_x * sigma_y, sigma_y**2]]
29
30# Condition using the formula: Sigma_12 @ Sigma_22^{-1}
31# For bivariate: beta = Sigma_12 / Sigma_22 = rho * sigma_y / sigma_x
32beta = cov_matrix[0][1] / cov_matrix[1][1]  # Note: roles swapped for Y|X
33print(f"\nVerification: regression coefficient = {beta:.4f}")
34
35# Compare with original Y distribution
36print(f"\nOriginal Y ~ N({mu_y}, {sigma_y**2})")
37print(f"Conditional Y|X ~ N({mu_cond:.3f}, {sigma_cond**2:.3f})")
38print(f"Knowing X reduces Y's variance by factor {1 - rho**2:.3f}")

General MVN Conditioning

General Multivariate Normal Conditioning
🐍mvn_conditional.py
4Partitioned MVN

We partition X into two groups: X_1 (variables to predict) and X_2 (observed variables). The covariance matrix is partitioned accordingly.

23Extract Submatrices

Sigma_11 is variance of X_1, Sigma_22 is variance of X_2, Sigma_12 (= Sigma_21^T) is cross-covariance between groups.

33Conditional Mean Formula

mu_bar = mu_1 + Sigma_12 @ Sigma_22^{-1} @ (x_2 - mu_2). This is a linear function of the observed values!

36Conditional Covariance Formula

Sigma_bar = Sigma_11 - Sigma_12 @ Sigma_22^{-1} @ Sigma_21. This is independent of observed values - the uncertainty reduction is fixed.

47Example: 4D MVN

We have 4 correlated variables. We observe X_2 and X_3, and want the conditional distribution of X_1 and X_4.

61Empirical Verification

We verify the formulas by sampling from the full distribution and filtering samples near the observed values. The empirical statistics should match our conditional formulas.

72 lines without explanation
1import numpy as np
2from scipy.stats import multivariate_normal
3
4# General MVN Conditional Distribution
5# Partition X = [X_1, X_2]^T with:
6#   X ~ N(mu, Sigma) where mu = [mu_1, mu_2], Sigma = [[S_11, S_12], [S_21, S_22]]
7# Then: X_1 | X_2 = x_2 ~ N(mu_bar, Sigma_bar)
8
9def mvn_conditional(mu, Sigma, x_2, indices_1, indices_2):
10    """
11    Compute conditional MVN distribution.
12
13    Parameters:
14    - mu: Full mean vector
15    - Sigma: Full covariance matrix
16    - x_2: Observed values for X_2
17    - indices_1: Indices of variables we want to condition TO
18    - indices_2: Indices of observed (conditioning) variables
19
20    Returns:
21    - mu_bar: Conditional mean
22    - Sigma_bar: Conditional covariance
23    """
24    mu = np.array(mu)
25    Sigma = np.array(Sigma)
26    x_2 = np.array(x_2)
27
28    # Extract submatrices
29    mu_1 = mu[indices_1]
30    mu_2 = mu[indices_2]
31
32    Sigma_11 = Sigma[np.ix_(indices_1, indices_1)]
33    Sigma_12 = Sigma[np.ix_(indices_1, indices_2)]
34    Sigma_21 = Sigma[np.ix_(indices_2, indices_1)]
35    Sigma_22 = Sigma[np.ix_(indices_2, indices_2)]
36
37    # Compute conditional parameters
38    Sigma_22_inv = np.linalg.inv(Sigma_22)
39
40    # Conditional mean: mu_1 + Sigma_12 @ Sigma_22^{-1} @ (x_2 - mu_2)
41    mu_bar = mu_1 + Sigma_12 @ Sigma_22_inv @ (x_2 - mu_2)
42
43    # Conditional covariance: Sigma_11 - Sigma_12 @ Sigma_22^{-1} @ Sigma_21
44    Sigma_bar = Sigma_11 - Sigma_12 @ Sigma_22_inv @ Sigma_21
45
46    return mu_bar, Sigma_bar
47
48# Example: 4D MVN, condition on variables 2 and 3
49mu = [0, 1, 2, 3]
50Sigma = [
51    [1.0, 0.5, 0.3, 0.1],
52    [0.5, 2.0, 0.7, 0.2],
53    [0.3, 0.7, 1.5, 0.4],
54    [0.1, 0.2, 0.4, 1.0]
55]
56
57# Observe X_2 = 0.5, X_3 = 1.0 (indices 1 and 2)
58x_observed = [0.5, 1.0]
59indices_1 = [0, 3]  # Variables to predict (X_1, X_4)
60indices_2 = [1, 2]  # Observed variables (X_2, X_3)
61
62mu_cond, Sigma_cond = mvn_conditional(mu, Sigma, x_observed, indices_1, indices_2)
63
64print("Conditional Distribution [X_1, X_4] | [X_2=0.5, X_3=1.0]:")
65print(f"  Conditional mean: {mu_cond}")
66print(f"  Conditional covariance:\\n{Sigma_cond}")
67
68# Verify: Create full distribution and sample
69full_dist = multivariate_normal(mean=mu, cov=Sigma)
70samples = full_dist.rvs(size=10000)
71
72# Filter samples close to observed values
73mask = (np.abs(samples[:, 1] - 0.5) < 0.1) & (np.abs(samples[:, 2] - 1.0) < 0.1)
74filtered = samples[mask][:, [0, 3]]
75
76print(f"\nEmpirical (from sampling):")
77print(f"  Mean: {np.mean(filtered, axis=0)}")
78print(f"  Covariance:\\n{np.cov(filtered.T)}")

Gaussian Process Implementation

Gaussian Process Regression
🐍gaussian_process.py
5GP = Infinite MVN

A Gaussian Process is a distribution over functions, where any finite subset of function values is multivariate normal. The kernel defines the covariance structure.

7RBF Kernel

The RBF (Radial Basis Function) kernel defines how correlated f(x_i) and f(x_j) are based on distance |x_i - x_j|. Nearby points are highly correlated.

30Build Covariance Matrices

K is the train-train covariance (Sigma_22), K_star is test-train covariance (Sigma_12), K_star_star is test-test covariance (Sigma_11).

47Conditional MVN: Posterior Mean

mu_star = K_star @ K^{-1} @ y_train. This is exactly the conditional MVN mean formula with prior mean 0.

50Conditional MVN: Posterior Covariance

cov_star = K_star_star - K_star @ K^{-1} @ K_star^T. This is exactly the conditional MVN covariance formula!

63Uncertainty Quantification

The diagonal of cov_star gives variance at each test point. Far from training data, uncertainty is high. Near training data, uncertainty is low.

71 lines without explanation
1import numpy as np
2from scipy.stats import multivariate_normal
3import matplotlib.pyplot as plt
4
5# Gaussian Process = Infinite-dimensional MVN
6# The key insight: GP posterior is just conditional MVN
7
8def rbf_kernel(x1, x2, length_scale=1.0, variance=1.0):
9    """RBF (Squared Exponential) kernel."""
10    return variance * np.exp(-0.5 * (x1 - x2)**2 / length_scale**2)
11
12def gp_posterior(X_train, y_train, X_test,
13                 length_scale=1.0, variance=1.0, noise=0.1):
14    """
15    Compute GP posterior using conditional MVN formulas.
16
17    Parameters:
18    - X_train: Training inputs (n x 1)
19    - y_train: Training outputs (n,)
20    - X_test: Test inputs (m x 1)
21    - length_scale, variance: Kernel hyperparameters
22    - noise: Observation noise variance
23
24    Returns:
25    - mu_star: Posterior mean at test points
26    - cov_star: Posterior covariance at test points
27    """
28    n = len(X_train)
29    m = len(X_test)
30
31    # Build covariance matrices
32    K = np.zeros((n, n))
33    K_star = np.zeros((m, n))
34    K_star_star = np.zeros((m, m))
35
36    for i in range(n):
37        for j in range(n):
38            K[i, j] = rbf_kernel(X_train[i], X_train[j], length_scale, variance)
39            if i == j:
40                K[i, j] += noise  # Add noise on diagonal
41
42    for i in range(m):
43        for j in range(n):
44            K_star[i, j] = rbf_kernel(X_test[i], X_train[j], length_scale, variance)
45
46    for i in range(m):
47        for j in range(m):
48            K_star_star[i, j] = rbf_kernel(X_test[i], X_test[j], length_scale, variance)
49
50    # Conditional MVN formulas!
51    # This is EXACTLY: mu_bar = mu + Sigma_12 @ Sigma_22^{-1} @ (x - mu)
52    # With mu = 0 (GP prior mean) and x = y_train
53    K_inv = np.linalg.inv(K)
54
55    # Posterior mean: K_* @ K^{-1} @ y
56    mu_star = K_star @ K_inv @ y_train
57
58    # Posterior covariance: K_** - K_* @ K^{-1} @ K_*^T
59    cov_star = K_star_star - K_star @ K_inv @ K_star.T
60
61    return mu_star, cov_star
62
63# Example: Fit GP to noisy observations
64np.random.seed(42)
65X_train = np.array([-2, -1, 0, 1, 2.5])
66y_train = np.sin(X_train) + 0.1 * np.random.randn(len(X_train))
67
68X_test = np.linspace(-3, 3, 100)
69mu, cov = gp_posterior(X_train, y_train, X_test)
70std = np.sqrt(np.diag(cov))
71
72print("GP Posterior (at selected test points):")
73for i in [0, 25, 50, 75, 99]:
74    print(f"  x={X_test[i]:.2f}: mean={mu[i]:.3f}, std={std[i]:.3f}")
75
76print("\nKey insight: The GP posterior formulas are EXACTLY")
77print("the conditional MVN formulas!")

Common Pitfalls and Misconceptions

Pitfall 1: Confusing Conditional with Marginal

Wrong: "The conditional YXY|X has the same variance as the marginal YY."

Correct: Var(YX)=σY2(1ρ2)σY2=Var(Y)\text{Var}(Y|X) = \sigma_Y^2(1-\rho^2) \leq \sigma_Y^2 = \text{Var}(Y). Conditioning always reduces (or maintains) variance.

Pitfall 2: Assuming Conditional Variance Depends on Observed Value

Wrong: "If we observe X=2X=2 vs X=0X=0, the conditional variance of YY is different."

Correct: For MVN, Var(YX=x)\text{Var}(Y|X=x) is the same for all xx. Only the conditional mean shifts.

Pitfall 3: Forgetting the Schur Complement Structure

When computing Σˉ=Σ11Σ12Σ221Σ21\bar{\boldsymbol{\Sigma}} = \boldsymbol{\Sigma}_{11} - \boldsymbol{\Sigma}_{12} \boldsymbol{\Sigma}_{22}^{-1} \boldsymbol{\Sigma}_{21}, ensure:

  • Σ22\boldsymbol{\Sigma}_{22} is invertible (positive definite)
  • Σ12=Σ21T\boldsymbol{\Sigma}_{12} = \boldsymbol{\Sigma}_{21}^T (symmetry)
  • The result is positive semi-definite

Pitfall 4: Applying MVN Formulas to Non-Gaussian Data

The beautiful closed-form conditional formulas only apply to Gaussian distributions. For non-Gaussian data, conditionals can have complex, non-linear shapes.

Pitfall 5: Numerical Instability in Matrix Inversion

For large Σ22\boldsymbol{\Sigma}_{22}, direct inversion can be unstable. Use Cholesky decomposition:

Σ22=LLT\boldsymbol{\Sigma}_{22} = LL^T, then solve Lz=(x2μ2)L \mathbf{z} = (\mathbf{x}_2 - \boldsymbol{\mu}_2) via forward substitution.


Summary: The Power of Gaussian Conditioning

You've now mastered one of the most powerful tools in probability and statistics. Let's recap:

The Key Formulas

QuantityBivariateGeneral MVN
Conditional Meanμ_Y + ρ(σ_Y/σ_X)(x - μ_X)μ₁ + Σ₁₂ Σ₂₂⁻¹ (x₂ - μ₂)
Conditional Covarianceσ²_Y(1 - ρ²)Σ₁₁ - Σ₁₂ Σ₂₂⁻¹ Σ₂₁
Variance Reductionρ²Σ₁₂ Σ₂₂⁻¹ Σ₂₁ (relative to Σ₁₁)

Key Properties

  1. Closure: Conditioning a Gaussian on another Gaussian yields a Gaussian
  2. Linear mean: The conditional mean is a linear function of observed values
  3. Constant variance: Conditional covariance doesn't depend on observed values
  4. Variance reduction: Conditioning always reduces variance (information gain)

Applications Mastered

  • Linear Regression: The regression line is E[YX]E[Y|X], and R2=ρ2R^2 = \rho^2
  • Gaussian Processes: GP posterior = conditional MVN in function space
  • Kalman Filter: State estimation = sequential Gaussian conditioning
  • Deep Learning: VAEs, flows, Bayesian NNs all use conditional Gaussians

The Ultimate Insight

The conditional MVN formulas are not just mathematical curiosities—they are the computational backbone of modern probabilistic machine learning. Whenever you see:

  • A posterior that's Gaussian
  • An update step involving Kalman-like gains
  • A regression coefficient matrix
  • Uncertainty that shrinks near observations

You're seeing the conditional MVN formulas in action. Master this section, and you've unlocked a vast landscape of probabilistic methods.

Loading comments...