Boo-AI — Master Artificial Intelligence by Building from Scratch

Learning Objectives

Conditional distributions of the multivariate normal (MVN) are among the most powerful and widely-used results in all of statistics and machine learning. By the end of this section, you will:

Master the formulas for conditional mean and conditional covariance of the multivariate normal distribution, understanding every term intuitively
Visualize how conditioning on observed variables "slices" the joint distribution, producing a new Gaussian with shifted mean and reduced variance
Derive the deep connection between conditional MVN and linear regression, showing that regression coefficients emerge naturally from the covariance structure
Understand how Gaussian Processes use conditional MVN for function-space inference, enabling uncertainty-aware predictions
Recognize the Kalman filter as sequential conditioning on observations, making state estimation principled
Apply these concepts to modern deep learning: variational inference, latent variable models, and Bayesian neural networks
Implement conditional MVN computations efficiently in Python using matrix operations

Why This is Foundational

The conditional MVN formulas are not just theoretical curiosities—they are the computational engine behind:

Gaussian Process regression and Bayesian optimization
Kalman filters for robotics, navigation, and time series
Linear and multivariate regression
Variational Autoencoders (VAEs) and flow-based models
Bayesian linear regression and uncertainty quantification

Master this section, and you'll have the key to understanding a vast landscape of probabilistic methods.

The Big Picture: What Happens When We Condition

"Conditioning is the soul of statistics." — Joe Blitzstein

Consider a multivariate normal distribution over several variables. When we observe (condition on) some of these variables, what happens to the distribution of the remaining variables?

The remarkable answer for the multivariate normal is:

The MVN Conditioning Miracle

The conditional distribution is still Gaussian—conditioning preserves normality
The conditional mean is a linear function of the observed values
The conditional covariance is constant—it doesn't depend on the specific observed values, only on which variables were observed
The conditional variance is always reduced (or unchanged if variables are independent)

This is extraordinary! For most distributions, conditioning leads to complex, non-standard forms. But for the MVN, we get clean, closed-form formulas that are computationally tractable.

The Geometric Intuition

Imagine a 2D Gaussian as a "probability cloud" shaped like an ellipse. When we condition on $X = x_0$ :

We "slice" the cloud with a vertical plane at $x = x_0$
The cross-section is another Gaussian (1D in this case)
If the original ellipse was tilted (correlated), the slice is centered away from the marginal mean of Y
The slice is always narrower than the full marginal distribution of Y

Historical Context: From Gauss to Machine Learning

The properties of conditional Gaussian distributions have a rich history spanning over 200 years.

Carl Friedrich Gauss (1777-1855)

Gauss developed the method of least squares for astronomical observations, implicitly using properties of conditional normal distributions. His work on error analysis laid the foundation for understanding how correlated measurements could be optimally combined.

Francis Galton (1822-1911)

Galton discovered regression toward the mean while studying heredity. He noticed that tall parents tended to have children closer to average height. This is precisely the behavior of the conditional mean in a bivariate normal:

E[Y|X=x] = \mu_Y + \rho \frac{\sigma_Y}{\sigma_X}(x - \mu_X)

When $|\rho| < 1$ , the conditional mean is "regressed" toward $\mu_Y$ compared to naive prediction.

Karl Pearson (1857-1936)

Pearson formalized the correlation coefficient and developed the full theory of the bivariate normal distribution, including the explicit formulas for conditional distributions that we use today.

Rudolf Kálmán (1930-2016)

Kálmán revolutionized control theory by showing that optimal state estimation in linear dynamical systems is achieved by sequential Gaussian conditioning. The Kalman filter, published in 1960, uses the MVN conditional formulas at every time step.

Modern Relevance

Today, conditional MVN is the mathematical foundation for Gaussian Processes, which power Bayesian optimization (used to tune hyperparameters in deep learning), geostatistics, and uncertainty quantification in neural networks.

Bivariate Normal Conditioning: The Foundation

Let's start with the simplest case: two correlated normal random variables $(X, Y)$ with joint distribution:

\begin{pmatrix} X \\ Y \end{pmatrix} \sim N\left( \begin{pmatrix} \mu_X \\ \mu_Y \end{pmatrix}, \begin{pmatrix} \sigma_X^2 & \rho \sigma_X \sigma_Y \\ \rho \sigma_X \sigma_Y & \sigma_Y^2 \end{pmatrix} \right)

where $\rho \in [-1, 1]$ is the correlation coefficient.

The Conditional Distribution

Given that we observe $X = x$ , the conditional distribution of $Y$ is:

Y | X = x \sim N\left( \mu_{Y|X}, \sigma_{Y|X}^2 \right)

where the conditional parameters are:

Conditional Mean

\mu_{Y|X} = E[Y|X=x] = \mu_Y + \rho \frac{\sigma_Y}{\sigma_X}(x - \mu_X)

Let's understand each term:

$\mu_Y$ : The starting point—the unconditional mean of $Y$
$(x - \mu_X)$ : How far the observed $X$ is from its mean (in raw units)
$(x - \mu_X)/\sigma_X$ : How far $X$ is from its mean (in standard deviations)
$\rho$ : The correlation—how much $Y$ should "follow" $X$
$\rho \sigma_Y / \sigma_X$ : The regression coefficient—how many units $Y$ shifts per unit increase in $X$

Conditional Variance

\sigma_{Y|X}^2 = \text{Var}(Y|X) = \sigma_Y^2 (1 - \rho^2)

This formula encapsulates a profound insight:

The conditional variance does not depend on x—the uncertainty reduction is the same regardless of which $X$ value we observe
The factor $(1 - \rho^2)$ is always between 0 and 1, so variance is always reduced
When $\rho = 0$ : $\sigma_{Y|X}^2 = \sigma_Y^2$ (no reduction— $X$ provides no information)
When $|\rho| = 1$ : $\sigma_{Y|X}^2 = 0$ (perfect prediction— $Y$ is deterministic given $X$ )
$\rho^2$ is the fraction of variance explained—this is exactly $R^2$ in linear regression!

The Key Insight

The conditional variance formula $\sigma_Y^2(1 - \rho^2)$ tells us:

$\rho^2$ = fraction of Y's variance explained by knowing X
$1 - \rho^2$ = fraction of Y's variance remaining after knowing X

If $\rho = 0.7$ , then $\rho^2 = 0.49$ , meaning knowing $X$ explains 49% of Y's variance!

Interactive Exploration: See Conditioning in Action

The visualization below shows a bivariate normal distribution. Adjust the correlation and the observed X value to see how the conditional distribution of Y changes.

Conditional Distribution of Bivariate Normal

Correlation

\rho

0.70

Observed X value1.00

Conditional Mean

E[Y|X=1.0] = 0.700

\mu_Y + \rho \frac{\sigma_Y}{\sigma_X}(x - \mu_X)

Conditional Std Dev

\sigma_{Y|X}

= 0.714

\sigma_Y \sqrt{1 - \rho^2}

Variance Reduction

\rho^2

= 49.0%

Fraction of variance explained

Key Insight: When we condition on $X = x$ , the conditional distribution $Y|X$ is still normal, but with:

A shifted mean that depends linearly on the observed value of X
A reduced variance by factor $(1 - \rho^2)$ , independent of which X value we observe
Higher $|\rho|$ means knowing X gives more information about Y

What to Observe

As |ρ| increases: The conditional distribution becomes narrower (less uncertainty)
As X moves away from 0: The conditional mean shifts in the direction of correlation
The green curve (marginal): Never changes—it's the unconditional distribution of Y
The red curve (conditional): Shifts position and narrows based on ρ and observed X

General Multivariate Normal Conditioning

Now let's extend to the general case with $n$ variables. We partition the random vector into two groups:

\mathbf{X} = \begin{pmatrix} \mathbf{X}_1 \\ \mathbf{X}_2 \end{pmatrix} \sim N\left( \begin{pmatrix} \boldsymbol{\mu}_1 \\ \boldsymbol{\mu}_2 \end{pmatrix}, \begin{pmatrix} \boldsymbol{\Sigma}_{11} & \boldsymbol{\Sigma}_{12} \\ \boldsymbol{\Sigma}_{21} & \boldsymbol{\Sigma}_{22} \end{pmatrix} \right)

where:

$\mathbf{X}_1$ : The variables we want to predict (dimension $p$ )
$\mathbf{X}_2$ : The observed (conditioning) variables (dimension $q$ )
$\boldsymbol{\Sigma}_{11}$ : Covariance of $\mathbf{X}_1$ ( $p \times p$ )
$\boldsymbol{\Sigma}_{22}$ : Covariance of $\mathbf{X}_2$ ( $q \times q$ )
$\boldsymbol{\Sigma}_{12} = \boldsymbol{\Sigma}_{21}^T$ : Cross-covariance ( $p \times q$ )

The General Conditional Formulas

Given observation $\mathbf{X}_2 = \mathbf{x}_2$ , the conditional distribution of $\mathbf{X}_1$ is:

\mathbf{X}_1 | \mathbf{X}_2 = \mathbf{x}_2 \sim N\left( \bar{\boldsymbol{\mu}}, \bar{\boldsymbol{\Sigma}} \right)

where:

\bar{\boldsymbol{\mu}} = \boldsymbol{\mu}_1 + \boldsymbol{\Sigma}_{12} \boldsymbol{\Sigma}_{22}^{-1} (\mathbf{x}_2 - \boldsymbol{\mu}_2)

\bar{\boldsymbol{\Sigma}} = \boldsymbol{\Sigma}_{11} - \boldsymbol{\Sigma}_{12} \boldsymbol{\Sigma}_{22}^{-1} \boldsymbol{\Sigma}_{21}

Understanding the Matrix Formulas

Let's interpret each term:

Conditional Mean

\bar{\boldsymbol{\mu}} = \boldsymbol{\mu}_1 + \underbrace{\boldsymbol{\Sigma}_{12} \boldsymbol{\Sigma}_{22}^{-1}}_{\text{Regression matrix}} \underbrace{(\mathbf{x}_2 - \boldsymbol{\mu}_2)}_{\text{Deviation from mean}}

$\boldsymbol{\mu}_1$ : Start from the unconditional mean of $\mathbf{X}_1$
$\boldsymbol{\Sigma}_{12} \boldsymbol{\Sigma}_{22}^{-1}$ : The multivariate regression coefficient matrix ( $p \times q$ )
$(\mathbf{x}_2 - \boldsymbol{\mu}_2)$ : How far the observations deviate from their mean
The mean shifts linearly with the observed values

Conditional Covariance

\bar{\boldsymbol{\Sigma}} = \boldsymbol{\Sigma}_{11} - \underbrace{\boldsymbol{\Sigma}_{12} \boldsymbol{\Sigma}_{22}^{-1} \boldsymbol{\Sigma}_{21}}_{\text{Variance reduction}}

$\boldsymbol{\Sigma}_{11}$ : Start from the unconditional covariance of $\mathbf{X}_1$
$\boldsymbol{\Sigma}_{12} \boldsymbol{\Sigma}_{22}^{-1} \boldsymbol{\Sigma}_{21}$ : Variance reduction from knowing $\mathbf{X}_2$
This is a positive semi-definite matrix, so $\bar{\boldsymbol{\Sigma}} \preceq \boldsymbol{\Sigma}_{11}$
The result is independent of the specific observed values $\mathbf{x}_2$

Schur Complement

The conditional covariance $\bar{\boldsymbol{\Sigma}} = \boldsymbol{\Sigma}_{11} - \boldsymbol{\Sigma}_{12} \boldsymbol{\Sigma}_{22}^{-1} \boldsymbol{\Sigma}_{21}$ is called the Schur complement of $\boldsymbol{\Sigma}_{22}$ in $\boldsymbol{\Sigma}$ . It appears throughout linear algebra and statistics.

Connection to Linear Regression

One of the most beautiful results in statistics is that linear regression emerges naturally from the conditional MVN.

The Deep Connection

Consider the bivariate normal $(X, Y)$ . The conditional expectation is:

E[Y|X=x] = \mu_Y + \rho \frac{\sigma_Y}{\sigma_X}(x - \mu_X) = \underbrace{\left(\mu_Y - \rho \frac{\sigma_Y}{\sigma_X} \mu_X\right)}_{\beta_0} + \underbrace{\rho \frac{\sigma_Y}{\sigma_X}}_{\beta_1} x

This is exactly the linear regression line $E[Y|X] = \beta_0 + \beta_1 X$ !

Quantity	Formula	Meaning
Slope	β₁ = ρ σ_Y / σ_X	Change in E[Y] per unit increase in X
Intercept	β₀ = μ_Y - β₁ μ_X	E[Y] when X = 0
Residual Variance	σ²_Y\|X = σ²_Y(1 - ρ²)	Unexplained variance
R²	ρ²	Fraction of variance explained

Interactive Linear Regression Demo

The visualization below shows data from a bivariate normal and the regression line. The conditional distribution at any X value is shown as an orange curve.

Linear Regression as Conditional MVN

Correlation

\rho

0.80

Number of Points30

Query X1.50

Regression Slope

\beta_1 = \rho \frac{\sigma_Y}{\sigma_X}

= 0.800

Conditional Mean at X=1.5

E[Y|X] = 1.200

Residual Std Dev

\sigma_{Y|X} = \sigma_Y\sqrt{1-\rho^2}

= 0.600

The Deep Connection

For bivariate normal $(X, Y)$ , the conditional expectation is:

E[Y|X=x] = \mu_Y + \rho \frac{\sigma_Y}{\sigma_X}(x - \mu_X) = \beta_0 + \beta_1 x

This is exactly the linear regression line! The regression coefficient $\beta_1 = \rho \sigma_Y / \sigma_X$ comes directly from the conditional MVN formula. Furthermore, $R^2 = \rho^2 = {(rho * rho * 100).toFixed(0)}\%$ is the fraction of variance explained.

Key Takeaway

Linear regression is not just a curve-fitting technique—it's the conditional expectation of a bivariate normal distribution. The OLS estimator finds the parameters that would make the data most likely under this model.

Gaussian Processes: MVN Conditioning in Function Space

"A Gaussian Process is a distribution over functions, where any finite collection of function values is jointly Gaussian."

Gaussian Processes (GPs) are one of the most powerful applications of conditional MVN. The key insight: GP regression is just MVN conditioning in infinite dimensions.

How GPs Work

A GP defines a prior distribution over functions $f(x)$ . When we observe training data $(X, \mathbf{y})$ , we condition on these observations to get a posterior distribution over functions.

At any finite set of test points $X_*$ , the joint distribution of function values is:

\begin{pmatrix} \mathbf{f} \\ \mathbf{f}_* \end{pmatrix} \sim N\left( \mathbf{0}, \begin{pmatrix} K(X, X) & K(X, X_*) \\ K(X_*, X) & K(X_*, X_*) \end{pmatrix} \right)

where $K(\cdot, \cdot)$ is the kernel function that defines the prior covariance structure.

The GP Posterior (Conditional MVN!)

Given observations $\mathbf{y} = \mathbf{f} + \boldsymbol{\epsilon}$ (with noise $\boldsymbol{\epsilon} \sim N(\mathbf{0}, \sigma_n^2 I)$ ), the posterior is:

\mathbf{f}_* | X, \mathbf{y}, X_* \sim N\left( \boldsymbol{\mu}_*, \boldsymbol{\Sigma}_* \right)

where:

\boldsymbol{\mu}_* = K(X_*, X) \left[ K(X, X) + \sigma_n^2 I \right]^{-1} \mathbf{y}

\boldsymbol{\Sigma}_* = K(X_*, X_*) - K(X_*, X) \left[ K(X, X) + \sigma_n^2 I \right]^{-1} K(X, X_*)

These are exactly the conditional MVN formulas! The kernel defines $\boldsymbol{\Sigma}_{12}, \boldsymbol{\Sigma}_{22}$ , etc.

Gaussian Process Regression

Length Scale

\ell

1.00

Controls smoothness of function

Signal Variance

\sigma_f^2

1.00

Controls amplitude of function

Interactive: Click anywhere on the plot to add training points. Click on an existing point to remove it. Watch how the posterior mean and uncertainty update using the conditional MVN formulas!

\pm 2\sigma

)Training data

GP Posterior = Conditional MVN

Given training data $(X, \mathbf{y})$ , the posterior at test point $x_*$ is:

Posterior Mean:

\mu_* = k(x_*, X) [K(X,X) + \sigma_n^2 I]^{-1} \mathbf{y}

Posterior Variance:

\sigma_*^2 = k(x_*, x_*) - k(x_*, X) [K(X,X) + \sigma_n^2 I]^{-1} k(X, x_*)

These are exactly the conditional MVN formulas! The kernel $k(\cdot, \cdot)$ defines the prior covariance structure.

Training Points

Avg Posterior Std

0.530

Uncertainty Reduction

47%

Key Properties of GP Posterior

Near training points: Low uncertainty (we've observed data here)
Far from training points: High uncertainty (reverting to prior)
Mean passes through data: Interpolates exactly (with noise-free observations)
Smoothness controlled by kernel: Length scale determines function smoothness

Kalman Filter: Sequential Conditioning

The Kalman filter is perhaps the most celebrated application of conditional Gaussians in engineering. It solves the problem of optimal state estimation in linear dynamical systems with Gaussian noise.

The Problem Setting

Consider a system with hidden state $\mathbf{x}_t$ that evolves according to:

\mathbf{x}_{t+1} = F \mathbf{x}_t + \mathbf{w}_t \quad \text{(State transition)}

\mathbf{z}_t = H \mathbf{x}_t + \mathbf{v}_t \quad \text{(Measurement)}

where $\mathbf{w}_t \sim N(\mathbf{0}, Q)$ is process noise and $\mathbf{v}_t \sim N(\mathbf{0}, R)$ is measurement noise.

The Kalman Filter Algorithm

At each time step, the Kalman filter performs two operations:

Predict (Prior): Propagate the state estimate forward in time
$\hat{\mathbf{x}}_{t|t-1} = F \hat{\mathbf{x}}_{t-1|t-1}$
$P_{t|t-1} = F P_{t-1|t-1} F^T + Q$
Update (Condition!): Incorporate new measurement via MVN conditioning
$K_t = P_{t|t-1} H^T (H P_{t|t-1} H^T + R)^{-1} \quad \text{(Kalman gain)}$
$\hat{\mathbf{x}}_{t|t} = \hat{\mathbf{x}}_{t|t-1} + K_t (\mathbf{z}_t - H \hat{\mathbf{x}}_{t|t-1})$
$P_{t|t} = (I - K_t H) P_{t|t-1}$

The Update Step is Conditional MVN

The Kalman update formulas are exactly the conditional MVN formulas! The Kalman gain $K$ is:

K = \Sigma_{12} \Sigma_{22}^{-1} = P H^T (H P H^T + R)^{-1}

This is the optimal weight for combining prior prediction with new measurement.

Kalman Filter: Conditioning in Action

Process Noise

Q

0.10

Measurement Noise

R

1.00

True Position

0.00

Estimated Position

0.00

Position Uncertainty (

\sigma

)

3.162

Estimation Error

0.00

Kalman Filter = Predict + Condition

1. Predict (Prior):

\hat{x}_{k|k-1} = F \hat{x}_{k-1|k-1}

P_{k|k-1} = F P_{k-1|k-1} F^T + Q

2. Update (Condition):

K = P_{k|k-1} H^T (H P_{k|k-1} H^T + R)^{-1}

\hat{x}_{k|k} = \hat{x}_{k|k-1} + K(z_k - H\hat{x}_{k|k-1})

The update step is exactly the conditional MVN formula! The Kalman gain K is the optimal weight for combining prior prediction with new measurement.

Deep Learning Applications

Conditional MVN is not just classical statistics—it's fundamental to modern deep learning.

1. Variational Autoencoders (VAEs)

VAEs use conditional Gaussians in two places:

Encoder: $q(\mathbf{z}|\mathbf{x}) = N(\boldsymbol{\mu}_\phi(\mathbf{x}), \boldsymbol{\sigma}_\phi^2(\mathbf{x}))$ — maps data to latent distribution
Decoder: $p(\mathbf{x}|\mathbf{z}) = N(\boldsymbol{\mu}_\theta(\mathbf{z}), \boldsymbol{\sigma}_\theta^2)$ — generates data from latent

The reparameterization trick leverages the fact that conditional Gaussians can be sampled as $\mathbf{z} = \boldsymbol{\mu} + \boldsymbol{\sigma} \odot \boldsymbol{\epsilon}$ where $\boldsymbol{\epsilon} \sim N(\mathbf{0}, I)$ .

2. Normalizing Flows

Flow-based models transform a simple base distribution (often Gaussian) through invertible transformations. Affine coupling layers use conditional Gaussians:

\mathbf{x}_2 | \mathbf{x}_1 = \mathbf{x}_1 \cdot \exp(s(\mathbf{x}_1)) + t(\mathbf{x}_1)

3. Bayesian Neural Networks

In Bayesian deep learning, we place priors on weights and compute posteriors. The approximate posterior is often Gaussian:

q(\mathbf{W}) = N(\boldsymbol{\mu}_W, \text{diag}(\boldsymbol{\sigma}_W^2))

Predictions involve conditioning on inputs, integrating out weights.

4. Attention Mechanisms

While standard attention isn't explicitly Gaussian, probabilistic attention mechanisms can be interpreted as computing conditional expectations over keys given queries, with Gaussian assumptions on representations.

5. Latent Variable Models

Many generative models (GMMs, HMMs, LDA) use conditional Gaussians for observations given latent states. The E-step in EM computes $p(\mathbf{z}|\mathbf{x})$ , which for Gaussian mixtures is a weighted average of conditionals.

Python Implementation

Bivariate Normal Conditioning

🐍bivariate_conditional.py

Explanation(6)

Code(38)

8Bivariate Normal Parameters

Define the marginal means, standard deviations, and correlation. These fully specify the joint distribution.

14Observed Value

x_obs is the value at which we condition. This is the 'given' part of Y | X = x_obs.

18Conditional Mean Formula

The conditional mean shifts from mu_y by rho * (sigma_y/sigma_x) * (x_obs - mu_x). This is the linear regression formula!

19Conditional Standard Deviation

The conditional std dev is sigma_y * sqrt(1 - rho^2). Note: it doesn't depend on x_obs - the uncertainty reduction is the same everywhere.

23Variance Reduction

rho^2 is the fraction of Y's variance explained by knowing X. This equals R^2 in linear regression!

33Regression Coefficient

The slope beta = Sigma_12 / Sigma_11 (or rho * sigma_y / sigma_x) is the optimal linear predictor coefficient.

32 lines without explanation

1import numpy as np
2from scipy.stats import norm, multivariate_normal
3import matplotlib.pyplot as plt
4
5# Bivariate Normal Conditional Distribution
6# Given (X, Y) ~ N(mu, Sigma), find distribution of Y | X = x_obs
7
8# Parameters
9mu_x, mu_y = 0, 0
10sigma_x, sigma_y = 1, 1
11rho = 0.7  # Correlation coefficient
12
13# Observed value
14x_obs = 1.5
15
16# Conditional distribution formulas
17# Y | X = x_obs ~ N(mu_cond, sigma_cond^2)
18mu_cond = mu_y + rho * (sigma_y / sigma_x) * (x_obs - mu_x)
19sigma_cond = sigma_y * np.sqrt(1 - rho**2)
20
21print("Conditional Distribution Y | X = {:.2f}:".format(x_obs))
22print(f"  Mean: E[Y|X={x_obs}] = {mu_cond:.4f}")
23print(f"  Std Dev: sqrt(Var(Y|X)) = {sigma_cond:.4f}")
24print(f"  Variance Reduction: {rho**2 * 100:.1f}%")
25
26# Verify by computing from joint distribution
27cov_matrix = [[sigma_x**2, rho * sigma_x * sigma_y],
28              [rho * sigma_x * sigma_y, sigma_y**2]]
29
30# Condition using the formula: Sigma_12 @ Sigma_22^{-1}
31# For bivariate: beta = Sigma_12 / Sigma_22 = rho * sigma_y / sigma_x
32beta = cov_matrix[0][1] / cov_matrix[1][1]  # Note: roles swapped for Y|X
33print(f"\nVerification: regression coefficient = {beta:.4f}")
34
35# Compare with original Y distribution
36print(f"\nOriginal Y ~ N({mu_y}, {sigma_y**2})")
37print(f"Conditional Y|X ~ N({mu_cond:.3f}, {sigma_cond**2:.3f})")
38print(f"Knowing X reduces Y's variance by factor {1 - rho**2:.3f}")

General MVN Conditioning

General Multivariate Normal Conditioning

🐍mvn_conditional.py

Explanation(6)

Code(78)

4Partitioned MVN

We partition X into two groups: X_1 (variables to predict) and X_2 (observed variables). The covariance matrix is partitioned accordingly.

23Extract Submatrices

Sigma_11 is variance of X_1, Sigma_22 is variance of X_2, Sigma_12 (= Sigma_21^T) is cross-covariance between groups.

33Conditional Mean Formula

mu_bar = mu_1 + Sigma_12 @ Sigma_22^{-1} @ (x_2 - mu_2). This is a linear function of the observed values!

36Conditional Covariance Formula

Sigma_bar = Sigma_11 - Sigma_12 @ Sigma_22^{-1} @ Sigma_21. This is independent of observed values - the uncertainty reduction is fixed.

47Example: 4D MVN

We have 4 correlated variables. We observe X_2 and X_3, and want the conditional distribution of X_1 and X_4.

61Empirical Verification

We verify the formulas by sampling from the full distribution and filtering samples near the observed values. The empirical statistics should match our conditional formulas.

72 lines without explanation

1import numpy as np
2from scipy.stats import multivariate_normal
3
4# General MVN Conditional Distribution
5# Partition X = [X_1, X_2]^T with:
6#   X ~ N(mu, Sigma) where mu = [mu_1, mu_2], Sigma = [[S_11, S_12], [S_21, S_22]]
7# Then: X_1 | X_2 = x_2 ~ N(mu_bar, Sigma_bar)
8
9def mvn_conditional(mu, Sigma, x_2, indices_1, indices_2):
10    """
11    Compute conditional MVN distribution.
12
13    Parameters:
14    - mu: Full mean vector
15    - Sigma: Full covariance matrix
16    - x_2: Observed values for X_2
17    - indices_1: Indices of variables we want to condition TO
18    - indices_2: Indices of observed (conditioning) variables
19
20    Returns:
21    - mu_bar: Conditional mean
22    - Sigma_bar: Conditional covariance
23    """
24    mu = np.array(mu)
25    Sigma = np.array(Sigma)
26    x_2 = np.array(x_2)
27
28    # Extract submatrices
29    mu_1 = mu[indices_1]
30    mu_2 = mu[indices_2]
31
32    Sigma_11 = Sigma[np.ix_(indices_1, indices_1)]
33    Sigma_12 = Sigma[np.ix_(indices_1, indices_2)]
34    Sigma_21 = Sigma[np.ix_(indices_2, indices_1)]
35    Sigma_22 = Sigma[np.ix_(indices_2, indices_2)]
36
37    # Compute conditional parameters
38    Sigma_22_inv = np.linalg.inv(Sigma_22)
39
40    # Conditional mean: mu_1 + Sigma_12 @ Sigma_22^{-1} @ (x_2 - mu_2)
41    mu_bar = mu_1 + Sigma_12 @ Sigma_22_inv @ (x_2 - mu_2)
42
43    # Conditional covariance: Sigma_11 - Sigma_12 @ Sigma_22^{-1} @ Sigma_21
44    Sigma_bar = Sigma_11 - Sigma_12 @ Sigma_22_inv @ Sigma_21
45
46    return mu_bar, Sigma_bar
47
48# Example: 4D MVN, condition on variables 2 and 3
49mu = [0, 1, 2, 3]
50Sigma = [
51    [1.0, 0.5, 0.3, 0.1],
52    [0.5, 2.0, 0.7, 0.2],
53    [0.3, 0.7, 1.5, 0.4],
54    [0.1, 0.2, 0.4, 1.0]
55]
56
57# Observe X_2 = 0.5, X_3 = 1.0 (indices 1 and 2)
58x_observed = [0.5, 1.0]
59indices_1 = [0, 3]  # Variables to predict (X_1, X_4)
60indices_2 = [1, 2]  # Observed variables (X_2, X_3)
61
62mu_cond, Sigma_cond = mvn_conditional(mu, Sigma, x_observed, indices_1, indices_2)
63
64print("Conditional Distribution [X_1, X_4] | [X_2=0.5, X_3=1.0]:")
65print(f"  Conditional mean: {mu_cond}")
66print(f"  Conditional covariance:\\n{Sigma_cond}")
67
68# Verify: Create full distribution and sample
69full_dist = multivariate_normal(mean=mu, cov=Sigma)
70samples = full_dist.rvs(size=10000)
71
72# Filter samples close to observed values
73mask = (np.abs(samples[:, 1] - 0.5) < 0.1) & (np.abs(samples[:, 2] - 1.0) < 0.1)
74filtered = samples[mask][:, [0, 3]]
75
76print(f"\nEmpirical (from sampling):")
77print(f"  Mean: {np.mean(filtered, axis=0)}")
78print(f"  Covariance:\\n{np.cov(filtered.T)}")

Gaussian Process Implementation

Gaussian Process Regression

🐍gaussian_process.py

Explanation(6)

Code(77)

5GP = Infinite MVN

A Gaussian Process is a distribution over functions, where any finite subset of function values is multivariate normal. The kernel defines the covariance structure.

7RBF Kernel

The RBF (Radial Basis Function) kernel defines how correlated f(x_i) and f(x_j) are based on distance |x_i - x_j|. Nearby points are highly correlated.

30Build Covariance Matrices

K is the train-train covariance (Sigma_22), K_star is test-train covariance (Sigma_12), K_star_star is test-test covariance (Sigma_11).

47Conditional MVN: Posterior Mean

mu_star = K_star @ K^{-1} @ y_train. This is exactly the conditional MVN mean formula with prior mean 0.

50Conditional MVN: Posterior Covariance

cov_star = K_star_star - K_star @ K^{-1} @ K_star^T. This is exactly the conditional MVN covariance formula!

63Uncertainty Quantification

The diagonal of cov_star gives variance at each test point. Far from training data, uncertainty is high. Near training data, uncertainty is low.

71 lines without explanation

1import numpy as np
2from scipy.stats import multivariate_normal
3import matplotlib.pyplot as plt
4
5# Gaussian Process = Infinite-dimensional MVN
6# The key insight: GP posterior is just conditional MVN
7
8def rbf_kernel(x1, x2, length_scale=1.0, variance=1.0):
9    """RBF (Squared Exponential) kernel."""
10    return variance * np.exp(-0.5 * (x1 - x2)**2 / length_scale**2)
11
12def gp_posterior(X_train, y_train, X_test,
13                 length_scale=1.0, variance=1.0, noise=0.1):
14    """
15    Compute GP posterior using conditional MVN formulas.
16
17    Parameters:
18    - X_train: Training inputs (n x 1)
19    - y_train: Training outputs (n,)
20    - X_test: Test inputs (m x 1)
21    - length_scale, variance: Kernel hyperparameters
22    - noise: Observation noise variance
23
24    Returns:
25    - mu_star: Posterior mean at test points
26    - cov_star: Posterior covariance at test points
27    """
28    n = len(X_train)
29    m = len(X_test)
30
31    # Build covariance matrices
32    K = np.zeros((n, n))
33    K_star = np.zeros((m, n))
34    K_star_star = np.zeros((m, m))
35
36    for i in range(n):
37        for j in range(n):
38            K[i, j] = rbf_kernel(X_train[i], X_train[j], length_scale, variance)
39            if i == j:
40                K[i, j] += noise  # Add noise on diagonal
41
42    for i in range(m):
43        for j in range(n):
44            K_star[i, j] = rbf_kernel(X_test[i], X_train[j], length_scale, variance)
45
46    for i in range(m):
47        for j in range(m):
48            K_star_star[i, j] = rbf_kernel(X_test[i], X_test[j], length_scale, variance)
49
50    # Conditional MVN formulas!
51    # This is EXACTLY: mu_bar = mu + Sigma_12 @ Sigma_22^{-1} @ (x - mu)
52    # With mu = 0 (GP prior mean) and x = y_train
53    K_inv = np.linalg.inv(K)
54
55    # Posterior mean: K_* @ K^{-1} @ y
56    mu_star = K_star @ K_inv @ y_train
57
58    # Posterior covariance: K_** - K_* @ K^{-1} @ K_*^T
59    cov_star = K_star_star - K_star @ K_inv @ K_star.T
60
61    return mu_star, cov_star
62
63# Example: Fit GP to noisy observations
64np.random.seed(42)
65X_train = np.array([-2, -1, 0, 1, 2.5])
66y_train = np.sin(X_train) + 0.1 * np.random.randn(len(X_train))
67
68X_test = np.linspace(-3, 3, 100)
69mu, cov = gp_posterior(X_train, y_train, X_test)
70std = np.sqrt(np.diag(cov))
71
72print("GP Posterior (at selected test points):")
73for i in [0, 25, 50, 75, 99]:
74    print(f"  x={X_test[i]:.2f}: mean={mu[i]:.3f}, std={std[i]:.3f}")
75
76print("\nKey insight: The GP posterior formulas are EXACTLY")
77print("the conditional MVN formulas!")

Common Pitfalls and Misconceptions

Pitfall 1: Confusing Conditional with Marginal

Wrong: "The conditional $Y|X$ has the same variance as the marginal $Y$ ."

Correct: $\text{Var}(Y|X) = \sigma_Y^2(1-\rho^2) \leq \sigma_Y^2 = \text{Var}(Y)$ . Conditioning always reduces (or maintains) variance.

Pitfall 2: Assuming Conditional Variance Depends on Observed Value

Wrong: "If we observe $X=2$ vs $X=0$ , the conditional variance of $Y$ is different."

Correct: For MVN, $\text{Var}(Y|X=x)$ is the same for all $x$ . Only the conditional mean shifts.

Pitfall 3: Forgetting the Schur Complement Structure

When computing $\bar{\boldsymbol{\Sigma}} = \boldsymbol{\Sigma}_{11} - \boldsymbol{\Sigma}_{12} \boldsymbol{\Sigma}_{22}^{-1} \boldsymbol{\Sigma}_{21}$ , ensure:

$\boldsymbol{\Sigma}_{22}$ is invertible (positive definite)
$\boldsymbol{\Sigma}_{12} = \boldsymbol{\Sigma}_{21}^T$ (symmetry)
The result is positive semi-definite

Pitfall 4: Applying MVN Formulas to Non-Gaussian Data

The beautiful closed-form conditional formulas only apply to Gaussian distributions. For non-Gaussian data, conditionals can have complex, non-linear shapes.

Pitfall 5: Numerical Instability in Matrix Inversion

For large $\boldsymbol{\Sigma}_{22}$ , direct inversion can be unstable. Use Cholesky decomposition:

$\boldsymbol{\Sigma}_{22} = LL^T$ , then solve $L \mathbf{z} = (\mathbf{x}_2 - \boldsymbol{\mu}_2)$ via forward substitution.

Summary: The Power of Gaussian Conditioning

You've now mastered one of the most powerful tools in probability and statistics. Let's recap:

The Key Formulas

Quantity	Bivariate	General MVN
Conditional Mean	μ_Y + ρ(σ_Y/σ_X)(x - μ_X)	μ₁ + Σ₁₂ Σ₂₂⁻¹ (x₂ - μ₂)
Conditional Covariance	σ²_Y(1 - ρ²)	Σ₁₁ - Σ₁₂ Σ₂₂⁻¹ Σ₂₁
Variance Reduction	ρ²	Σ₁₂ Σ₂₂⁻¹ Σ₂₁ (relative to Σ₁₁)

Key Properties

Closure: Conditioning a Gaussian on another Gaussian yields a Gaussian
Linear mean: The conditional mean is a linear function of observed values
Constant variance: Conditional covariance doesn't depend on observed values
Variance reduction: Conditioning always reduces variance (information gain)

Applications Mastered

Linear Regression: The regression line is $E[Y|X]$ , and $R^2 = \rho^2$
Gaussian Processes: GP posterior = conditional MVN in function space
Kalman Filter: State estimation = sequential Gaussian conditioning
Deep Learning: VAEs, flows, Bayesian NNs all use conditional Gaussians

The Ultimate Insight

The conditional MVN formulas are not just mathematical curiosities—they are the computational backbone of modern probabilistic machine learning. Whenever you see:

A posterior that's Gaussian
An update step involving Kalman-like gains
A regression coefficient matrix
Uncertainty that shrinks near observations

You're seeing the conditional MVN formulas in action. Master this section, and you've unlocked a vast landscape of probabilistic methods.