Chapter 0
20 min read
Section 2 of 76

Gaussian Distributions Deep Dive

Prerequisites

Learning Objectives

By the end of this section, you will:

  1. Master the univariate Gaussian PDF and understand every term in the formula
  2. Extend to multivariate Gaussians with covariance matrices and understand their geometric interpretation
  3. Apply key properties including closure under linear transformations and conditioning
  4. Implement the reparameterization trick that enables gradient-based learning through sampling
  5. Connect every concept to diffusion models where Gaussians define both forward and reverse processes

Why Gaussians Dominate Diffusion

Diffusion models are built almost entirely on Gaussian distributions. The forward process adds Gaussian noise: q(xtxt1)=N(xt;1βtxt1,βtI)q(x_t|x_{t-1}) = \mathcal{N}(x_t; \sqrt{1-\beta_t}x_{t-1}, \beta_t I). The reverse process predicts Gaussian parameters: pθ(xt1xt)=N(xt1;μθ(xt,t),σt2I)p_\theta(x_{t-1}|x_t) = \mathcal{N}(x_{t-1}; \mu_\theta(x_t, t), \sigma_t^2 I). Understanding Gaussians deeply is not optional—it's essential.

The Big Picture

The Gaussian (or Normal) distribution is the most important distribution in all of statistics and machine learning. It was first studied by Abraham de Moivre in 1733 and later developed extensively by Carl Friedrich Gauss, who used it to analyze astronomical data.

Why is the Gaussian so ubiquitous? There are three fundamental reasons:

  1. Central Limit Theorem: The sum of many independent random variables converges to a Gaussian, regardless of their individual distributions. Real-world measurements often arise from many small, independent effects—hence the "normal" name.
  2. Maximum Entropy: Among all distributions with a given mean and variance, the Gaussian has maximum entropy. It makes the fewest assumptions beyond these constraints, making it the "least informative" choice.
  3. Mathematical Convenience: Gaussians are closed under linear operations. Sums of Gaussians are Gaussian. Products of Gaussians are (proportional to) Gaussians. Conditionals and marginals of joint Gaussians are also Gaussian.

Connection to Physics

The Gaussian arises naturally in diffusion physics. When particles undergo Brownian motion (random collisions), their positions after time t follow a Gaussian distribution. This physical process inspired the name "diffusion models"—we're simulating information diffusing into noise.

The Univariate Gaussian

The Probability Density Function

A random variable X follows a Gaussian distribution with mean μ\mu and variance σ2\sigma^2, written XN(μ,σ2)X \sim \mathcal{N}(\mu, \sigma^2), if its probability density function (PDF) is:

p(x)=12πσ2exp((xμ)22σ2)p(x) = \frac{1}{\sqrt{2\pi\sigma^2}} \exp\left(-\frac{(x - \mu)^2}{2\sigma^2}\right)

Let's understand each component:

SymbolNameRoleDiffusion Connection
1/sqrt(2*pi*sigma^2)Normalizing constantEnsures integral = 1Appears in ELBO derivation
muMeanCenter of the distributionIn diffusion: sqrt(alpha_bar_t) * x_0
sigma^2VarianceSpread/width of the distributionIn diffusion: (1 - alpha_bar_t)
(x - mu)^2Squared deviationDistance from meanMeasures noise magnitude
exp(-...)Exponential decayTails decay rapidlyEnables tractable integrals

Standard Normal

When μ=0\mu = 0 and σ2=1\sigma^2 = 1, we get the standard normal N(0,1)\mathcal{N}(0, 1). Any Gaussian can be transformed to standard normal via Z=(Xμ)/σZ = (X - \mu)/\sigma. This is called standardization.

Interactive Visualization: Explore how changing μ\mu andσ\sigma affects the Gaussian PDF. Notice how the area under the curve always equals 1.

Log-Probability

In practice, we almost always work with log-probabilities rather than probabilities to avoid numerical underflow. Taking the log of the Gaussian PDF:

logp(x)=12log(2πσ2)(xμ)22σ2\log p(x) = -\frac{1}{2}\log(2\pi\sigma^2) - \frac{(x - \mu)^2}{2\sigma^2}

The log-probability reveals the quadratic structure of the Gaussian. This is why minimizing squared error is equivalent to maximum likelihood estimation with Gaussian noise.

The MSE Connection

In diffusion training, we minimize ϵϵθ(xt,t)2\|\epsilon - \epsilon_\theta(x_t, t)\|^2, which is the squared error between true and predicted noise. This is exactly the negative log-likelihood of a Gaussian with unit variance!

Multivariate Gaussian

Definition

For a random vector xRd\mathbf{x} \in \mathbb{R}^d, the multivariate Gaussian with mean vector μRd\boldsymbol{\mu} \in \mathbb{R}^d and covariance matrix ΣRd×d\boldsymbol{\Sigma} \in \mathbb{R}^{d \times d} has PDF:

p(x)=1(2π)d/2Σ1/2exp(12(xμ)Σ1(xμ))p(\mathbf{x}) = \frac{1}{(2\pi)^{d/2}|\boldsymbol{\Sigma}|^{1/2}} \exp\left(-\frac{1}{2}(\mathbf{x} - \boldsymbol{\mu})^\top \boldsymbol{\Sigma}^{-1}(\mathbf{x} - \boldsymbol{\mu})\right)

Each component has a natural interpretation:

ComponentSymbolMeaning
Normalizing constant(2*pi)^(d/2) * |Sigma|^(1/2)Depends on dimension d and determinant of covariance
Mean vectormu (d-dimensional)Center of the distribution in d-dimensional space
Covariance matrixSigma (d x d)Encodes variances (diagonal) and correlations (off-diagonal)
Mahalanobis distance(x-mu)^T Sigma^(-1) (x-mu)Quadratic form measuring distance from mean, scaled by covariance

Covariance Matrix Requirements

The covariance matrix Σ\boldsymbol{\Sigma} must be:
  • Symmetric: Σij=Σji\Sigma_{ij} = \Sigma_{ji}
  • Positive semi-definite: xΣx0\mathbf{x}^\top \boldsymbol{\Sigma} \mathbf{x} \geq 0 for all x
These ensure the PDF is well-defined and integrates to 1.

Geometric Interpretation

The contours of constant probability density form ellipsoids in d-dimensional space:

  • Principal axes are the eigenvectors of Σ\boldsymbol{\Sigma}
  • Axis lengths are proportional to λi\sqrt{\lambda_i} whereλi\lambda_i are the eigenvalues
  • Spherical when Σ=σ2I\boldsymbol{\Sigma} = \sigma^2 I (isotropic)
  • Axis-aligned ellipse when Σ\boldsymbol{\Sigma} is diagonal
  • Rotated ellipse when off-diagonal elements are non-zero

Interactive 3D Visualization: Explore the multivariate Gaussian in 3D. Rotate the view, adjust mean, variance, and correlation to see how the distribution shape changes.

Implementation

Multivariate Gaussian in PyTorch
🐍multivariate_gaussian.py
1Import PyTorch

PyTorch provides both tensor operations and probability distributions that mirror the mathematical notation for Gaussians.

2Import MultivariateNormal

This distribution class handles multivariate Gaussians with arbitrary mean vectors and covariance matrices.

5Mean Vector

The mean vector mu defines the center of the distribution in n-dimensional space. For diffusion, the forward process pushes mu toward zero.

EXAMPLE
mu = [0, 0] for a 2D standard normal
6Covariance Matrix

The covariance matrix Sigma captures variances (diagonal) and correlations (off-diagonal). Must be positive semi-definite.

EXAMPLE
Diagonal Sigma means independent dimensions
12Create Distribution

MultivariateNormal(mean, cov) creates the distribution object. In diffusion, we create many such distributions for different timesteps.

15Sample from Distribution

The sample() method draws random samples. In diffusion training, we sample noise epsilon ~ N(0, I) millions of times.

18Compute Log-Probability

log_prob() computes log p(x). This is essential for training objectives where we minimize negative log-likelihood.

21Reparameterization

Instead of sampling z ~ N(mu, Sigma), we sample epsilon ~ N(0, I) and compute z = mu + L @ epsilon where LL^T = Sigma. This enables gradient flow.

EXAMPLE
z = mu + cholesky(Sigma) @ epsilon
16 lines without explanation
1import torch
2from torch.distributions import MultivariateNormal
3
4# Define parameters
5mean = torch.tensor([0.0, 0.0])
6covariance = torch.tensor([
7    [1.0, 0.5],
8    [0.5, 1.0]
9])
10
11# Create the distribution
12mvn = MultivariateNormal(mean, covariance)
13
14# Sample from the distribution
15samples = mvn.sample((1000,))  # Shape: (1000, 2)
16
17# Compute log-probability
18point = torch.tensor([0.5, 0.5])
19log_prob = mvn.log_prob(point)
20
21# Reparameterization: sample via Cholesky decomposition
22epsilon = torch.randn(2)  # Standard normal
23L = torch.linalg.cholesky(covariance)
24z = mean + L @ epsilon  # Same distribution as mvn.sample()

Interactive 2D Contours: Adjust correlation and variances to see how the elliptical contours change. In diffusion models with isotropic noise, contours are circles.


Key Properties of Gaussians

Property 1: Linear Transformations

If xN(μ,Σ)\mathbf{x} \sim \mathcal{N}(\boldsymbol{\mu}, \boldsymbol{\Sigma}) andy=Ax+b\mathbf{y} = A\mathbf{x} + \mathbf{b}, then:

yN(Aμ+b,AΣA)\mathbf{y} \sim \mathcal{N}(A\boldsymbol{\mu} + \mathbf{b}, A\boldsymbol{\Sigma}A^\top)

Diffusion Application

The forward process xt=αˉtx0+1αˉtϵx_t = \sqrt{\bar{\alpha}_t}x_0 + \sqrt{1-\bar{\alpha}_t}\epsilonis exactly a linear transformation! Starting from x0x_0 (treated as fixed) andϵN(0,I)\epsilon \sim \mathcal{N}(0, I), the result xtx_tis Gaussian with mean αˉtx0\sqrt{\bar{\alpha}_t}x_0 and variance(1αˉt)I(1-\bar{\alpha}_t)I.

Property 2: Marginals and Conditionals

For a joint Gaussian (x1x2)N((μ1μ2),(Σ11Σ12Σ21Σ22))\begin{pmatrix} \mathbf{x}_1 \\ \mathbf{x}_2 \end{pmatrix} \sim \mathcal{N}\left(\begin{pmatrix} \boldsymbol{\mu}_1 \\ \boldsymbol{\mu}_2 \end{pmatrix}, \begin{pmatrix} \boldsymbol{\Sigma}_{11} & \boldsymbol{\Sigma}_{12} \\ \boldsymbol{\Sigma}_{21} & \boldsymbol{\Sigma}_{22} \end{pmatrix}\right):

  • Marginals: x1N(μ1,Σ11)\mathbf{x}_1 \sim \mathcal{N}(\boldsymbol{\mu}_1, \boldsymbol{\Sigma}_{11})
  • Conditionals: x1x2N(μ12,Σ12)\mathbf{x}_1 | \mathbf{x}_2 \sim \mathcal{N}(\boldsymbol{\mu}_{1|2}, \boldsymbol{\Sigma}_{1|2}) where:
    • μ12=μ1+Σ12Σ221(x2μ2)\boldsymbol{\mu}_{1|2} = \boldsymbol{\mu}_1 + \boldsymbol{\Sigma}_{12}\boldsymbol{\Sigma}_{22}^{-1}(\mathbf{x}_2 - \boldsymbol{\mu}_2)
    • Σ12=Σ11Σ12Σ221Σ21\boldsymbol{\Sigma}_{1|2} = \boldsymbol{\Sigma}_{11} - \boldsymbol{\Sigma}_{12}\boldsymbol{\Sigma}_{22}^{-1}\boldsymbol{\Sigma}_{21}

The Key Insight for Diffusion

The true reverse conditional q(xt1xt,x0)q(x_{t-1}|x_t, x_0) is Gaussian with a closed-form mean and variance! This is possible because q(x0,xt1,xt)q(x_0, x_{t-1}, x_t)is jointly Gaussian. The neural network learns to approximate this conditional.

Property 3: Products of Gaussians

The product of two Gaussian PDFs is proportional to another Gaussian:

N(x;μ1,σ12)N(x;μ2,σ22)N(x;μ,σ2)\mathcal{N}(x; \mu_1, \sigma_1^2) \cdot \mathcal{N}(x; \mu_2, \sigma_2^2) \propto \mathcal{N}(x; \mu_*, \sigma_*^2)

where the precision-weighted combination is:

  • σ2=σ12+σ22\sigma_*^{-2} = \sigma_1^{-2} + \sigma_2^{-2} (precisions add)
  • μ=σ2(σ12μ1+σ22μ2)\mu_* = \sigma_*^2(\sigma_1^{-2}\mu_1 + \sigma_2^{-2}\mu_2) (weighted by precision)

Precision = Confidence

Precision τ=1/σ2\tau = 1/\sigma^2 measures confidence in the estimate. When combining information, we weight by precision—more confident estimates get more weight. This is exactly how Bayesian updates work and appears in deriving q(xt1xt,x0)q(x_{t-1}|x_t, x_0).

Property 4: Sum of Independent Gaussians

If XN(μ1,σ12)X \sim \mathcal{N}(\mu_1, \sigma_1^2) andYN(μ2,σ22)Y \sim \mathcal{N}(\mu_2, \sigma_2^2) are independent:

X+YN(μ1+μ2,σ12+σ22)X + Y \sim \mathcal{N}(\mu_1 + \mu_2, \sigma_1^2 + \sigma_2^2)

Forward Process Derivation

This property lets us derive the marginal q(xtx0)q(x_t|x_0) directly! Instead of composing T steps of adding noise, we can jump directly to timestep t by accumulating the variances.

The Reparameterization Trick

One of the most important techniques in modern generative modeling is thereparameterization trick. The problem it solves: how do we backpropagate gradients through random sampling?

The Problem

Suppose we want to optimize parameters θ\theta of a distributionpθ(z)p_\theta(z) with respect to some loss L(z)\mathcal{L}(z). We need:

θEzpθ[L(z)]\nabla_\theta \mathbb{E}_{z \sim p_\theta}[\mathcal{L}(z)]

But sampling zpθz \sim p_\theta is a discrete, non-differentiable operation! The gradient cannot flow through the sampling step.

The Solution

For Gaussians, we can reparameterize the sampling:

  1. Instead of sampling zN(μ,σ2)z \sim \mathcal{N}(\mu, \sigma^2) directly...
  2. Sample ϵN(0,1)\epsilon \sim \mathcal{N}(0, 1) (independent of parameters)
  3. Compute z=μ+σϵz = \mu + \sigma \cdot \epsilon (deterministic given ϵ\epsilon)

Now the expectation becomes:

Ezpθ[L(z)]=EϵN(0,1)[L(μ+σϵ)]\mathbb{E}_{z \sim p_\theta}[\mathcal{L}(z)] = \mathbb{E}_{\epsilon \sim \mathcal{N}(0,1)}[\mathcal{L}(\mu + \sigma \cdot \epsilon)]

And we can differentiate through z=μ+σϵz = \mu + \sigma \cdot \epsilon!

Reparameterization Trick Implementation
🐍reparameterization.py
1Import Modules

We need torch for tensors and Normal for the standard normal distribution.

4Learnable Parameters

In a VAE or diffusion model, mu and log_sigma are outputs of neural networks, and we need gradients through them.

5Log-Sigma Parameterization

We parameterize log(sigma) instead of sigma to ensure sigma > 0 without constraints. Common trick in variational autoencoders.

8Sample Standard Normal

Sample epsilon ~ N(0, 1). This is the source of randomness. Crucially, epsilon does not depend on mu or sigma.

11Transform to Target Distribution

Apply z = mu + sigma * epsilon. Since epsilon is fixed during backprop, gradients flow through mu and sigma!

EXAMPLE
dL/d(mu) = dL/dz * dz/d(mu) = dL/dz * 1
14Why This Matters

Without reparameterization, we cannot backpropagate through sampling. This trick is essential for VAEs and diffusion models.

15 lines without explanation
1import torch
2from torch.distributions import Normal
3
4# Parameters from a neural network (require gradients)
5mu = torch.tensor([1.0], requires_grad=True)
6log_sigma = torch.tensor([0.0], requires_grad=True)  # Use log for positivity
7sigma = torch.exp(log_sigma)
8
9# Sample noise from standard normal
10epsilon = torch.randn_like(mu)
11
12# Reparameterize: z = mu + sigma * epsilon
13z = mu + sigma * epsilon
14
15# Compute some loss and backpropagate
16loss = (z ** 2).sum()  # Example loss
17loss.backward()
18
19# Gradients now flow through mu and sigma!
20print(f"Gradient w.r.t. mu: {mu.grad}")
21print(f"Gradient w.r.t. log_sigma: {log_sigma.grad}")

Multivariate Case

For multivariate Gaussians, use the Cholesky decomposition:z=μ+Lϵ\mathbf{z} = \boldsymbol{\mu} + L\boldsymbol{\epsilon} whereLL=ΣLL^\top = \boldsymbol{\Sigma} andϵN(0,I)\boldsymbol{\epsilon} \sim \mathcal{N}(\mathbf{0}, I). In diffusion models with isotropic noise (Σ=σ2I\Sigma = \sigma^2 I), this simplifies to z=μ+σϵ\mathbf{z} = \boldsymbol{\mu} + \sigma\boldsymbol{\epsilon}.

Bonus: The Central Limit Theorem

The CLT explains why Gaussians appear everywhere. If X1,X2,,XnX_1, X_2, \ldots, X_n are independent and identically distributed (i.i.d.) with mean μ\mu and variance σ2\sigma^2, then as nn \to \infty:

Xˉnμσ/ndN(0,1)\frac{\bar{X}_n - \mu}{\sigma/\sqrt{n}} \xrightarrow{d} \mathcal{N}(0, 1)

Interactive CLT Demonstration: Start with any distribution (uniform, exponential, etc.) and see how the sum converges to a Gaussian.


Gaussians in Diffusion Models

Let's connect everything to diffusion models explicitly:

Diffusion EquationGaussian FormWhat It Means
q(x_t | x_{t-1})N(sqrt(1-beta_t) x_{t-1}, beta_t I)Single step of adding noise
q(x_t | x_0)N(sqrt(alpha_bar_t) x_0, (1-alpha_bar_t) I)Marginal at any timestep t
q(x_{t-1} | x_t, x_0)N(mu_tilde_t, sigma_tilde_t^2 I)True reverse posterior (tractable!)
p_theta(x_{t-1} | x_t)N(mu_theta(x_t, t), sigma_t^2 I)Learned reverse process

Why This Matters

  1. Forward process is fully determined: We can computeq(xtx0)q(x_t|x_0) in closed form for any t, enabling efficient training.
  2. Reverse posterior is tractable: Given x0x_0, we know the exact q(xt1xt,x0)q(x_{t-1}|x_t, x_0). This becomes our training target.
  3. KL divergences have closed forms: Between two Gaussians, the KL divergence is analytic, enabling efficient ELBO computation.
  4. Reparameterization enables training: We samplext=αˉtx0+1αˉtϵx_t = \sqrt{\bar{\alpha}_t}x_0 + \sqrt{1-\bar{\alpha}_t}\epsilonand train to predict ϵ\epsilon.

The Epsilon Prediction Target

Instead of predicting the mean μθ\mu_\theta directly, DDPM trains to predict the noise ϵ\epsilon. Sincext=αˉtx0+1αˉtϵx_t = \sqrt{\bar{\alpha}_t}x_0 + \sqrt{1-\bar{\alpha}_t}\epsilon, knowing ϵ\epsilon lets us recover everything else. This noise-prediction parameterization leads to better empirical results.

Summary

In this section, we developed a deep understanding of Gaussian distributions:

  1. Univariate Gaussian: The PDF has a normalizing constant and exponential of quadratic deviation from the mean. Log-probability reveals the connection to squared error loss.
  2. Multivariate Gaussian: Extends to vectors with mean vectors and covariance matrices. Contours form ellipsoids whose axes are determined by eigenvectors.
  3. Key Properties: Closure under linear transformations, tractable marginals and conditionals, products combine via precision weighting, sums of independent Gaussians are Gaussian.
  4. Reparameterization Trick: Enables gradient-based learning by separating randomness (ϵ\epsilon) from parameters (μ,σ\mu, \sigma).
  5. Diffusion Connection: Both forward and reverse processes are defined by Gaussians, making the entire framework analytically tractable.

Exercises

Conceptual Questions

  1. Explain why the Gaussian distribution maximizes entropy among all distributions with fixed mean and variance. What does this imply for modeling uncertainty?
  2. Given a 2D Gaussian with correlation ρ=0.9\rho = 0.9, describe qualitatively what the contours look like. What happens asρ1\rho \to 1?
  3. Why does the reparameterization trick work for Gaussians but not for all distributions (e.g., discrete distributions)?
  4. In the diffusion forward process, why do we useαˉt\sqrt{\bar{\alpha}_t} and1αˉt\sqrt{1-\bar{\alpha}_t} as coefficients rather than arbitrary functions?

Computational Exercises

  1. Implement a function that computes the KL divergence between two univariate Gaussians N(μ1,σ12)\mathcal{N}(\mu_1, \sigma_1^2) andN(μ2,σ22)\mathcal{N}(\mu_2, \sigma_2^2) using the closed-form formula. Verify empirically using Monte Carlo sampling.
  2. Given a bivariate Gaussian with μ=[0,0]\mu = [0, 0] andΣ=[10.70.71]\Sigma = \begin{bmatrix} 1 & 0.7 \\ 0.7 & 1 \end{bmatrix}, compute the conditional distribution p(x1x2=1.5)p(x_1 | x_2 = 1.5). Sample from both the joint and the conditional to verify.
  3. Implement the reparameterization trick for a VAE encoder that outputsμ\mu and logσ\log\sigma. Verify that gradients flow correctly by computing L/μ\partial L/\partial \mufor a simple loss.
  4. Simulate the diffusion forward process for a 1D distribution. Start with samples from a mixture of Gaussians, apply the forward process for T=100 steps with a linear noise schedule, and visualize how the distribution evolves towardN(0,1)\mathcal{N}(0, 1).

Challenge Problem

Derive the posterior q(xt1xt,x0)q(x_{t-1}|x_t, x_0) for the diffusion process:

  1. Start with q(xtxt1)q(x_t|x_{t-1}) andq(xt1x0)q(x_{t-1}|x_0)
  2. Use Bayes' theorem: q(xt1xt,x0)q(xtxt1)q(xt1x0)q(x_{t-1}|x_t, x_0) \propto q(x_t|x_{t-1})q(x_{t-1}|x_0)
  3. Show that the product of two Gaussians gives a Gaussian with specific mean and variance
  4. Verify your result matches the formula in the DDPM paper