Learning Objectives
By the end of this section, you will:
- Master the univariate Gaussian PDF and understand every term in the formula
- Extend to multivariate Gaussians with covariance matrices and understand their geometric interpretation
- Apply key properties including closure under linear transformations and conditioning
- Implement the reparameterization trick that enables gradient-based learning through sampling
- Connect every concept to diffusion models where Gaussians define both forward and reverse processes
Why Gaussians Dominate Diffusion
The Big Picture
The Gaussian (or Normal) distribution is the most important distribution in all of statistics and machine learning. It was first studied by Abraham de Moivre in 1733 and later developed extensively by Carl Friedrich Gauss, who used it to analyze astronomical data.
Why is the Gaussian so ubiquitous? There are three fundamental reasons:
- Central Limit Theorem: The sum of many independent random variables converges to a Gaussian, regardless of their individual distributions. Real-world measurements often arise from many small, independent effects—hence the "normal" name.
- Maximum Entropy: Among all distributions with a given mean and variance, the Gaussian has maximum entropy. It makes the fewest assumptions beyond these constraints, making it the "least informative" choice.
- Mathematical Convenience: Gaussians are closed under linear operations. Sums of Gaussians are Gaussian. Products of Gaussians are (proportional to) Gaussians. Conditionals and marginals of joint Gaussians are also Gaussian.
Connection to Physics
The Univariate Gaussian
The Probability Density Function
A random variable X follows a Gaussian distribution with mean and variance , written , if its probability density function (PDF) is:
Let's understand each component:
| Symbol | Name | Role | Diffusion Connection |
|---|---|---|---|
| 1/sqrt(2*pi*sigma^2) | Normalizing constant | Ensures integral = 1 | Appears in ELBO derivation |
| mu | Mean | Center of the distribution | In diffusion: sqrt(alpha_bar_t) * x_0 |
| sigma^2 | Variance | Spread/width of the distribution | In diffusion: (1 - alpha_bar_t) |
| (x - mu)^2 | Squared deviation | Distance from mean | Measures noise magnitude |
| exp(-...) | Exponential decay | Tails decay rapidly | Enables tractable integrals |
Standard Normal
Interactive Visualization: Explore how changing and affects the Gaussian PDF. Notice how the area under the curve always equals 1.
Log-Probability
In practice, we almost always work with log-probabilities rather than probabilities to avoid numerical underflow. Taking the log of the Gaussian PDF:
The log-probability reveals the quadratic structure of the Gaussian. This is why minimizing squared error is equivalent to maximum likelihood estimation with Gaussian noise.
The MSE Connection
Multivariate Gaussian
Definition
For a random vector , the multivariate Gaussian with mean vector and covariance matrix has PDF:
Each component has a natural interpretation:
| Component | Symbol | Meaning |
|---|---|---|
| Normalizing constant | (2*pi)^(d/2) * |Sigma|^(1/2) | Depends on dimension d and determinant of covariance |
| Mean vector | mu (d-dimensional) | Center of the distribution in d-dimensional space |
| Covariance matrix | Sigma (d x d) | Encodes variances (diagonal) and correlations (off-diagonal) |
| Mahalanobis distance | (x-mu)^T Sigma^(-1) (x-mu) | Quadratic form measuring distance from mean, scaled by covariance |
Covariance Matrix Requirements
- Symmetric:
- Positive semi-definite: for all x
Geometric Interpretation
The contours of constant probability density form ellipsoids in d-dimensional space:
- Principal axes are the eigenvectors of
- Axis lengths are proportional to where are the eigenvalues
- Spherical when (isotropic)
- Axis-aligned ellipse when is diagonal
- Rotated ellipse when off-diagonal elements are non-zero
Interactive 3D Visualization: Explore the multivariate Gaussian in 3D. Rotate the view, adjust mean, variance, and correlation to see how the distribution shape changes.
Implementation
Interactive 2D Contours: Adjust correlation and variances to see how the elliptical contours change. In diffusion models with isotropic noise, contours are circles.
Key Properties of Gaussians
Property 1: Linear Transformations
If and, then:
Diffusion Application
Property 2: Marginals and Conditionals
For a joint Gaussian :
- Marginals:
- Conditionals: where:
The Key Insight for Diffusion
Property 3: Products of Gaussians
The product of two Gaussian PDFs is proportional to another Gaussian:
where the precision-weighted combination is:
- (precisions add)
- (weighted by precision)
Precision = Confidence
Property 4: Sum of Independent Gaussians
If and are independent:
Forward Process Derivation
The Reparameterization Trick
One of the most important techniques in modern generative modeling is thereparameterization trick. The problem it solves: how do we backpropagate gradients through random sampling?
The Problem
Suppose we want to optimize parameters of a distribution with respect to some loss . We need:
But sampling is a discrete, non-differentiable operation! The gradient cannot flow through the sampling step.
The Solution
For Gaussians, we can reparameterize the sampling:
- Instead of sampling directly...
- Sample (independent of parameters)
- Compute (deterministic given )
Now the expectation becomes:
And we can differentiate through !
Multivariate Case
Bonus: The Central Limit Theorem
The CLT explains why Gaussians appear everywhere. If are independent and identically distributed (i.i.d.) with mean and variance , then as :
Interactive CLT Demonstration: Start with any distribution (uniform, exponential, etc.) and see how the sum converges to a Gaussian.
Gaussians in Diffusion Models
Let's connect everything to diffusion models explicitly:
| Diffusion Equation | Gaussian Form | What It Means |
|---|---|---|
| q(x_t | x_{t-1}) | N(sqrt(1-beta_t) x_{t-1}, beta_t I) | Single step of adding noise |
| q(x_t | x_0) | N(sqrt(alpha_bar_t) x_0, (1-alpha_bar_t) I) | Marginal at any timestep t |
| q(x_{t-1} | x_t, x_0) | N(mu_tilde_t, sigma_tilde_t^2 I) | True reverse posterior (tractable!) |
| p_theta(x_{t-1} | x_t) | N(mu_theta(x_t, t), sigma_t^2 I) | Learned reverse process |
Why This Matters
- Forward process is fully determined: We can compute in closed form for any t, enabling efficient training.
- Reverse posterior is tractable: Given , we know the exact . This becomes our training target.
- KL divergences have closed forms: Between two Gaussians, the KL divergence is analytic, enabling efficient ELBO computation.
- Reparameterization enables training: We sampleand train to predict .
The Epsilon Prediction Target
Summary
In this section, we developed a deep understanding of Gaussian distributions:
- Univariate Gaussian: The PDF has a normalizing constant and exponential of quadratic deviation from the mean. Log-probability reveals the connection to squared error loss.
- Multivariate Gaussian: Extends to vectors with mean vectors and covariance matrices. Contours form ellipsoids whose axes are determined by eigenvectors.
- Key Properties: Closure under linear transformations, tractable marginals and conditionals, products combine via precision weighting, sums of independent Gaussians are Gaussian.
- Reparameterization Trick: Enables gradient-based learning by separating randomness () from parameters ().
- Diffusion Connection: Both forward and reverse processes are defined by Gaussians, making the entire framework analytically tractable.
Exercises
Conceptual Questions
- Explain why the Gaussian distribution maximizes entropy among all distributions with fixed mean and variance. What does this imply for modeling uncertainty?
- Given a 2D Gaussian with correlation , describe qualitatively what the contours look like. What happens as?
- Why does the reparameterization trick work for Gaussians but not for all distributions (e.g., discrete distributions)?
- In the diffusion forward process, why do we use and as coefficients rather than arbitrary functions?
Computational Exercises
- Implement a function that computes the KL divergence between two univariate Gaussians and using the closed-form formula. Verify empirically using Monte Carlo sampling.
- Given a bivariate Gaussian with and, compute the conditional distribution . Sample from both the joint and the conditional to verify.
- Implement the reparameterization trick for a VAE encoder that outputs and . Verify that gradients flow correctly by computing for a simple loss.
- Simulate the diffusion forward process for a 1D distribution. Start with samples from a mixture of Gaussians, apply the forward process for T=100 steps with a linear noise schedule, and visualize how the distribution evolves toward.
Challenge Problem
Derive the posterior for the diffusion process:
- Start with and
- Use Bayes' theorem:
- Show that the product of two Gaussians gives a Gaussian with specific mean and variance
- Verify your result matches the formula in the DDPM paper