Boo-AI — Master Artificial Intelligence by Building from Scratch

Learning Objectives

By the end of this section, you will:

Master the univariate Gaussian PDF and understand every term in the formula
Extend to multivariate Gaussians with covariance matrices and understand their geometric interpretation
Apply key properties including closure under linear transformations and conditioning
Implement the reparameterization trick that enables gradient-based learning through sampling
Connect every concept to diffusion models where Gaussians define both forward and reverse processes

Why Gaussians Dominate Diffusion

Diffusion models are built almost entirely on Gaussian distributions. The forward process adds Gaussian noise:

q(x_t|x_{t-1}) = \mathcal{N}(x_t; \sqrt{1-\beta_t}x_{t-1}, \beta_t I)

. The reverse process predicts Gaussian parameters:

p_\theta(x_{t-1}|x_t) = \mathcal{N}(x_{t-1}; \mu_\theta(x_t, t), \sigma_t^2 I)

. Understanding Gaussians deeply is not optional—it's essential.

The Big Picture

The Gaussian (or Normal) distribution is the most important distribution in all of statistics and machine learning. It was first studied by Abraham de Moivre in 1733 and later developed extensively by Carl Friedrich Gauss, who used it to analyze astronomical data.

Why is the Gaussian so ubiquitous? There are three fundamental reasons:

Central Limit Theorem: The sum of many independent random variables converges to a Gaussian, regardless of their individual distributions. Real-world measurements often arise from many small, independent effects—hence the "normal" name.
Maximum Entropy: Among all distributions with a given mean and variance, the Gaussian has maximum entropy. It makes the fewest assumptions beyond these constraints, making it the "least informative" choice.
Mathematical Convenience: Gaussians are closed under linear operations. Sums of Gaussians are Gaussian. Products of Gaussians are (proportional to) Gaussians. Conditionals and marginals of joint Gaussians are also Gaussian.

Connection to Physics

The Gaussian arises naturally in diffusion physics. When particles undergo Brownian motion (random collisions), their positions after time t follow a Gaussian distribution. This physical process inspired the name "diffusion models"—we're simulating information diffusing into noise.

The Univariate Gaussian

The Probability Density Function

A random variable X follows a Gaussian distribution with mean $\mu$ and variance $\sigma^2$ , written $X \sim \mathcal{N}(\mu, \sigma^2)$ , if its probability density function (PDF) is:

$p(x) = \frac{1}{\sqrt{2\pi\sigma^2}} \exp\left(-\frac{(x - \mu)^2}{2\sigma^2}\right)$

Let's understand each component:

Symbol	Name	Role	Diffusion Connection
1/sqrt(2pisigma^2)	Normalizing constant	Ensures integral = 1	Appears in ELBO derivation
mu	Mean	Center of the distribution	In diffusion: sqrt(alpha_bar_t) * x_0
sigma^2	Variance	Spread/width of the distribution	In diffusion: (1 - alpha_bar_t)
(x - mu)^2	Squared deviation	Distance from mean	Measures noise magnitude
exp(-...)	Exponential decay	Tails decay rapidly	Enables tractable integrals

Standard Normal

When

\mu = 0

and

\sigma^2 = 1

, we get the standard normal

\mathcal{N}(0, 1)

. Any Gaussian can be transformed to standard normal via

Z = (X - \mu)/\sigma

. This is called standardization.

Interactive Visualization: Explore how changing $\mu$ and $\sigma$ affects the Gaussian PDF. Notice how the area under the curve always equals 1.

Log-Probability

In practice, we almost always work with log-probabilities rather than probabilities to avoid numerical underflow. Taking the log of the Gaussian PDF:

$\log p(x) = -\frac{1}{2}\log(2\pi\sigma^2) - \frac{(x - \mu)^2}{2\sigma^2}$

The log-probability reveals the quadratic structure of the Gaussian. This is why minimizing squared error is equivalent to maximum likelihood estimation with Gaussian noise.

The MSE Connection

In diffusion training, we minimize

\|\epsilon - \epsilon_\theta(x_t, t)\|^2

, which is the squared error between true and predicted noise. This is exactly the negative log-likelihood of a Gaussian with unit variance!

Multivariate Gaussian

Definition

For a random vector $\mathbf{x} \in \mathbb{R}^d$ , the multivariate Gaussian with mean vector $\boldsymbol{\mu} \in \mathbb{R}^d$ and covariance matrix $\boldsymbol{\Sigma} \in \mathbb{R}^{d \times d}$ has PDF:

$p(\mathbf{x}) = \frac{1}{(2\pi)^{d/2}|\boldsymbol{\Sigma}|^{1/2}} \exp\left(-\frac{1}{2}(\mathbf{x} - \boldsymbol{\mu})^\top \boldsymbol{\Sigma}^{-1}(\mathbf{x} - \boldsymbol{\mu})\right)$

Each component has a natural interpretation:

Component	Symbol	Meaning
Normalizing constant	(2pi)^(d/2) \|Sigma\|^(1/2)	Depends on dimension d and determinant of covariance
Mean vector	mu (d-dimensional)	Center of the distribution in d-dimensional space
Covariance matrix	Sigma (d x d)	Encodes variances (diagonal) and correlations (off-diagonal)
Mahalanobis distance	(x-mu)^T Sigma^(-1) (x-mu)	Quadratic form measuring distance from mean, scaled by covariance

Covariance Matrix Requirements

The covariance matrix

\boldsymbol{\Sigma}

must be:

Symmetric: $\Sigma_{ij} = \Sigma_{ji}$
Positive semi-definite: $\mathbf{x}^\top \boldsymbol{\Sigma} \mathbf{x} \geq 0$ for all x

These ensure the PDF is well-defined and integrates to 1.

Geometric Interpretation

The contours of constant probability density form ellipsoids in d-dimensional space:

Principal axes are the eigenvectors of $\boldsymbol{\Sigma}$
Axis lengths are proportional to $\sqrt{\lambda_i}$ where $\lambda_i$ are the eigenvalues
Spherical when $\boldsymbol{\Sigma} = \sigma^2 I$ (isotropic)
Axis-aligned ellipse when $\boldsymbol{\Sigma}$ is diagonal
Rotated ellipse when off-diagonal elements are non-zero

Interactive 3D Visualization: Explore the multivariate Gaussian in 3D. Rotate the view, adjust mean, variance, and correlation to see how the distribution shape changes.

Implementation

Multivariate Gaussian in PyTorch

🐍multivariate_gaussian.py

Explanation(8)

Code(24)

1Import PyTorch

PyTorch provides both tensor operations and probability distributions that mirror the mathematical notation for Gaussians.

2Import MultivariateNormal

This distribution class handles multivariate Gaussians with arbitrary mean vectors and covariance matrices.

5Mean Vector

The mean vector mu defines the center of the distribution in n-dimensional space. For diffusion, the forward process pushes mu toward zero.

EXAMPLE

mu = [0, 0] for a 2D standard normal

6Covariance Matrix

The covariance matrix Sigma captures variances (diagonal) and correlations (off-diagonal). Must be positive semi-definite.

EXAMPLE

Diagonal Sigma means independent dimensions

12Create Distribution

MultivariateNormal(mean, cov) creates the distribution object. In diffusion, we create many such distributions for different timesteps.

15Sample from Distribution

The sample() method draws random samples. In diffusion training, we sample noise epsilon ~ N(0, I) millions of times.

18Compute Log-Probability

log_prob() computes log p(x). This is essential for training objectives where we minimize negative log-likelihood.

21Reparameterization

Instead of sampling z ~ N(mu, Sigma), we sample epsilon ~ N(0, I) and compute z = mu + L @ epsilon where LL^T = Sigma. This enables gradient flow.

EXAMPLE

z = mu + cholesky(Sigma) @ epsilon

16 lines without explanation

1import torch
2from torch.distributions import MultivariateNormal
3
4# Define parameters
5mean = torch.tensor([0.0, 0.0])
6covariance = torch.tensor([
7    [1.0, 0.5],
8    [0.5, 1.0]
9])
10
11# Create the distribution
12mvn = MultivariateNormal(mean, covariance)
13
14# Sample from the distribution
15samples = mvn.sample((1000,))  # Shape: (1000, 2)
16
17# Compute log-probability
18point = torch.tensor([0.5, 0.5])
19log_prob = mvn.log_prob(point)
20
21# Reparameterization: sample via Cholesky decomposition
22epsilon = torch.randn(2)  # Standard normal
23L = torch.linalg.cholesky(covariance)
24z = mean + L @ epsilon  # Same distribution as mvn.sample()

Interactive 2D Contours: Adjust correlation and variances to see how the elliptical contours change. In diffusion models with isotropic noise, contours are circles.

Key Properties of Gaussians

Property 1: Linear Transformations

If $\mathbf{x} \sim \mathcal{N}(\boldsymbol{\mu}, \boldsymbol{\Sigma})$ and $\mathbf{y} = A\mathbf{x} + \mathbf{b}$ , then:

$\mathbf{y} \sim \mathcal{N}(A\boldsymbol{\mu} + \mathbf{b}, A\boldsymbol{\Sigma}A^\top)$

Diffusion Application

The forward process

x_t = \sqrt{\bar{\alpha}_t}x_0 + \sqrt{1-\bar{\alpha}_t}\epsilon

is exactly a linear transformation! Starting from

x_0

(treated as fixed) and

\epsilon \sim \mathcal{N}(0, I)

, the result

x_t

is Gaussian with mean

\sqrt{\bar{\alpha}_t}x_0

and variance

(1-\bar{\alpha}_t)I

Property 2: Marginals and Conditionals

For a joint Gaussian $\begin{pmatrix} \mathbf{x}_1 \\ \mathbf{x}_2 \end{pmatrix} \sim \mathcal{N}\left(\begin{pmatrix} \boldsymbol{\mu}_1 \\ \boldsymbol{\mu}_2 \end{pmatrix}, \begin{pmatrix} \boldsymbol{\Sigma}_{11} & \boldsymbol{\Sigma}_{12} \\ \boldsymbol{\Sigma}_{21} & \boldsymbol{\Sigma}_{22} \end{pmatrix}\right)$ :

Marginals: $\mathbf{x}_1 \sim \mathcal{N}(\boldsymbol{\mu}_1, \boldsymbol{\Sigma}_{11})$
Conditionals: $\mathbf{x}_1 | \mathbf{x}_2 \sim \mathcal{N}(\boldsymbol{\mu}_{1|2}, \boldsymbol{\Sigma}_{1|2})$ where:
- $\boldsymbol{\mu}_{1|2} = \boldsymbol{\mu}_1 + \boldsymbol{\Sigma}_{12}\boldsymbol{\Sigma}_{22}^{-1}(\mathbf{x}_2 - \boldsymbol{\mu}_2)$
- $\boldsymbol{\Sigma}_{1|2} = \boldsymbol{\Sigma}_{11} - \boldsymbol{\Sigma}_{12}\boldsymbol{\Sigma}_{22}^{-1}\boldsymbol{\Sigma}_{21}$

The Key Insight for Diffusion

The true reverse conditional

q(x_{t-1}|x_t, x_0)

is Gaussian with a closed-form mean and variance! This is possible because

q(x_0, x_{t-1}, x_t)

is jointly Gaussian. The neural network learns to approximate this conditional.

Property 3: Products of Gaussians

The product of two Gaussian PDFs is proportional to another Gaussian:

$\mathcal{N}(x; \mu_1, \sigma_1^2) \cdot \mathcal{N}(x; \mu_2, \sigma_2^2) \propto \mathcal{N}(x; \mu_*, \sigma_*^2)$

where the precision-weighted combination is:

$\sigma_*^{-2} = \sigma_1^{-2} + \sigma_2^{-2}$ (precisions add)
$\mu_* = \sigma_*^2(\sigma_1^{-2}\mu_1 + \sigma_2^{-2}\mu_2)$ (weighted by precision)

Precision = Confidence

Precision

\tau = 1/\sigma^2

measures confidence in the estimate. When combining information, we weight by precision—more confident estimates get more weight. This is exactly how Bayesian updates work and appears in deriving

q(x_{t-1}|x_t, x_0)

Property 4: Sum of Independent Gaussians

If $X \sim \mathcal{N}(\mu_1, \sigma_1^2)$ and $Y \sim \mathcal{N}(\mu_2, \sigma_2^2)$ are independent:

$X + Y \sim \mathcal{N}(\mu_1 + \mu_2, \sigma_1^2 + \sigma_2^2)$

Forward Process Derivation

This property lets us derive the marginal

q(x_t|x_0)

directly! Instead of composing T steps of adding noise, we can jump directly to timestep t by accumulating the variances.

The Reparameterization Trick

One of the most important techniques in modern generative modeling is thereparameterization trick. The problem it solves: how do we backpropagate gradients through random sampling?

The Problem

Suppose we want to optimize parameters $\theta$ of a distribution $p_\theta(z)$ with respect to some loss $\mathcal{L}(z)$ . We need:

$\nabla_\theta \mathbb{E}_{z \sim p_\theta}[\mathcal{L}(z)]$

But sampling $z \sim p_\theta$ is a discrete, non-differentiable operation! The gradient cannot flow through the sampling step.

The Solution

For Gaussians, we can reparameterize the sampling:

Instead of sampling $z \sim \mathcal{N}(\mu, \sigma^2)$ directly...
Sample $\epsilon \sim \mathcal{N}(0, 1)$ (independent of parameters)
Compute $z = \mu + \sigma \cdot \epsilon$ (deterministic given $\epsilon$ )

Now the expectation becomes:

$\mathbb{E}_{z \sim p_\theta}[\mathcal{L}(z)] = \mathbb{E}_{\epsilon \sim \mathcal{N}(0,1)}[\mathcal{L}(\mu + \sigma \cdot \epsilon)]$

And we can differentiate through $z = \mu + \sigma \cdot \epsilon$ !

Reparameterization Trick Implementation

🐍reparameterization.py

Explanation(6)

Code(21)

1Import Modules

We need torch for tensors and Normal for the standard normal distribution.

4Learnable Parameters

In a VAE or diffusion model, mu and log_sigma are outputs of neural networks, and we need gradients through them.

5Log-Sigma Parameterization

We parameterize log(sigma) instead of sigma to ensure sigma > 0 without constraints. Common trick in variational autoencoders.

8Sample Standard Normal

Sample epsilon ~ N(0, 1). This is the source of randomness. Crucially, epsilon does not depend on mu or sigma.

11Transform to Target Distribution

Apply z = mu + sigma * epsilon. Since epsilon is fixed during backprop, gradients flow through mu and sigma!

EXAMPLE

dL/d(mu) = dL/dz * dz/d(mu) = dL/dz * 1

14Why This Matters

Without reparameterization, we cannot backpropagate through sampling. This trick is essential for VAEs and diffusion models.

15 lines without explanation

1import torch
2from torch.distributions import Normal
3
4# Parameters from a neural network (require gradients)
5mu = torch.tensor([1.0], requires_grad=True)
6log_sigma = torch.tensor([0.0], requires_grad=True)  # Use log for positivity
7sigma = torch.exp(log_sigma)
8
9# Sample noise from standard normal
10epsilon = torch.randn_like(mu)
11
12# Reparameterize: z = mu + sigma * epsilon
13z = mu + sigma * epsilon
14
15# Compute some loss and backpropagate
16loss = (z ** 2).sum()  # Example loss
17loss.backward()
18
19# Gradients now flow through mu and sigma!
20print(f"Gradient w.r.t. mu: {mu.grad}")
21print(f"Gradient w.r.t. log_sigma: {log_sigma.grad}")

Multivariate Case

For multivariate Gaussians, use the Cholesky decomposition:

\mathbf{z} = \boldsymbol{\mu} + L\boldsymbol{\epsilon}

where

LL^\top = \boldsymbol{\Sigma}

and

\boldsymbol{\epsilon} \sim \mathcal{N}(\mathbf{0}, I)

. In diffusion models with isotropic noise (

\Sigma = \sigma^2 I

), this simplifies to

\mathbf{z} = \boldsymbol{\mu} + \sigma\boldsymbol{\epsilon}

Bonus: The Central Limit Theorem

The CLT explains why Gaussians appear everywhere. If $X_1, X_2, \ldots, X_n$ are independent and identically distributed (i.i.d.) with mean $\mu$ and variance $\sigma^2$ , then as $n \to \infty$ :

$\frac{\bar{X}_n - \mu}{\sigma/\sqrt{n}} \xrightarrow{d} \mathcal{N}(0, 1)$

Interactive CLT Demonstration: Start with any distribution (uniform, exponential, etc.) and see how the sum converges to a Gaussian.

Gaussians in Diffusion Models

Let's connect everything to diffusion models explicitly:

Diffusion Equation	Gaussian Form	What It Means
q(x_t \| x_{t-1})	N(sqrt(1-beta_t) x_{t-1}, beta_t I)	Single step of adding noise
q(x_t \| x_0)	N(sqrt(alpha_bar_t) x_0, (1-alpha_bar_t) I)	Marginal at any timestep t
q(x_{t-1} \| x_t, x_0)	N(mu_tilde_t, sigma_tilde_t^2 I)	True reverse posterior (tractable!)
p_theta(x_{t-1} \| x_t)	N(mu_theta(x_t, t), sigma_t^2 I)	Learned reverse process

Why This Matters

Forward process is fully determined: We can compute $q(x_t|x_0)$ in closed form for any t, enabling efficient training.
Reverse posterior is tractable: Given $x_0$ , we know the exact $q(x_{t-1}|x_t, x_0)$ . This becomes our training target.
KL divergences have closed forms: Between two Gaussians, the KL divergence is analytic, enabling efficient ELBO computation.
Reparameterization enables training: We sample $x_t = \sqrt{\bar{\alpha}_t}x_0 + \sqrt{1-\bar{\alpha}_t}\epsilon$ and train to predict $\epsilon$ .

The Epsilon Prediction Target

Instead of predicting the mean

\mu_\theta

directly, DDPM trains to predict the noise

\epsilon

. Since

x_t = \sqrt{\bar{\alpha}_t}x_0 + \sqrt{1-\bar{\alpha}_t}\epsilon

, knowing

\epsilon

lets us recover everything else. This noise-prediction parameterization leads to better empirical results.

Summary

In this section, we developed a deep understanding of Gaussian distributions:

Univariate Gaussian: The PDF has a normalizing constant and exponential of quadratic deviation from the mean. Log-probability reveals the connection to squared error loss.
Multivariate Gaussian: Extends to vectors with mean vectors and covariance matrices. Contours form ellipsoids whose axes are determined by eigenvectors.
Key Properties: Closure under linear transformations, tractable marginals and conditionals, products combine via precision weighting, sums of independent Gaussians are Gaussian.
Reparameterization Trick: Enables gradient-based learning by separating randomness ( $\epsilon$ ) from parameters ( $\mu, \sigma$ ).
Diffusion Connection: Both forward and reverse processes are defined by Gaussians, making the entire framework analytically tractable.

Exercises

Conceptual Questions

Explain why the Gaussian distribution maximizes entropy among all distributions with fixed mean and variance. What does this imply for modeling uncertainty?
Given a 2D Gaussian with correlation $\rho = 0.9$ , describe qualitatively what the contours look like. What happens as $\rho \to 1$ ?
Why does the reparameterization trick work for Gaussians but not for all distributions (e.g., discrete distributions)?
In the diffusion forward process, why do we use $\sqrt{\bar{\alpha}_t}$ and $\sqrt{1-\bar{\alpha}_t}$ as coefficients rather than arbitrary functions?

Computational Exercises

Implement a function that computes the KL divergence between two univariate Gaussians $\mathcal{N}(\mu_1, \sigma_1^2)$ and $\mathcal{N}(\mu_2, \sigma_2^2)$ using the closed-form formula. Verify empirically using Monte Carlo sampling.
Given a bivariate Gaussian with $\mu = [0, 0]$ and $\Sigma = \begin{bmatrix} 1 & 0.7 \\ 0.7 & 1 \end{bmatrix}$ , compute the conditional distribution $p(x_1 | x_2 = 1.5)$ . Sample from both the joint and the conditional to verify.
Implement the reparameterization trick for a VAE encoder that outputs $\mu$ and $\log\sigma$ . Verify that gradients flow correctly by computing $\partial L/\partial \mu$ for a simple loss.
Simulate the diffusion forward process for a 1D distribution. Start with samples from a mixture of Gaussians, apply the forward process for T=100 steps with a linear noise schedule, and visualize how the distribution evolves toward $\mathcal{N}(0, 1)$ .

Challenge Problem

Derive the posterior $q(x_{t-1}|x_t, x_0)$ for the diffusion process:

Start with $q(x_t|x_{t-1})$ and $q(x_{t-1}|x_0)$
Use Bayes' theorem: $q(x_{t-1}|x_t, x_0) \propto q(x_t|x_{t-1})q(x_{t-1}|x_0)$
Show that the product of two Gaussians gives a Gaussian with specific mean and variance
Verify your result matches the formula in the DDPM paper