Boo-AI — Master Artificial Intelligence by Building from Scratch

Learning Objectives

By the end of this section, you will be able to:

Prove that the forward process is variance-preserving when data starts with unit variance
Explain why $\mathbf{x}_T$ converges to $\mathcal{N}(\mathbf{0}, \mathbf{I})$
Describe how information about $\mathbf{x}_0$ is destroyed over time
Understand why the reverse process is tractable when conditioned on $\mathbf{x}_0$

Variance Preservation

A key property of the forward process is that it preserves variancewhen the input data has unit variance. This is crucial for numerical stability during training.

The Mathematical Proof

Starting from the reparameterization:

$\mathbf{x}_t = \sqrt{\bar{\alpha}_t}\mathbf{x}_0 + \sqrt{1-\bar{\alpha}_t}\boldsymbol{\epsilon}$

If $\text{Var}[\mathbf{x}_0] = \sigma_0^2\mathbf{I}$ and $\boldsymbol{\epsilon} \sim \mathcal{N}(\mathbf{0}, \mathbf{I})$ is independent of $\mathbf{x}_0$ , then:

$\text{Var}[\mathbf{x}_t] = \bar{\alpha}_t \cdot \text{Var}[\mathbf{x}_0] + (1-\bar{\alpha}_t) \cdot \text{Var}[\boldsymbol{\epsilon}]$

$= \bar{\alpha}_t \sigma_0^2 \mathbf{I} + (1-\bar{\alpha}_t)\mathbf{I}$

Special Case: Unit Variance Data

\sigma_0^2 = 1

(data normalized to unit variance):

$\text{Var}[\mathbf{x}_t] = \bar{\alpha}_t \mathbf{I} + (1-\bar{\alpha}_t)\mathbf{I} = \mathbf{I}$

The variance is exactly preserved at every timestep! This is why diffusion models typically work with data normalized to have unit variance.

Why This Matters

Numerical stability: Activations stay in a reasonable range
Consistent scale: The network sees similar magnitudes at all timesteps
Simple prior: The endpoint is standard Gaussian, easy to sample

Convergence to Gaussian Prior

As $t \to T$ (with sufficiently large $T$ and appropriate schedule), we have $\bar{\alpha}_T \to 0$ . This means:

$q(\mathbf{x}_T | \mathbf{x}_0) = \mathcal{N}(\sqrt{\bar{\alpha}_T}\mathbf{x}_0, (1-\bar{\alpha}_T)\mathbf{I}) \approx \mathcal{N}(\mathbf{0}, \mathbf{I})$

Quantifying Convergence

For typical schedules with $T = 1000$ :

Schedule	α̅_T	Mean shift	Variance ratio
Linear	~6×10⁻⁶	~0.0025 × x₀	~0.999994
Cosine	~2×10⁻⁴	~0.014 × x₀	~0.9998

In practice, this is close enough to $\mathcal{N}(\mathbf{0}, \mathbf{I})$ that the approximation introduces negligible error.

Why This Matters: The generation process starts by sampling $\mathbf{x}_T \sim \mathcal{N}(\mathbf{0}, \mathbf{I})$ . This is only valid because $q(\mathbf{x}_T)$ is approximately standard Gaussian regardless of the data distribution $q(\mathbf{x}_0)$ .

The Markov Property

The forward process satisfies the Markov property:

$q(\mathbf{x}_t | \mathbf{x}_0, \mathbf{x}_1, \ldots, \mathbf{x}_{t-1}) = q(\mathbf{x}_t | \mathbf{x}_{t-1})$

Given $\mathbf{x}_{t-1}$ , the distribution of $\mathbf{x}_t$ is independent of all earlier states.

Implications

Factorization: The joint distribution factors as $q(\mathbf{x}_{1:T}|\mathbf{x}_0) = \prod_{t=1}^T q(\mathbf{x}_t|\mathbf{x}_{t-1})$
Local dependence: Each step only needs the previous state
Enables ELBO: The variational bound derivation relies on this factorization

Non-Markov Extensions

Interestingly, the reverse process for generation does not need to be Markov. DDIM and other fast samplers exploit this by taking larger steps that skip intermediate states.

Information Destruction

The forward process systematically destroys information about the original data. This can be quantified using information-theoretic measures.

Mutual Information Decay

The mutual information between $\mathbf{x}_0$ and $\mathbf{x}_t$ decreases with $t$ :

$I(\mathbf{x}_0; \mathbf{x}_t) \leq I(\mathbf{x}_0; \mathbf{x}_{t-1})$

This follows from the data processing inequality: processing data cannot increase the information it contains about the original.

Signal-to-Noise Ratio Decay

The SNR provides a more intuitive measure:

$\text{SNR}(t) = \frac{\bar{\alpha}_t}{1 - \bar{\alpha}_t}$

$t = 0$ : $\text{SNR} = \infty$ (pure signal)
$t = T$ : $\text{SNR} \approx 0$ (pure noise)

The schedule controls how quickly SNR decays, and thus how quickly information is destroyed.

Conditional Reversibility

While the forward process destroys information, a remarkable property enables us to reverse it: the posterior $q(\mathbf{x}_{t-1} | \mathbf{x}_t, \mathbf{x}_0)$ is tractable.

The Posterior Distribution

Using Bayes' rule and the Gaussian forms:

$q(\mathbf{x}_{t-1} | \mathbf{x}_t, \mathbf{x}_0) = \mathcal{N}(\mathbf{x}_{t-1}; \tilde{\mu}_t(\mathbf{x}_t, \mathbf{x}_0), \tilde{\beta}_t \mathbf{I})$

where the posterior mean and variance have closed forms (derived in Chapter 3):

$\tilde{\mu}_t = \frac{\sqrt{\bar{\alpha}_{t-1}}\beta_t}{1-\bar{\alpha}_t}\mathbf{x}_0 + \frac{\sqrt{\alpha_t}(1-\bar{\alpha}_{t-1})}{1-\bar{\alpha}_t}\mathbf{x}_t$

$\tilde{\beta}_t = \frac{(1-\bar{\alpha}_{t-1})}{1-\bar{\alpha}_t}\beta_t$

The Key Insight: If we knew $\mathbf{x}_0$ , we could perfectly reverse the forward process. Since we don't know $\mathbf{x}_0$ during generation, we train a neural network to predict it (or equivalently, predict the noise $\boldsymbol{\epsilon}$ ).

Key Takeaways

Variance preservation: With unit variance data, $\text{Var}[\mathbf{x}_t] = \mathbf{I}$ at all timesteps
Convergence: $q(\mathbf{x}_T) \approx \mathcal{N}(\mathbf{0}, \mathbf{I})$ regardless of data distribution
Markov: Each step depends only on the previous state
Information destruction: SNR decreases monotonically
Tractable posterior: $q(\mathbf{x}_{t-1}|\mathbf{x}_t, \mathbf{x}_0)$ is Gaussian

Looking Ahead: In the next section, we'll put everything together with a complete, production-ready implementation of the forward diffusion process.