Chapter 2
15 min read
Section 13 of 76

Properties of the Forward Process

The Forward Diffusion Process

Learning Objectives

By the end of this section, you will be able to:

  1. Prove that the forward process is variance-preserving when data starts with unit variance
  2. Explain why xT\mathbf{x}_T converges to N(0,I)\mathcal{N}(\mathbf{0}, \mathbf{I})
  3. Describe how information about x0\mathbf{x}_0 is destroyed over time
  4. Understand why the reverse process is tractable when conditioned on x0\mathbf{x}_0

Variance Preservation

A key property of the forward process is that it preserves variancewhen the input data has unit variance. This is crucial for numerical stability during training.

The Mathematical Proof

Starting from the reparameterization:

xt=αˉtx0+1αˉtϵ\mathbf{x}_t = \sqrt{\bar{\alpha}_t}\mathbf{x}_0 + \sqrt{1-\bar{\alpha}_t}\boldsymbol{\epsilon}

If Var[x0]=σ02I\text{Var}[\mathbf{x}_0] = \sigma_0^2\mathbf{I} and ϵN(0,I)\boldsymbol{\epsilon} \sim \mathcal{N}(\mathbf{0}, \mathbf{I})is independent of x0\mathbf{x}_0, then:

Var[xt]=αˉtVar[x0]+(1αˉt)Var[ϵ]\text{Var}[\mathbf{x}_t] = \bar{\alpha}_t \cdot \text{Var}[\mathbf{x}_0] + (1-\bar{\alpha}_t) \cdot \text{Var}[\boldsymbol{\epsilon}]

=αˉtσ02I+(1αˉt)I= \bar{\alpha}_t \sigma_0^2 \mathbf{I} + (1-\bar{\alpha}_t)\mathbf{I}

Special Case: Unit Variance Data

If σ02=1\sigma_0^2 = 1 (data normalized to unit variance):

Var[xt]=αˉtI+(1αˉt)I=I\text{Var}[\mathbf{x}_t] = \bar{\alpha}_t \mathbf{I} + (1-\bar{\alpha}_t)\mathbf{I} = \mathbf{I}

The variance is exactly preserved at every timestep! This is why diffusion models typically work with data normalized to have unit variance.

Why This Matters

  • Numerical stability: Activations stay in a reasonable range
  • Consistent scale: The network sees similar magnitudes at all timesteps
  • Simple prior: The endpoint is standard Gaussian, easy to sample

Convergence to Gaussian Prior

As tTt \to T (with sufficiently large TTand appropriate schedule), we have αˉT0\bar{\alpha}_T \to 0. This means:

q(xTx0)=N(αˉTx0,(1αˉT)I)N(0,I)q(\mathbf{x}_T | \mathbf{x}_0) = \mathcal{N}(\sqrt{\bar{\alpha}_T}\mathbf{x}_0, (1-\bar{\alpha}_T)\mathbf{I}) \approx \mathcal{N}(\mathbf{0}, \mathbf{I})

Quantifying Convergence

For typical schedules with T=1000T = 1000:

Scheduleα̅_TMean shiftVariance ratio
Linear~6×10⁻⁶~0.0025 × x₀~0.999994
Cosine~2×10⁻⁴~0.014 × x₀~0.9998

In practice, this is close enough to N(0,I)\mathcal{N}(\mathbf{0}, \mathbf{I}) that the approximation introduces negligible error.

Why This Matters: The generation process starts by sampling xTN(0,I)\mathbf{x}_T \sim \mathcal{N}(\mathbf{0}, \mathbf{I}). This is only valid because q(xT)q(\mathbf{x}_T) is approximately standard Gaussian regardless of the data distribution q(x0)q(\mathbf{x}_0).

The Markov Property

The forward process satisfies the Markov property:

q(xtx0,x1,,xt1)=q(xtxt1)q(\mathbf{x}_t | \mathbf{x}_0, \mathbf{x}_1, \ldots, \mathbf{x}_{t-1}) = q(\mathbf{x}_t | \mathbf{x}_{t-1})

Given xt1\mathbf{x}_{t-1}, the distribution of xt\mathbf{x}_t is independent of all earlier states.

Implications

  1. Factorization: The joint distribution factors as q(x1:Tx0)=t=1Tq(xtxt1)q(\mathbf{x}_{1:T}|\mathbf{x}_0) = \prod_{t=1}^T q(\mathbf{x}_t|\mathbf{x}_{t-1})
  2. Local dependence: Each step only needs the previous state
  3. Enables ELBO: The variational bound derivation relies on this factorization

Non-Markov Extensions

Interestingly, the reverse process for generation does not need to be Markov. DDIM and other fast samplers exploit this by taking larger steps that skip intermediate states.

Information Destruction

The forward process systematically destroys information about the original data. This can be quantified using information-theoretic measures.

Mutual Information Decay

The mutual information between x0\mathbf{x}_0 and xt\mathbf{x}_t decreases with tt:

I(x0;xt)I(x0;xt1)I(\mathbf{x}_0; \mathbf{x}_t) \leq I(\mathbf{x}_0; \mathbf{x}_{t-1})

This follows from the data processing inequality: processing data cannot increase the information it contains about the original.

Signal-to-Noise Ratio Decay

The SNR provides a more intuitive measure:

SNR(t)=αˉt1αˉt\text{SNR}(t) = \frac{\bar{\alpha}_t}{1 - \bar{\alpha}_t}

  • t=0t = 0: SNR=\text{SNR} = \infty (pure signal)
  • t=Tt = T: SNR0\text{SNR} \approx 0 (pure noise)

The schedule controls how quickly SNR decays, and thus how quickly information is destroyed.


Conditional Reversibility

While the forward process destroys information, a remarkable property enables us to reverse it: the posterior q(xt1xt,x0)q(\mathbf{x}_{t-1} | \mathbf{x}_t, \mathbf{x}_0)is tractable.

The Posterior Distribution

Using Bayes' rule and the Gaussian forms:

q(xt1xt,x0)=N(xt1;μ~t(xt,x0),β~tI)q(\mathbf{x}_{t-1} | \mathbf{x}_t, \mathbf{x}_0) = \mathcal{N}(\mathbf{x}_{t-1}; \tilde{\mu}_t(\mathbf{x}_t, \mathbf{x}_0), \tilde{\beta}_t \mathbf{I})

where the posterior mean and variance have closed forms (derived in Chapter 3):

μ~t=αˉt1βt1αˉtx0+αt(1αˉt1)1αˉtxt\tilde{\mu}_t = \frac{\sqrt{\bar{\alpha}_{t-1}}\beta_t}{1-\bar{\alpha}_t}\mathbf{x}_0 + \frac{\sqrt{\alpha_t}(1-\bar{\alpha}_{t-1})}{1-\bar{\alpha}_t}\mathbf{x}_t

β~t=(1αˉt1)1αˉtβt\tilde{\beta}_t = \frac{(1-\bar{\alpha}_{t-1})}{1-\bar{\alpha}_t}\beta_t

The Key Insight: If we knew x0\mathbf{x}_0, we could perfectly reverse the forward process. Since we don't know x0\mathbf{x}_0 during generation, we train a neural network to predict it (or equivalently, predict the noise ϵ\boldsymbol{\epsilon}).

Key Takeaways

  1. Variance preservation: With unit variance data, Var[xt]=I\text{Var}[\mathbf{x}_t] = \mathbf{I} at all timesteps
  2. Convergence: q(xT)N(0,I)q(\mathbf{x}_T) \approx \mathcal{N}(\mathbf{0}, \mathbf{I})regardless of data distribution
  3. Markov: Each step depends only on the previous state
  4. Information destruction: SNR decreases monotonically
  5. Tractable posterior: q(xt1xt,x0)q(\mathbf{x}_{t-1}|\mathbf{x}_t, \mathbf{x}_0) is Gaussian
Looking Ahead: In the next section, we'll put everything together with a complete, production-ready implementation of the forward diffusion process.