Learning Objectives
By the end of this section, you will be able to:
- Prove that the forward process is variance-preserving when data starts with unit variance
- Explain why converges to
- Describe how information about is destroyed over time
- Understand why the reverse process is tractable when conditioned on
Variance Preservation
A key property of the forward process is that it preserves variancewhen the input data has unit variance. This is crucial for numerical stability during training.
The Mathematical Proof
Starting from the reparameterization:
If and is independent of , then:
Special Case: Unit Variance Data
The variance is exactly preserved at every timestep! This is why diffusion models typically work with data normalized to have unit variance.
Why This Matters
- Numerical stability: Activations stay in a reasonable range
- Consistent scale: The network sees similar magnitudes at all timesteps
- Simple prior: The endpoint is standard Gaussian, easy to sample
Convergence to Gaussian Prior
As (with sufficiently large and appropriate schedule), we have . This means:
Quantifying Convergence
For typical schedules with :
| Schedule | α̅_T | Mean shift | Variance ratio |
|---|---|---|---|
| Linear | ~6×10⁻⁶ | ~0.0025 × x₀ | ~0.999994 |
| Cosine | ~2×10⁻⁴ | ~0.014 × x₀ | ~0.9998 |
In practice, this is close enough to that the approximation introduces negligible error.
Why This Matters: The generation process starts by sampling . This is only valid because is approximately standard Gaussian regardless of the data distribution .
The Markov Property
The forward process satisfies the Markov property:
Given , the distribution of is independent of all earlier states.
Implications
- Factorization: The joint distribution factors as
- Local dependence: Each step only needs the previous state
- Enables ELBO: The variational bound derivation relies on this factorization
Non-Markov Extensions
Information Destruction
The forward process systematically destroys information about the original data. This can be quantified using information-theoretic measures.
Mutual Information Decay
The mutual information between and decreases with :
This follows from the data processing inequality: processing data cannot increase the information it contains about the original.
Signal-to-Noise Ratio Decay
The SNR provides a more intuitive measure:
- : (pure signal)
- : (pure noise)
The schedule controls how quickly SNR decays, and thus how quickly information is destroyed.
Conditional Reversibility
While the forward process destroys information, a remarkable property enables us to reverse it: the posterior is tractable.
The Posterior Distribution
Using Bayes' rule and the Gaussian forms:
where the posterior mean and variance have closed forms (derived in Chapter 3):
The Key Insight: If we knew , we could perfectly reverse the forward process. Since we don't know during generation, we train a neural network to predict it (or equivalently, predict the noise ).
Key Takeaways
- Variance preservation: With unit variance data, at all timesteps
- Convergence: regardless of data distribution
- Markov: Each step depends only on the previous state
- Information destruction: SNR decreases monotonically
- Tractable posterior: is Gaussian
Looking Ahead: In the next section, we'll put everything together with a complete, production-ready implementation of the forward diffusion process.