Chapter 2
18 min read
Section 11 of 76

The Noise Schedule

The Forward Diffusion Process

Learning Objectives

By the end of this section, you will be able to:

  1. Explain why the noise schedule {βt}t=1T\{\beta_t\}_{t=1}^T critically affects generation quality
  2. Derive the relationship between βt\beta_t and αˉt\bar{\alpha}_t
  3. Compare linear and cosine schedules and understand their trade-offs
  4. Implement multiple noise schedules in PyTorch

Why the Schedule Matters

The noise schedule {βt}t=1T\{\beta_t\}_{t=1}^T determines how quickly we destroy information during the forward process. This seemingly simple hyperparameter has profound effects on:

  • Sample Quality: Too aggressive noise destroys fine details; too slow noise makes generation harder to learn
  • Training Efficiency: The schedule affects which timesteps contribute most to the loss, impacting convergence
  • Generation Speed: Different schedules enable different sampling strategies, some much faster than others
The Key Insight: The noise schedule controls the "information destruction curve." We want to destroy information gradually enough that each reverse step is learnable, but completely enough that we reach pure noise.

From Beta to Alpha Bar

Recall from Section 2.1 that the single-step transition uses βt\beta_t. But what we really care about is the cumulative effect - how much signal remains after ttsteps. This is captured by αˉt\bar{\alpha}_t:

αt=1βt\alpha_t = 1 - \beta_t

αˉt=s=1tαs=s=1t(1βs)\bar{\alpha}_t = \prod_{s=1}^{t} \alpha_s = \prod_{s=1}^{t} (1 - \beta_s)

The quantity αˉt\bar{\alpha}_t tells us what fraction of the original signal variance remains at timestep tt. When αˉt0\bar{\alpha}_t \approx 0, almost all signal is lost.


The Linear Schedule

The original DDPM paper (Ho et al., 2020) used a linear schedule:

βt=βstart+tT(βendβstart)\beta_t = \beta_{\text{start}} + \frac{t}{T}(\beta_{\text{end}} - \beta_{\text{start}})

with typical values βstart=0.0001\beta_{\text{start}} = 0.0001 and βend=0.02\beta_{\text{end}} = 0.02.

PropertyValueInterpretation
β₁ (first step)0.0001Very small noise initially
β_T (last step)0.02Still relatively small per step
Total T1000Many small steps accumulate to destroy signal
α_bar_T~0.000006Almost no signal remains at end

Problem with Linear Schedule

The linear schedule has a significant flaw: αˉt\bar{\alpha}_t decays too quickly in early timesteps. Because αˉt\bar{\alpha}_t is a product, even small βt\beta_t values compound exponentially:

αˉ1000.77,αˉ5000.04,αˉ10000.000006\bar{\alpha}_{100} \approx 0.77, \quad \bar{\alpha}_{500} \approx 0.04, \quad \bar{\alpha}_{1000} \approx 0.000006

This means by timestep 500, 96% of the signal is already gone! The model must learn most of the high-frequency details in very few effective timesteps.


The Cosine Schedule

The Improved DDPM paper (Nichol & Dhariwal, 2021) proposed the cosine schedule, which directly specifies αˉt\bar{\alpha}_t rather than βt\beta_t:

αˉt=f(t)f(0),where f(t)=cos2(t/T+s1+sπ2)\bar{\alpha}_t = \frac{f(t)}{f(0)}, \quad \text{where } f(t) = \cos^2\left(\frac{t/T + s}{1+s} \cdot \frac{\pi}{2}\right)

The offset s=0.008s = 0.008 prevents αˉT\bar{\alpha}_T from being exactly zero.

Why Cosine Works Better

The cosine function creates a smooth S-curve for αˉt\bar{\alpha}_t:

  • Slow start: αˉt\bar{\alpha}_t stays close to 1 for early timesteps, preserving fine details longer
  • Gradual middle: Smooth decay through middle timesteps
  • Complete destruction: Still reaches near-zero by t=Tt = T

Deriving Beta from Alpha Bar

Once we specify αˉt\bar{\alpha}_t, we can derive βt\beta_t using:

βt=1αˉtαˉt1\beta_t = 1 - \frac{\bar{\alpha}_t}{\bar{\alpha}_{t-1}}

This works because αˉt=αˉt1(1βt)\bar{\alpha}_t = \bar{\alpha}_{t-1} \cdot (1 - \beta_t).

Interactive Schedule Comparison

Use the interactive visualization below to compare different noise schedules. Notice how the cosine schedule preserves signal (αˉt\bar{\alpha}_t) for longer during early timesteps:

📊Noise Schedule Comparison

α̅t (Signal Retention)

0.000.250.500.751.00Timestep t

βt (Noise Added per Step)

0.0000.0050.0100.0150.0200.025Timestep t

SNR = α̅t/(1-α̅t) (Log Scale)

10-210-1100101102103Timestep t
Explore timestep:t = 500

Values at t = 500

Scheduleβtα̅t√α̅t√(1-α̅t)SNR
Linear (DDPM)0.0100500.0779920.2792710.9602120.0846
Cosine (Improved DDPM)0.0031460.4938440.7027400.7114470.9757
Linear Schedule

Original DDPM. α̅t decays too quickly early on, potentially losing high-frequency details. Simple but not optimal.

Cosine Schedule

Improved DDPM. Slower decay preserves more signal structure. Better image quality, especially for high resolution.

Implementation

Here is a complete implementation of multiple noise schedules:

Noise Schedule Implementations
🐍noise_schedules.py
5Linear Schedule Function

The simplest schedule: beta_t increases linearly from a small starting value to a larger ending value over T timesteps.

7DDPM Original

This is the schedule used in the original DDPM paper. It is simple but may not be optimal for all data types.

18torch.linspace

Creates T evenly spaced values between beta_start and beta_end. Each beta_t = beta_start + t/T * (beta_end - beta_start).

EXAMPLE
linspace(0.0001, 0.02, 1000) gives [0.0001, 0.00012, ..., 0.02]
21Cosine Schedule Function

A more sophisticated schedule designed to preserve signal longer. The cosine curve causes slower noise addition in early steps.

25Small Offset s

The offset s=0.008 prevents the schedule from starting at exactly 0 or 1, which would cause numerical issues.

32Cosine Function f(t)

The cosine-squared function creates a smooth decay. f(0)=1 and f(T) approaches 0. Squaring makes the curve smoother.

36Alpha Bar Calculation

We compute alpha_bar directly from the cosine function, then derive beta from the relationship beta_t = 1 - alpha_bar_t/alpha_bar_{t-1}.

39Derive Betas

Since alpha_bar_t = prod_{s=1}^t (1-beta_s), we can reverse this: beta_t = 1 - alpha_bar_t / alpha_bar_{t-1}.

42Clipping Betas

We clip beta values to prevent extreme values that could cause numerical instability. Values too close to 0 or 1 are problematic.

45Derived Parameters

From betas, we pre-compute all the quantities needed during training and sampling. This is done once and cached.

53Cumulative Product

alpha_bar_t = prod_{s=1}^t alpha_s is the cumulative product. This tells us how much of the original signal remains at step t.

62Signal-to-Noise Ratio

SNR = alpha_bar / (1 - alpha_bar) measures signal vs noise. High SNR = mostly signal, low SNR = mostly noise.

76 lines without explanation
1import torch
2import math
3from typing import Tuple
4
5def linear_beta_schedule(T: int, beta_start: float = 0.0001, beta_end: float = 0.02) -> torch.Tensor:
6    """
7    Linear noise schedule from DDPM (Ho et al., 2020).
8
9    beta_t increases linearly from beta_start to beta_end.
10
11    Args:
12        T: Total number of timesteps
13        beta_start: Starting noise variance
14        beta_end: Ending noise variance
15
16    Returns:
17        betas: Tensor of shape [T] with beta values
18    """
19    return torch.linspace(beta_start, beta_end, T)
20
21
22def cosine_beta_schedule(T: int, s: float = 0.008) -> torch.Tensor:
23    """
24    Cosine noise schedule from Improved DDPM (Nichol & Dhariwal, 2021).
25
26    Designed to preserve more signal in early timesteps.
27
28    Args:
29        T: Total number of timesteps
30        s: Small offset to prevent singularities (default 0.008)
31
32    Returns:
33        betas: Tensor of shape [T] with beta values
34    """
35    # Define the cosine function f(t)
36    def f(t):
37        return math.cos(((t / T) + s) / (1 + s) * math.pi / 2) ** 2
38
39    # Calculate alpha_bar values
40    alpha_bars = torch.tensor([f(t) / f(0) for t in range(T + 1)])
41
42    # Derive betas from alpha_bars: beta_t = 1 - alpha_bar_t / alpha_bar_{t-1}
43    betas = 1 - alpha_bars[1:] / alpha_bars[:-1]
44
45    # Clip betas to prevent numerical issues
46    return torch.clamp(betas, min=0.0001, max=0.999)
47
48
49def get_schedule_parameters(betas: torch.Tensor) -> dict:
50    """
51    Compute all derived parameters from a beta schedule.
52
53    Args:
54        betas: Tensor of beta values [T]
55
56    Returns:
57        Dictionary with alpha, alpha_bar, sqrt terms, etc.
58    """
59    alphas = 1.0 - betas
60    alpha_bars = torch.cumprod(alphas, dim=0)
61
62    # Pre-compute useful quantities
63    sqrt_alphas = torch.sqrt(alphas)
64    sqrt_alpha_bars = torch.sqrt(alpha_bars)
65    sqrt_one_minus_alpha_bars = torch.sqrt(1.0 - alpha_bars)
66
67    # Signal-to-noise ratio
68    snr = alpha_bars / (1.0 - alpha_bars)
69
70    return {
71        "betas": betas,
72        "alphas": alphas,
73        "alpha_bars": alpha_bars,
74        "sqrt_alphas": sqrt_alphas,
75        "sqrt_alpha_bars": sqrt_alpha_bars,
76        "sqrt_one_minus_alpha_bars": sqrt_one_minus_alpha_bars,
77        "snr": snr,
78    }
79
80
81# Compare schedules
82T = 1000
83linear_params = get_schedule_parameters(linear_beta_schedule(T))
84cosine_params = get_schedule_parameters(cosine_beta_schedule(T))
85
86print(f"At t=500:")
87print(f"  Linear: alpha_bar = {linear_params['alpha_bars'][500]:.4f}")
88print(f"  Cosine: alpha_bar = {cosine_params['alpha_bars'][500]:.4f}")

Signal-to-Noise Ratio Perspective

A powerful way to understand noise schedules is through the Signal-to-Noise Ratio (SNR):

SNR(t)=αˉt1αˉt\text{SNR}(t) = \frac{\bar{\alpha}_t}{1 - \bar{\alpha}_t}

The SNR measures the ratio of signal variance to noise variance at timestep tt:

  • SNR → ∞: Pure signal (t = 0)
  • SNR = 1: Equal signal and noise
  • SNR → 0: Pure noise (t = T)

Log-SNR

In practice, we often work with logSNR(t)\log \text{SNR}(t)because it spans many orders of magnitude:

logSNR(t)=logαˉtlog(1αˉt)\log \text{SNR}(t) = \log \bar{\alpha}_t - \log(1 - \bar{\alpha}_t)

The log-SNR typically ranges from about +10 (mostly signal) to -10 (mostly noise). A good noise schedule should have log-SNR decrease approximately linearly with tt.

Modern Perspective: Recent work (e.g., Karras et al., 2022) argues that diffusion models should be parameterized directly in terms of SNR or log-SNR, as this provides a more principled view of the denoising task.

Choosing a Schedule

Which schedule should you use? Here are practical guidelines:

ScheduleBest ForKey Trade-off
LinearSmall images (32×32), quick experimentsFast but may lose high-freq details
CosineHigh-resolution images, production modelsBetter quality but more complex
SigmoidCustom applicationsTunable middle transition
LearnedMaximum performanceAdds training complexity

General Recommendations

  1. Start with cosine for most applications - it works well across image resolutions
  2. Verify endpoint: Ensure αˉT<104\bar{\alpha}_T < 10^{-4} so the endpoint is effectively standard Gaussian
  3. Check SNR distribution: Log-SNR should span the range uniformly for balanced training
  4. Consider your data: Images with fine details may benefit from slower early decay

Key Takeaways

  1. Schedule is critical: The choice of {βt}\{\beta_t\} directly affects sample quality
  2. Alpha bar matters most: αˉt=s=1t(1βs)\bar{\alpha}_t = \prod_{s=1}^t(1-\beta_s) determines how much signal remains
  3. Linear is simple but flawed: Signal decays too quickly in early steps
  4. Cosine preserves signal: Designed so αˉt\bar{\alpha}_t decays more gradually
  5. SNR perspective: SNR(t)=αˉt/(1αˉt)\text{SNR}(t) = \bar{\alpha}_t/(1-\bar{\alpha}_t) provides intuitive interpretation
Looking Ahead: In the next section, we'll derive the closed-form expression for sampling at any timestep directly from x0\mathbf{x}_0, which is the key to efficient training.