Boo-AI — Master Artificial Intelligence by Building from Scratch

Learning Objectives

By the end of this section, you will be able to:

Explain why the noise schedule $\{\beta_t\}_{t=1}^T$ critically affects generation quality
Derive the relationship between $\beta_t$ and $\bar{\alpha}_t$
Compare linear and cosine schedules and understand their trade-offs
Implement multiple noise schedules in PyTorch

Why the Schedule Matters

The noise schedule $\{\beta_t\}_{t=1}^T$ determines how quickly we destroy information during the forward process. This seemingly simple hyperparameter has profound effects on:

Sample Quality: Too aggressive noise destroys fine details; too slow noise makes generation harder to learn
Training Efficiency: The schedule affects which timesteps contribute most to the loss, impacting convergence
Generation Speed: Different schedules enable different sampling strategies, some much faster than others

The Key Insight: The noise schedule controls the "information destruction curve." We want to destroy information gradually enough that each reverse step is learnable, but completely enough that we reach pure noise.

From Beta to Alpha Bar

Recall from Section 2.1 that the single-step transition uses $\beta_t$ . But what we really care about is the cumulative effect - how much signal remains after $t$ steps. This is captured by $\bar{\alpha}_t$ :

$\alpha_t = 1 - \beta_t$

$\bar{\alpha}_t = \prod_{s=1}^{t} \alpha_s = \prod_{s=1}^{t} (1 - \beta_s)$

The quantity $\bar{\alpha}_t$ tells us what fraction of the original signal variance remains at timestep $t$ . When $\bar{\alpha}_t \approx 0$ , almost all signal is lost.

The Linear Schedule

The original DDPM paper (Ho et al., 2020) used a linear schedule:

$\beta_t = \beta_{\text{start}} + \frac{t}{T}(\beta_{\text{end}} - \beta_{\text{start}})$

with typical values $\beta_{\text{start}} = 0.0001$ and $\beta_{\text{end}} = 0.02$ .

Property	Value	Interpretation
β₁ (first step)	0.0001	Very small noise initially
β_T (last step)	0.02	Still relatively small per step
Total T	1000	Many small steps accumulate to destroy signal
α_bar_T	~0.000006	Almost no signal remains at end

Problem with Linear Schedule

The linear schedule has a significant flaw: $\bar{\alpha}_t$ decays too quickly in early timesteps. Because $\bar{\alpha}_t$ is a product, even small $\beta_t$ values compound exponentially:

$\bar{\alpha}_{100} \approx 0.77, \quad \bar{\alpha}_{500} \approx 0.04, \quad \bar{\alpha}_{1000} \approx 0.000006$

This means by timestep 500, 96% of the signal is already gone! The model must learn most of the high-frequency details in very few effective timesteps.

The Cosine Schedule

The Improved DDPM paper (Nichol & Dhariwal, 2021) proposed the cosine schedule, which directly specifies $\bar{\alpha}_t$ rather than $\beta_t$ :

$\bar{\alpha}_t = \frac{f(t)}{f(0)}, \quad \text{where } f(t) = \cos^2\left(\frac{t/T + s}{1+s} \cdot \frac{\pi}{2}\right)$

The offset $s = 0.008$ prevents $\bar{\alpha}_T$ from being exactly zero.

Why Cosine Works Better

The cosine function creates a smooth S-curve for $\bar{\alpha}_t$ :

Slow start: $\bar{\alpha}_t$ stays close to 1 for early timesteps, preserving fine details longer
Gradual middle: Smooth decay through middle timesteps
Complete destruction: Still reaches near-zero by $t = T$

Deriving Beta from Alpha Bar

Once we specify

\bar{\alpha}_t

, we can derive

\beta_t

using:

$\beta_t = 1 - \frac{\bar{\alpha}_t}{\bar{\alpha}_{t-1}}$

This works because

\bar{\alpha}_t = \bar{\alpha}_{t-1} \cdot (1 - \beta_t)

Interactive Schedule Comparison

Use the interactive visualization below to compare different noise schedules. Notice how the cosine schedule preserves signal ( $\bar{\alpha}_t$ ) for longer during early timesteps:

📊Noise Schedule Comparison

α̅_t (Signal Retention)

β_t (Noise Added per Step)

SNR = α̅_t/(1-α̅_t) (Log Scale)

Explore timestep:t = 500

Values at t = 500

Schedule	β_t	α̅_t	√α̅_t	√(1-α̅_t)	SNR
Linear (DDPM)	0.010050	0.077992	0.279271	0.960212	0.0846
Cosine (Improved DDPM)	0.003146	0.493844	0.702740	0.711447	0.9757

Linear Schedule

Original DDPM. α̅_t decays too quickly early on, potentially losing high-frequency details. Simple but not optimal.

Cosine Schedule

Improved DDPM. Slower decay preserves more signal structure. Better image quality, especially for high resolution.

Implementation

Here is a complete implementation of multiple noise schedules:

Noise Schedule Implementations

🐍noise_schedules.py

Explanation(12)

Code(88)

5Linear Schedule Function

The simplest schedule: beta_t increases linearly from a small starting value to a larger ending value over T timesteps.

7DDPM Original

This is the schedule used in the original DDPM paper. It is simple but may not be optimal for all data types.

18torch.linspace

Creates T evenly spaced values between beta_start and beta_end. Each beta_t = beta_start + t/T * (beta_end - beta_start).

EXAMPLE

linspace(0.0001, 0.02, 1000) gives [0.0001, 0.00012, ..., 0.02]

21Cosine Schedule Function

A more sophisticated schedule designed to preserve signal longer. The cosine curve causes slower noise addition in early steps.

25Small Offset s

The offset s=0.008 prevents the schedule from starting at exactly 0 or 1, which would cause numerical issues.

32Cosine Function f(t)

The cosine-squared function creates a smooth decay. f(0)=1 and f(T) approaches 0. Squaring makes the curve smoother.

36Alpha Bar Calculation

We compute alpha_bar directly from the cosine function, then derive beta from the relationship beta_t = 1 - alpha_bar_t/alpha_bar_{t-1}.

39Derive Betas

Since alpha_bar_t = prod_{s=1}^t (1-beta_s), we can reverse this: beta_t = 1 - alpha_bar_t / alpha_bar_{t-1}.

42Clipping Betas

We clip beta values to prevent extreme values that could cause numerical instability. Values too close to 0 or 1 are problematic.

45Derived Parameters

From betas, we pre-compute all the quantities needed during training and sampling. This is done once and cached.

53Cumulative Product

alpha_bar_t = prod_{s=1}^t alpha_s is the cumulative product. This tells us how much of the original signal remains at step t.

62Signal-to-Noise Ratio

SNR = alpha_bar / (1 - alpha_bar) measures signal vs noise. High SNR = mostly signal, low SNR = mostly noise.

76 lines without explanation

1import torch
2import math
3from typing import Tuple
4
5def linear_beta_schedule(T: int, beta_start: float = 0.0001, beta_end: float = 0.02) -> torch.Tensor:
6    """
7    Linear noise schedule from DDPM (Ho et al., 2020).
8
9    beta_t increases linearly from beta_start to beta_end.
10
11    Args:
12        T: Total number of timesteps
13        beta_start: Starting noise variance
14        beta_end: Ending noise variance
15
16    Returns:
17        betas: Tensor of shape [T] with beta values
18    """
19    return torch.linspace(beta_start, beta_end, T)
20
21
22def cosine_beta_schedule(T: int, s: float = 0.008) -> torch.Tensor:
23    """
24    Cosine noise schedule from Improved DDPM (Nichol & Dhariwal, 2021).
25
26    Designed to preserve more signal in early timesteps.
27
28    Args:
29        T: Total number of timesteps
30        s: Small offset to prevent singularities (default 0.008)
31
32    Returns:
33        betas: Tensor of shape [T] with beta values
34    """
35    # Define the cosine function f(t)
36    def f(t):
37        return math.cos(((t / T) + s) / (1 + s) * math.pi / 2) ** 2
38
39    # Calculate alpha_bar values
40    alpha_bars = torch.tensor([f(t) / f(0) for t in range(T + 1)])
41
42    # Derive betas from alpha_bars: beta_t = 1 - alpha_bar_t / alpha_bar_{t-1}
43    betas = 1 - alpha_bars[1:] / alpha_bars[:-1]
44
45    # Clip betas to prevent numerical issues
46    return torch.clamp(betas, min=0.0001, max=0.999)
47
48
49def get_schedule_parameters(betas: torch.Tensor) -> dict:
50    """
51    Compute all derived parameters from a beta schedule.
52
53    Args:
54        betas: Tensor of beta values [T]
55
56    Returns:
57        Dictionary with alpha, alpha_bar, sqrt terms, etc.
58    """
59    alphas = 1.0 - betas
60    alpha_bars = torch.cumprod(alphas, dim=0)
61
62    # Pre-compute useful quantities
63    sqrt_alphas = torch.sqrt(alphas)
64    sqrt_alpha_bars = torch.sqrt(alpha_bars)
65    sqrt_one_minus_alpha_bars = torch.sqrt(1.0 - alpha_bars)
66
67    # Signal-to-noise ratio
68    snr = alpha_bars / (1.0 - alpha_bars)
69
70    return {
71        "betas": betas,
72        "alphas": alphas,
73        "alpha_bars": alpha_bars,
74        "sqrt_alphas": sqrt_alphas,
75        "sqrt_alpha_bars": sqrt_alpha_bars,
76        "sqrt_one_minus_alpha_bars": sqrt_one_minus_alpha_bars,
77        "snr": snr,
78    }
79
80
81# Compare schedules
82T = 1000
83linear_params = get_schedule_parameters(linear_beta_schedule(T))
84cosine_params = get_schedule_parameters(cosine_beta_schedule(T))
85
86print(f"At t=500:")
87print(f"  Linear: alpha_bar = {linear_params['alpha_bars'][500]:.4f}")
88print(f"  Cosine: alpha_bar = {cosine_params['alpha_bars'][500]:.4f}")

Signal-to-Noise Ratio Perspective

A powerful way to understand noise schedules is through the Signal-to-Noise Ratio (SNR):

$\text{SNR}(t) = \frac{\bar{\alpha}_t}{1 - \bar{\alpha}_t}$

The SNR measures the ratio of signal variance to noise variance at timestep $t$ :

SNR → ∞: Pure signal (t = 0)
SNR = 1: Equal signal and noise
SNR → 0: Pure noise (t = T)

Log-SNR

In practice, we often work with $\log \text{SNR}(t)$ because it spans many orders of magnitude:

$\log \text{SNR}(t) = \log \bar{\alpha}_t - \log(1 - \bar{\alpha}_t)$

The log-SNR typically ranges from about +10 (mostly signal) to -10 (mostly noise). A good noise schedule should have log-SNR decrease approximately linearly with $t$ .

Modern Perspective: Recent work (e.g., Karras et al., 2022) argues that diffusion models should be parameterized directly in terms of SNR or log-SNR, as this provides a more principled view of the denoising task.

Choosing a Schedule

Which schedule should you use? Here are practical guidelines:

Schedule	Best For	Key Trade-off
Linear	Small images (32×32), quick experiments	Fast but may lose high-freq details
Cosine	High-resolution images, production models	Better quality but more complex
Sigmoid	Custom applications	Tunable middle transition
Learned	Maximum performance	Adds training complexity

General Recommendations

Start with cosine for most applications - it works well across image resolutions
Verify endpoint: Ensure $\bar{\alpha}_T < 10^{-4}$ so the endpoint is effectively standard Gaussian
Check SNR distribution: Log-SNR should span the range uniformly for balanced training
Consider your data: Images with fine details may benefit from slower early decay

Key Takeaways

Schedule is critical: The choice of $\{\beta_t\}$ directly affects sample quality
Alpha bar matters most: $\bar{\alpha}_t = \prod_{s=1}^t(1-\beta_s)$ determines how much signal remains
Linear is simple but flawed: Signal decays too quickly in early steps
Cosine preserves signal: Designed so $\bar{\alpha}_t$ decays more gradually
SNR perspective: $\text{SNR}(t) = \bar{\alpha}_t/(1-\bar{\alpha}_t)$ provides intuitive interpretation

Looking Ahead: In the next section, we'll derive the closed-form expression for sampling at any timestep directly from $\mathbf{x}_0$ , which is the key to efficient training.