Boo-AI — Master Artificial Intelligence by Building from Scratch

Learning Objectives

By the end of this section, you will be able to:

Define the score function $\mathbf{s}(\mathbf{x}) = \nabla_{\mathbf{x}} \log p(\mathbf{x})$
Explain denoising score matching and why it works
Derive Tweedie's formula and its implications
Connect epsilon-prediction to score estimation

The Score Function

The score function is the gradient of the log probability density with respect to the data:

$\mathbf{s}(\mathbf{x}) = \nabla_{\mathbf{x}} \log p(\mathbf{x})$

Why the Score?

Unlike the density $p(\mathbf{x})$ , the score:

Points toward high density: The gradient indicates the direction of steepest increase in log-probability
Avoids normalization: Since $\nabla \log p = \nabla \log \tilde{p}$ where $\tilde{p}$ is unnormalized
Enables sampling: Langevin dynamics uses the score to sample from a distribution

Score for a Gaussian

For a Gaussian $p(\mathbf{x}) = \mathcal{N}(\mathbf{x}; \boldsymbol{\mu}, \sigma^2\mathbf{I})$ :

$\log p(\mathbf{x}) = -\frac{\|\mathbf{x} - \boldsymbol{\mu}\|^2}{2\sigma^2} + C$

$\mathbf{s}(\mathbf{x}) = \nabla_{\mathbf{x}} \log p(\mathbf{x}) = -\frac{\mathbf{x} - \boldsymbol{\mu}}{\sigma^2}$

Key Insight: The score of a Gaussian points from $\mathbf{x}$ toward the mean $\boldsymbol{\mu}$ , scaled by the inverse variance. It tells you "which way to go" to reach higher probability.

Denoising Score Matching

Score matching (Hyvarinen, 2005) trains a network to approximate the score function. The naive objective:

$\mathbb{E}_{p(\mathbf{x})}\left[\|\mathbf{s}_\theta(\mathbf{x}) - \nabla_{\mathbf{x}} \log p(\mathbf{x})\|^2\right]$

is intractable because we don't know $\nabla_{\mathbf{x}} \log p(\mathbf{x})$ .

The Denoising Score Matching Trick

Vincent (2011) showed we can use a perturbed version:

$\mathbb{E}_{p(\mathbf{x})q(\tilde{\mathbf{x}}|\mathbf{x})}\left[\|\mathbf{s}_\theta(\tilde{\mathbf{x}}) - \nabla_{\tilde{\mathbf{x}}} \log q(\tilde{\mathbf{x}}|\mathbf{x})\|^2\right]$

where $q(\tilde{\mathbf{x}}|\mathbf{x}) = \mathcal{N}(\tilde{\mathbf{x}}; \mathbf{x}, \sigma^2\mathbf{I})$ adds Gaussian noise.

The Target Score

For the perturbation kernel, the score is:

$\nabla_{\tilde{\mathbf{x}}} \log q(\tilde{\mathbf{x}}|\mathbf{x}) = -\frac{\tilde{\mathbf{x}} - \mathbf{x}}{\sigma^2} = -\frac{\boldsymbol{\epsilon}}{\sigma^2}$

where $\boldsymbol{\epsilon} = \tilde{\mathbf{x}} - \mathbf{x}$ is the noise. This is tractable - we know the noise!

Key Result

Denoising score matching learns to estimate the score of the noisy distribution

q(\tilde{\mathbf{x}}) = \int q(\tilde{\mathbf{x}}|\mathbf{x})p(\mathbf{x})d\mathbf{x}

. With multiple noise levels, we can learn scores at all scales!

Tweedie's Formula

Tweedie's formula provides a remarkable connection between the score and optimal denoising:

$\mathbb{E}[\mathbf{x}_0|\mathbf{x}_t] = \frac{\mathbf{x}_t + (1-\bar{\alpha}_t)\nabla_{\mathbf{x}_t} \log p(\mathbf{x}_t)}{\sqrt{\bar{\alpha}_t}}$

Derivation Sketch

Given the forward process $\mathbf{x}_t = \sqrt{\bar{\alpha}_t}\mathbf{x}_0 + \sqrt{1-\bar{\alpha}_t}\boldsymbol{\epsilon}$ :

The conditional distribution is $p(\mathbf{x}_t|\mathbf{x}_0) = \mathcal{N}(\sqrt{\bar{\alpha}_t}\mathbf{x}_0, (1-\bar{\alpha}_t)\mathbf{I})$
By Bayes' rule: $p(\mathbf{x}_0|\mathbf{x}_t) \propto p(\mathbf{x}_t|\mathbf{x}_0)p(\mathbf{x}_0)$
The posterior mean can be expressed using the score of the marginal $p(\mathbf{x}_t)$

Tweedie's Formula: The optimal denoiser (MMSE estimator) can be expressed in terms of the score function. If we know the score, we know how to optimally denoise!

Rearranging Tweedie's Formula

Solving for the score:

$\nabla_{\mathbf{x}_t} \log p(\mathbf{x}_t) = \frac{\sqrt{\bar{\alpha}_t}\mathbb{E}[\mathbf{x}_0|\mathbf{x}_t] - \mathbf{x}_t}{1-\bar{\alpha}_t}$

Or in terms of expected noise:

$\nabla_{\mathbf{x}_t} \log p(\mathbf{x}_t) = -\frac{\mathbb{E}[\boldsymbol{\epsilon}|\mathbf{x}_t]}{\sqrt{1-\bar{\alpha}_t}}$

The Epsilon-Score Connection

Now we can connect epsilon-prediction to score matching. Recall:

$\mathbf{x}_t = \sqrt{\bar{\alpha}_t}\mathbf{x}_0 + \sqrt{1-\bar{\alpha}_t}\boldsymbol{\epsilon}$

The Fundamental Relationship

From Tweedie's formula:

$\boldsymbol{\epsilon}_\theta(\mathbf{x}_t, t) \approx -\sqrt{1-\bar{\alpha}_t} \nabla_{\mathbf{x}_t} \log p(\mathbf{x}_t)$

Or equivalently:

$\mathbf{s}_\theta(\mathbf{x}_t, t) = -\frac{\boldsymbol{\epsilon}_\theta(\mathbf{x}_t, t)}{\sqrt{1-\bar{\alpha}_t}}$

Perspective	Network Predicts	Interpretation
Epsilon-prediction	epsilon_theta(x_t, t)	Noise that was added
Score-based	-epsilon_theta / sqrt(1-alpha-bar_t)	Gradient of log density
Denoising	(x_t - sqrt(1-alpha-bar_t) epsilon_theta) / sqrt(alpha-bar_t)	Denoised estimate x_0

The Unified View: Training to predict the noise is exactly equivalent to learning the score function at different noise levels. The only difference is a time-dependent scaling.

A Unified View

This connection reveals why diffusion models work so well:

From Three Perspectives

Variational (ELBO): Maximize a lower bound on log-likelihood by matching the reverse process to the tractable posterior
Denoising: Train a hierarchy of denoisers at different noise levels, then chain them together
Score-based: Learn the score function at multiple noise levels, then use Langevin dynamics or reverse SDE to sample

Why Multiple Noise Levels?

Low noise (t near 0): Score captures fine details, local structure
High noise (t near T): Score captures global structure, coarse features
Multi-scale denoising: Annealed sampling from coarse to fine

Historical Connection

Score-based generative models (Song & Ermon, 2019) and DDPMs (Ho et al., 2020) were developed independently but turned out to be the same thing! Song et al. (2021) unified them through the SDE framework.

Langevin Dynamics Connection

Given the score, we can sample using Langevin dynamics:

$\mathbf{x}_{i+1} = \mathbf{x}_i + \frac{\eta}{2}\nabla_{\mathbf{x}} \log p(\mathbf{x}_i) + \sqrt{\eta}\mathbf{z}_i$

where $\mathbf{z}_i \sim \mathcal{N}(\mathbf{0}, \mathbf{I})$ and $\eta$ is the step size.

This is essentially what the DDPM sampling algorithm does - it follows the learned score with added noise!

PyTorch Implementation

Let's implement the score-based perspective and verify the connection:

🐍python

1import torch
2import torch.nn as nn
3from typing import Tuple
4
5class ScoreBasedDiffusion:
6    """Diffusion model from the score-based perspective."""
7
8    def __init__(self, betas: torch.Tensor):
9        self.betas = betas
10        self.alphas = 1.0 - betas
11        self.alphas_bar = torch.cumprod(self.alphas, dim=0)
12        self.sqrt_alphas_bar = torch.sqrt(self.alphas_bar)
13        self.sqrt_one_minus_alphas_bar = torch.sqrt(1.0 - self.alphas_bar)
14
15    def _extract(self, tensor: torch.Tensor, t: torch.Tensor, x: torch.Tensor):
16        """Extract and reshape for broadcasting."""
17        values = tensor.to(x.device)[t]
18        while values.dim() < x.dim():
19            values = values.unsqueeze(-1)
20        return values
21
22    def eps_to_score(
23        self,
24        eps: torch.Tensor,
25        t: torch.Tensor
26    ) -> torch.Tensor:
27        """
28        Convert epsilon prediction to score.
29
30        score = -eps / sqrt(1 - alpha_bar_t)
31        """
32        sqrt_one_minus_alpha_bar = self._extract(
33            self.sqrt_one_minus_alphas_bar, t, eps
34        )
35        return -eps / sqrt_one_minus_alpha_bar
36
37    def score_to_eps(
38        self,
39        score: torch.Tensor,
40        t: torch.Tensor
41    ) -> torch.Tensor:
42        """
43        Convert score to epsilon prediction.
44
45        eps = -score * sqrt(1 - alpha_bar_t)
46        """
47        sqrt_one_minus_alpha_bar = self._extract(
48            self.sqrt_one_minus_alphas_bar, t, score
49        )
50        return -score * sqrt_one_minus_alpha_bar
51
52    def tweedie_x0_estimate(
53        self,
54        x_t: torch.Tensor,
55        score: torch.Tensor,
56        t: torch.Tensor
57    ) -> torch.Tensor:
58        """
59        Tweedie's formula: estimate x_0 from x_t and score.
60
61        E[x_0 | x_t] = (x_t + (1 - alpha_bar_t) * score) / sqrt(alpha_bar_t)
62        """
63        sqrt_alpha_bar = self._extract(self.sqrt_alphas_bar, t, x_t)
64        one_minus_alpha_bar = self._extract(
65            1.0 - self.alphas_bar, t, x_t
66        )
67
68        return (x_t + one_minus_alpha_bar * score) / sqrt_alpha_bar
69
70    def langevin_step(
71        self,
72        x: torch.Tensor,
73        score: torch.Tensor,
74        step_size: float = 0.01
75    ) -> torch.Tensor:
76        """
77        Single Langevin dynamics step.
78
79        x_{i+1} = x_i + (step_size/2) * score + sqrt(step_size) * noise
80        """
81        noise = torch.randn_like(x)
82        return x + 0.5 * step_size * score + torch.sqrt(
83            torch.tensor(step_size)
84        ) * noise
85
86
87class ScoreMatchingLoss:
88    """
89    Denoising score matching loss.
90
91    Equivalent to epsilon-prediction loss up to scaling.
92    """
93
94    def __init__(self, betas: torch.Tensor):
95        self.diffusion = ScoreBasedDiffusion(betas)
96        self.sqrt_one_minus_alphas_bar = torch.sqrt(
97            1.0 - torch.cumprod(1.0 - betas, dim=0)
98        )
99
100    def compute_loss(
101        self,
102        score_model: nn.Module,
103        x_start: torch.Tensor,
104        t: torch.Tensor
105    ) -> Tuple[torch.Tensor, dict]:
106        """
107        Compute denoising score matching loss.
108
109        The target is: -eps / sqrt(1 - alpha_bar_t)
110        """
111        # Sample noise
112        eps = torch.randn_like(x_start)
113
114        # Create noisy sample
115        sqrt_alpha_bar = self.diffusion._extract(
116            self.diffusion.sqrt_alphas_bar, t, x_start
117        )
118        sqrt_one_minus_alpha_bar = self.diffusion._extract(
119            self.diffusion.sqrt_one_minus_alphas_bar, t, x_start
120        )
121        x_t = sqrt_alpha_bar * x_start + sqrt_one_minus_alpha_bar * eps
122
123        # Target score: -eps / sqrt(1 - alpha_bar_t)
124        target_score = -eps / sqrt_one_minus_alpha_bar
125
126        # Predicted score
127        pred_score = score_model(x_t, t)
128
129        # Score matching loss
130        loss = ((pred_score - target_score) ** 2).mean()
131
132        # Also compute equivalent epsilon loss for comparison
133        pred_eps = self.diffusion.score_to_eps(pred_score, t)
134        eps_loss = ((pred_eps - eps) ** 2).mean()
135
136        return loss, {
137            "score_loss": loss.item(),
138            "eps_loss": eps_loss.item(),
139            "score_norm": pred_score.norm().item(),
140            "eps_norm": pred_eps.norm().item(),
141        }
142
143
144def verify_equivalence():
145    """Verify that epsilon-prediction and score-matching are equivalent."""
146
147    T = 1000
148    betas = torch.linspace(0.0001, 0.02, T)
149    diffusion = ScoreBasedDiffusion(betas)
150
151    # Random test data
152    x_start = torch.randn(4, 3, 32, 32)
153    eps = torch.randn_like(x_start)
154    t = torch.randint(0, T, (4,))
155
156    # Create x_t
157    sqrt_alpha_bar = diffusion._extract(diffusion.sqrt_alphas_bar, t, x_start)
158    sqrt_one_minus_alpha_bar = diffusion._extract(
159        diffusion.sqrt_one_minus_alphas_bar, t, x_start
160    )
161    x_t = sqrt_alpha_bar * x_start + sqrt_one_minus_alpha_bar * eps
162
163    # Convert eps to score and back
164    score = diffusion.eps_to_score(eps, t)
165    eps_reconstructed = diffusion.score_to_eps(score, t)
166
167    print("Epsilon reconstruction error:",
168          (eps - eps_reconstructed).abs().max().item())
169
170    # Use Tweedie's formula
171    x0_tweedie = diffusion.tweedie_x0_estimate(x_t, score, t)
172    x0_from_eps = (x_t - sqrt_one_minus_alpha_bar * eps) / sqrt_alpha_bar
173
174    print("x0 estimation error (Tweedie vs eps):",
175          (x0_tweedie - x0_from_eps).abs().max().item())
176
177    # Both should give the same x_start
178    print("x0 estimation error (vs ground truth):",
179          (x0_tweedie - x_start).abs().max().item())
180
181
182if __name__ == "__main__":
183    verify_equivalence()

Key Takeaways

Score function: $\mathbf{s}(\mathbf{x}) = \nabla_{\mathbf{x}} \log p(\mathbf{x})$ points toward high probability regions
Denoising score matching: Learn the score of noisy data by predicting the negative noise direction
Tweedie's formula: $\mathbb{E}[\mathbf{x}_0|\mathbf{x}_t] = \frac{\mathbf{x}_t + (1-\bar{\alpha}_t)\nabla \log p(\mathbf{x}_t)}{\sqrt{\bar{\alpha}_t}}$
The connection: $\boldsymbol{\epsilon}_\theta \approx -\sqrt{1-\bar{\alpha}_t} \nabla \log p(\mathbf{x}_t)$
Unified framework: ELBO, denoising, and score-based perspectives all describe the same model

Looking Ahead: In Chapter 4, we'll dive deeper into the loss function, exploring different weighting strategies and understanding when each is appropriate.