Learning Objectives
By the end of this section, you will be able to:
- Define the score function
- Explain denoising score matching and why it works
- Derive Tweedie's formula and its implications
- Connect epsilon-prediction to score estimation
The Score Function
The score function is the gradient of the log probability density with respect to the data:
Why the Score?
Unlike the density , the score:
- Points toward high density: The gradient indicates the direction of steepest increase in log-probability
- Avoids normalization: Since where is unnormalized
- Enables sampling: Langevin dynamics uses the score to sample from a distribution
Score for a Gaussian
For a Gaussian :
Key Insight: The score of a Gaussian points from toward the mean , scaled by the inverse variance. It tells you "which way to go" to reach higher probability.
Denoising Score Matching
Score matching (Hyvarinen, 2005) trains a network to approximate the score function. The naive objective:
is intractable because we don't know .
The Denoising Score Matching Trick
Vincent (2011) showed we can use a perturbed version:
where adds Gaussian noise.
The Target Score
For the perturbation kernel, the score is:
where is the noise. This is tractable - we know the noise!
Key Result
Tweedie's Formula
Tweedie's formula provides a remarkable connection between the score and optimal denoising:
Derivation Sketch
Given the forward process :
- The conditional distribution is
- By Bayes' rule:
- The posterior mean can be expressed using the score of the marginal
Tweedie's Formula: The optimal denoiser (MMSE estimator) can be expressed in terms of the score function. If we know the score, we know how to optimally denoise!
Rearranging Tweedie's Formula
Solving for the score:
Or in terms of expected noise:
The Epsilon-Score Connection
Now we can connect epsilon-prediction to score matching. Recall:
The Fundamental Relationship
From Tweedie's formula:
Or equivalently:
| Perspective | Network Predicts | Interpretation |
|---|---|---|
| Epsilon-prediction | epsilon_theta(x_t, t) | Noise that was added |
| Score-based | -epsilon_theta / sqrt(1-alpha-bar_t) | Gradient of log density |
| Denoising | (x_t - sqrt(1-alpha-bar_t) epsilon_theta) / sqrt(alpha-bar_t) | Denoised estimate x_0 |
The Unified View: Training to predict the noise is exactly equivalent to learning the score function at different noise levels. The only difference is a time-dependent scaling.
A Unified View
This connection reveals why diffusion models work so well:
From Three Perspectives
- Variational (ELBO): Maximize a lower bound on log-likelihood by matching the reverse process to the tractable posterior
- Denoising: Train a hierarchy of denoisers at different noise levels, then chain them together
- Score-based: Learn the score function at multiple noise levels, then use Langevin dynamics or reverse SDE to sample
Why Multiple Noise Levels?
- Low noise (t near 0): Score captures fine details, local structure
- High noise (t near T): Score captures global structure, coarse features
- Multi-scale denoising: Annealed sampling from coarse to fine
Historical Connection
Langevin Dynamics Connection
Given the score, we can sample using Langevin dynamics:
where and is the step size.
This is essentially what the DDPM sampling algorithm does - it follows the learned score with added noise!
PyTorch Implementation
Let's implement the score-based perspective and verify the connection:
1import torch
2import torch.nn as nn
3from typing import Tuple
4
5class ScoreBasedDiffusion:
6 """Diffusion model from the score-based perspective."""
7
8 def __init__(self, betas: torch.Tensor):
9 self.betas = betas
10 self.alphas = 1.0 - betas
11 self.alphas_bar = torch.cumprod(self.alphas, dim=0)
12 self.sqrt_alphas_bar = torch.sqrt(self.alphas_bar)
13 self.sqrt_one_minus_alphas_bar = torch.sqrt(1.0 - self.alphas_bar)
14
15 def _extract(self, tensor: torch.Tensor, t: torch.Tensor, x: torch.Tensor):
16 """Extract and reshape for broadcasting."""
17 values = tensor.to(x.device)[t]
18 while values.dim() < x.dim():
19 values = values.unsqueeze(-1)
20 return values
21
22 def eps_to_score(
23 self,
24 eps: torch.Tensor,
25 t: torch.Tensor
26 ) -> torch.Tensor:
27 """
28 Convert epsilon prediction to score.
29
30 score = -eps / sqrt(1 - alpha_bar_t)
31 """
32 sqrt_one_minus_alpha_bar = self._extract(
33 self.sqrt_one_minus_alphas_bar, t, eps
34 )
35 return -eps / sqrt_one_minus_alpha_bar
36
37 def score_to_eps(
38 self,
39 score: torch.Tensor,
40 t: torch.Tensor
41 ) -> torch.Tensor:
42 """
43 Convert score to epsilon prediction.
44
45 eps = -score * sqrt(1 - alpha_bar_t)
46 """
47 sqrt_one_minus_alpha_bar = self._extract(
48 self.sqrt_one_minus_alphas_bar, t, score
49 )
50 return -score * sqrt_one_minus_alpha_bar
51
52 def tweedie_x0_estimate(
53 self,
54 x_t: torch.Tensor,
55 score: torch.Tensor,
56 t: torch.Tensor
57 ) -> torch.Tensor:
58 """
59 Tweedie's formula: estimate x_0 from x_t and score.
60
61 E[x_0 | x_t] = (x_t + (1 - alpha_bar_t) * score) / sqrt(alpha_bar_t)
62 """
63 sqrt_alpha_bar = self._extract(self.sqrt_alphas_bar, t, x_t)
64 one_minus_alpha_bar = self._extract(
65 1.0 - self.alphas_bar, t, x_t
66 )
67
68 return (x_t + one_minus_alpha_bar * score) / sqrt_alpha_bar
69
70 def langevin_step(
71 self,
72 x: torch.Tensor,
73 score: torch.Tensor,
74 step_size: float = 0.01
75 ) -> torch.Tensor:
76 """
77 Single Langevin dynamics step.
78
79 x_{i+1} = x_i + (step_size/2) * score + sqrt(step_size) * noise
80 """
81 noise = torch.randn_like(x)
82 return x + 0.5 * step_size * score + torch.sqrt(
83 torch.tensor(step_size)
84 ) * noise
85
86
87class ScoreMatchingLoss:
88 """
89 Denoising score matching loss.
90
91 Equivalent to epsilon-prediction loss up to scaling.
92 """
93
94 def __init__(self, betas: torch.Tensor):
95 self.diffusion = ScoreBasedDiffusion(betas)
96 self.sqrt_one_minus_alphas_bar = torch.sqrt(
97 1.0 - torch.cumprod(1.0 - betas, dim=0)
98 )
99
100 def compute_loss(
101 self,
102 score_model: nn.Module,
103 x_start: torch.Tensor,
104 t: torch.Tensor
105 ) -> Tuple[torch.Tensor, dict]:
106 """
107 Compute denoising score matching loss.
108
109 The target is: -eps / sqrt(1 - alpha_bar_t)
110 """
111 # Sample noise
112 eps = torch.randn_like(x_start)
113
114 # Create noisy sample
115 sqrt_alpha_bar = self.diffusion._extract(
116 self.diffusion.sqrt_alphas_bar, t, x_start
117 )
118 sqrt_one_minus_alpha_bar = self.diffusion._extract(
119 self.diffusion.sqrt_one_minus_alphas_bar, t, x_start
120 )
121 x_t = sqrt_alpha_bar * x_start + sqrt_one_minus_alpha_bar * eps
122
123 # Target score: -eps / sqrt(1 - alpha_bar_t)
124 target_score = -eps / sqrt_one_minus_alpha_bar
125
126 # Predicted score
127 pred_score = score_model(x_t, t)
128
129 # Score matching loss
130 loss = ((pred_score - target_score) ** 2).mean()
131
132 # Also compute equivalent epsilon loss for comparison
133 pred_eps = self.diffusion.score_to_eps(pred_score, t)
134 eps_loss = ((pred_eps - eps) ** 2).mean()
135
136 return loss, {
137 "score_loss": loss.item(),
138 "eps_loss": eps_loss.item(),
139 "score_norm": pred_score.norm().item(),
140 "eps_norm": pred_eps.norm().item(),
141 }
142
143
144def verify_equivalence():
145 """Verify that epsilon-prediction and score-matching are equivalent."""
146
147 T = 1000
148 betas = torch.linspace(0.0001, 0.02, T)
149 diffusion = ScoreBasedDiffusion(betas)
150
151 # Random test data
152 x_start = torch.randn(4, 3, 32, 32)
153 eps = torch.randn_like(x_start)
154 t = torch.randint(0, T, (4,))
155
156 # Create x_t
157 sqrt_alpha_bar = diffusion._extract(diffusion.sqrt_alphas_bar, t, x_start)
158 sqrt_one_minus_alpha_bar = diffusion._extract(
159 diffusion.sqrt_one_minus_alphas_bar, t, x_start
160 )
161 x_t = sqrt_alpha_bar * x_start + sqrt_one_minus_alpha_bar * eps
162
163 # Convert eps to score and back
164 score = diffusion.eps_to_score(eps, t)
165 eps_reconstructed = diffusion.score_to_eps(score, t)
166
167 print("Epsilon reconstruction error:",
168 (eps - eps_reconstructed).abs().max().item())
169
170 # Use Tweedie's formula
171 x0_tweedie = diffusion.tweedie_x0_estimate(x_t, score, t)
172 x0_from_eps = (x_t - sqrt_one_minus_alpha_bar * eps) / sqrt_alpha_bar
173
174 print("x0 estimation error (Tweedie vs eps):",
175 (x0_tweedie - x0_from_eps).abs().max().item())
176
177 # Both should give the same x_start
178 print("x0 estimation error (vs ground truth):",
179 (x0_tweedie - x_start).abs().max().item())
180
181
182if __name__ == "__main__":
183 verify_equivalence()Key Takeaways
- Score function: points toward high probability regions
- Denoising score matching: Learn the score of noisy data by predicting the negative noise direction
- Tweedie's formula:
- The connection:
- Unified framework: ELBO, denoising, and score-based perspectives all describe the same model
Looking Ahead: In Chapter 4, we'll dive deeper into the loss function, exploring different weighting strategies and understanding when each is appropriate.